This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Wells Fargo is seeking a Lead Systems Operations Engineer
Job Responsibility:
Lead complex, broad impact initiatives including provision of high level systems consultation for the technology teams
Work as key participant in large scale planning of computer systems and network infrastructure for Systems Operations functional area
Review and analyze complex technical challenges, as well as escalated support issues related to core business solutions that require in depth evaluation of multiple factors, such as alternatives, enhancements, periodic systems reviews, or improvements to existing systems
Make decisions on technical changes and enhancements
Consult with engineering team on change design requiring solid understanding of technical process controls or standards that influence and drive new initiatives
Collaborate and consult with technical peers, colleagues, and mid to more experienced level managers to resolve systems support issues and achieve goals
Lead the transformation of traditional platform operations into a modern Site Reliability Engineering (SRE) model—driving reliability by design, elevating SLIs/SLOs, automating operational toil, strengthening observability, and maturing incident & problem management
Be hands-on while mentoring Ops and Engineering teams to adopt SRE practices at scale across the platform ecosystem
Define and implement SLIs/SLOs and error budgets for critical platform services
drive SLO adoption across product and operations teams
Build, enhance, and tune end-to-end observability (metrics, logs, traces) with focus on golden signals: latency, traffic, errors, saturation
Partner with performance engineering teams to run load, stress, soak, and failover tests
identify and eliminate performance bottlenecks
Identify and eliminate operational toil
implement automation and AI-driven workflows for reliability and operational excellence
Generate AI-based observability assessments, maturity scoring, and gap analysis for all platform applications
Build self-service reliability tooling: automated runbooks, readiness checkers, golden paths, and standard reliability patterns
Lead Major incidents as Incident Commander
ensure clear communication, rapid triage, and timely restoration
Facilitate blameless postmortems, document corrective actions, and ensure follow-through
Strengthen platform-level problem management through trend analysis, recurring issue elimination, and proactive risk reduction
Coach and mentor platform engineering, ops, and product teams on SRE principles and reliability-first mindset
Define and maintain SRE maturity models, track adoption, and provide continuous improvement recommendations
5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years in large-scale distributed systems
minimum 5+ years hands-on experience in SRE, DevOps, or Platform Engineering
Cloud: Expertise in one or more: AWS, Azure, GCP (cloud certifications preferred)
IaC & Automation: Terraform, Ansible/Chef
strong Git and GitOps practices
Observability: Hands-on experience with Prometheus, Grafana, OpenTelemetry, ThousandEyes, AppDynamics, Aternity
CI/CD: Azure DevOps, GitHub Actions, Jenkins, or GitLab CI
strong understanding of artifact management & environment promotion workflows
Programming: Proficiency in Python/Go/Java for scripting, automation, and API integrations