This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Wells Fargo is seeking a Senior Lead Systems Operations Engineer.
Job Responsibility:
Act as an advisor to senior leadership to develop or influence platform support solutions for highly complex business and technical needs or technology initiatives
Lead highly complex, broad impact initiatives including provision of high-level systems consultation for the technology teams related to large scale planning of computer systems and network infrastructure for Systems Operations functional areas
Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
Translate advanced technology experience, in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
Provide vision, direction and expertise to senior leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Provide training and mentoring to less experienced team members on guidebook changes and lead team to meet technical deliverables, while leveraging solid understanding of technical process controls or standards
Act as a Platform Reliability Engineering (PRE) subject matter expert, providing deep technical leadership in one core domain (Database, Cloud, Network, Compute/Storage, Middleware, or Application Support)
Lead analysis and resolution of complex, systemic production reliability issues, translating recurring incidents into long-term engineering solutions
Apply SRE principles including SLIs, SLOs, error budgets, and incident-driven engineering improvements to both new and legacy platforms
Define and drive enterprise observability standards, including metrics, logs, traces, alerting, and service health dashboards
Design and implement automation-first solutions to reduce operational toil, improve MTTR, and enable self-healing and self-service
Partner with application, infrastructure, cloud, and support teams to improve availability, performance, capacity, and resiliency
Lead or contribute to blameless post-mortems, ensuring measurable and sustained reduction of repeat incidents
Translate complex technical and operational risks into clear, data-driven guidance for senior leadership
Mentor engineers and support staff on reliability engineering, observability, and automation best practices
Requirements:
7+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
7+ years of experience in Systems Operations, SRE, Platform Engineering, or Production Support with deep expertise in at least one platform domain: Database, Cloud, Network, Compute/Storage, Middleware, or Enterprise Application Support
Strong hands-on experience applying SRE practices, including SLI/SLO definition, error budgets, and reliability metrics
Proven experience troubleshooting and resolving large-scale, distributed production systems
Hands-on experience with observability and monitoring tools such as Grafana, Splunk, Prometheus, Cribl, ThousandEyes, AppDynamics, or equivalent, including dashboards, alerting, logs, and metrics
Strong scripting and automation skills using Python, Bash, and/or PowerShell to reduce operational toil
Experience building automation or reliability tooling using APIs, Git-based workflows, and modern engineering practices
Solid understanding of incident, problem, and change management in enterprise production environments
Strong communication and influencing skills across engineering teams and senior leadership
Experience with capacity management, performance engineering, and resiliency design (HA, fault tolerance, RTO/RPO)
Experience operating in hybrid environments (on‑prem + cloud) with complex enterprise dependencies
Familiarity with infrastructure automation / IaC tools such as Ansible or Terraform
Ability to drive technical debt remediation for critical legacy platforms using structured backlogs
Experience mentoring or leading senior engineers in reliability, operations, or SRE-focused roles
Nice to have:
Prior project or initiative leadership experience is highly desirable