This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Wells Fargo is seeking a Lead Systems Operations Engineer. Platform Reliability Engineering (PRE) within the CTO Platform organization. This role is aligned to modern Site Reliability Engineering (SRE) practices and is responsible for driving reliability, resiliency, observability, and operational excellence across critical platform and application services. The role is intended for senior engineers with deep expertise in one core platform domain, applying that expertise to proactively improve platform stability, scalability, and availability.
Job Responsibility:
Lead complex, broad impact initiatives including provision of high level systems consultation for the technology teams
Work as key participant in large scale planning of computer systems and network infrastructure for Systems Operations functional area
Review and analyze complex technical challenges, as well as escalated support issues related to core business solutions that require in depth evaluation of multiple factors, such as alternatives, enhancements, periodic systems reviews, or improvements to existing systems
Make decisions on technical changes and enhancements
Consult with engineering team on change design requiring solid understanding of technical process controls or standards that influence and drive new initiatives
Collaborate and consult with technical peers, colleagues, and mid to more experienced level managers to resolve systems support issues and achieve goals
Requirements:
5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of experience in Systems Operations, SRE, Platform Engineering, or Production Support with deep expertise in at least one platform domain: Database, Cloud, Network, Compute/Storage, Middleware, or Enterprise Application Support
Strong hands-on experience applying SRE practices, including SLI/SLO definition, error budgets, and reliability metrics
Proven experience troubleshooting and resolving large-scale, distributed production systems
Hands-on experience with observability and monitoring tools such as Grafana, Splunk, Prometheus, Cribl, ThousandEyes, AppDynamics, or equivalent, including dashboards, alerting, logs, and metrics
Strong scripting and automation skills using Python, Bash, and/or PowerShell to reduce operational toil
Experience building automation or reliability tooling using APIs, Git-based workflows, and modern engineering practices
Solid understanding of incident, problem, and change management in enterprise production environments
Strong communication and influencing skills across engineering teams and senior leadership
Experience with capacity management, performance engineering, and resiliency design (HA, fault tolerance, RTO/RPO)
Experience operating in hybrid environments (on‑prem + cloud) with complex enterprise dependencies
Familiarity with infrastructure automation / IaC tools such as Ansible or Terraform
Ability to drive technical debt remediation for critical legacy platforms using structured backlogs
Experience mentoring or leading senior engineers in reliability, operations, or SRE-focused roles
Strong collaboration and partnering skills across platform, application, and support teams
Ability to manage multiple priorities in a fast-paced, high-impact production environment
Consistent delivery of high-quality reliability outcomes within expected timelines
High attention to detail, data-driven problem-solving, and operational rigor
Prior project or initiative leadership experience is highly desirable
Nice to have:
Prior project or initiative leadership experience is highly desirable