This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Wells Fargo is seeking a Lead Systems Operations Engineer – Platform Reliability Engineering (PRE) within the CTO Platform organization. This role is aligned to modern Site Reliability Engineering (SRE) practices and is responsible for driving reliability, resiliency, observability, and operational excellence across critical platform and application services. The role is intended for senior engineers with deep expertise in one core platform domain, applying that expertise to proactively improve platform stability, scalability, and availability.
Job Responsibility
Act as a Platform Reliability Engineering (PRE) subject matter expert, providing deep technical leadership in one core domain (Database, Cloud, Network, Compute/Storage, Middleware, or Application Support)
Lead analysis and resolution of complex, systemic production reliability issues, translating recurring incidents into long-term engineering solutions
Apply SRE principles including SLIs, SLOs, error budgets, and incident-driven engineering improvements to both new and legacy platforms
Define and drive enterprise observability standards, including metrics, logs, traces, alerting, and service health dashboards
Design and implement automation-first solutions to reduce operational toil, improve MTTR, and enable self-healing and self-service
Partner with application, infrastructure, cloud, and support teams to improve availability, performance, capacity, and resiliency
Lead or contribute to blameless post-mortems, ensuring measurable and sustained reduction of repeat incidents
Translate complex technical and operational risks into clear, data-driven guidance for senior leadership
Mentor engineers and support staff on reliability engineering, observability, and automation best practices
Requirements
5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of experience in Systems Operations, SRE, Platform Engineering, or Production Support with deep expertise in at least one platform domain: Database, Cloud, Network, Compute/Storage, Middleware, or Enterprise Application Support
Nice to have
Strong hands-on experience applying SRE practices, including SLI/SLO definition, error budgets, and reliability metrics
Proven experience troubleshooting and resolving large-scale, distributed production systems
Hands-on experience with observability and monitoring tools such as Grafana, Splunk, Prometheus, Cribl, ThousandEyes, AppDynamics, or equivalent, including dashboards, alerting, logs, and metrics
Strong scripting and automation skills using Python, Bash, and/or PowerShell to reduce operational toil
Experience building automation or reliability tooling using APIs, Git-based workflows, and modern engineering practices
Solid understanding of incident, problem, and change management in enterprise production environments
Strong communication and influencing skills across engineering teams and senior leadership
Experience with capacity management, performance engineering, and resiliency design (HA, fault tolerance, RTO/RPO)
Experience operating in hybrid environments (on‑prem + cloud) with complex enterprise dependencies
Familiarity with infrastructure automation / IaC tools such as Ansible or Terraform
Ability to drive technical debt remediation for critical legacy platforms using structured backlogs
Experience mentoring or leading senior engineers in reliability, operations, or SRE-focused roles
What we offer
Health benefits
401(k) Plan
Paid time off
Disability benefits
Life insurance, critical illness insurance, and accident insurance