Senior Systems Operations Engineer - SRE and AIOps Job at Wells Fargo (Hyderabad)

Job Description

Wells Fargo is seeking a Senior Systems Operations Engineer within the Enterprise Functions Technology, Center of Excellence platform engineering team to deliver and support cloud workloads and services, provide engineering support and drive modernization of critical cloud capabilities.

Job Responsibility

Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
Work with vendors and other technical personnel for problem resolution
Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability

Requirements

4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Strong Java / backend service development experience
Distributed systems and API-based service design
CI/CD pipelines and Git-based workflows
3+ years of experience with scripting and infrastructure automation using Terraform
3+ years of hands-on experience with OpenShift, GCP or Azure platform enablement and application migrations, build out of complex infrastructure programmable patterns using Infrastructure as Code (IaC)
2+ years of knowledge and understanding of Cloud service offerings such as data, analytics, AL/ML on GCP or Azure
2+ years of experience with key services provided by Azure and/or GCP such as BigQuery, Vertix AI, DataProc, Functions. AKS, Service Fabric
2+ years working in a globally distributed team to provide innovative and robust cloud centric solutions
2+ years gathering and analyzing data to diagnose the root cause of cloud workload issues, recommending and implementing solutions to resolve issues in timely manner
Exposure to cloud governance and logging/monitoring tooling
Experience with Agile concepts and Site Reliability Engineering (SRE) Principles
Understanding, engineering and implementing disaster recovery and business continuity playbooks
Proficient on container-based solutions and services and have handled large scale Kubernetes based infrastructure build out and provisioning on OpenShift, Azure or GCP
Knowledge and understanding of Cloud Service offerings on OpenShift, Azure or GCP related to security, data protection, and policy implementations
Ability to articulate technical solutions to both technical and business partners
Good understanding of networking, firewalls, load balancing concepts (IP, DNS, Guardrails, Vnets) and exposure to database, cloud security, active directory, authentication methods, RBAC
SRE / Reliability
Production support mindset (incident response, on-call readiness)
Observability: logging, metrics, tracing (Splunk/AppD/AppD-alikes)
Performance, availability, and reliability engineering concepts
Experience partnering with SRE or platform teams
Platform / Cloud
Kubernetes/OpenShift (deployments, troubleshooting, scaling)
Infrastructure-as-Code exposure (Terraform/Helm is a plus)
Desired Qualifications: Set and evangelize the SRE and AIOps technical strategy for EFT, establishing reference architectures, standards, and guardrails (service tiering, onboarding criteria, SLO/error budget governance) and holding teams accountable through transparent executive-level reporting
Own the reliability and observability architecture across hybrid/multi-cloud, driving standardization of monitoring, logging, tracing, synthetics, and resilience/chaos testing
define platform patterns that teams can adopt with minimal friction
Design and implement AIOps and automation platforms (event correlation, anomaly detection, runbook automation, self-healing) with strong engineering discipline (testability, auditability, change safety) and prioritize initiatives that materially reduce incident volume, toil, and MTTR
Define the reliability measurement system (SLIs/SLOs, error budgets, customer impact, MTTR/MTBF, change failure rate) and build reusable dashboards and alerts that drive consistent prioritization, investment decisions, and engineering behavior across teams
Provide technical leadership during major incidents for critical services, driving rapid triage, clear stakeholder communications, and cross-domain coordination
institutionalize blameless post-incident reviews and engineering mechanisms that eliminate systemic causes
Partner with application, platform, and architecture leaders to embed reliability into planning and delivery (design and architecture reviews, operational readiness gates, non-functional requirements, capacity/performance engineering), influencing roadmaps based on quantified risk and customer impact
Lead multi-quarter, cross-organization reliability transformations (e.g., platform modernization, resilience programs, observability convergence), delivering reusable capabilities and operating mechanisms that improve reliability posture and reduce operational risk at scale

Wells Fargo - All Job Offers

Select Country

Senior Systems Operations Engineer - SRE and AIOps

Job Description

Job Responsibility

Requirements

Looking for more opportunities?

Senior Systems Operations Engineer - SRE and AIOps

Senior AIOps Engineer (Platform & Infrastructure)

Senior Software Engineer, AI

Senior Ansible Automation & Platform Engineer

Lead Software Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Our AI answers in your language