CrawlJobs Logo

Senior Systems Operations Engineer - SRE and AIOps

India, Hyderabad · Job Posted June 17, 2026
Apply Position
Job Link Share

Job Description

Wells Fargo is seeking a Senior Systems Operations Engineer within the Enterprise Functions Technology, Center of Excellence platform engineering team to deliver and support cloud workloads and services, provide engineering support and drive modernization of critical cloud capabilities.

Job Responsibility

  • Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
  • Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
  • Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
  • Work with vendors and other technical personnel for problem resolution
  • Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
  • Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability

Requirements

  • 4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Strong Java / backend service development experience
  • Distributed systems and API-based service design
  • CI/CD pipelines and Git-based workflows
  • 3+ years of experience with scripting and infrastructure automation using Terraform
  • 3+ years of hands-on experience with OpenShift, GCP or Azure platform enablement and application migrations, build out of complex infrastructure programmable patterns using Infrastructure as Code (IaC)
  • 2+ years of knowledge and understanding of Cloud service offerings such as data, analytics, AL/ML on GCP or Azure
  • 2+ years of experience with key services provided by Azure and/or GCP such as BigQuery, Vertix AI, DataProc, Functions. AKS, Service Fabric
  • 2+ years working in a globally distributed team to provide innovative and robust cloud centric solutions
  • 2+ years gathering and analyzing data to diagnose the root cause of cloud workload issues, recommending and implementing solutions to resolve issues in timely manner
  • Exposure to cloud governance and logging/monitoring tooling
  • Experience with Agile concepts and Site Reliability Engineering (SRE) Principles
  • Understanding, engineering and implementing disaster recovery and business continuity playbooks
  • Proficient on container-based solutions and services and have handled large scale Kubernetes based infrastructure build out and provisioning on OpenShift, Azure or GCP
  • Knowledge and understanding of Cloud Service offerings on OpenShift, Azure or GCP related to security, data protection, and policy implementations
  • Ability to articulate technical solutions to both technical and business partners
  • Good understanding of networking, firewalls, load balancing concepts (IP, DNS, Guardrails, Vnets) and exposure to database, cloud security, active directory, authentication methods, RBAC
  • SRE / Reliability
  • Production support mindset (incident response, on-call readiness)
  • Observability: logging, metrics, tracing (Splunk/AppD/AppD-alikes)
  • Performance, availability, and reliability engineering concepts
  • Experience partnering with SRE or platform teams
  • Platform / Cloud
  • Kubernetes/OpenShift (deployments, troubleshooting, scaling)
  • Infrastructure-as-Code exposure (Terraform/Helm is a plus)
  • Desired Qualifications: Set and evangelize the SRE and AIOps technical strategy for EFT, establishing reference architectures, standards, and guardrails (service tiering, onboarding criteria, SLO/error budget governance) and holding teams accountable through transparent executive-level reporting
  • Own the reliability and observability architecture across hybrid/multi-cloud, driving standardization of monitoring, logging, tracing, synthetics, and resilience/chaos testing
  • define platform patterns that teams can adopt with minimal friction
  • Design and implement AIOps and automation platforms (event correlation, anomaly detection, runbook automation, self-healing) with strong engineering discipline (testability, auditability, change safety) and prioritize initiatives that materially reduce incident volume, toil, and MTTR
  • Define the reliability measurement system (SLIs/SLOs, error budgets, customer impact, MTTR/MTBF, change failure rate) and build reusable dashboards and alerts that drive consistent prioritization, investment decisions, and engineering behavior across teams
  • Provide technical leadership during major incidents for critical services, driving rapid triage, clear stakeholder communications, and cross-domain coordination
  • institutionalize blameless post-incident reviews and engineering mechanisms that eliminate systemic causes
  • Partner with application, platform, and architecture leaders to embed reliability into planning and delivery (design and architecture reviews, operational readiness gates, non-functional requirements, capacity/performance engineering), influencing roadmaps based on quantified risk and customer impact
  • Lead multi-quarter, cross-organization reliability transformations (e.g., platform modernization, resilience programs, observability convergence), delivering reusable capabilities and operating mechanisms that improve reliability posture and reduce operational risk at scale

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Systems Operations Engineer - SRE and AIOps

8 matching positions

Senior AIOps Engineer (Platform & Infrastructure)

Groupon is moving beyond "experimenting" with AI to running it at massive scale....
Location
Location
Prague; Warsaw; Valencia; Madrid
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in Platform Engineering, SRE, or DevOps within a cloud-native environment
  • Deep experience managing stateful and stateless workloads (Helm, Istio, Docker)
  • Hands-on experience deploying and operating AI/ML tools or data-intensive systems in production
  • Strong skills in Python or Go to build custom API wrappers and automate operational tasks
  • Expertise in Prometheus, Grafana, and ELK stack to ensure end-to-end observability of complex AI requests
Job Responsibility
Job Responsibility
  • Architect the AI Stack: Design and operate core infrastructure on Kubernetes, including Vector Databases, LLM Gateways (LiteLLM), and workflow automation tools (n8n)
  • Enable at Scale: Drive AI adoption by creating self-service "Golden Paths" using Terraform and Helm, allowing engineering teams to deploy RAG pipelines with one click
  • Operational Excellence: Implement centralized observability, tracing (Langfuse), and governance to ensure our AI systems are reliable, auditable, and secure
  • Fiscal Discipline: Own the "AI Bill"—monitoring token usage and latency to optimize spend while maintaining high performance
What we offer
What we offer
  • End-to-end Ownership: Real authority to standardize how a global company builds with AI
  • Career Growth: This is a high-visibility role within a new, strategic team with potential for leadership progression
Read More
Arrow Right

Senior Software Engineer, AI

LogicMonitor is advancing observability through AI‑driven data intelligence, con...
Location
Location
India , Pune
Salary
Salary:
Not provided
logicmonitor.com Logo
LogicMonitor
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Data Engineering, or a related field
  • 4-5 years of experience in backend or data systems engineering
  • Experience building streaming data pipelines (Kafka / Spark or any similar technology)
  • Strong programming background in Java and Python, including microservice design
  • Experience with ETL, data modeling, and distributed storage systems
  • Familiarity with LLM pipelines, embeddings, and vector retrieval
  • Understanding of Kubernetes, containerization, and CI/CD workflows
  • Awareness of data governance, validation, and lineage best practices
  • Strong communication and collaboration across AI, Data, and Platform teams
Job Responsibility
Job Responsibility
  • Design and build streaming and batch data pipelines that process metrics, logs, and events for AI workflows
  • Develop ETL and feature‑extraction pipelines using Python and Java microservices
  • Integrate data ingestion and enrichment from multiple observability sources into AI‑ready formats
  • Build resilient data orchestration using Kafka, Airflow, and Redis Streams
  • Develop data indexing and semantic search for large‑scale observability and operational data
  • Work with structured and unstructured data lakes and warehouses (Delta Lake, Iceberg, ClickHouse)
  • Collaborate with the AI Platform team to manage embeddings, metadata, and model context storage
  • Optimize latency and throughput for retrieval, query expansion, and AI response generation
  • Build and maintain Java microservices (Spring Boot) that serve AI and analytics data to Edwin and AIOps applications
  • Develop Python APIs (FastAPI / LangGraph) for LLM orchestration, summarization, and correlation reasoning
Read More
Arrow Right
New

Senior Ansible Automation & Platform Engineer

The Senior Ansible Automation & Platform Engineer is a strategic member of the o...
Location
Location
United States , Austin; Mountain View; Warren
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–12+ years in Architecture, DevOps, SRE, Platform Engineering, or Infrastructure Engineering
  • Expert-level proficiency with Ansible (playbooks, roles, collections, Jinja2, modules)
  • Hands-on experience designing and operating Ansible Automation Platform (AAP)
  • Strong experience with Terraform, Chef, or other IaC tools
  • Deep Linux engineering background and configuration management expertise
  • Expert in integrating automation with ServiceNow (CMDB, ITSM, workflows)
  • Exceptional scripting skills (Python, Bash, PowerShell)
  • Experience with AWS/Azure/GCP automation
  • Experience with Kubernetes, containerization, and orchestration
  • Experience with CI/CD pipelines (GitHub Actions, GitLab, Jenkins, Azure DevOps)
Job Responsibility
Job Responsibility
  • Architect, design, and operate the Ansible Automation Platform (AAP) including controller, execution environments, mesh architecture, and collections strategy
  • Define and maintain the Ansible Platform roadmap, including feature evolution, lifecycle management, scalability planning, and enterprise adoption milestones
  • Establish platform governance: coding standards, role/playbook patterns, collections, testing frameworks, and security guardrails
  • Build and maintain Execution Environments (EEs) optimized for performance, security, and dependency management
  • Lead platform upgrades, migrations, and cross-environment standardization
  • Design enterprise-grade Ansible automation frameworks with reusable roles, collections, and modular playbooks
  • Build automation for provisioning, configuration management, patching, compliance, and cloud infrastructure
  • Integrate Ansible with Terraform, CI/CD pipelines, GitOps workflows, and event-driven automation systems
  • Implement self-service automation capabilities for developers, operations, and business teams
  • Integrate Agentic AI systems to enhance automation workflows, including: AI-driven playbook generation and validation, Automated remediation recommendations, Intelligent change-impact analysis, AI-assisted troubleshooting and root-cause analysis
What we offer
What we offer
  • Relocation benefits (may be eligible)
  • Fulltime
Read More
Arrow Right

Lead Software Engineer

Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Experience in Software Engineering, SRE, DevOps, or Platform Engineering
  • Strong proficiency in Python for automation and tooling
  • Hands‑on experience with Grafana, Prometheus, and Splunk in production environments
  • Solid understanding of SLIs, SLOs, dashboards, alerting, and observability best practices
  • Experience applying AI/ML concepts to monitoring, alerting, or operational analytics
  • Strong knowledge of Linux, networking, and distributed systems
  • Experience with Cloud platforms and Kubernetes/OpenShift
  • Proven experience leading incidents, RCAs, and reliability initiatives
  • Experience building custom Prometheus exporters or advanced Grafana dashboards
Job Responsibility
Job Responsibility
  • Lead complex technology initiatives including those that are companywide with broad impact
  • Act as a key participant in developing standards and companywide best practices for engineering complex and large scale technology solutions for technology engineering disciplines
  • Design, code, test, debug, and document for projects and programs
  • Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
  • Make decisions in developing standard and companywide best practices for engineering and technology solutions requiring understanding of industry best practices and new technologies, influencing and leading technology team to meet deliverables and drive new initiatives
  • Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
  • Lead projects, teams, or serve as a peer mentor
  • Own and improve availability, performance, scalability, and resilience of production systems
  • Define, monitor, and manage SLIs/SLOs and error budgets to guide reliability investments
  • Lead capacity planning, performance testing, failover readiness, and disaster‑recovery design
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Delhi
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Pune
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right