Lead Systems Operations Engineer - Platform Reliability Engineering, SRE, Observability and Monitoring, Platform Support Job at Wells Fargo (Bengaluru)

Staff Observability Operations Engineer

We are currently seeking several experienced and highly skilled Staff Observabil...

Location

United States , Hartford

Salary:

130295.00 - 260590.00 USD / Year

CVS Health

Expiration Date

Until further notice

Requirements

7+ Years of experience in IT operations, with significant responsibilities in system monitoring, performance tuning, and troubleshooting enterprise applications
5+ Years in a Site Reliability Engineering (SRE) role deploying and managing modern observability solutions
5+ Years managing and implementing observability and event management platforms (e.g., AppDynamics, Splunk, Prometheus, Grafana)
Experience developing and administering ServiceNow ITOM event management solutions
Experience deploying and managing service reliability platforms (e.g., xMatters, OpsGenie, PagerDuty)
Experience with and deep knowledge of cloud environments, cloud monitoring platforms, and container orchestration tools (e.g., AWS/CloudTrail, Azure/Monitor, GCP/GCM, Kubernetes, OpenShift)
Proficiency in Python and other scripting languages such as Ansible, PowerShell, Bash for automation and configuration
Hands-on experience deploying, managing, and administering observability platforms
Hands-on experience leading, coordinating, and performing migration of application, platform, and infrastructure observability solutions
Proven ability to troubleshoot and resolve complex technical issues

Job Responsibility

Deploy and implement modern observability solutions
Manage and administer observability and event management platforms
Coordinate and manage release cycles for observability platforms
Troubleshoot and resolve incidents related to observability platforms
Continuously monitor and enhance platform performance
Collaborate with cross-functional stakeholders
Provide training and mentoring to junior engineers
Ensure compliance and security of observability platforms
Maintain documentation of observability platform configurations
Generate and analyze reports on platform performance and capacity

What we offer

Affordable medical plan options
a 401(k) plan (including matching company contributions)
an employee stock purchase plan
No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs
confidential counseling and financial coaching
Paid time off
flexible work schedules
family leave
dependent care resources
colleague assistance programs

Fulltime

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Platform Engineer

Motorica is at a breakthrough moment. We’ve built a generative AI animation plat...

Location

Sweden , Stockholm

Salary:

Not provided

Motorica

Expiration Date

Until further notice

Requirements

Proven experience in Platform Engineering, SRE, or DevOps, ideally in high-growth or AI/ML-heavy environments
Strong grasp of CI/CD systems, cloud infrastructure (AWS/GCP), and containerization (Docker/Kubernetes)
Familiarity with observability, monitoring, and incident response best practices
Security mindset with hands-on experience in audits, compliance (ISO 27001, SOC2, etc.), and vulnerability management
Strong communication skills
you’ll be interfacing with developers daily and need to translate infrastructure into clarity, not complexity
A proactive, solution-oriented mindset: you anticipate friction before others feel it

Job Responsibility

Provide common infrastructure guidance, reusable patterns, and automated tooling to engineering teams
Own the “paved road” for developers, reducing friction and cognitive load
Champion and implement security best practices across the entire platform
Play a key role in achieving ISO 27001 certification through technical implementation and evidence gathering
Build and operate a highly reliable and cost-efficient platform, with particular focus on optimizing GPU-heavy AI/ML workloads
Manage CI/CD systems (GitHub Actions, GitLab CI) and track key metrics like build times, deployment frequency, and failure rates
Oversee cloud environments (AWS, GCP), including health, security, and cost reporting
Lead security scans, audits, and vulnerability remediation
Maintain observability stack (Prometheus, Grafana, Datadog, GCP Logging), ensuring meaningful dashboards and alerts
Act as point-of-contact for ML Research team’s infra requests (GPU access, specialized pipelines)

What we offer

Stock Options program
Retirement Plan
Health Benefits (5000 SEK/year)
Life Insurance / Health Insurance / Injury Insurance
Competitive compensation

Fulltime

Software Engineer, Site Reliability

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical r...

Location

United States , San Mateo

Salary:

Not provided

Fireworks AI

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
Proven ability to troubleshoot complex issues across the entire stack
Excellent communication, collaboration, and problem-solving skills

Job Responsibility

Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts

Fulltime

Senior Security Operations Engineer II

As a Senior Security Operations Engineer, you’ll play a key role in ensuring the...

Location

United States , Scottsdale

Salary:

Not provided

Axon

Expiration Date

Until further notice

Requirements

7+ years of experience in operations, site reliability, or infrastructure engineering roles
Strong experience securing and managing cloud environments (e.g., AWS, Azure) and containerized workloads
Deep understanding of Linux systems, networking, distributed systems, and their associated security controls
Proficiency in automation, scripting, and security tooling integration to streamline operations and enforcement
Experience with security monitoring, alerting, SIEM platforms, and observability tools
Solid grasp of CI/CD practices with integrated security testing and compliance checks
Experience managing Kubernetes clusters and running containerized workloads in production
Experience with deploying and administrating any of the following: scalable cloud native secrets solutions such as AWS KMS, Azure KeyVault
PKI solutions such as EJBCA, Smallstep, Venafi
or vaulting solutions such as Hashicorp Vault

Job Responsibility

Implementing and improving automated security checks in CI/CD pipelines to prevent vulnerabilities from reaching production
Writing, reviewing, and maintaining security-focused infrastructure-as-code for scalable and compliant deployments
Investigating security incidents, performing root cause analysis, and implementing long-term mitigation strategies
Collaborating with developers to develop new features, services, and infrastructure requirements
Enhancing security observability through improved log collection, metrics, and alerting configurations
Maintaining and improving security runbooks, incident response playbooks, and internal security tooling for operational efficiency
Resolve security/infrastructure incidents by participating in high impact/high visibility incidents as a participant and ideally as an incident commander
Maintain and secure critical infrastructure components such as PKI (Public Key Infrastructure) and IAM ( Identity & Access Management) systems, ensuring reliability, scalability, and compliance with organizational and industry security standards
Build and maintain secure, reliable, and scalable infrastructure that protects core services and sensitive data
Troubleshoot and resolve complex operational and system-level issues across environments

What we offer

Competitive salary and 401k with employer match
Discretionary paid time off
Paid parental leave for all
Medical, Dental, Vision plans
Fitness Programs
Emotional & Mental Wellness support
Learning & Development programs
Snacks in our offices

Fulltime

Senior Security Operations Engineer II

As a Senior Security Operations Engineer, you’ll play a key role in ensuring the...

Location

United States , Scottsdale

Salary:

Not provided

Axon

Expiration Date

Until further notice

Requirements

7+ years of experience in operations, site reliability, or infrastructure engineering roles
Strong experience securing and managing cloud environments (e.g., AWS, Azure) and containerized workloads
Deep understanding of Linux systems, networking, distributed systems, and their associated security controls
Proficiency in automation, scripting, and security tooling integration to streamline operations and enforcement
Experience with security monitoring, alerting, SIEM platforms, and observability tools
Solid grasp of CI/CD practices with integrated security testing and compliance checks
Experience managing Kubernetes clusters and running containerized workloads in production
Experience with deploying and administrating any of the following: scalable cloud native secrets solutions such as AWS KMS, Azure KeyVault
PKI solutions such as EJBCA, Smallstep, Venafi
or vaulting solutions such as Hashicorp Vault

Job Responsibility

Implementing and improving automated security checks in CI/CD pipelines to prevent vulnerabilities from reaching production
Writing, reviewing, and maintaining security-focused infrastructure-as-code for scalable and compliant deployments
Investigating security incidents, performing root cause analysis, and implementing long-term mitigation strategies
Collaborating with developers to develop new features, services, and infrastructure requirements
Enhancing security observability through improved log collection, metrics, and alerting configurations
Maintaining and improving security runbooks, incident response playbooks, and internal security tooling for operational efficiency
Resolve security/infrastructure incidents by participating in high impact/high visibility incidents as a participant and ideally as an incident commander
Maintain and secure critical infrastructure components such as PKI (Public Key Infrastructure) and IAM ( Identity & Access Management) systems, ensuring reliability, scalability, and compliance with organizational and industry security standards
Build and maintain secure, reliable, and scalable infrastructure that protects core services and sensitive data
Troubleshoot and resolve complex operational and system-level issues across environments

What we offer

Competitive salary and 401k with employer match
Discretionary paid time off
Paid parental leave for all
Medical, Dental, Vision plans
Fitness Programs
Emotional & Mental Wellness support
Learning & Development programs
Snacks in our offices

Fulltime

Director SRE & Operations

Director SRE & Operations for E-business / Digital at PUMA in Herzogenaurach, Ge...

Location

Germany , Herzogenaurach

Salary:

Not provided

Puma Group

Expiration Date

Until further notice

Requirements

10–15 years of experience in technology operations, site reliability engineering, or platform engineering within large-scale digital or eCommerce environments
Proven track record owning platform reliability, availability, and operational performance for consumer-facing systems
Strong experience with cloud infrastructure, incident management, observability, and operational readiness in high-traffic, peak-driven environments
Demonstrated ability to embed SRE practices (SLOs, SLIs, incident response, automation) across engineering teams
Experienced leader of global operations or SRE teams, comfortable working in on-call and 24/7 operational models
Calm, decisive leader with a strong focus on stability, resilience, and continuous operational improvement

Job Responsibility

Leadership: Responsible for all aspects of the performance management and professional development of the team, including recruitment, development plans, providing constructive feedback, appraisals and exit processes
Foster a positive and inclusive team culture by actively engaging team members, promoting open communication, and implementing initiatives that enhance employee satisfaction and well-being
Compliance with and implementation of legal and operational requirements regarding occupational health and safety within your own area of responsibility
Global Site Reliability & Operations Strategy: Define and execute a global Site Reliability Engineering (SRE) and Technology Operations strategy aligned with PUMA’s D2C growth, peak trading demands, and omnichannel ambitions
Establish reliability, availability, performance, and scalability targets across all D2C platforms (eCommerce, in-store integrations, APIs, data platforms)
Own the end-to-end operational health of consumer-facing and business-critical platforms
Platform Reliability, Resilience & Performance: Drive a reliability-first mindset across engineering, embedding SRE principles such as SLIs, SLOs, SLAs, error budgets, and resilience-by-design
Ensure platforms are engineered to handle peak events (campaigns, drops, seasonal peaks) with minimal risk and rapid recovery
Lead incident management, major incident response, root cause analysis, and post-incident reviews with a strong focus on learning and prevention
Continuously improve platform observability, monitoring, alerting, and performance management

Fulltime

Platform Engineer DevOps

We are looking for an experienced Platform Engineer DevOps to ensure that the fo...

Location

France , Paris

Salary:

Not provided

cozycozy

Expiration Date

Until further notice

Requirements

5+ years of hands-on experience in Platform Engineering, Infrastructure or DevOps
Expertise in operating and scaling Kubernetes and Docker in production environments
Proven experience managing hybrid cloud / on-premises infrastructure for high-traffic applications
A strong background in designing and implementing robust CI/CD pipelines (GitLab CI, Jenkins, etc.)
Experience with Infrastructure as Code (Terraform, Ansible, etc.)
Experience with monitoring, alerting, and reliability practices (SRE principles)
The mindset to mentor and guide other engineers, fostering a culture of automation and operational excellence
Excellent communication skills in English
The demonstrated ability to drive complex projects

Job Responsibility

Implement, maintain and secure infrastructure (cloud, bare-metal, Kubernetes clusters)
Automate environment configuration using Infrastructure as Code (e.g.,Terraform, Ansible) and adhere to GitOps principles
Implement full-stack observability (metrics, logs, traces), sophisticated alerting, and participate in the incident management lifecycle
Ensure compliance with Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all managed services
Implement and manage secrets management systems
Contribute to the design and evolution of hybrid infrastructure
Define, lead, and maintain engineering standards for security, reliability, and technology selection across the organization, supporting the Head of Engineering in defining the platform roadmap
Drive continuous improvement initiatives for cloud cost optimization, scalability, performance, and platform security posture
Maintain comprehensive, up-to-date documentation and best practices to foster self-service and cross-team enablement
Design, implement, and maintain CI/CD pipelines (using GitLab CI, Github, and/or Jenkins) tailored for microservice architectures built with Node.js

What we offer

Competitive salary
stock options
Alan health insurance
Swile card
unlimited coffee, tea, snacks, and drinks in the office

Select Country

Lead Systems Operations Engineer - Platform Reliability Engineering, SRE, Observability and Monitoring, Platform Support

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?