CrawlJobs Logo

Lead SRE

United States, St Louis · Job Posted January 29, 2026
Apply Position
Job Link Share

Job Description

We have a 6 month contract to hire for a senior, hands-on Site Reliability Engineer who blends deep AWS and Kubernetes production experience with strong leadership in reliability strategy, incident response, and observability. They bring expert-level skills in modern monitoring platforms (especially Dynatrace), CI/CD and infrastructure-as-code, and can partner with application teams to drive SLOs, reduce downtime, and scale highly reliable systems in a regulated enterprise environment. 100% Remote. Forming new teams, focusing on Adobe Stack to enhance the scalability of the Adobe platform. This initiative aims to align with a unified technology strategy that supports evolving business needs. Uses advanced experience to lead more complex projects from end-to-end that are focused on managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery(CI/CD) tools, processes, and designs. Leads the development and delivery of complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents. Leads the delivery of standardized and scalable software tools to ensure that systems operate without interruption at optimum performance and leads project teams through out the deployment process. Troubleshoots and analyzes service disruptions to determine the root cause of issues and develop solutions for improved reliability.

Job Responsibility

  • Lead SRE to drive reliability, scalability, observability (monitoring & alerts) and performance across the production platforms
  • Own the SLO/SLI strategy, modernize observability and incident response, and partner with application teams to deliver resilient systems
  • Define and govern SLOs/SLIs/Error Budgets for critical services
  • enforce guardrails and drive reliability roadmaps
  • Lead performance tuning collaboration with application teams to ensure high availability and low latency
  • Define and own infrastructure tuning to ensure scalability leading to high availability
  • Lead Metrics and automation driven Reliability
  • Dedug systems across layers
  • Architect and evolve CI/CD, infrastructure-as-code (IaC- Terraform)
  • Design and build serverless APIs (Lambda, API Gateway, SQS, SNS, DynamoDB, etc.)
  • Build scalable Kubernetes/container platforms, service meshes, and developer self service workflows
  • Mature observability (metrics, logs, traces, RUM, synthetic checks) and AIOps/alert hygiene to reduce noise and MTTR
  • Produce actionable dashboards at team and exec levels
  • Lead incident management (on-call rotations, triage, comms, postmortems)
  • Partner with Security to embed shift-left practices, secure defaults, and policy-as-code (RBAC, secrets)
  • Ensure compliance with SOC2 / HIPAA / PCI (as applicable) in production operations
  • Mentor partner teams
  • establish runbooks, standards, and golden paths
  • Influence architecture decisions, participate in design reviews, and evangelize reliability best practices
  • Optimize cloud spend via right sizing, autoscaling, workload placement, and utilization insights
  • Lead team to identify problems with systems and services and drives regular deployment of new versions of the systems and their subcomponents
  • Lead projects from end-to-end that are focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility
  • Drives decisions around periodic system validation and testing, service monitoring, and standing up new services/tools
  • Uses advanced knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization
  • Leads post incident reviews and documents findings for future informed decision making
  • Drives implementation of approved proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability
  • Leads functional and development teams to investigate and document issues and leads internal team to develop solutions to mitigate them
  • Leads root cause and problem solving initiatives
  • Understand and adapt new technologies, tools, methods, and processes from Microsoft and industry
  • Coaches and mentors team
  • Designs and implements key performance indicators
  • Contributes to engineering and organization success by welcoming related, different, and new requests
  • helping others accomplish job results
  • Trains the engineering team on new systems, protocols, and best practice
  • Drive and coach others through reviews of design, code, and test cases

Requirements

  • Bachelor's degree
  • AWS Certified DevOps Engineer – Professional
  • Dynatrace Professional
  • One SaaS tool certifications (Prometheus Certified Associate (PCA), Datadog, New Relic)
  • 7+ years in SRE/Production Engineering/Platform roles
  • 2+ years leading initiatives or teams
  • Strong in Linux, networking fundamentals (HTTP, TLS, DNS, TCP), and distributed systems concepts
  • Proficiency with Go, Python, Shell Scripting, SQL, Java or JVM, JavaScript/TypeScript, YAML/HCL/JSON
  • Hands-on with IaC (Terraform) and CI/CD (GitLab CI, GitHub Actions, AWS/Azure DevOps)
  • Deep experience in AWS Cloud infrastructure
  • Deep experience operating AWS Kubernetes (or equivalent orchestration), AWS Lambdas in production
  • Deep experience in Monitoring & Observability stack expertise (e.g., Dynatrace, Prometheus/Grafana, OpenTelemetry, ELK, Datadog, New Relic)
  • Demonstrated leadership in incident response, postmortems, and reliability governance (SLOs/error budgets)

Nice to have

  • Healthcare Experience
  • AWS Certified Solutions Architect – Professional
  • Dynatrace Master
  • Azure DevOps Engineer Expert
  • Certified Kubernetes Administrator (CKA)
  • Splunk Core Certified Power User / Admin
  • Experience with multi cloud or hybrid: Azure, AWS
  • Experience with API gateways, and edge/CDN (CloudFront/Akamai/Azure Front Door)
  • Message streaming and storage: Kafka, AWS EDA
  • Security automation: Vault, SOPS, supply chain security (SLSA, Sigstore)
  • Performance engineering (profiling, p99 latency, load testing: k6)
  • Healthcare Industry Experience & experience in regulated environments (e.g., SOX, HIPAA, PCI)

What we offer

  • Weekly Direct Deposit
  • 401K Matching
  • Competitive medical, dental and vision insurance
  • Consistent communication throughout your project
  • ZeekTek Referral Program

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Lead SRE

8 matching positions

Lead SRE

About the Role Deliver cloud-native solutions and patterns that are highly elast...
Location
Location
United States of America , Fort Lauderdale
Salary
Salary:
Not provided
bhsg.com Logo
Beacon Hill
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deliver cloud-native solutions and patterns that are highly elastic
  • Empower stakeholders and reduce toil through self-service pipelines
  • Mentor your team in solving deep technical issues, advanced cloud infrastructure topics, and complex coding problems
  • Set an example of methodical, systematic task execution for your team
  • Work with project managers and stakeholders to provide status and reporting
  • Act as an ambassador to other teams, finding common ground and defining clear agreements
  • Drive projects to schedule
  • Perform code reviews with an eye toward rigor and best practice
  • Apply continuous process improving techniques across the operation
  • Automate everything
Job Responsibility
Job Responsibility
  • Deliver cloud-native solutions and patterns that are highly elastic
  • Empower stakeholders and reduce toil through self-service pipelines
  • Mentor your team in solving deep technical issues, advanced cloud infrastructure topics, and complex coding problems
  • Set an example of methodical, systematic task execution for your team
  • Work with project managers and stakeholders to provide status and reporting
  • Act as an ambassador to other teams, finding common ground and defining clear agreements
  • Drive projects to schedule
  • Perform code reviews with an eye toward rigor and best practice
  • Apply continuous process improving techniques across the operation
  • Automate everything
  • Fulltime
Read More
Arrow Right

Lead SRE

We are looking for a Lead SRE to join our Inetum Team and be part of a work cult...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • SRE IT production processes
  • Agile / DevOps Mindset Problem Solving
  • Scripting: Python, YML, Shell
  • Monitoring: Dynatrace, Nagios
  • Linux
  • Admin Network (DNS, Firewall, Switch)
  • DevOps stack: Git & Git Flow, Artifactory, Jenkins or Gitlab CI, Ansible Tower, Digital ai Release
  • Cloud: Kubernetes, Docker, Argo CD, ArgoCD, Vault, Helm
  • End-to-end IT organization and processes (from development to run / operate)
  • Technical Architecture
Job Responsibility
Job Responsibility
  • Train SREs and their managers on SRE practices
  • Co-construct the transformation strategy and the support plan by participating in workshops, brainstorming with the transformation team and producing training content
  • Coach and support
  • Fulltime
Read More
Arrow Right

Credit Risk Support Lead- SRE

Join Barclays as a Credit Risk Support Lead- SRE role, where to effectively moni...
Location
Location
India , Pune
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 14+ years’ experience in production support
  • High energy, hands-on and results & goal-oriented
  • Expertise in log debugging, root cause analysis and troubleshooting live issues
  • Experience on observability tools like ESaaS, AppD / ITRS , Netcool
  • Experience in data analysis to identify underlying themes impacting stability, performance, and customer experience
  • Ensures and promotes ITIL best practices for Incident, Problem, Change, Release management (including managing and running triages, conducting root cause analysis, post incident reviews etc)
  • Strong Credit Risk business knowledge
  • Negotiate SLAs/OLAs with customer and other support elements
  • Business (IT) Continuity Management
  • KPI reporting and monitoring
Job Responsibility
Job Responsibility
  • Provision of technical support for the service management function to resolve more complex issues for a specific client of group of clients. Develop the support model and service offering to improve the service to customers and stakeholders.
  • Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics to identify, prevent and address potential issues and ensure optimal performance.
  • Maintenance of a knowledge base containing detailed documentation of resolved cases for future reference, self-service opportunities and knowledge sharing.
  • Analysis of system logs, error messages and user reports to identify the root causes of hardware, software and network issues, and providing a resolution to these issues by fixing or replacing faulty hardware components, reinstalling software, or applying configuration changes.
  • Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management.
  • Identification and remediation or raising, through appropriate process, of potential service impacting risks and issues.
  • Proactively assess support activities implementing automations where appropriate to maintain stability and drive efficiency. Actively tune monitoring tools, thresholds, and alerting to ensure issues are known when they occur.
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Credit Risk Support Lead- SRE

Embark on a transformative journey as a Credit Risk Support Lead-SRE. At Barclay...
Location
Location
United States , Whippany
Salary
Salary:
150000.00 - 215000.00 USD / Year
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Good domain knowledge with end-to-end responsibility of IT services, including day-to-day operations, incidents and changes
  • Robust understanding of regulatory compliance, risk frameworks, audit, and metric monitoring of service health and control effectiveness
  • Overseeing support teams, effective delegation, and communication with business users and senior stakeholders
  • Ability to prioritize issues, refine support procedures, and drive continuous improvement across RTB and support processes
  • Solid understanding of the software development lifecycle and how application support integrates to enhance delivery, stability, and reliability
Job Responsibility
Job Responsibility
  • Provision of technical support for the service management function to resolve more complex issues for a specific client of group of clients
  • Develop the support model and service offering to improve the service to customers and stakeholders
  • Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics to identify, prevent and address potential issues and ensure optimal performance
  • Maintenance of a knowledge base containing detailed documentation of resolved cases for future reference, self-service opportunities and knowledge sharing
  • Analysis of system logs, error messages and user reports to identify the root causes of hardware, software and network issues, and providing a resolution to these issues
  • Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management
  • Identification and remediation or raising, through appropriate process, of potential service impacting risks and issues
  • Proactively assess support activities implementing automations where appropriate to maintain stability and drive efficiency
  • Actively tune monitoring tools, thresholds, and alerting to ensure issues are known when they occur
What we offer
What we offer
  • Medical coverage
  • Dental coverage
  • Vision coverage
  • 401(k)
  • Life insurance
  • Paid leave
  • Incentive award
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
Job Responsibility
Job Responsibility
  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence
  • Fulltime
Read More
Arrow Right

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...
Location
Location
Mexico , Miguel Hidalgo
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
Job Responsibility
Job Responsibility
  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
What we offer
What we offer
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life
Read More
Arrow Right

Lead Mainframe SRE

Location
Location
Greece , Athens
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Engineering, Information Systems, or related field (or equivalent experience)
  • 6-8 years of experience in relevant roles
  • Extensive experience managing enterprise middleware and batch-processing platforms within large-scale production environments
  • Strong expertise in re-hosted middleware services and production-control ecosystems, including technologies such as MQ, CICS/TX, Stonebranch UAC, scheduling platforms, and cross-platform operational tooling
  • Strong experience supporting Linux-hosted middleware environments, DB2 LUW interfaces, queue and log monitoring, and disaster recovery operations
  • Strong understanding of operational standards, root cause analysis, change-window governance, service continuity, and platform resilience practices
  • Experience leading modernization, migration, or decommissioning initiatives within critical enterprise platforms
  • Proven ability to lead small technical teams and coordinate multiple stakeholders, vendors, and operational groups
  • Strong stakeholder-management and communication skills, with the ability to gather requirements and align technical solutions with business needs
  • Comfortable operating within highly critical, governance-driven, and cross-functional enterprise environments
Job Responsibility
Job Responsibility
  • Own the end-to-end technical service for re-platformed middleware and production-control platforms across production, non-production, and disaster recovery environments
  • Lead and mentor a small team of engineers, providing technical leadership, operational guidance, and coordination of day-to-day platform activities
  • Act as the primary technical point of contact for stakeholders, client representatives, and operational teams, gathering requirements and translating business needs into reliable technical solutions
  • Lead Stonebranch UAC and batch operations activities, including scheduling standards, post-batch support, bundle promotion governance, and resolution of cross-domain batch-processing issues
  • Drive service improvements, platform upgrades, cluster and scheduler health initiatives, and operational standards across highly critical enterprise services
  • Act as the command point for high-severity incidents, out-of-hours change windows, root cause analysis activities, and cross-team operational coordination
  • Coordinate closely with IBM, Stonebranch, infrastructure teams, and application owners to ensure operational continuity and service resilience
  • Support disaster recovery readiness, service recovery testing, and operational governance activities across the platform landscape
  • Guide the controlled migration, modernization, or decommissioning of legacy batch-processing services while protecting production stability
  • Define and maintain standards for availability, operational excellence, monitoring, service governance, and platform reliability
What we offer
What we offer
  • Health insurance for the employee and one dependent family member (100% paid by NTT DATA)
  • Meal vouchers of 120€ per month (x12)
  • Corporate mobile phone: subscription & device
  • Teleworking equipment allowance
  • Internal Trainings Platform Account
  • Access to Open Up mental health service
  • 28 days of paid annual leave consisting of your legal holidays and compensation days
Read More
Arrow Right

Sre Team Lead (Fedramp / Security)

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
United States , Los Angeles
Salary
Salary:
230000.00 - 270000.00 USD / Year
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience as a Team Lead / Tech Lead
  • At least 5 years of experience as a DevOps Engineer/ SRE in production environments
  • At least 2 years of experience Experience with FedRAMP compliance (High/Moderate levels), vulnerability management, and continuous monitoring, including scanning, patching, and reporting - Advantage
  • In-depth experience with Kubernetes - operating & monitoring are key parts
  • High familiarity with monitoring tools such as Coralogix, Grafana, Prometheus
  • Experience in AWS or other cloud providers
  • Experience with infrastructure as a code (Terraform, Crossplane, etc.)
  • Understanding of networking - from networking layers to different networking protocols (http, grpc, ssl)
  • Some software engineering experience, preferably in Golang
  • An advantage - operating data pipelines
Job Responsibility
Job Responsibility
  • Lead and mentor a team of engineers, including hiring, onboarding, and performance management
  • Work in high scale environments - Coralogix data pipeline processes 55Tb of data each day
  • Adopt cutting edge technologies with end-to-end responsibility
  • Building internal tools to expand our platform capabilities
  • Collaborate with R&D to improve stability & reliability of the system
  • Lead the product roadmap - our product is designed for engineers. Therefore, our engineers promote, enhance, and take a crucial part in influencing the product roadmap
  • Perform operational duties for FedRAMP cloud products, including deployments, on-call support, and incident management
What we offer
What we offer
  • comprehensive and inclusive employee benefits for healthcare, dental, and mental health benefits
  • 401(k) plan and match
  • paid sick time and paid time off
  • Fulltime
Read More
Arrow Right