CrawlJobs Logo

Lead SRE

zeektek.com Logo

Zeektek

Location Icon

Location:
United States , St Louis

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We have a 6 month contract to hire for a senior, hands-on Site Reliability Engineer who blends deep AWS and Kubernetes production experience with strong leadership in reliability strategy, incident response, and observability. They bring expert-level skills in modern monitoring platforms (especially Dynatrace), CI/CD and infrastructure-as-code, and can partner with application teams to drive SLOs, reduce downtime, and scale highly reliable systems in a regulated enterprise environment. 100% Remote. Forming new teams, focusing on Adobe Stack to enhance the scalability of the Adobe platform. This initiative aims to align with a unified technology strategy that supports evolving business needs. Uses advanced experience to lead more complex projects from end-to-end that are focused on managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery(CI/CD) tools, processes, and designs. Leads the development and delivery of complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents. Leads the delivery of standardized and scalable software tools to ensure that systems operate without interruption at optimum performance and leads project teams through out the deployment process. Troubleshoots and analyzes service disruptions to determine the root cause of issues and develop solutions for improved reliability.

Job Responsibility:

  • Lead SRE to drive reliability, scalability, observability (monitoring & alerts) and performance across the production platforms
  • Own the SLO/SLI strategy, modernize observability and incident response, and partner with application teams to deliver resilient systems
  • Define and govern SLOs/SLIs/Error Budgets for critical services
  • enforce guardrails and drive reliability roadmaps
  • Lead performance tuning collaboration with application teams to ensure high availability and low latency
  • Define and own infrastructure tuning to ensure scalability leading to high availability
  • Lead Metrics and automation driven Reliability
  • Dedug systems across layers
  • Architect and evolve CI/CD, infrastructure-as-code (IaC- Terraform)
  • Design and build serverless APIs (Lambda, API Gateway, SQS, SNS, DynamoDB, etc.)
  • Build scalable Kubernetes/container platforms, service meshes, and developer self service workflows
  • Mature observability (metrics, logs, traces, RUM, synthetic checks) and AIOps/alert hygiene to reduce noise and MTTR
  • Produce actionable dashboards at team and exec levels
  • Lead incident management (on-call rotations, triage, comms, postmortems)
  • Partner with Security to embed shift-left practices, secure defaults, and policy-as-code (RBAC, secrets)
  • Ensure compliance with SOC2 / HIPAA / PCI (as applicable) in production operations
  • Mentor partner teams
  • establish runbooks, standards, and golden paths
  • Influence architecture decisions, participate in design reviews, and evangelize reliability best practices
  • Optimize cloud spend via right sizing, autoscaling, workload placement, and utilization insights
  • Lead team to identify problems with systems and services and drives regular deployment of new versions of the systems and their subcomponents
  • Lead projects from end-to-end that are focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility
  • Drives decisions around periodic system validation and testing, service monitoring, and standing up new services/tools
  • Uses advanced knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization
  • Leads post incident reviews and documents findings for future informed decision making
  • Drives implementation of approved proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability
  • Leads functional and development teams to investigate and document issues and leads internal team to develop solutions to mitigate them
  • Leads root cause and problem solving initiatives
  • Understand and adapt new technologies, tools, methods, and processes from Microsoft and industry
  • Coaches and mentors team
  • Designs and implements key performance indicators
  • Contributes to engineering and organization success by welcoming related, different, and new requests
  • helping others accomplish job results
  • Trains the engineering team on new systems, protocols, and best practice
  • Drive and coach others through reviews of design, code, and test cases

Requirements:

  • Bachelor's degree
  • AWS Certified DevOps Engineer – Professional
  • Dynatrace Professional
  • One SaaS tool certifications (Prometheus Certified Associate (PCA), Datadog, New Relic)
  • 7+ years in SRE/Production Engineering/Platform roles
  • 2+ years leading initiatives or teams
  • Strong in Linux, networking fundamentals (HTTP, TLS, DNS, TCP), and distributed systems concepts
  • Proficiency with Go, Python, Shell Scripting, SQL, Java or JVM, JavaScript/TypeScript, YAML/HCL/JSON
  • Hands-on with IaC (Terraform) and CI/CD (GitLab CI, GitHub Actions, AWS/Azure DevOps)
  • Deep experience in AWS Cloud infrastructure
  • Deep experience operating AWS Kubernetes (or equivalent orchestration), AWS Lambdas in production
  • Deep experience in Monitoring & Observability stack expertise (e.g., Dynatrace, Prometheus/Grafana, OpenTelemetry, ELK, Datadog, New Relic)
  • Demonstrated leadership in incident response, postmortems, and reliability governance (SLOs/error budgets)

Nice to have:

  • Healthcare Experience
  • AWS Certified Solutions Architect – Professional
  • Dynatrace Master
  • Azure DevOps Engineer Expert
  • Certified Kubernetes Administrator (CKA)
  • Splunk Core Certified Power User / Admin
  • Experience with multi cloud or hybrid: Azure, AWS
  • Experience with API gateways, and edge/CDN (CloudFront/Akamai/Azure Front Door)
  • Message streaming and storage: Kafka, AWS EDA
  • Security automation: Vault, SOPS, supply chain security (SLSA, Sigstore)
  • Performance engineering (profiling, p99 latency, load testing: k6)
  • Healthcare Industry Experience & experience in regulated environments (e.g., SOX, HIPAA, PCI)
What we offer:
  • Weekly Direct Deposit
  • 401K Matching
  • Competitive medical, dental and vision insurance
  • Consistent communication throughout your project
  • ZeekTek Referral Program

Additional Information:

Job Posted:
January 29, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Lead SRE

Lead SRE

We are looking for a Lead SRE to join our Inetum Team and be part of a work cult...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • SRE IT production processes
  • Agile / DevOps Mindset Problem Solving
  • Scripting: Python, YML, Shell
  • Monitoring: Dynatrace, Nagios
  • Linux
  • Admin Network (DNS, Firewall, Switch)
  • DevOps stack: Git & Git Flow, Artifactory, Jenkins or Gitlab CI, Ansible Tower, Digital ai Release
  • Cloud: Kubernetes, Docker, Argo CD, ArgoCD, Vault, Helm
  • End-to-end IT organization and processes (from development to run / operate)
  • Technical Architecture
Job Responsibility
Job Responsibility
  • Train SREs and their managers on SRE practices
  • Co-construct the transformation strategy and the support plan by participating in workshops, brainstorming with the transformation team and producing training content
  • Coach and support
  • Fulltime
Read More
Arrow Right

Internal Kubernetes Platform Lead SRE

HSBC is seeking an IKP Support Engineer (SRE) to join the IKP Team within the Hy...
Location
Location
Poland
Salary
Salary:
Not provided
https://www.hsbc.com Logo
HSBC
Expiration Date
February 17, 2026
Flip Icon
Requirements
Requirements
  • Solid technical knowledge and experience with Kubernetes administration
  • 3+ years of hands-on experience with Kubernetes administration
  • Strong knowledge of Kubernetes concepts and operations and troubleshooting tools
  • Understanding of containerization and orchestration
  • Experience with Unix administration skills
  • Experience with Service Meshes is a plus
  • Understanding of ITIL processes and automation skills
  • Familiarity with infrastructure as a code
  • Strong analytical and communication skills
  • Proficiency in English.
Job Responsibility
Job Responsibility
  • Ensure the reliability, availability, and performance of the infrastructure platform
  • Collaborate in diagnosing and resolving IKP infrastructure issues
  • Support the deployment, configuration, and maintenance of Kubernetes platform
  • Troubleshoot and resolve incidents, performance issues, and integration failures
  • Perform root cause analysis and implement reliability improvements
  • Provide 24x7 support as part of an on-call Rota
  • Plan duties and the other administrative tasks for a team in line with Polish Labor Code.
What we offer
What we offer
  • Competitive salary
  • Annual performance-based bonus
  • Additional bonuses for recognition awards
  • Multisport card
  • Private medical care
  • Life insurance
  • One-time reimbursement of home office set-up (up to 800 PLN)
  • Corporate parties & events
  • CSR initiatives
  • Nursery discounts
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Engineering Lead Analyst

The Engineering Lead Analyst is a senior level position responsible for leading ...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-10 years of relevant experience in an Engineering role
  • Experience working in Financial Services or a large complex and/or global environment
  • Project Management experience
  • Consistently demonstrates clear and concise written and verbal communication
  • Comprehensive knowledge of design metrics, analytics tools, benchmarking activities and related reporting to identify best practices
  • Demonstrated analytic/diagnostic skills
  • Ability to work in a matrix environment and partner with virtual teams
  • Ability to work independently, multi-task, and take ownership of various parts of a project or initiative
  • Ability to work under pressure and manage to tight deadlines or unexpected changes in expectations or requirements
  • Proven track record of operational process change and improvement
Job Responsibility
Job Responsibility
  • Serve as a technology subject matter expert for internal and external stakeholders
  • Provide direction for all firm mandated controls and compliance initiatives
  • Lead projects within the group and create a technology domain roadmap
  • Ensure that all integration of functions meet business goals
  • Define necessary system enhancements to deploy new products and process enhancements
  • Recommend product customization for system integration
  • Identify problem causality, business impact and root causes
  • Exhibit knowledge of how own specialty area contributes to the business
  • Apply knowledge of competitors, products and services
  • Advise or mentor junior team members
  • Fulltime
Read More
Arrow Right

Director, Service Reliability Engineering

As Director of SRE, you will lead the team responsible for accelerating and auto...
Location
Location
United States , Bethesda
Salary
Salary:
125600.00 - 203700.00 USD / Year
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Undergraduate degree in computer science, software engineering, or a related field (or equivalent experience)
  • 10+ years of experience in SRE, devsecops or IT operations
  • At least 5 years’ experience in a previous leadership role within SRE, devsecops or IT Operations
  • At least five years of experience in the following technologies - Presentation Management: HTML, CSS, JS, Backbone, Node JS, Android, iOS, Application Platforms: NGINX, Java, Akana, Play Framework, Tomcat, Docker, Openshift, Application Data: PostgreSQL, Couchbase, Cassandra, Integration Services: Apache Kafka, Apache Spark, Akana, Analytics Platforms: Hadoop, dashDB, Cognos, Tableau, Security: Forgerock, OpenID, OAUTH, Ping Identity, Public Cloud: Azure, Google Cloud, AliCloud, Amazon Web Services, CI/CD: Harness
  • Experience with test automation
  • Working knowledge and proven track record of implementing disaster indifferent architecture
  • Experience with CDN and Akamai tools
  • Linux/Unix system administration experience
  • Proficient in scripting and programming languages (like Python, Go, Bash, Shell)
  • Hands on experience with infrastructure as code (like Terraform), container orchestration (like Kubernetes), and reliability automation
Job Responsibility
Job Responsibility
  • Define and execute Marriott’s SRE vision, aligning with business objectives and technology roadmaps
  • Build, mentor and lead a high-performing SRE team, fostering a culture of collaboration and innovation
  • Establish reliability, observability and automation goals to improve system uptime, performance and scalability
  • Partner with engineering, operations and security teams to drive best practices and continuous improvement
  • Implement reliability-focused engineering practices, including SLAs, SLOs/SLIs and error budgets
  • Design and maintain resilient, scalable and fault-tolerant architectures across cloud and hybrid environments
  • Develop strategies to proactively identify and mitigate risks to system performance and availability
  • Drive root cause analysis (RCA) and post-mortem processes to prevent recurring incidents
  • Champion automation in monitoring, deployment and incident resolution to reduce toil and enhance efficiency
  • Lead and optimize incident response processes, ensuring rapid detection, diagnosis, and resolution of system failures
What we offer
What we offer
  • Bonus program
  • comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • employee stock purchase plan at 15% discount
  • accrued paid time off (including sick leave where applicable)
  • life insurance
  • group disability insurance
  • travel discounts
  • adoption assistance
  • paid parental leave
  • Fulltime
Read More
Arrow Right

Engineering Manager for Observability/CI/CD and Cloud

Lead the AI-Driven Evolution of Groupon’s Global Engineering Platform. At Groupo...
Location
Location
Dublin; Madrid; Prague; Valencia; Warsaw
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years’ experience leading infrastructure, DevOps, or SRE teams (5+ people), ideally in high-change, scale-up environments
  • Deep technical expertise in cloud-native platforms, observability, infrastructure as code, and CI/CD tooling
  • Proven success operationalizing AI tools within engineering workflows
  • Strategic, resilient, and pragmatic approach: ready to own results and thrive under shifting priorities
  • Exceptional communication: able to simplify complexity and effectively partner with C-level and global teams
  • Bachelor’s or Master’s in Computer Science (or similar)—or equivalent industry experience
Job Responsibility
Job Responsibility
  • Lead & Inspire: Build and mentor a high-performing, globally distributed team of CI/CD and Observability engineers (5-10 direct reports), coaching them in cutting-edge AI-assisted workflows and best practices
  • Modernize Core Infrastructure: Spearhead the migration from legacy platforms (Jenkins, ELK) to cloud-native solutions (GitHub Actions, Google Cloud Logging, GCP Prometheus/Grafana). Eliminate “straggler” pipelines and drive cost-efficient, reliable operations
  • AI-First Engineering: Operationalize AI tools (Claude Code, Copilot, ChatGPT, etc.) for everything from log analysis and incident summaries to automated infrastructure as code, making AI-augmented engineering a daily norm
  • Architect & Optimize: Oversee a hybrid tech stack (Kubernetes, Envoy, Terraform, GCP, AWS), ensuring platforms are fast, scalable, and “self-healing” via LLM integrations
  • Collaborate Globally: Act as a thought leader and cross-functional partner, advocating for AI-driven developer experience and collaborating with leaders in SRE, Product, and Cloud
  • Drive Transformation: Deliver strategic projects with tight deadlines and direct business impact, such as the Jenkins-to-GHA and ELK-to-GCP migrations, while maintaining a high standard of technical excellence and cost efficiency
What we offer
What we offer
  • Drive real, high-visibility change at the heart of a company undergoing major transformation
  • Work on complex technical and operational challenges in a fast-paced, AI-first environment
  • Accelerate your impact—and your team’s—using industry-leading AI and automation tools
  • Influence engineering practices across a global platform impacting millions of users
Read More
Arrow Right

Senior Site Reliability Manager

RUCKUS Networks is seeking an experienced Site Reliability Engineering (SRE) Man...
Location
Location
United States , Sunnyvale
Salary
Salary:
135600.00 - 200000.00 USD / Year
commscope.com Logo
CommScope
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in Site Reliability Engineering (SRE), with 6+ years leading SRE, DevOps, or infrastructure teams
  • Proven experience mentoring engineering managers and developing leadership talent
  • Track record of transforming traditional operations or NOC teams into modern SRE organizations
  • Strong project management skills with Agile/Kanban experience and JIRA proficiency
  • Excellent communication skills, including executive-level presentations
  • Deep SRE expertise: incident management, on-call systems, monitoring, and reliability engineering
  • Infrastructure automation experience with Terraform, Kubernetes, Docker, and CI/CD pipelines
  • Cloud platform proficiency (GCP/AWS), including networking, security, and cost optimization
  • Monitoring and observability experience with Prometheus, Grafana, APM tools, and log aggregation
  • 24/7 operations experience with global team coordination and escalation management
Job Responsibility
Job Responsibility
  • Lead and develop engineering managers and technical operations engineers across India and APAC time zones
  • Build a collaborative team culture that emphasizes knowledge sharing, automation, and operational excellence
  • Mentor engineering managers to strengthen leadership capabilities and technical expertise
  • Set clear performance expectations and provide ongoing coaching for growth
  • Partner cross-functionally with Product, Security, Development, and global operations teams
  • Own 24/7 operational stability for India/APAC, including incident response, escalation, and post-incident reviews
  • Drive comprehensive incident management: alert handling, outage response, and root cause analysis (RCA/CAR)
  • Transform traditional operations into modern SRE practices using SLOs, error budgets, and reliability engineering
  • Implement robust monitoring and alerting with APM tools, dashboards, and automation frameworks
  • Lead technical project delivery with clear timelines, resource planning, and stakeholder communication
What we offer
What we offer
  • medical, dental, and vision plans
  • life and accidental death insurance
  • a 401(k) plan
  • participation in the Company’s Incentive Plan
  • eleven paid holidays in a full calendar year
  • two weeks of paid vacation (prorated based on start date)
  • other leave options
  • Fulltime
Read More
Arrow Right