Lead SRE Job at Zeektek (St Louis)

Job Description

We have a 6 month contract to hire for a senior, hands-on Site Reliability Engineer who blends deep AWS and Kubernetes production experience with strong leadership in reliability strategy, incident response, and observability. They bring expert-level skills in modern monitoring platforms (especially Dynatrace), CI/CD and infrastructure-as-code, and can partner with application teams to drive SLOs, reduce downtime, and scale highly reliable systems in a regulated enterprise environment. 100% Remote. Forming new teams, focusing on Adobe Stack to enhance the scalability of the Adobe platform. This initiative aims to align with a unified technology strategy that supports evolving business needs. Uses advanced experience to lead more complex projects from end-to-end that are focused on managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery(CI/CD) tools, processes, and designs. Leads the development and delivery of complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents. Leads the delivery of standardized and scalable software tools to ensure that systems operate without interruption at optimum performance and leads project teams through out the deployment process. Troubleshoots and analyzes service disruptions to determine the root cause of issues and develop solutions for improved reliability.

Job Responsibility

Lead SRE to drive reliability, scalability, observability (monitoring & alerts) and performance across the production platforms
Own the SLO/SLI strategy, modernize observability and incident response, and partner with application teams to deliver resilient systems
Define and govern SLOs/SLIs/Error Budgets for critical services
enforce guardrails and drive reliability roadmaps
Lead performance tuning collaboration with application teams to ensure high availability and low latency
Define and own infrastructure tuning to ensure scalability leading to high availability
Lead Metrics and automation driven Reliability
Dedug systems across layers
Architect and evolve CI/CD, infrastructure-as-code (IaC- Terraform)
Design and build serverless APIs (Lambda, API Gateway, SQS, SNS, DynamoDB, etc.)
Build scalable Kubernetes/container platforms, service meshes, and developer self service workflows
Mature observability (metrics, logs, traces, RUM, synthetic checks) and AIOps/alert hygiene to reduce noise and MTTR
Produce actionable dashboards at team and exec levels
Lead incident management (on-call rotations, triage, comms, postmortems)
Partner with Security to embed shift-left practices, secure defaults, and policy-as-code (RBAC, secrets)
Ensure compliance with SOC2 / HIPAA / PCI (as applicable) in production operations
Mentor partner teams
establish runbooks, standards, and golden paths
Influence architecture decisions, participate in design reviews, and evangelize reliability best practices
Optimize cloud spend via right sizing, autoscaling, workload placement, and utilization insights
Lead team to identify problems with systems and services and drives regular deployment of new versions of the systems and their subcomponents
Lead projects from end-to-end that are focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility
Drives decisions around periodic system validation and testing, service monitoring, and standing up new services/tools
Uses advanced knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization
Leads post incident reviews and documents findings for future informed decision making
Drives implementation of approved proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability
Leads functional and development teams to investigate and document issues and leads internal team to develop solutions to mitigate them
Leads root cause and problem solving initiatives
Understand and adapt new technologies, tools, methods, and processes from Microsoft and industry
Coaches and mentors team
Designs and implements key performance indicators
Contributes to engineering and organization success by welcoming related, different, and new requests
helping others accomplish job results
Trains the engineering team on new systems, protocols, and best practice
Drive and coach others through reviews of design, code, and test cases

Requirements

Bachelor's degree
AWS Certified DevOps Engineer – Professional
Dynatrace Professional
One SaaS tool certifications (Prometheus Certified Associate (PCA), Datadog, New Relic)
7+ years in SRE/Production Engineering/Platform roles
2+ years leading initiatives or teams
Strong in Linux, networking fundamentals (HTTP, TLS, DNS, TCP), and distributed systems concepts
Proficiency with Go, Python, Shell Scripting, SQL, Java or JVM, JavaScript/TypeScript, YAML/HCL/JSON
Hands-on with IaC (Terraform) and CI/CD (GitLab CI, GitHub Actions, AWS/Azure DevOps)
Deep experience in AWS Cloud infrastructure
Deep experience operating AWS Kubernetes (or equivalent orchestration), AWS Lambdas in production
Deep experience in Monitoring & Observability stack expertise (e.g., Dynatrace, Prometheus/Grafana, OpenTelemetry, ELK, Datadog, New Relic)
Demonstrated leadership in incident response, postmortems, and reliability governance (SLOs/error budgets)

Nice to have

Healthcare Experience
AWS Certified Solutions Architect – Professional
Dynatrace Master
Azure DevOps Engineer Expert
Certified Kubernetes Administrator (CKA)
Splunk Core Certified Power User / Admin
Experience with multi cloud or hybrid: Azure, AWS
Experience with API gateways, and edge/CDN (CloudFront/Akamai/Azure Front Door)
Message streaming and storage: Kafka, AWS EDA
Security automation: Vault, SOPS, supply chain security (SLSA, Sigstore)
Performance engineering (profiling, p99 latency, load testing: k6)
Healthcare Industry Experience & experience in regulated environments (e.g., SOX, HIPAA, PCI)

What we offer

Weekly Direct Deposit
401K Matching
Competitive medical, dental and vision insurance
Consistent communication throughout your project
ZeekTek Referral Program

Zeektek - All Job Offers

Select Country

Lead SRE

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Lead SRE

Lead SRE

Lead SRE

Credit Risk Support Lead- SRE

Credit Risk Support Lead- SRE

Site Reliability Engineering (SRE) / Lead Engineer

SRE Lead Design & Support Engineer

Lead Mainframe SRE

Sre Team Lead (Fedramp / Security)

Our AI answers in your language