CrawlJobs Logo

Lead Infrastructure Engineer (SRE)

United States, Charlotte Employment contract 119000.00 - 224000.00 USD / Year · Job Posted May 27, 2026

Job offer has expired

Job Link Share

Job Responsibility

  • Drive and lead Site Reliability Engineering capabilities at Wells Fargo Banking Operations igniting the practice, principles, and culture, leading by example. Mentor and coach engineers while scaling the SRE practice within Banking Operations and partnering with peer platform embedded SRE teams
  • Leverage enterprise capabilities, tools, and innovation to improve availability in a complex ecosystem by maturing observability practices including monitoring, logging, distributed tracing, synthetic monitoring, and chaos engineering with a focus on actionable insights and proactive detection
  • Lead the evolution of our environment introducing self-healing and autonomic capabilities, solving complex operational and systemic issues with precision including building and training models, automating cognitive processes, and leveraging telemetry to improve availability and reliability of products we provide to customers
  • Own and automate key SRE metrics and IT Service Operations processes including customer impact, golden signals and critical user journeys, % availability of critical business flows, SLO/SLI definition and adherence, error budget management, and real-time observability dashboards
  • automate incident response processes through data integration with unified communications and alerting/notification systems
  • Provide leadership in support responsibilities for critical applications and customer journeys onboarded to SRE including rapid remediation of issues through Agile practices, conducting blameless post mortems, driving root cause analysis, and implementing durable solutions through continuous improvement with the goal of eliminating repeat incidents

Requirements

  • 5+ years of Technology Infrastructure Engineering and Solutions experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of experience using Observability Tools with hands-on implementation of monitoring, logging, or tracing solutions utilizing Grafana, ThousandEyes, Prometheus, AppDynamics, or Splunk
  • 3+ years of application production support experience in complex, high-availability environments
  • 2+ years of experience with Confluence or Jira

Nice to have

  • Experienced with Site Reliability Engineering (SRE) including SLO/SLI frameworks, error budgets, toil reduction, and production reliability engineering practices
  • Experience with database logging and monitoring concepts experience
  • Experience with Application performance monitoring and optimization using BlazeMeter, JMeter, Splunk, AppDynamics, or similar observability platforms
  • Experience with scripting or programming languages such as Bash, PowerShell, Python, Shell, VBScript, or JavaScript for automation and reliability engineering use cases
  • Experience and understanding of AIOps and related tools such as MoogSoft or Big Panda, including event correlation and noise reduction
  • Experience with one or more automation tools such as Ansible or similar infrastructure-as-code/configuration management tools
  • Experience with Container technologies: Kubernetes, Docker, PKS, with focus on observability and reliability patterns in distributed systems

What we offer

  • Health benefits
  • 401(k) Plan
  • Paid time off
  • Disability benefits
  • Life insurance, critical illness insurance, and accident insurance
  • Parental leave
  • Critical caregiving leave
  • Discounts and savings
  • Commuter benefits
  • Tuition reimbursement
  • Scholarships for dependent children
  • Adoption reimbursement

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Lead Infrastructure Engineer (SRE)

8 matching positions

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Infrastructure

You’ll help shape the future of infrastructure automation for law enforcement sy...
Location
Location
United States , Seattle; Boston
Salary
Salary:
141000.00 - 225600.00 USD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
  • 8+ years of professional software development experience
  • Strong background building cloud-native, distributed solutions
  • Experience designing tooling and automation to simplify the operational management of SaaS/PaaS systems
  • Proficiency in backend services with multiple managed languages (e.g., Java, Scala, Go, C#, or similar)
  • Expertise with Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation) and building modular, reusable, testable components
  • Familiarity with Kubernetes platforms (e.g., AKS, EKS, or similar)
  • Hands-on experience with CI/CD platforms for automating infrastructure, builds, testing, and releases
  • Strong collaboration and communication skills, with empathy for the needs of engineering teams
Job Responsibility
Job Responsibility
  • Lead engineering architecture design reviews
  • Set a high technical bar for the team through code and architecture design reviews
  • Mentoring engineers
  • Working across teams with Product, Design, and Engineering to create integrated solutions that delight our customers
  • Improve our Engineering process, including long-term thinking, sprint planning and stand-ups
  • Building services that adhere to our high bar on availability and latency in this mission-critical space
  • Working with the latest open source technologies
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right

Senior Infrastructure Engineer - Postgres

ClickHouse is expanding its cloud data platform across AWS, GCP, and Azure—addin...
Location
Location
United States
Salary
Salary:
140000.00 - 208000.00 USD / Year
clickhouse.com Logo
ClickHouse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years in SRE, DevOps, or infrastructure engineering, with a track record of running distributed, production-grade systems
  • Solid understanding of Postgres operations, scaling, and performance tuning
  • Deep hands-on experience across AWS, with exposure to GCP and Azure
  • comfortable navigating multi-cloud topologies
  • Proficient with Terraform, Kubernetes, and container-based infrastructure
  • Strong Go development skills (or willingness to write and own production Go code)
  • Familiar with tools like Prometheus, Grafana, Loki, OpenTelemetry, or equivalents
  • Deep understanding of SLOs, incident response, and continuous improvement in service reliability
  • You operate with a founder’s mentality — hands-on, resourceful, and willing to dive deep to get things done. You take pride in hard work, autonomy, and shipping impactful systems
Job Responsibility
Job Responsibility
  • Lead reliability and operations for ClickHouse’s Postgres integration — upgrades, patching, maintenance, and scaling
  • Design and implement automation for provisioning, deployments, and service lifecycle management across AWS, GCP, and Azure
  • Develop infrastructure-as-code using Terraform and modern CI/CD tooling to ensure consistent, repeatable deployments
  • Contribute Go-based tooling and services that improve automation, observability, and developer experience
  • Own observability and monitoring, ensuring robust alerting, metrics, and tracing across environments
  • Drive incident management and postmortem practices that strengthen reliability and learning loops
  • Collaborate cross-functionally with platform, networking, and product teams to improve service operability
  • Mentor and enable engineers, helping the team scale effectively as customer adoption grows
What we offer
What we offer
  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
  • Fulltime
Read More
Arrow Right

Senior Infrastructure Engineer - Postgres

ClickHouse is expanding its cloud data platform across AWS, GCP, and Azure—addin...
Location
Location
India
Salary
Salary:
Not provided
clickhouse.com Logo
ClickHouse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years in SRE, DevOps, or infrastructure engineering, with a track record of running distributed, production-grade systems
  • Solid understanding of Postgres operations, scaling, and performance tuning
  • Deep hands-on experience across AWS, with exposure to GCP and Azure
  • comfortable navigating multi-cloud topologies
  • Proficient with Terraform, Kubernetes, and container-based infrastructure
  • Strong Go development skills (or willingness to write and own production Go code)
  • Familiar with tools like Prometheus, Grafana, Loki, OpenTelemetry, or equivalents
  • Deep understanding of SLOs, incident response, and continuous improvement in service reliability
  • You operate with a founder’s mentality — hands-on, resourceful, and willing to dive deep to get things done. You take pride in hard work, autonomy, and shipping impactful systems
Job Responsibility
Job Responsibility
  • Lead reliability and operations for ClickHouse’s Postgres integration — upgrades, patching, maintenance, and scaling
  • Design and implement automation for provisioning, deployments, and service lifecycle management across AWS, GCP, and Azure
  • Develop infrastructure-as-code using Terraform and modern CI/CD tooling to ensure consistent, repeatable deployments
  • Contribute Go-based tooling and services that improve automation, observability, and developer experience
  • Own observability and monitoring, ensuring robust alerting, metrics, and tracing across environments
  • Drive incident management and postmortem practices that strengthen reliability and learning loops
  • Collaborate cross-functionally with platform, networking, and product teams to improve service operability
  • Mentor and enable engineers, helping the team scale effectively as customer adoption grows
What we offer
What we offer
  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
Read More
Arrow Right

Engineering Lead Analyst

Engineering Lead Analyst position in Citi's Cloud Technology Services (CTS) team...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12 Plus years of relevant experience in an Engineering role
  • Deep understanding of public cloud services adoption at scale
  • Expert-level understanding of AWS/GCP Cloud Network across Internet Application Hosting, B2B Connectivity, and Application Resiliency
  • Infrastructure as Code (IaC) Hands On Expertise with Python and Go
  • CI/CD experience with Terraform, Harness, Tekton, Jenkins, etc.
  • Testing Automation experience with Terratest, Cucumber, PytestBD, AWS Fault Injection Simulator (FIS), Chaos Mesh, etc.
  • Familiarity with Agile Development, DevOps, and SRE practices
  • Demonstrated ability to quickly learn new technologies and adapt to changing project requirements
  • Experience evaluating complex requirements and rationalizing them into consistent service offering
  • Excellent communication skills
Job Responsibility
Job Responsibility
  • Technical Expertise: hands-on technical contribution within product team focused on public cloud network
  • Collaborative Development: contribute to team of cloud engineers and full-stack software developers
  • Automation: Identify and develop automation initiatives to improve processes related to public cloud services consumption
  • Cross-Functional Partnership: collaborate with teams across Citi's technology landscape
  • Engineering Excellence: contribute to defining and measuring success criteria for service availability and reliability
  • Compliance Advocacy: ensure adherence to relevant standards, policies, and regulations
  • Serve as technology subject matter expert for internal and external stakeholders
  • Provide direction for firm mandated controls and compliance initiatives
  • Define necessary system enhancements to deploy new products and process enhancements
  • Recommend product customization for system integration
What we offer
What we offer
  • Career growth opportunities
  • Opportunity to give back to community
  • Make real impact
  • Global team environment
  • Well-being support
  • Work-life balance programs
  • Fulltime
Read More
Arrow Right

Intermediate Software Engineer SRE – AI

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Senior Infrastructure Engineer

Coralogix is seeking a Senior Infrastructure Engineer to join our Core SRE team ...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, SRE, platform engineering, or infrastructure roles
  • Deep understanding of Kubernetes: API, CNI, scheduling, container runtimes and such
  • Strong hands-on experience with Kafka and Istio (or similar technologies ), and core networking protocols (HTTP, gRPC, TLS)
  • Proven experience managing large-scale cloud infrastructure (AWS, GCP, etc.)
  • Experience in incident response and troubleshooting complex distributed systems
  • Some software engineering experience, preferably in Golang
  • Passion for automation, performance tuning, and operational excellence
Job Responsibility
Job Responsibility
  • Act as a hands-on technical leader with deep expertise in modern cloud infrastructure
  • Serve as a go-to person in the team — leading through influence, not hierarchy
  • Collaborate cross-functionally to refine requirements and propose innovative, scalable solutions
  • Drive long-term, high-impact infrastructure projects across multiple teams, from design to implementation, within defined timelines
  • Contribute to improving system reliability, performance, and cost-efficiency at scale
  • Fulltime
Read More
Arrow Right