CrawlJobs Logo

Senior DevOps / Site Reliability Engineer

· Job Posted March 19, 2026
Apply Position
Job Link Share

Job Description

Our client is a leader in sustainable packaging solutions, leveraging cutting-edge cloud technologies to enhance production, operational excellence, and innovation. Join our team and contribute to eco-friendly advances with state-of-the-art technology and DevOps practices.

Job Responsibility

  • Cloud Infrastructure: Architect, implement, and manage Microsoft Azure resources including App Services, Virtual Machines, Container Instances, AKS, SQL Server/Instance, and Azure SQL
  • DevOps Automation: Design and maintain CI/CD workflows using Git, Github Actions, SonarQube Cloud, Terraform, and Docker
  • SRE Practices: Develop and monitor SLOs, SLIs, and golden signals
  • instrument applications and infrastructure
  • build Datadog dashboards for real-time business and incident reporting
  • Incident Management: Lead incident response, root cause analysis, and post-mortem documentation. Maintain high availability and rapid recovery for business-critical systems
  • Monitoring & Observability: Extensive use of Datadog for monitoring, logging, and performance analytics
  • Configuration Management: Work with Shell, YAML, JSON, and Python for scripting, automation, and configuration
  • System Administration: Administer Ubuntu, RHEL, CentOS, and (entry-level) Windows Server environments
  • Collaboration: Utilize Atlassian Suite (Jira, Confluence) for documentation, ticketing, and project tracking
  • contribute to ITSM/ITIL frameworks
  • AI & Productivity Tools: Integrate and leverage tools like Claude, Github CoPilot, and other AI productivity solutions
  • Reporting: Create dashboards and business reports to provide actionable insights and drive continuous improvement

Requirements

  • Microsoft Azure (App Services, VM, Container Instances, AKS, SQL Server, Azure SQL)
  • Git, Github, Github Actions
  • SonarQube Cloud, Terraform, Docker
  • Datadog (extensive), SRE concepts (SLOs, SLIs, golden signals, instrumentation)
  • Incident management, dashboard development, business reporting
  • Shell scripting, YAML/JSON configs, Python
  • Ubuntu, RHEL, CentOS, Windows/Server (entry)
  • Atlassian Suite (Jira/Confluence)
  • ITSM / ITIL familiarity
  • AI tools (Claude, Github CoPilot, etc.)
  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience)
  • 5+ years demonstrated experience in DevOps, SRE, or cloud engineering roles
  • Analytical thinker, problem solver, and proactive communicator
  • Strong collaboration skills, especially across cross-functional and remote teams
  • Ability to thrive in a fast-paced, innovative business environment

Nice to have

  • Microsoft Azure: App Insights, IoT Hub, Azure DevOps, API Management
  • DevOps: Ansible, Argo CD, CodeRabbit, Artifactory
  • SRE Practices: Capacity planning, cost optimization
  • Programming Languages: JavaScript, PowerShell
  • Other Tools: Snowflake, PagerDuty, Salesforce MuleSoft Anypoint

What we offer

  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior DevOps / Site Reliability Engineer

8 matching positions

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting language used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience with distributed caching systems: including their underlying algorithms and how to optimize their performance
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
  • Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Working closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure.
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer (Sre/Devops)

Build infrastructure using your knowledge of Public Cloud (AWS/GCP) services; Su...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience with programming languages (Python/Go)
  • 5+ years of experience with automation/configuration tools like Terraform, Ansible, or Chef
  • Advanced knowledge of Git
  • Experience with Identity and Access Management (IAM) services in public clouds
  • 5 years of hands-on Linux experience (configuration, troubleshooting, deployment)
  • Hands-on experience with AWS or GCP services
  • minimum 2 years managing cloud infrastructure
  • Bachelor’s degree in Computer Science, CIS, Engineering, or a related field
  • Experience building and managing CI/CD pipelines
  • Experience with container technologies like Kubernetes or Mesos
Job Responsibility
Job Responsibility
  • Build infrastructure using your knowledge of Public Cloud (AWS/GCP) services
  • Support cloud infrastructure development using Infrastructure as Code best practices
  • Develop CI/CD pipelines
  • Monitor and maintain production cloud systems
  • Research and implement new cloud technologies using open-source tools
  • Utilize APIs to write DevOps tools for large-scale automation
  • Collaborate with team leads to secure infrastructure in AWS and GCP, including CI/CD pipeline security
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce. Its ...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
autorabit.com Logo
AutoRABIT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in SRE, DevOps, or related roles
  • Solid hands-on experience with AWS services (EKS, ECS, EC2, RDS, S3, Redis, etc.)
  • Proficient in writing Terraform infrastructure scripts
  • Strong scripting skills in Python using Boto3
  • Deep understanding of monitoring/logging tools (ELK, CloudWatch, TrendMicro)
  • Experience building and managing CI/CD pipelines (CodeBuild, CodePipeline)
  • Knowledge of infrastructure security and incident response practices
  • Willing to work in rotational shifts and rotational week-offs
  • Bachelor’s in computers or any related field
  • AWS certifications is preferred
Job Responsibility
Job Responsibility
  • Provision and manage AWS infrastructure using Terraform
  • Write AWS Lambda functions (Python3 + Boto3) to automate operational tasks
  • Set up monitoring, logging, and alerting with ELK, TrendMicro, and AWS CloudWatch
  • Configure alerts for performance and security anomalies
  • Develop and maintain CI/CD pipelines using AWS CodeBuild and CodePipeline
  • Troubleshoot production issues and contribute to blameless postmortems
  • Contribute to system hardening and security compliance efforts
  • Responsibility to adhere to set internal controls
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Our client, a leader in the HCM space is in need of a Senior Site Reliability En...
Location
Location
United States , Reston
Salary
Salary:
67.50 - 97.50 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience support large scale cloud infrastructure, automation and DevOps preferably in an AWS environment
  • Ability to build, maintain, and consume CI/CD pipelines and tools
  • Proficient w/ Terraform to automate critical infrastructure
  • Experience supporting Kubernetes based platforms to ensure high availability
  • Active TS SCI w/ CI Poly
Job Responsibility
Job Responsibility
  • Ensuring Kubernetes based platform is maintained, healthy, and ensures high availability, scalability and security
  • Automating infrastructure provisioning, configuration management, application deployments using Terraform and Argo CD
  • Handling troubleshooting and documentation associated with the platform
  • Collaborating with multiple cross functional teams
  • Proficient at building, maintaining and consuming CI/CD pipelines
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are looking for a highly motivated Site Reliability Engineer to join our grow...
Location
Location
United States , Reston
Salary
Salary:
147400.00 - 221200.00 USD / Year
Workday
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 5 years of hands-on experience working with large scale cloud infrastructure, automation, and overall DevOps methodologies
  • Bachelor's degree in a computer related field or equivalent work experience
  • Proficiency in infrastructure automation tools like Terraform
  • Experience with building, maintaining, and consuming CI/CD pipelines and tools like Argo CD
  • Strong analytical and problem-solving skills
  • Excellent communication and collaboration skills
  • Strong understanding of Kubernetes
  • Amazon Web Services proficiency working in a production environment
  • Proficiency in at least one programming language such as C#, Python, Ruby, Rust, or Go
  • Experience with security auditing and compliance frameworks
Job Responsibility
Job Responsibility
  • Ensuring the Workday Kubernetes based platform is maintained, healthy, and ensures high availability
  • Maintaining the overall platform, ensuring high availability, scalability, and security
  • Automating infrastructure provisioning, configuration management, and application deployments using tools like Terraform and Argo CD
  • Troubleshooting and support for platform-related issues
  • Implementing and maintaining security standard methodologies for the platform
  • Building and maintaining comprehensive documentation for platform components and processes
  • Collaborating effectively with other engineers and development teams across multiple locations and time zones
What we offer
What we offer
  • Eligibility for Workday Bonus Plan or commission/bonus
  • Annual refresh stock grants
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We're hiring a Senior Site Reliability Engineer to lead reliability strategy and...
Location
Location
India , Chennai
Salary
Salary:
Not provided
zuora.com Logo
Zuora
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of hands-on experience in Site Reliability Engineering, DevOps, or large-scale production operations.
  • Advanced expertise in AWS, including architecture design across services such as EC2, EKS, VPC, IAM, RDS, S3, and CloudWatch.
  • Deep experience with Infrastructure-as-Code using Terraform, including complex modules, state management, and governance.
  • Strong programming and automation skills using Python and Shell
  • experience building production-grade automation systems.
  • Expert-level Linux systems knowledge, including performance tuning, security hardening, and deep troubleshooting.
  • Proven experience operating distributed systems and data streaming platforms such as Kafka in high-throughput environments.
  • Demonstrated ability to work independently on complex, ambiguous problems with broad organizational impact.
  • Proven technical leadership experience driving large, cross-team reliability or infrastructure initiatives, including setting technical direction, influencing design decisions, and mentoring engineers to deliver measurable outcomes at scale.
  • Practical experience designing or implementing AI/ML-driven automation in operations, reliability, or platform engineering.
Job Responsibility
Job Responsibility
  • Define and evolve SLOs, SLIs, and resilience patterns
  • Build AI-driven automation for detection, remediation, and forecasting
  • Lead cloud infrastructure and Kubernetes platforms
  • Drive incident response and operational excellence
  • Mentor engineers and influence org-wide reliability practices
What we offer
What we offer
  • Competitive compensation, variable bonus and performance-based reward opportunities, and retirement programs
  • Medical, dental, and vision insurance
  • Generous, flexible time off, plus paid holidays, wellness days, and a company-wide year-end break
  • Paid parental leave (including fully paid leave for eligible ZEOs, subject to local policy)
  • Learning & development stipend to support ongoing growth
  • Opportunities to volunteer and give back, including charitable donation matching where available
  • Mental wellbeing resources and support
  • Fulltime
Read More
Arrow Right