CrawlJobs Logo

Senior Site Reliability Engineer

United States 113082.00 - 175725.00 USD / Year · Job Posted June 02, 2026
Apply Position
Job Link Share

Job Description

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to support and develop the platform serving the world’s favorite encyclopedia, Wikipedia, to millions of people around the globe. Wikimedia’s Site Reliability Engineering (SRE) team is principally responsible for ensuring our global top-10 website and its underlying infrastructure is healthy and developing further in support of Wikimedia’s mission: to help everyone share in the sum of all knowledge. The SRE team at Wikimedia is a globally distributed and diverse team of engineers with a drive to explore, experiment, and embrace new technologies. We work in the open by publishing all of our documentation, code, and configuration as open source, and all our production systems are powered by open source software. We invite you to go through our documentation and code -- no login required. If you find what we do interesting, if you are up to the challenge of improving the reliability and delivery of one of the Internet’s top websites, and you enjoy the idea of working in a remote-first role, we may just be the right place for you. If you are interested in this role we’d expect you to be able to travel 1-2 times a year for in-person events and team meetings. Most importantly, share our values and work in accordance with them!

Job Responsibility

  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Working closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure.
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength

Requirements

  • 6+ years experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting language used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience with distributed caching systems: including their underlying algorithms and how to optimize their performance
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
  • Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures

Nice to have

  • Experience with Linux kernel tuning
  • Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Prometheus, Grafana, etc.)
  • Developing/contributing to Free and Open Source software, or being part of an open-source community (share your favourite pull requests!)
  • Experience with LAMP stack technologies (PHP/HHVM, memcached/Redis) -- MediaWiki experience is a definite plus
  • Experience with defining cross-team SLOs and their implementation
  • Experience operating an on-premise filesystem or object store at scale, preferably OpenStack Swift or Ceph
  • Experience with other advanced distributed storage and database systems (Cassandra, MariaDB etc.)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer

8 matching positions

Senior Site Reliability Engineer

Embark on a transformative journey as a Senior Site Reliability Engineer - AVP. ...
Location
Location
United States , Whippany
Salary
Salary:
120000.00 - 175000.00 USD / Year
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Considerable programming expertise in languages such as Python, Java, and others
  • Practical experience with Infrastructure as Code (IaC) tools, including Ansible, Chef, and Terraform
  • Validated experience with observability and monitoring platforms such as Observe, Elastic, InfluxDB, and Grafana
  • Solid understanding of containerization technologies and Unix/Linux environments
  • Demonstrates a Site Reliability Engineering (SRE) mindset, with good analytical skills, ownership, and a forward-thinking approach to problem-solving
Job Responsibility
Job Responsibility
  • Build and maintain infrastructure platforms and products that support applications and data systems
  • Ensure the reliability, availability, and scalability of the systems, platforms, and technology
  • Development, delivery, and maintenance of high-quality infrastructure solutions
  • Monitoring of IT infrastructure and system performance to measure, identify, address, and resolve any potential issues, vulnerabilities, or outages
  • Development and implementation of automated tasks and processes to improve efficiency and reduce manual intervention
  • Implementation of a secure configuration and measures to protect infrastructure against cyber-attacks, vulnerabilities, and other security threats
  • Cross-functional collaboration with product managers, architects, and other engineers to define IT Infrastructure requirements
  • Stay informed of industry technology trends and innovations
What we offer
What we offer
  • medical, dental and vision coverage
  • 401(k)
  • life insurance
  • other paid leave for qualifying circumstances
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Are you interested in working on cutting-edge cloud security products Would you ...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance
  • Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
  • Failure to maintain or obtain the appropriate clearance and/or customer screening requirements may result in employment action up to and including termination
  • This position requires successful verification of the stated security clearance to meet federal government customer requirements
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Ensure 24x7 Service Reliability: Act as a Designated Responsible Individual (DRI) in an on-call rotation, leading incident response and resolution to maintain uptime and performance for Microsoft's most critical services
  • Support and Automate Deployments: Execute and improve manual operations and deployments for our products, while designing automation to scale and streamline those processes across environments
  • Build Scalable Systems: Develop automation for monitoring, alerting, debugging, and deployment to reduce manual effort and accelerate safe, reliable delivery
  • Drive Compliance and Security: Ensure systems meet Microsoft's standards for security, privacy, and accessibility, especially when onboarding new technologies
  • Lead Post-Incident Learning: Conduct postmortems, share insights, and implement solutions that prevent recurrence—fostering a culture of learning and continuous improvement
  • Collaborate Across Teams: Partner with engineering and product teams to align reliability goals with customer needs and deliver seamless user experiences
  • Stay Ahead Technically: Continuously invest in your technical growth to improve system availability, observability, and performance at scale
  • Embody our company's Culture and Values
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Do you want to be at the heart of cloud computing? The Compute team is at the co...
Location
Location
United States , Reston
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Verification of U.S. citizenship due to citizenship-based legal restrictions
Job Responsibility
Job Responsibility
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of service fabric services while also driving consistency in monitoring and operations at scale
  • Drives development of design documents for a product, application, service, or platform
  • Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
  • Leverages subject-matter expertise of product features and partners with appropriate stakeholders to drive a workgroup's project plans, release plans, and work items
  • Take full ownership of assigned services, actively contributing to its enhancement across all cloud environments
  • Identify opportunities for automation and optimization within the cloud to better support customers
What we offer
What we offer
  • Certain roles may be eligible for benefits and other compensation
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Doctolib is looking for a Senior Site Reliability Engineer to keep Doctolib prod...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have a strong hands-on experience (6y+) on a production platform, if possible at scale
  • Have proven experience with cloud platforms such as AWS, Azure or Google Cloud
  • Have proven experience with datastores such as PostgreSQL and/or Kafka and/or Couchbase
  • Have solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
  • Have proficiency in at least one programming language (Ruby, Python, Go, Java, etc.) and understanding of infrastructure as code principles
  • Are fluent in English
Job Responsibility
Job Responsibility
  • Design, build and maintain core infrastructure databases that allow Doctolib scaling to support hundreds of thousands of concurrent users
  • Automate deployment, scaling, and maintenance of databases to enhance system reliability and operational efficiency
  • Implement and improve monitoring, alerting, and incident response processes to identify and address potential issues before they impact both practitioners and patients
  • Provide documentation and tooling to empower the feature teams in their use of their databases, while ensuring their reliability
  • Mitigate production database issues during working hours when the issue cannot be fixed by the responsible feature team
  • Research and evaluate new technologies, tools, and best practices to continuously improve the reliability and availability of our systems and processes
What we offer
What we offer
  • Free Health Insurance for you
  • Up to 14 days of RTT
  • A flexible workplace policy offering both hybrid and office-based modes
  • Flexibility days allowing to work in EU countries and the UK 10 days per year
  • Wellbeing program with free mental health and coaching through moka.care
  • Special support package for caregivers and workers with disabilities
  • Lunch voucher with Swile card
  • Work Council subsidy for sport club membership or creative activities
  • Bicycle subsidy
  • Public transportation reimbursement
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Join us as a "Project Manager" at Barclays, where you'll spearhead the evolution...
Location
Location
India , Pune; Bengaluru
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience with managing projects of medium - large complexity, business critical and cross-functional teams spanning multiple regions and functions, with project management qualification accreditation (e.g. Prince 2, PMI, APM, Agile)
  • Excellent written and verbal communication skills, including presentation to Senior level stakeholders
  • Good IT skills, including proficiency in Microsoft PowerPoint, Word PowerPoint, Projects
Job Responsibility
Job Responsibility
  • Manage a single project with specific, defined objectives, deadlines, and deliverables
  • Operate more tactically, focusing on day-to-day management of resources, schedules, and deliverables for their individual project
  • Work with a shorter, more defined timeframe as projects have a set beginning and end
  • Primarily manage stakeholders related to their specific project, ensuring communication and expectations are clear for the project’s deliverables
  • Focus on risks and issues specific to their project and work to mitigate them within the project’s scope
  • Manage resources for their individual project, ensuring that the project team has the necessary skills, tools, and time to complete the work
  • Focus on managing the budget of their specific project, ensuring it is completed within the financial constraints
  • Measure success based on the timely completion of project deliverables within scope, time, and budget
  • Manage changes that directly impact their specific project, including scope changes, timelines, or resource allocation adjustments
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce. Its ...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
autorabit.com Logo
AutoRABIT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in SRE, DevOps, or related roles
  • Solid hands-on experience with AWS services (EKS, ECS, EC2, RDS, S3, Redis, etc.)
  • Proficient in writing Terraform infrastructure scripts
  • Strong scripting skills in Python using Boto3
  • Deep understanding of monitoring/logging tools (ELK, CloudWatch, TrendMicro)
  • Experience building and managing CI/CD pipelines (CodeBuild, CodePipeline)
  • Knowledge of infrastructure security and incident response practices
  • Willing to work in rotational shifts and rotational week-offs
  • Bachelor’s in computers or any related field
  • AWS certifications is preferred
Job Responsibility
Job Responsibility
  • Provision and manage AWS infrastructure using Terraform
  • Write AWS Lambda functions (Python3 + Boto3) to automate operational tasks
  • Set up monitoring, logging, and alerting with ELK, TrendMicro, and AWS CloudWatch
  • Configure alerts for performance and security anomalies
  • Develop and maintain CI/CD pipelines using AWS CodeBuild and CodePipeline
  • Troubleshoot production issues and contribute to blameless postmortems
  • Contribute to system hardening and security compliance efforts
  • Responsibility to adhere to set internal controls
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Our client, a leader in the HCM space is in need of a Senior Site Reliability En...
Location
Location
United States , Reston
Salary
Salary:
67.50 - 97.50 USD / Hour
clearbridgetech.com Logo
ClearBridge Technology Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience support large scale cloud infrastructure, automation and DevOps preferably in an AWS environment
  • Ability to build, maintain, and consume CI/CD pipelines and tools
  • Proficient w/ Terraform to automate critical infrastructure
  • Experience supporting Kubernetes based platforms to ensure high availability
  • Active TS SCI w/ CI Poly
Job Responsibility
Job Responsibility
  • Ensuring Kubernetes based platform is maintained, healthy, and ensures high availability, scalability and security
  • Automating infrastructure provisioning, configuration management, application deployments using Terraform and Argo CD
  • Handling troubleshooting and documentation associated with the platform
  • Collaborating with multiple cross functional teams
  • Proficient at building, maintaining and consuming CI/CD pipelines
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

As a Senior Site Reliability Engineer at Optimizely, you will play a critical ro...
Location
Location
Vietnam , Hanoi
Salary
Salary:
Not provided
optimizely.com Logo
Optimizely
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as a Senior Site Reliability Engineer or similar role in a fast-paced environment
  • Strong understanding of cloud computing, networking, and system architecture. Preferably AWS, GCP is a plus
  • Proficiency in scripting and automation tools (e.g., Python, Bash, Terraform, Chef)
  • Experience with observability tools (e.g., Datadog, Prometheus, Grafana, ELK Stack)
  • Kubernetes Expertise: Demonstrated experience in designing, deploying, and managing applications in Kubernetes environments. Proficiency in configuring and optimizing Kubernetes clusters for scalability, reliability, and performance. Hands-on experience with Kubernetes tools and technologies such as Helm, Kustomize, and Kubectl
  • Istio Proficiency (preferred): Familiarity with Istio service mesh architecture and its components is a plus
  • Experience (preferred) with message broker, preferably Kafka
  • Understanding (preferred) of coordination services such as Zookeeper
  • Proficiency in version control software, particularly Git, is required
  • Excellent problem-solving skills and attention to detail
Job Responsibility
Job Responsibility
  • Design and implement reliable and scalable systems to support our digital platforms
  • Collaborate with software engineers to integrate reliability into the architecture
  • Develop and maintain monitoring solutions to ensure system performance and availability
  • Identify and resolve performance bottlenecks and optimize system performance
  • Lead incident response efforts, including troubleshooting, root cause analysis, and implementing corrective actions to prevent future incidents
  • Develop and maintain automation tools and scripts to improve system efficiency and reduce manual intervention
  • Implement infrastructure as code practices
  • Work closely with cross-functional teams to align on reliability goals and best practices
  • Communicate effectively with stakeholders to provide updates on system status and improvements
  • Stay updated with the latest industry trends and technologies related to site reliability engineering
  • Fulltime
Read More
Arrow Right