CrawlJobs Logo

Senior Site Reliability Engineer, Infrastructure Foundations

United States Employment contract 113082.00 - 175725.00 USD / Year · Job Posted May 14, 2026
Apply Position
Job Link Share

Job Description

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to support and develop the platform serving the world’s favorite encyclopedia, Wikipedia, to millions of people around the globe. Wikimedia’s Site Reliability Engineering (SRE) team is principally responsible for ensuring our global top-10 website and its underlying infrastructure is healthy and developing further in support of Wikimedia’s mission: to help everyone share in the sum of all knowledge. The SRE team at Wikimedia is a globally distributed and diverse team of engineers with a drive to explore, experiment, and embrace new technologies. We work in the open by publishing all of our documentation, code, and configuration as open source, and all our production systems are powered by open source software. We invite you to go through our documentation and code -- no login required. If you find what we do interesting, if you are up to the challenge of improving the reliability and delivery of one of the Internet’s top websites, and you enjoy the idea of working in a remote-first role, we may just be the right place for you.

Job Responsibility

  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them

Requirements

  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
  • Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures

Nice to have

  • Experience setting and implementing fleet-wide security policies
  • Experience with software supply chain security
  • Awareness of the current open source infrastructure security landscape
  • Experience working together with software security teams
  • Experience with credential management systems
  • Experience implementing immutable logging and auditing
  • Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Prometheus, Grafana, etc.)
  • Developing/contributing to Free and Open Source software, or being part of an open-source community (share your favourite pull requests!)
  • Experience with LAMP stack technologies (PHP/HHVM, memcached/Redis) -- MediaWiki experience is a definite plus
  • Experience with defining cross-team SLOs and their implementation

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer, Infrastructure Foundations

8 matching positions

Senior Site Reliability Engineer

Join us in shaping the future of infrastructure automation for mission-critical ...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering or site reliability
  • Experience building and scaling complex and impactful software products in a team environment
  • Deep skill in driving technical solutions across multiple teams
  • Strong Experience with Terraform and CI/CD
  • Strong experience managing infrastructure in the cloud (AWS or Azure)
  • Experience using languages such as Go, Python, C#, Java, or similar
  • Experience designing tooling to simplify the operational management of SaaS/PaaS systems
  • Empathy to support the needs of software engineers
Job Responsibility
Job Responsibility
  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision infrastructure rapidly, consistently, and securely across multiple cloud providers
  • Write code in Go that is performant, maintainable, clear, and concise
  • Championing and enforcing Infrastructure as Code (IaC) best practices and coding standards
  • Employ strong problem-solving skills, with the ability to debug problems in cloud native distributed systems
  • Influence and educate the engineering organization to adopt new and improved architectural patterns
  • Provide robust documentation for use by engineers to promote self-service
What we offer
What we offer
  • Competitive base salary and RSUs
  • Comprehensive pension plan with matching contribution
  • Private health insurance & cash plans
  • 30 days paid holiday + UK public holidays
  • Enhanced maternity/paternity leave
  • GymPass subscription
  • Life assurance & income protection
  • Career growth support and wellness resources
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a Site Reliability Engineer (SRE) at Polygon Labs, you will play a key role i...
Location
Location
Salary
Salary:
Not provided
polygon.technology Logo
Polygon Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A foundational understanding of Linux systems, processes, and basic networking concepts
  • Familiarity with at least one scripting or programming language, such as Python, Bash, or Go
  • An interest in site reliability, monitoring, and operating production infrastructure
  • Clear written and verbal communication skills, with a willingness to ask questions and learn
  • The ability to remain calm, methodical, and responsive during incidents or operational events
Job Responsibility
Job Responsibility
  • Monitoring production systems, alerts, dashboards, and logs across Polygon networks, including Polygon PoS and the Agglayer
  • Assisting with incident detection, triage, escalation, and resolution under the guidance of senior engineers
  • Supporting on-call and operational coverage through structured rotations, with training and mentorship
  • Following, maintaining, and improving runbooks and standard operating procedures
  • Assisting with routine operational tasks such as service restarts, upgrades, and configuration changes
  • Helping maintain and improve monitoring, logging, and alerting systems, including dashboards for network health, RPC performance, and node metrics
  • Learning to improve alert signal quality and reduce operational noise
  • Supporting cloud-based and containerized infrastructure, including nodes, RPC endpoints, and supporting services
  • Collaborating with protocol, product, and cross-functional teams to understand production issues and user impact
  • Participating in post-incident reviews and contributing to root-cause analysis documentation
What we offer
What we offer
  • Remote first global workforce
  • Industry leading Medical, Dental and Vision health insurance
  • Company matching 401k with 3% match
  • $1,500 Home Office Set Up Allowance (life-time max)
  • $75 Monthly internet or phone reimbursement
  • Flexible Time Off
  • Company issued laptop
  • Egg freezing, mental health, and employee wellness benefits
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Manager

RUCKUS Networks is seeking an experienced Site Reliability Engineering (SRE) Man...
Location
Location
United States , Sunnyvale
Salary
Salary:
135600.00 - 200000.00 USD / Year
commscope.com Logo
CommScope
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in Site Reliability Engineering (SRE), with 6+ years leading SRE, DevOps, or infrastructure teams
  • Proven experience mentoring engineering managers and developing leadership talent
  • Track record of transforming traditional operations or NOC teams into modern SRE organizations
  • Strong project management skills with Agile/Kanban experience and JIRA proficiency
  • Excellent communication skills, including executive-level presentations
  • Deep SRE expertise: incident management, on-call systems, monitoring, and reliability engineering
  • Infrastructure automation experience with Terraform, Kubernetes, Docker, and CI/CD pipelines
  • Cloud platform proficiency (GCP/AWS), including networking, security, and cost optimization
  • Monitoring and observability experience with Prometheus, Grafana, APM tools, and log aggregation
  • 24/7 operations experience with global team coordination and escalation management
Job Responsibility
Job Responsibility
  • Lead and develop engineering managers and technical operations engineers across India and APAC time zones
  • Build a collaborative team culture that emphasizes knowledge sharing, automation, and operational excellence
  • Mentor engineering managers to strengthen leadership capabilities and technical expertise
  • Set clear performance expectations and provide ongoing coaching for growth
  • Partner cross-functionally with Product, Security, Development, and global operations teams
  • Own 24/7 operational stability for India/APAC, including incident response, escalation, and post-incident reviews
  • Drive comprehensive incident management: alert handling, outage response, and root cause analysis (RCA/CAR)
  • Transform traditional operations into modern SRE practices using SLOs, error budgets, and reliability engineering
  • Implement robust monitoring and alerting with APM tools, dashboards, and automation frameworks
  • Lead technical project delivery with clear timelines, resource planning, and stakeholder communication
What we offer
What we offer
  • medical, dental, and vision plans
  • life and accidental death insurance
  • a 401(k) plan
  • participation in the Company’s Incentive Plan
  • eleven paid holidays in a full calendar year
  • two weeks of paid vacation (prorated based on start date)
  • other leave options
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

We are seeking a Director of Site Reliability Engineering to lead a global organ...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
aiven.io Logo
Aiven Deutschland GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience leading and scaling global SRE or infrastructure organizations through managers, ideally across multiple regions and time zones
  • Strong track record of defining and executing reliability strategy at scale, including ownership of SLIs/SLOs, incident management frameworks, and operational excellence programs
  • Demonstrated ability to build, develop, and mentor senior leaders, creating high-performing, inclusive teams and strong leadership pipelines
  • Experience operating in a 24/7/365 production environment, with deep understanding of follow-the-sun models, on-call design, and large-scale incident response
  • Ability to partner cross-functionally at the executive level (Engineering, Product, Support) to influence architecture, prioritization, and long-term platform investments
  • Strong data-driven leadership approach, with experience defining SLI/SLOs and using metrics to drive prioritization, accountability, and continuous improvement
  • Solid technical foundation in distributed systems, cloud infrastructure, and automation, with the ability to engage credibly with senior engineers and influence technical direction
  • Experience driving large-scale change and organizational design, including scaling teams, evolving operating models, and improving efficiency and reliability at company level
Job Responsibility
Job Responsibility
  • Define and drive global SRE operating strategy in partnership with regional SRE leaders across EMEA, AMER and APAC, ensuring alignment on reliability goals, operating models, and execution across a 24/7/365 follow-the-sun organization
  • Build and lead a multi-regional SRE organization through managers, developing leadership capability, mentoring team, and ensuring consistent performance, culture, and delivery across geographies
  • Set the vision and roadmap for reliability engineering, enabling teams to deliver high-impact tools, automation, and process initiatives that improve platform resilience, scalability, and efficiency
  • Own global incident management strategy and operating model, including on-call design, coverage, and escalation frameworks, ensuring seamless coordination and high availability across regions
  • Establish a metrics-driven operating cadence, defining KPIs/SLIs/SLOs/Error Budget, driving data-informed prioritization, and embedding operational rigor and continuous improvement across the SRE organization
What we offer
What we offer
  • Participate in Aiven’s equity plan
  • Balance work and life with our hybrid work policy
  • Choose the equipment you need to set yourself up for success
  • Use your Professional Development Plan budget for learning opportunities
  • Receive holistic wellbeing support through our global Employee Assistance Program
  • Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
  • Enjoy country-specific benefits for our global cast
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Intern

Join a mission-driven team at the forefront of distributed systems technology to...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Current Enrollment: Must be currently pursuing a Bachelor’s or Master’s degree
  • Graduation Timeline: Must have a projected graduation date of either Fall 2026 or Spring 2027
  • Programming Proficiency: Demonstrated ability to code in at least one major language, such as Python, Java, C++, JavaScript, Rust, or Go
  • CS Fundamentals: Strong foundational knowledge of computer science, specifically in data structures and algorithm design
  • Communication Skills: Proactive communication style with a proven ability to collaborate effectively within a team environment
  • Analytical Mindset: A talent for troubleshooting and a curiosity-driven approach to optimizing complex technical systems
Job Responsibility
Job Responsibility
  • System Development: Build and maintain scalable, highly available, and fault-tolerant distributed systems at scale to support intensive computational workloads
  • Product Creation: Develop innovative products and tools from the ground up that will be utilized by a global customer base
  • Production Support: Triage bugs and resolve complex issues within production environments to ensure platform reliability
  • Feature Iteration: Partner with product owners and stakeholders to design, test, and iterate on new features that drive platform growth
  • Cross-Functional Collaboration: Work closely with senior engineers and teammates to align technical tasks with broader company goals
  • Mentorship Engagement: Participate in dedicated one-on-one mentorship sessions to accelerate your professional growth and technical mastery
  • Strategic Problem Solving: Apply analytical thinking to troubleshoot and optimize complex systems for maximum efficiency
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Cloud Foundation

Who We Are: SiriusXM and its brands (Pandora, SiriusXM Media, AdsWizz, Simplecas...
Location
Location
United States , New York; Washington, DC
Salary
Salary:
105800.00 - 165000.00 USD / Year
siriusxm.com Logo
SiriusXM
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related field, or an equivalent combination of education and experience
  • 5+ years in Cloud Infrastructure, Platform Engineering Development, DevOps, or Site Reliability Engineering roles
  • 2+ years of experience with AWS services (EC2, S3, ELB, VPC, IAM) or equivalent cloud environments, with a strong understanding of AWS best practices
  • 2+ years of experience running Linux-based systems, with advanced knowledge of Linux operating systems
  • Proficiency in Typescript, Python, Bash, or other common languages used for automation and infrastructure management
  • Expertise with tools like CloudFormation (CDK) or Terraform to deploy and manage infrastructure at scale
  • Proficiency in Git and experience with platforms like GitLab or GitHub for collaborative code management
  • Curiosity and Initiative
  • High Availability Mindset
  • Must have legal right to work in the U.S.
Job Responsibility
Job Responsibility
  • Build and Manage Foundational Cloud Infrastructure
  • Improve Transparency
  • AWS Expertise
  • Optimize Infrastructure Costs
  • Collaborate with other technical groups
  • Fulltime
Read More
Arrow Right

Senior Growth Engineer

About Buffer: We create social media and brand-building software for small busin...
Location
Location
Salary
Salary:
156500.00 - 202300.00 USD / Year
buffer.com Logo
Buffer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong frontend foundations in React, TypeScript, and modern web development
  • comfortable going beyond the frontend when the work calls for it - writing backend logic, building API routes, connecting third-party integrations
  • understands what it takes to grow something (hands-on experience with SEO, analytics, A/B testing, and conversion optimization or grown own product/startup/side project)
  • thinks beyond the immediate task
  • thrives in remote, async environments
  • communicates clearly, supports teammates, works effectively across marketing, design, and engineering
  • has built AI into user-facing marketing features at scale and uses AI tools daily
  • has a personal stake in the world of content creation
Job Responsibility
Job Responsibility
  • Own growth engineering projects end-to-end, from implementing a localization framework for international audiences to building referral program logic, attribution systems, and rebuilding tracking infrastructure for accuracy and reliability
  • build features on Buffer's marketing site that help people discover and try Buffer, shipping landing pages, interactive tools, and conversion flows
  • shape our marketing platform capabilities by building the systems and frameworks that help the whole team move faster
  • drive experimentation and optimization, helping the team learn faster through code
  • strengthen our foundations by maintaining integrations with our marketing technology stack (Segment, GTM, Mixpanel, BigQuery)
  • help shape our engineering culture by pairing with other engineers, reviewing code, sharing knowledge about growth systems and marketing engineering patterns
What we offer
What we offer
  • Competitive salary
  • work remotely
  • 4-day workweeks
  • health insurance
  • home office setup ($1000)
  • growth mindset fund
  • new laptop
  • unlimited free books
  • AI tools stipend
  • flexible time off
  • Fulltime
Read More
Arrow Right

Senior MLOps Engineer - Data Ingestion - Paris

We are looking for a Senior MLOps Engineer to join the Panda Team (Data & ML Ope...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You have at least 7+ years as an MLOps Engineer or ML Platform Engineer with proven production model lifecycle management experience
  • You have expert-level experience with ML orchestration tools (MLflow, Braintrust, or similar) for batch processing and inference pipelines
  • You have a strong Site Reliability Engineering (SRE) foundation with focus on operations excellence, reliability, and observability
  • You have expertise in Python for automation and ML pipeline scripting
  • You have strong proficiency with infrastructure-as-code tools such as Terraform and container orchestration (Kubernetes)
  • You have experience with model evaluation frameworks and golden dataset management
  • You have a solid understanding of cloud infrastructure (preferably GCP, AWS, or Azure)
  • You have excellent problem-solving skills with focus on identifying and resolving infrastructure bottlenecks
  • You are fluent in English
Job Responsibility
Job Responsibility
  • Design and implement end-to-end ML model pipelines in production (LLM and custom models) with robust deployment, evaluation, and monitoring frameworks
  • Own data pseudo-anonymization architecture within ingestion services, converting Tier 0 (personal identifiers) to Tier 1 (anonymized data) while ensuring data quality and model performance
  • Build and maintain secure data export services with ML-based threat detection to prevent attack vectors (SQL injection, etc.) using adaptive models rather than manual rules
  • Manage golden datasets and implement production model evaluation frameworks to ensure anonymization quality and system reliability
  • Build and maintain data pipelines that efficiently extract, transform, and load data from various sources, handling multiple data formats (text, images, audio, video)
  • Implement automation and orchestration tools using ML orchestration platforms (MLflow, Braintrust, or similar) to streamline infrastructure provisioning and reduce manual effort
  • Monitor data and ML platforms for performance, reliability, and security
  • identify and troubleshoot issues proactively
  • Mentor team members on MLOps expertise and best practices to reduce knowledge silos and build organizational capability
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • 25 days of paid vacation per year, plus up to 14 days of RTT
  • Free mental health and coaching services through our partner Moka.care
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Lunch vouchers (Swile card) worth €8.50 per working day, with €4.50 covered by Doctolib
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • 50% reimbursement of your public transport subscription
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Relocation support in case of international mobility
  • Fulltime
Read More
Arrow Right