CrawlJobs Logo

Senior Site Reliability Engineer, Infrastructure Foundations

wikimediafoundation.org Logo

Wikimedia Foundation

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

113082.00 - 175725.00 USD / Year

Job Description:

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to support and develop the platform serving the world’s favorite encyclopedia, Wikipedia, to millions of people around the globe. Wikimedia’s Site Reliability Engineering (SRE) team is principally responsible for ensuring our global top-10 website and its underlying infrastructure is healthy and developing further in support of Wikimedia’s mission: to help everyone share in the sum of all knowledge. The SRE team at Wikimedia is a globally distributed and diverse team of engineers with a drive to explore, experiment, and embrace new technologies. We work in the open by publishing all of our documentation, code, and configuration as open source, and all our production systems are powered by open source software. We invite you to go through our documentation and code -- no login required. If you find what we do interesting, if you are up to the challenge of improving the reliability and delivery of one of the Internet’s top websites, and you enjoy the idea of working in a remote-first role, we may just be the right place for you.

Job Responsibility:

  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them

Requirements:

  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
  • Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures

Nice to have:

  • Experience setting and implementing fleet-wide security policies
  • Experience with software supply chain security
  • Awareness of the current open source infrastructure security landscape
  • Experience working together with software security teams
  • Experience with credential management systems
  • Experience implementing immutable logging and auditing
  • Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Prometheus, Grafana, etc.)
  • Developing/contributing to Free and Open Source software, or being part of an open-source community (share your favourite pull requests!)
  • Experience with LAMP stack technologies (PHP/HHVM, memcached/Redis) -- MediaWiki experience is a definite plus
  • Experience with defining cross-team SLOs and their implementation

Additional Information:

Job Posted:
May 14, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer, Infrastructure Foundations

Senior Site Reliability Engineer

Join us in shaping the future of infrastructure automation for mission-critical ...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering or site reliability
  • Experience building and scaling complex and impactful software products in a team environment
  • Deep skill in driving technical solutions across multiple teams
  • Strong Experience with Terraform and CI/CD
  • Strong experience managing infrastructure in the cloud (AWS or Azure)
  • Experience using languages such as Go, Python, C#, Java, or similar
  • Experience designing tooling to simplify the operational management of SaaS/PaaS systems
  • Empathy to support the needs of software engineers
Job Responsibility
Job Responsibility
  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision infrastructure rapidly, consistently, and securely across multiple cloud providers
  • Write code in Go that is performant, maintainable, clear, and concise
  • Championing and enforcing Infrastructure as Code (IaC) best practices and coding standards
  • Employ strong problem-solving skills, with the ability to debug problems in cloud native distributed systems
  • Influence and educate the engineering organization to adopt new and improved architectural patterns
  • Provide robust documentation for use by engineers to promote self-service
What we offer
What we offer
  • Competitive base salary and RSUs
  • Comprehensive pension plan with matching contribution
  • Private health insurance & cash plans
  • 30 days paid holiday + UK public holidays
  • Enhanced maternity/paternity leave
  • GymPass subscription
  • Life assurance & income protection
  • Career growth support and wellness resources
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Manager

RUCKUS Networks is seeking an experienced Site Reliability Engineering (SRE) Man...
Location
Location
United States , Sunnyvale
Salary
Salary:
135600.00 - 200000.00 USD / Year
commscope.com Logo
CommScope
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in Site Reliability Engineering (SRE), with 6+ years leading SRE, DevOps, or infrastructure teams
  • Proven experience mentoring engineering managers and developing leadership talent
  • Track record of transforming traditional operations or NOC teams into modern SRE organizations
  • Strong project management skills with Agile/Kanban experience and JIRA proficiency
  • Excellent communication skills, including executive-level presentations
  • Deep SRE expertise: incident management, on-call systems, monitoring, and reliability engineering
  • Infrastructure automation experience with Terraform, Kubernetes, Docker, and CI/CD pipelines
  • Cloud platform proficiency (GCP/AWS), including networking, security, and cost optimization
  • Monitoring and observability experience with Prometheus, Grafana, APM tools, and log aggregation
  • 24/7 operations experience with global team coordination and escalation management
Job Responsibility
Job Responsibility
  • Lead and develop engineering managers and technical operations engineers across India and APAC time zones
  • Build a collaborative team culture that emphasizes knowledge sharing, automation, and operational excellence
  • Mentor engineering managers to strengthen leadership capabilities and technical expertise
  • Set clear performance expectations and provide ongoing coaching for growth
  • Partner cross-functionally with Product, Security, Development, and global operations teams
  • Own 24/7 operational stability for India/APAC, including incident response, escalation, and post-incident reviews
  • Drive comprehensive incident management: alert handling, outage response, and root cause analysis (RCA/CAR)
  • Transform traditional operations into modern SRE practices using SLOs, error budgets, and reliability engineering
  • Implement robust monitoring and alerting with APM tools, dashboards, and automation frameworks
  • Lead technical project delivery with clear timelines, resource planning, and stakeholder communication
What we offer
What we offer
  • medical, dental, and vision plans
  • life and accidental death insurance
  • a 401(k) plan
  • participation in the Company’s Incentive Plan
  • eleven paid holidays in a full calendar year
  • two weeks of paid vacation (prorated based on start date)
  • other leave options
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

We are seeking a Director of Site Reliability Engineering to lead a global organ...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
aiven.io Logo
Aiven Deutschland GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience leading and scaling global SRE or infrastructure organizations through managers, ideally across multiple regions and time zones
  • Strong track record of defining and executing reliability strategy at scale, including ownership of SLIs/SLOs, incident management frameworks, and operational excellence programs
  • Demonstrated ability to build, develop, and mentor senior leaders, creating high-performing, inclusive teams and strong leadership pipelines
  • Experience operating in a 24/7/365 production environment, with deep understanding of follow-the-sun models, on-call design, and large-scale incident response
  • Ability to partner cross-functionally at the executive level (Engineering, Product, Support) to influence architecture, prioritization, and long-term platform investments
  • Strong data-driven leadership approach, with experience defining SLI/SLOs and using metrics to drive prioritization, accountability, and continuous improvement
  • Solid technical foundation in distributed systems, cloud infrastructure, and automation, with the ability to engage credibly with senior engineers and influence technical direction
  • Experience driving large-scale change and organizational design, including scaling teams, evolving operating models, and improving efficiency and reliability at company level
Job Responsibility
Job Responsibility
  • Define and drive global SRE operating strategy in partnership with regional SRE leaders across EMEA, AMER and APAC, ensuring alignment on reliability goals, operating models, and execution across a 24/7/365 follow-the-sun organization
  • Build and lead a multi-regional SRE organization through managers, developing leadership capability, mentoring team, and ensuring consistent performance, culture, and delivery across geographies
  • Set the vision and roadmap for reliability engineering, enabling teams to deliver high-impact tools, automation, and process initiatives that improve platform resilience, scalability, and efficiency
  • Own global incident management strategy and operating model, including on-call design, coverage, and escalation frameworks, ensuring seamless coordination and high availability across regions
  • Establish a metrics-driven operating cadence, defining KPIs/SLIs/SLOs/Error Budget, driving data-informed prioritization, and embedding operational rigor and continuous improvement across the SRE organization
What we offer
What we offer
  • Participate in Aiven’s equity plan
  • Balance work and life with our hybrid work policy
  • Choose the equipment you need to set yourself up for success
  • Use your Professional Development Plan budget for learning opportunities
  • Receive holistic wellbeing support through our global Employee Assistance Program
  • Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
  • Enjoy country-specific benefits for our global cast
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer

As a Platform Engineer at PEXA, you will be at the heart of our global technolog...
Location
Location
Australia
Salary
Salary:
Not provided
pexa.co.uk Logo
PEXA UK
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years’ experience in platform, site reliability, DevOps or cloud infrastructure engineering roles within complex or large-scale environments
  • Strong knowledge of AWS including networking, compute, storage and identity services
  • Proficiency with Infrastructure-as-Code tools such as Terraform or CloudFormation
  • Strong automation and scripting skills in Python, NodeJS or Bash
  • Experience designing and maintaining CI/CD pipelines using tools such as GitHub Actions or ArgoCD
  • Hands-on experience with Kubernetes, Helm and service meshes such as Istio
  • Experience working with event streaming platforms such as Kafka
  • Solid understanding of system and application security best practices including IAM, secrets management and compliance frameworks
  • Experience operating Linux-based systems in production at scale
  • Knowledge and hands-on experience with generative and agentic AI tooling
Job Responsibility
Job Responsibility
  • Designing and evolving the foundational platform capabilities that power secure, scalable and efficient product delivery
  • Build and automate robust cloud infrastructure across our AWS environments using Infrastructure-as-Code and modern automation frameworks
  • Design and enhance CI and CD pipelines to improve delivery velocity, reliability and observability
  • Partner closely with software delivery squads, security, and resiliency and observability teams to strengthen our platform’s performance, security and developer experience
  • Mentor Associate Platform Engineers and Graduates, contribute to engineering forums and architecture reviews, and help shape the future direction of our platform roadmap
  • Designing, delivering and continuously improving scalable, resilient and secure platform infrastructure across PEXA’s global cloud environments
  • Champion self-service capabilities that empower delivery squads and reduce operational bottlenecks
  • Embed monitoring, alerting and incident response best practices
  • Support strategic initiatives such as cloud cost optimisation, architecture standardisation and technology modernisation
  • Drive continuous improvement across testing, observability and platform performance
What we offer
What we offer
  • Quarterly wellness days to recharge
  • Four weeks Workcation per year – work from an approved country
  • Take the opportunity to purchase up to four weeks additional annual leave per year
  • Learn from the best and upskill with PEXA Academy certifications and grow your career
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Intern

Join a mission-driven team at the forefront of distributed systems technology to...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Current Enrollment: Must be currently pursuing a Bachelor’s or Master’s degree
  • Graduation Timeline: Must have a projected graduation date of either Fall 2026 or Spring 2027
  • Programming Proficiency: Demonstrated ability to code in at least one major language, such as Python, Java, C++, JavaScript, Rust, or Go
  • CS Fundamentals: Strong foundational knowledge of computer science, specifically in data structures and algorithm design
  • Communication Skills: Proactive communication style with a proven ability to collaborate effectively within a team environment
  • Analytical Mindset: A talent for troubleshooting and a curiosity-driven approach to optimizing complex technical systems
Job Responsibility
Job Responsibility
  • System Development: Build and maintain scalable, highly available, and fault-tolerant distributed systems at scale to support intensive computational workloads
  • Product Creation: Develop innovative products and tools from the ground up that will be utilized by a global customer base
  • Production Support: Triage bugs and resolve complex issues within production environments to ensure platform reliability
  • Feature Iteration: Partner with product owners and stakeholders to design, test, and iterate on new features that drive platform growth
  • Cross-Functional Collaboration: Work closely with senior engineers and teammates to align technical tasks with broader company goals
  • Mentorship Engagement: Participate in dedicated one-on-one mentorship sessions to accelerate your professional growth and technical mastery
  • Strategic Problem Solving: Apply analytical thinking to troubleshoot and optimize complex systems for maximum efficiency
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a Site Reliability Engineer (SRE) at Polygon Labs, you will play a key role i...
Location
Location
Salary
Salary:
Not provided
polygon.technology Logo
Polygon Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A foundational understanding of Linux systems, processes, and basic networking concepts
  • Familiarity with at least one scripting or programming language, such as Python, Bash, or Go
  • An interest in site reliability, monitoring, and operating production infrastructure
  • Clear written and verbal communication skills, with a willingness to ask questions and learn
  • The ability to remain calm, methodical, and responsive during incidents or operational events
Job Responsibility
Job Responsibility
  • Monitoring production systems, alerts, dashboards, and logs across Polygon networks, including Polygon PoS and the Agglayer
  • Assisting with incident detection, triage, escalation, and resolution under the guidance of senior engineers
  • Supporting on-call and operational coverage through structured rotations, with training and mentorship
  • Following, maintaining, and improving runbooks and standard operating procedures
  • Assisting with routine operational tasks such as service restarts, upgrades, and configuration changes
  • Helping maintain and improve monitoring, logging, and alerting systems, including dashboards for network health, RPC performance, and node metrics
  • Learning to improve alert signal quality and reduce operational noise
  • Supporting cloud-based and containerized infrastructure, including nodes, RPC endpoints, and supporting services
  • Collaborating with protocol, product, and cross-functional teams to understand production issues and user impact
  • Participating in post-incident reviews and contributing to root-cause analysis documentation
What we offer
What we offer
  • Remote first global workforce
  • Industry leading Medical, Dental and Vision health insurance
  • Company matching 401k with 3% match
  • $1,500 Home Office Set Up Allowance (life-time max)
  • $75 Monthly internet or phone reimbursement
  • Flexible Time Off
  • Company issued laptop
  • Egg freezing, mental health, and employee wellness benefits
  • Fulltime
Read More
Arrow Right

Senior MLOps Engineer - Data Ingestion - Paris

We are looking for a Senior MLOps Engineer to join the Panda Team (Data & ML Ope...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You have at least 7+ years as an MLOps Engineer or ML Platform Engineer with proven production model lifecycle management experience
  • You have expert-level experience with ML orchestration tools (MLflow, Braintrust, or similar) for batch processing and inference pipelines
  • You have a strong Site Reliability Engineering (SRE) foundation with focus on operations excellence, reliability, and observability
  • You have expertise in Python for automation and ML pipeline scripting
  • You have strong proficiency with infrastructure-as-code tools such as Terraform and container orchestration (Kubernetes)
  • You have experience with model evaluation frameworks and golden dataset management
  • You have a solid understanding of cloud infrastructure (preferably GCP, AWS, or Azure)
  • You have excellent problem-solving skills with focus on identifying and resolving infrastructure bottlenecks
  • You are fluent in English
Job Responsibility
Job Responsibility
  • Design and implement end-to-end ML model pipelines in production (LLM and custom models) with robust deployment, evaluation, and monitoring frameworks
  • Own data pseudo-anonymization architecture within ingestion services, converting Tier 0 (personal identifiers) to Tier 1 (anonymized data) while ensuring data quality and model performance
  • Build and maintain secure data export services with ML-based threat detection to prevent attack vectors (SQL injection, etc.) using adaptive models rather than manual rules
  • Manage golden datasets and implement production model evaluation frameworks to ensure anonymization quality and system reliability
  • Build and maintain data pipelines that efficiently extract, transform, and load data from various sources, handling multiple data formats (text, images, audio, video)
  • Implement automation and orchestration tools using ML orchestration platforms (MLflow, Braintrust, or similar) to streamline infrastructure provisioning and reduce manual effort
  • Monitor data and ML platforms for performance, reliability, and security
  • identify and troubleshoot issues proactively
  • Mentor team members on MLOps expertise and best practices to reduce knowledge silos and build organizational capability
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • 25 days of paid vacation per year, plus up to 14 days of RTT
  • Free mental health and coaching services through our partner Moka.care
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Lunch vouchers (Swile card) worth €8.50 per working day, with €4.50 covered by Doctolib
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • 50% reimbursement of your public transport subscription
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Relocation support in case of international mobility
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Managed Services

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’r...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
166000.00 - 201000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud Expertise: Proven ability to design and scale fault-tolerant distributed systems and develop managed cloud services
  • Technical Proficiency: Strong fundamentals in microservices and infrastructure technologies like Docker, Kubernetes, Terraform, and CI/CD systems. Experience with observability principles and technologies, e.g., time-series databases, log aggregation, distributed tracing
  • Customer-Centric Mindset: A passion for creating intuitive, high-quality solutions that directly impact customer success and satisfaction
  • Collaboration Skills: Ability to work with cross-functional teams to align priorities and deliver customer-first solutions
  • Communication Skills: Exceptional ability to articulate complex ideas and align technical solutions with customer needs
  • Team Leadership: Mentor engineers, enhance hiring practices, and contribute to building a strong, inclusive engineering culture
  • Professional Experience: 3-5 years of software development experience, including programming with modern compiled languages such as Go, Rust, Java, or C++
Job Responsibility
Job Responsibility
  • Building Foundational Infrastructure: Build and scale core infrastructure services that manage critical resources within our cloud platform. This involves designing, developing, and deploying robust and reliable systems from the ground up
  • Scalable Design: Design highly scalable, durable, and reliable platform services that prioritize ease of use
  • Cross Functional Collaboration: Lead projects that require collaborating with engineering, cloud support, site reliability, and product teams to assess tools, frameworks, and solutions that align with both customer and operational needs
  • Innovation: Implement features that differentiate Crusoe Cloud, focusing on operational efficiency, low-touch adoption, turn-key AI services, and scalability
What we offer
What we offer
  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Fulltime
Read More
Arrow Right