CrawlJobs Logo

Director, Site Reliability Engineering

earnin.com Logo

EarnIn

Location Icon

Location:
United States , Mountain View

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

315000.00 - 385000.00 USD / Year

Job Description:

The Director of Site Reliability Engineering (SRE) will provide strategic leadership and technical direction for the reliability, scalability, and performance of our mission‑critical systems and services. This role combines deep SRE expertise with strong engineering leadership to drive organizational transformation toward reliability-first principles. The ideal candidate brings a strong software engineering foundation, a passion for automation, and a proven ability to develop and lead high‑performing teams. The Director will partner with engineering, product, operations, and business stakeholders to design, deliver, and operate resilient, high‑availability systems that support our customers and business objectives at scale.

Job Responsibility:

  • Drive organizational transformation toward SRE principles and own the strategic direction for reliability maturity, cultivating a culture centered on reliability, efficiency, and continuous improvement
  • Develop and oversee automation strategies, tools, and frameworks that improve system reliability, reduce operational toil, and enhance team productivity
  • Architect and evolve robust observability, monitoring, and alerting systems
  • champion chaos engineering and resilience testing practices to proactively validate system behavior under failure conditions
  • Partner with engineering, product, and operations teams to embed SRE practices throughout the development lifecycle and influence architectural decisions for reliability
  • Build, mentor, and develop a high‑performing global SRE organization, fostering technical excellence, career growth, and a strong culture of knowledge sharing
  • Oversee capacity planning, scalability assessments, and future‑state demand forecasting across critical systems
  • Lead and govern high‑severity incident response practices—ensuring rapid triage, thorough root cause analysis, and follow‑through on corrective and preventative actions

Requirements:

  • BS, MS, or PhD degree in Computer Science, Engineering, or related field, or related experience
  • 7+ years of experience in the field, including 3+ years leading SRE teams or a team in a similar role
  • Strong experience with container orchestration (Kubernetes), infrastructure as code (Terraform), and CI/CD pipelines
  • Hands-on experience with observability platforms (e.g., Datadog, Prometheus, Grafana) and incident management tools (e.g., incident.io, PagerDuty)
  • Proficiency in at least one programming language (Python, Go, or Java) with the ability to review code and guide system design decisions
  • Proven experience in architecting and managing highly available, scalable, and fault-tolerant systems
  • Ability to define a clear reliability vision and inspire teams and stakeholders toward long‑term reliability goals
  • Demonstrated sound judgment and calm decision‑making under pressure, particularly during high‑severity incidents
  • Strong people leadership skills, with experience coaching and mentoring engineering talent, developing future leaders, and aligning peer engineering managers and leaders on reliability best practices
  • Strategic planning skills with a track record of aligning technical direction with organizational objectives
  • Excellent communication skills
  • able to translate complex technical issues into clear, actionable insights for executive and non‑technical audiences
  • Highly collaborative, with the ability to work effectively across engineering, product, operations, and business functions and leaders
What we offer:

equity and benefits

Additional Information:

Job Posted:
February 17, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Director, Site Reliability Engineering

Director SRE & Operations

Director SRE & Operations for E-business / Digital at PUMA in Herzogenaurach, Ge...
Location
Location
Germany , Herzogenaurach
Salary
Salary:
Not provided
about.puma.com Logo
Puma Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10–15 years of experience in technology operations, site reliability engineering, or platform engineering within large-scale digital or eCommerce environments
  • Proven track record owning platform reliability, availability, and operational performance for consumer-facing systems
  • Strong experience with cloud infrastructure, incident management, observability, and operational readiness in high-traffic, peak-driven environments
  • Demonstrated ability to embed SRE practices (SLOs, SLIs, incident response, automation) across engineering teams
  • Experienced leader of global operations or SRE teams, comfortable working in on-call and 24/7 operational models
  • Calm, decisive leader with a strong focus on stability, resilience, and continuous operational improvement
Job Responsibility
Job Responsibility
  • Leadership: Responsible for all aspects of the performance management and professional development of the team, including recruitment, development plans, providing constructive feedback, appraisals and exit processes
  • Foster a positive and inclusive team culture by actively engaging team members, promoting open communication, and implementing initiatives that enhance employee satisfaction and well-being
  • Compliance with and implementation of legal and operational requirements regarding occupational health and safety within your own area of responsibility
  • Global Site Reliability & Operations Strategy: Define and execute a global Site Reliability Engineering (SRE) and Technology Operations strategy aligned with PUMA’s D2C growth, peak trading demands, and omnichannel ambitions
  • Establish reliability, availability, performance, and scalability targets across all D2C platforms (eCommerce, in-store integrations, APIs, data platforms)
  • Own the end-to-end operational health of consumer-facing and business-critical platforms
  • Platform Reliability, Resilience & Performance: Drive a reliability-first mindset across engineering, embedding SRE principles such as SLIs, SLOs, SLAs, error budgets, and resilience-by-design
  • Ensure platforms are engineered to handle peak events (campaigns, drops, seasonal peaks) with minimal risk and rapid recovery
  • Lead incident management, major incident response, root cause analysis, and post-incident reviews with a strong focus on learning and prevention
  • Continuously improve platform observability, monitoring, alerting, and performance management
  • Fulltime
Read More
Arrow Right

Director of Engineering & Reliability

Crusoe is expanding our hyperscale AI and high-performance computing (HPC) data ...
Location
Location
United States , San Francisco
Salary
Salary:
216000.00 - 260000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering experience in mission-critical facilities or hyperscale data centers
  • Strong technical expertise in mechanical and electrical systems (MV distribution, UPS, generators, cooling plants, CRAC/CRAH, liquid cooling)
  • Experience implementing RCM, FMEA, RCA, and reliability engineering programs
  • Ability to govern engineering standards across multi-site portfolios
  • Strong analytical, modeling, and systems-thinking capabilities
Job Responsibility
Job Responsibility
  • Build and govern Crusoe’s enterprise engineering design standards for mechanical, electrical, and critical infrastructure systems
  • Lead reliability engineering programs including FMEA, RCM, RCA, uptime strategy, and risk modeling
  • Develop asset lifecycle strategies, predictive maintenance programs, and long-term capital planning
  • Model power, cooling, airflow, and liquid-loop performance to optimize system capacity and readiness
  • Serve as L3 escalation for complex MEP issues and major incidents
  • Lead technical audits, quality assurance programs, and engineering evaluations across all campuses
  • Partner with Construction, Commissioning, and Operations to enable scalable, high-density AI workloads
  • Build and lead a team of MEP and reliability engineers
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Director of Engineering & Reliability

Crusoe is expanding our hyperscale AI and high-performance computing (HPC) data ...
Location
Location
United States , San Francisco
Salary
Salary:
216000.00 - 260000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering experience in mission-critical facilities or hyperscale data centers
  • Strong technical expertise in mechanical and electrical systems (MV distribution, UPS, generators, cooling plants, CRAC/CRAH, liquid cooling)
  • Experience implementing RCM, FMEA, RCA, and reliability engineering programs
  • Ability to govern engineering standards across multi-site portfolios
  • Strong analytical, modeling, and systems-thinking capabilities
Job Responsibility
Job Responsibility
  • Build and govern Crusoe’s enterprise engineering design standards for mechanical, electrical, and critical infrastructure systems
  • Lead reliability engineering programs including FMEA, RCM, RCA, uptime strategy, and risk modeling
  • Develop asset lifecycle strategies, predictive maintenance programs, and long-term capital planning
  • Model power, cooling, airflow, and liquid-loop performance to optimize system capacity and readiness
  • Serve as L3 escalation for complex MEP issues and major incidents
  • Lead technical audits, quality assurance programs, and engineering evaluations across all campuses
  • Partner with Construction, Commissioning, and Operations to enable scalable, high-density AI workloads
  • Build and lead a team of MEP and reliability engineers
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Director, Equipment Reliability Center of Excellence

The Reliability Manager is responsible for developing and implementing reliabili...
Location
Location
United States , Mapleton
Salary
Salary:
119900.00 - 199800.00 USD / Year
evonik.com Logo
Evonik Industries
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Mechanical/Electrical or Chemical Engineering with strong maintenance reliability experience
  • 5-10 years in manufacturing with leadership and general industrial management experience
  • Strong background and broad-based experience in the complex field of maintenance or reliability engineering
  • A thorough knowledge of technical codes, standards and regulations is required
  • This is a self-motivated position that requires excellent leadership, analytical, written and verbal communication skills
  • Responsiveness and professionalism are critical as this position communicates with Site Manager and Engineering and Maintenance Manager frequently
  • Must have the ability to effectively collaborate with senior management, both locally and globally, and positively add value to short term and long-term strategic planning
  • Must be analytical and have the ability to problem solve in a concise and logical manner
  • Ability to communicate effectively, both verbally and in writing, and manage expectations to create trust and credibility across a broad spectrum of the company
  • Ability to effectively articulate and explain market trends internally and externally
Job Responsibility
Job Responsibility
  • Develop and implement reliability strategies and asset management strategies to improve equipment performance, optimize asset lifecycle and reduce failure rates for Mapleton Site
  • Lead and mentor plant engineers and reliability engineers, providing guidance on best practices and methodologies
  • Drive continuous improvement initiatives using reliability-centered maintenance and other methodologies
  • Collaborate with cross functional team (maintenance, operations, engineering, safety, etc.) to ensure alignment on reliability goals and improve asset utilization and performance
  • Evaluate and prioritize asset investments based on risk, performance, business impact, ensuring alignment with organizational objectives
  • Monitor important reliability trends and technical developments for development of new applications
  • Communicate and liaison with key Evonik contact personnel in Care Solutions Business line , as well as Technical Services and Technology and Engineering Americas to solve reliability issues at Mapleton
What we offer
What we offer
  • Medical, dental, and vision benefits
  • Paid time off plan
  • 401(k) savings plans
  • Health Savings Account (HSA)
  • Flexible Spending Accounts (FSAs)
  • Employee Assistance Program
  • Voluntary Benefits and Employee Discounts
  • Disability benefits
  • Life Insurance
  • Parental leave
  • Fulltime
Read More
Arrow Right
New

Director, Site Reliability Engineering

As our Director of Infrastructure platform, you will be a key driver of Doctolib...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering, including 6+ years leading large (30+) distributed, international platform or infrastructure teams
  • Proven experience driving platform-as-a-product transformations and modularizing large monolithic architectures at scale
  • Demonstrated ability to architect, deliver, and operate secure, reliable, and scalable developer platforms in SaaS, multi-product, or regulated environments
  • Strong process orientation: experience implementing OKRs, robust monitoring/observability, and best-in-class incident management
  • Measurable impact on developer productivity, platform adoption, reliability, and cost-efficiency
  • Effective communicator and influencer, with the ability to align and inspire cross-functional stakeholders
  • Experience leading change and building high-performing, people-first engineering cultures
  • Fluent in English and comfortable in fast-paced, international environments
Job Responsibility
Job Responsibility
  • Lead and scale a high-performing infrastructure organization of 30+ engineers across Infrastructure, Automation, SRE, and Database teams, while maintaining strong engagement and fostering a culture of excellence and ownership
  • Own the infrastructure platform strategy and roadmap that enables Doctolib's modularization journey, delivers on company OKRs, and ensures predictable execution across all infrastructure and automation initiatives
  • Champion platform-as-a-product by building self-service capabilities (infrastructure provisioning, CI/CD, observability, database management) that transform developer experience and unlock team autonomy across the engineering organization
  • Be the guardian of quality and reliability by establishing world-class incident management, driving measurable improvements in availability and performance, and ensuring infrastructure components operate at the highest standards of security and resilience
  • Accelerate engineering velocity by reducing platform friction, enabling faster modularization, and leveraging AI-augmented development tools to multiply productivity across feature teams
  • Drive the infrastructure transformation from monolith-supporting infrastructure to a modular, multi-service platform architecture - enabling international expansion, product velocity, and operational excellence at scale
  • Act as a senior technical leader within the Platform organization and broader Tech leadership team, bringing strong technical opinions and challenging architectural decisions while clearly articulating how infrastructure investments contribute to company strategy and business outcomes
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive additional leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

We are seeking a Director of Site Reliability Engineering to lead a global organ...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
aiven.io Logo
Aiven Deutschland GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience leading and scaling global SRE or infrastructure organizations through managers, ideally across multiple regions and time zones
  • Strong track record of defining and executing reliability strategy at scale, including ownership of SLIs/SLOs, incident management frameworks, and operational excellence programs
  • Demonstrated ability to build, develop, and mentor senior leaders, creating high-performing, inclusive teams and strong leadership pipelines
  • Experience operating in a 24/7/365 production environment, with deep understanding of follow-the-sun models, on-call design, and large-scale incident response
  • Ability to partner cross-functionally at the executive level (Engineering, Product, Support) to influence architecture, prioritization, and long-term platform investments
  • Strong data-driven leadership approach, with experience defining SLI/SLOs and using metrics to drive prioritization, accountability, and continuous improvement
  • Solid technical foundation in distributed systems, cloud infrastructure, and automation, with the ability to engage credibly with senior engineers and influence technical direction
  • Experience driving large-scale change and organizational design, including scaling teams, evolving operating models, and improving efficiency and reliability at company level
Job Responsibility
Job Responsibility
  • Define and drive global SRE operating strategy in partnership with regional SRE leaders across EMEA, AMER and APAC, ensuring alignment on reliability goals, operating models, and execution across a 24/7/365 follow-the-sun organization
  • Build and lead a multi-regional SRE organization through managers, developing leadership capability, mentoring team, and ensuring consistent performance, culture, and delivery across geographies
  • Set the vision and roadmap for reliability engineering, enabling teams to deliver high-impact tools, automation, and process initiatives that improve platform resilience, scalability, and efficiency
  • Own global incident management strategy and operating model, including on-call design, coverage, and escalation frameworks, ensuring seamless coordination and high availability across regions
  • Establish a metrics-driven operating cadence, defining KPIs/SLIs/SLOs/Error Budget, driving data-informed prioritization, and embedding operational rigor and continuous improvement across the SRE organization
What we offer
What we offer
  • Participate in Aiven’s equity plan
  • Balance work and life with our hybrid work policy
  • Choose the equipment you need to set yourself up for success
  • Use your Professional Development Plan budget for learning opportunities
  • Receive holistic wellbeing support through our global Employee Assistance Program
  • Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
  • Enjoy country-specific benefits for our global cast
  • Fulltime
Read More
Arrow Right

Director of Engineering, Cloud Availability

As the Director of Engineering, Cloud Availability, you will lead our engineerin...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering leadership experience with a proven track record of managing high-performing technical teams
  • Deep technical knowledge of public cloud infrastructure and experience building or operating large-scale platforms (Public, Private, or Hybrid)
  • Expert-level understanding of availability, observability, SLIs/SLOs, and modern incident management frameworks
  • Proven ability to lead remote teams and successfully collaborate with US-based engineering organizations
  • Demonstrated success navigating and leading within a matrix organizational structure
  • Strong familiarity with virtual and managed Kubernetes platforms, such as EKS, GKE, or AKS
  • The ability to balance long-term organizational strategy with the immediate tactical needs of a fast-growing engineering site
Job Responsibility
Job Responsibility
  • Organizational Leadership: Partner closely with Data Center, Network, and SRE teams to build and scale a world-class engineering organization in Dublin
  • Site Leadership & Culture: Serve as the primary point of contact and face of Crusoe leadership in Dublin, proactively managing office sentiment and ensuring the team remains focused on high-impact objectives
  • Global Strategic Alignment: Build high-trust partnerships with US-based leadership to ensure local priorities are perfectly synchronized with the global business roadmap
  • Operational Excellence: Implement and refine "follow-the-sun" protocols to enable smooth hand-offs between time zones, ensuring zero customer disruption and 24/7 reliability
  • Unified Team Vision: Foster a "one-team" mindset across geographic boundaries, breaking down silos and promoting deep collaboration between Dublin and US offices
  • Talent Development: Level up the Dublin engineering team by identifying individual strengths and establishing a culture of mentorship to grow the next generation of Engineering Leads and ICs
  • Reliability Initiatives: Lead the development of SRE functions for IaaS and managed services, including Inference, SLURM, and automated cluster management
What we offer
What we offer
  • pension contributions
  • private health and dental insurance
  • income protection
  • life assurance
  • Fulltime
Read More
Arrow Right

Director of Engineering, Cloud Availability

As the Director of Engineering, Cloud Availability, you will lead our engineerin...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering leadership experience with a proven track record of managing high-performing technical teams
  • Deep technical knowledge of public cloud infrastructure and experience building or operating large-scale platforms (Public, Private, or Hybrid)
  • Expert-level understanding of availability, observability, SLIs/SLOs, and modern incident management frameworks
  • Proven ability to lead remote teams and successfully collaborate with US-based engineering organizations
  • Demonstrated success navigating and leading within a matrix organizational structure
  • Strong familiarity with virtual and managed Kubernetes platforms, such as EKS, GKE, or AKS
  • The ability to balance long-term organizational strategy with the immediate tactical needs of a fast-growing engineering site
Job Responsibility
Job Responsibility
  • Organizational Leadership: Partner closely with Data Center, Network, and SRE teams to build and scale a world-class engineering organization in Dublin
  • Site Leadership & Culture: Serve as the primary point of contact and face of Crusoe leadership in Dublin, proactively managing office sentiment and ensuring the team remains focused on high-impact objectives
  • Global Strategic Alignment: Build high-trust partnerships with US-based leadership to ensure local priorities are perfectly synchronized with the global business roadmap
  • Operational Excellence: Implement and refine "follow-the-sun" protocols to enable smooth hand-offs between time zones, ensuring zero customer disruption and 24/7 reliability
  • Unified Team Vision: Foster a "one-team" mindset across geographic boundaries, breaking down silos and promoting deep collaboration between Dublin and US offices
  • Talent Development: Level up the Dublin engineering team by identifying individual strengths and establishing a culture of mentorship to grow the next generation of Engineering Leads and ICs
  • Reliability Initiatives: Lead the development of SRE functions for IaaS and managed services, including Inference, SLURM, and automated cluster management
What we offer
What we offer
  • pension contributions
  • private health and dental insurance
  • income protection
  • life assurance
  • Fulltime
Read More
Arrow Right