CrawlJobs Logo

Director, Site Reliability Engineering

aiven.io Logo

Aiven Deutschland GmbH

Location Icon

Location:
Finland , Helsinki

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are seeking a Director of Site Reliability Engineering to lead a global organization responsible for the reliability and operational excellence of the Aiven platform globally. You will lead a high-performing SRE team, setting the vision and strategy to ensure resilient, scalable, and highly automated systems across our 24/7/365 operations. Your team will proactively manage platform health, lead incident response and cross-functional coordination, and drive continuous improvement in reliability and performance. As a senior leader, you will partner closely with engineering, product, and support teams worldwide, influence system architecture, and invest in tooling and automation to reduce toil and enhance production reliability. This role combines strategic leadership, customer centricity, and deep operational accountability, with a focus on delivering reliable services at global scale while developing strong technical leaders within your organization.

Job Responsibility:

  • Define and drive global SRE operating strategy in partnership with regional SRE leaders across EMEA, AMER and APAC, ensuring alignment on reliability goals, operating models, and execution across a 24/7/365 follow-the-sun organization
  • Build and lead a multi-regional SRE organization through managers, developing leadership capability, mentoring team, and ensuring consistent performance, culture, and delivery across geographies
  • Set the vision and roadmap for reliability engineering, enabling teams to deliver high-impact tools, automation, and process initiatives that improve platform resilience, scalability, and efficiency
  • Own global incident management strategy and operating model, including on-call design, coverage, and escalation frameworks, ensuring seamless coordination and high availability across regions
  • Establish a metrics-driven operating cadence, defining KPIs/SLIs/SLOs/Error Budget, driving data-informed prioritization, and embedding operational rigor and continuous improvement across the SRE organization

Requirements:

  • Proven experience leading and scaling global SRE or infrastructure organizations through managers, ideally across multiple regions and time zones
  • Strong track record of defining and executing reliability strategy at scale, including ownership of SLIs/SLOs, incident management frameworks, and operational excellence programs
  • Demonstrated ability to build, develop, and mentor senior leaders, creating high-performing, inclusive teams and strong leadership pipelines
  • Experience operating in a 24/7/365 production environment, with deep understanding of follow-the-sun models, on-call design, and large-scale incident response
  • Ability to partner cross-functionally at the executive level (Engineering, Product, Support) to influence architecture, prioritization, and long-term platform investments
  • Strong data-driven leadership approach, with experience defining SLI/SLOs and using metrics to drive prioritization, accountability, and continuous improvement
  • Solid technical foundation in distributed systems, cloud infrastructure, and automation, with the ability to engage credibly with senior engineers and influence technical direction
  • Experience driving large-scale change and organizational design, including scaling teams, evolving operating models, and improving efficiency and reliability at company level
What we offer:
  • Participate in Aiven’s equity plan
  • Balance work and life with our hybrid work policy
  • Choose the equipment you need to set yourself up for success
  • Use your Professional Development Plan budget for learning opportunities
  • Receive holistic wellbeing support through our global Employee Assistance Program
  • Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
  • Enjoy country-specific benefits for our global cast

Additional Information:

Job Posted:
April 24, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Director, Site Reliability Engineering

Director SRE & Operations

Director SRE & Operations for E-business / Digital at PUMA in Herzogenaurach, Ge...
Location
Location
Germany , Herzogenaurach
Salary
Salary:
Not provided
about.puma.com Logo
Puma Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10–15 years of experience in technology operations, site reliability engineering, or platform engineering within large-scale digital or eCommerce environments
  • Proven track record owning platform reliability, availability, and operational performance for consumer-facing systems
  • Strong experience with cloud infrastructure, incident management, observability, and operational readiness in high-traffic, peak-driven environments
  • Demonstrated ability to embed SRE practices (SLOs, SLIs, incident response, automation) across engineering teams
  • Experienced leader of global operations or SRE teams, comfortable working in on-call and 24/7 operational models
  • Calm, decisive leader with a strong focus on stability, resilience, and continuous operational improvement
Job Responsibility
Job Responsibility
  • Leadership: Responsible for all aspects of the performance management and professional development of the team, including recruitment, development plans, providing constructive feedback, appraisals and exit processes
  • Foster a positive and inclusive team culture by actively engaging team members, promoting open communication, and implementing initiatives that enhance employee satisfaction and well-being
  • Compliance with and implementation of legal and operational requirements regarding occupational health and safety within your own area of responsibility
  • Global Site Reliability & Operations Strategy: Define and execute a global Site Reliability Engineering (SRE) and Technology Operations strategy aligned with PUMA’s D2C growth, peak trading demands, and omnichannel ambitions
  • Establish reliability, availability, performance, and scalability targets across all D2C platforms (eCommerce, in-store integrations, APIs, data platforms)
  • Own the end-to-end operational health of consumer-facing and business-critical platforms
  • Platform Reliability, Resilience & Performance: Drive a reliability-first mindset across engineering, embedding SRE principles such as SLIs, SLOs, SLAs, error budgets, and resilience-by-design
  • Ensure platforms are engineered to handle peak events (campaigns, drops, seasonal peaks) with minimal risk and rapid recovery
  • Lead incident management, major incident response, root cause analysis, and post-incident reviews with a strong focus on learning and prevention
  • Continuously improve platform observability, monitoring, alerting, and performance management
  • Fulltime
Read More
Arrow Right

Director of Engineering & Reliability

Crusoe is expanding our hyperscale AI and high-performance computing (HPC) data ...
Location
Location
United States , San Francisco
Salary
Salary:
216000.00 - 260000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering experience in mission-critical facilities or hyperscale data centers
  • Strong technical expertise in mechanical and electrical systems (MV distribution, UPS, generators, cooling plants, CRAC/CRAH, liquid cooling)
  • Experience implementing RCM, FMEA, RCA, and reliability engineering programs
  • Ability to govern engineering standards across multi-site portfolios
  • Strong analytical, modeling, and systems-thinking capabilities
Job Responsibility
Job Responsibility
  • Build and govern Crusoe’s enterprise engineering design standards for mechanical, electrical, and critical infrastructure systems
  • Lead reliability engineering programs including FMEA, RCM, RCA, uptime strategy, and risk modeling
  • Develop asset lifecycle strategies, predictive maintenance programs, and long-term capital planning
  • Model power, cooling, airflow, and liquid-loop performance to optimize system capacity and readiness
  • Serve as L3 escalation for complex MEP issues and major incidents
  • Lead technical audits, quality assurance programs, and engineering evaluations across all campuses
  • Partner with Construction, Commissioning, and Operations to enable scalable, high-density AI workloads
  • Build and lead a team of MEP and reliability engineers
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Director of Engineering & Reliability

Crusoe is expanding our hyperscale AI and high-performance computing (HPC) data ...
Location
Location
United States , San Francisco
Salary
Salary:
216000.00 - 260000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering experience in mission-critical facilities or hyperscale data centers
  • Strong technical expertise in mechanical and electrical systems (MV distribution, UPS, generators, cooling plants, CRAC/CRAH, liquid cooling)
  • Experience implementing RCM, FMEA, RCA, and reliability engineering programs
  • Ability to govern engineering standards across multi-site portfolios
  • Strong analytical, modeling, and systems-thinking capabilities
Job Responsibility
Job Responsibility
  • Build and govern Crusoe’s enterprise engineering design standards for mechanical, electrical, and critical infrastructure systems
  • Lead reliability engineering programs including FMEA, RCM, RCA, uptime strategy, and risk modeling
  • Develop asset lifecycle strategies, predictive maintenance programs, and long-term capital planning
  • Model power, cooling, airflow, and liquid-loop performance to optimize system capacity and readiness
  • Serve as L3 escalation for complex MEP issues and major incidents
  • Lead technical audits, quality assurance programs, and engineering evaluations across all campuses
  • Partner with Construction, Commissioning, and Operations to enable scalable, high-density AI workloads
  • Build and lead a team of MEP and reliability engineers
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Director, Equipment Reliability Center of Excellence

The Reliability Manager is responsible for developing and implementing reliabili...
Location
Location
United States , Mapleton
Salary
Salary:
119900.00 - 199800.00 USD / Year
evonik.com Logo
Evonik Industries
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Mechanical/Electrical or Chemical Engineering with strong maintenance reliability experience
  • 5-10 years in manufacturing with leadership and general industrial management experience
  • Strong background and broad-based experience in the complex field of maintenance or reliability engineering
  • A thorough knowledge of technical codes, standards and regulations is required
  • This is a self-motivated position that requires excellent leadership, analytical, written and verbal communication skills
  • Responsiveness and professionalism are critical as this position communicates with Site Manager and Engineering and Maintenance Manager frequently
  • Must have the ability to effectively collaborate with senior management, both locally and globally, and positively add value to short term and long-term strategic planning
  • Must be analytical and have the ability to problem solve in a concise and logical manner
  • Ability to communicate effectively, both verbally and in writing, and manage expectations to create trust and credibility across a broad spectrum of the company
  • Ability to effectively articulate and explain market trends internally and externally
Job Responsibility
Job Responsibility
  • Develop and implement reliability strategies and asset management strategies to improve equipment performance, optimize asset lifecycle and reduce failure rates for Mapleton Site
  • Lead and mentor plant engineers and reliability engineers, providing guidance on best practices and methodologies
  • Drive continuous improvement initiatives using reliability-centered maintenance and other methodologies
  • Collaborate with cross functional team (maintenance, operations, engineering, safety, etc.) to ensure alignment on reliability goals and improve asset utilization and performance
  • Evaluate and prioritize asset investments based on risk, performance, business impact, ensuring alignment with organizational objectives
  • Monitor important reliability trends and technical developments for development of new applications
  • Communicate and liaison with key Evonik contact personnel in Care Solutions Business line , as well as Technical Services and Technology and Engineering Americas to solve reliability issues at Mapleton
What we offer
What we offer
  • Medical, dental, and vision benefits
  • Paid time off plan
  • 401(k) savings plans
  • Health Savings Account (HSA)
  • Flexible Spending Accounts (FSAs)
  • Employee Assistance Program
  • Voluntary Benefits and Employee Discounts
  • Disability benefits
  • Life Insurance
  • Parental leave
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

The Director of Site Reliability Engineering (SRE) will provide strategic leader...
Location
Location
United States , Mountain View
Salary
Salary:
315000.00 - 385000.00 USD / Year
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS, MS, or PhD degree in Computer Science, Engineering, or related field, or related experience
  • 7+ years of experience in the field, including 3+ years leading SRE teams or a team in a similar role
  • Strong experience with container orchestration (Kubernetes), infrastructure as code (Terraform), and CI/CD pipelines
  • Hands-on experience with observability platforms (e.g., Datadog, Prometheus, Grafana) and incident management tools (e.g., incident.io, PagerDuty)
  • Proficiency in at least one programming language (Python, Go, or Java) with the ability to review code and guide system design decisions
  • Proven experience in architecting and managing highly available, scalable, and fault-tolerant systems
  • Ability to define a clear reliability vision and inspire teams and stakeholders toward long‑term reliability goals
  • Demonstrated sound judgment and calm decision‑making under pressure, particularly during high‑severity incidents
  • Strong people leadership skills, with experience coaching and mentoring engineering talent, developing future leaders, and aligning peer engineering managers and leaders on reliability best practices
  • Strategic planning skills with a track record of aligning technical direction with organizational objectives
Job Responsibility
Job Responsibility
  • Drive organizational transformation toward SRE principles and own the strategic direction for reliability maturity, cultivating a culture centered on reliability, efficiency, and continuous improvement
  • Develop and oversee automation strategies, tools, and frameworks that improve system reliability, reduce operational toil, and enhance team productivity
  • Architect and evolve robust observability, monitoring, and alerting systems
  • champion chaos engineering and resilience testing practices to proactively validate system behavior under failure conditions
  • Partner with engineering, product, and operations teams to embed SRE practices throughout the development lifecycle and influence architectural decisions for reliability
  • Build, mentor, and develop a high‑performing global SRE organization, fostering technical excellence, career growth, and a strong culture of knowledge sharing
  • Oversee capacity planning, scalability assessments, and future‑state demand forecasting across critical systems
  • Lead and govern high‑severity incident response practices—ensuring rapid triage, thorough root cause analysis, and follow‑through on corrective and preventative actions
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right

Director of Engineering, Cloud Availability

As the Director of Engineering, Cloud Availability, you will lead our engineerin...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering leadership experience with a proven track record of managing high-performing technical teams
  • Deep technical knowledge of public cloud infrastructure and experience building or operating large-scale platforms (Public, Private, or Hybrid)
  • Expert-level understanding of availability, observability, SLIs/SLOs, and modern incident management frameworks
  • Proven ability to lead remote teams and successfully collaborate with US-based engineering organizations
  • Demonstrated success navigating and leading within a matrix organizational structure
  • Strong familiarity with virtual and managed Kubernetes platforms, such as EKS, GKE, or AKS
  • The ability to balance long-term organizational strategy with the immediate tactical needs of a fast-growing engineering site
Job Responsibility
Job Responsibility
  • Organizational Leadership: Partner closely with Data Center, Network, and SRE teams to build and scale a world-class engineering organization in Dublin
  • Site Leadership & Culture: Serve as the primary point of contact and face of Crusoe leadership in Dublin, proactively managing office sentiment and ensuring the team remains focused on high-impact objectives
  • Global Strategic Alignment: Build high-trust partnerships with US-based leadership to ensure local priorities are perfectly synchronized with the global business roadmap
  • Operational Excellence: Implement and refine "follow-the-sun" protocols to enable smooth hand-offs between time zones, ensuring zero customer disruption and 24/7 reliability
  • Unified Team Vision: Foster a "one-team" mindset across geographic boundaries, breaking down silos and promoting deep collaboration between Dublin and US offices
  • Talent Development: Level up the Dublin engineering team by identifying individual strengths and establishing a culture of mentorship to grow the next generation of Engineering Leads and ICs
  • Reliability Initiatives: Lead the development of SRE functions for IaaS and managed services, including Inference, SLURM, and automated cluster management
What we offer
What we offer
  • pension contributions
  • private health and dental insurance
  • income protection
  • life assurance
  • Fulltime
Read More
Arrow Right

Director of Engineering, Cloud Availability

As the Director of Engineering, Cloud Availability, you will lead our engineerin...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering leadership experience with a proven track record of managing high-performing technical teams
  • Deep technical knowledge of public cloud infrastructure and experience building or operating large-scale platforms (Public, Private, or Hybrid)
  • Expert-level understanding of availability, observability, SLIs/SLOs, and modern incident management frameworks
  • Proven ability to lead remote teams and successfully collaborate with US-based engineering organizations
  • Demonstrated success navigating and leading within a matrix organizational structure
  • Strong familiarity with virtual and managed Kubernetes platforms, such as EKS, GKE, or AKS
  • The ability to balance long-term organizational strategy with the immediate tactical needs of a fast-growing engineering site
Job Responsibility
Job Responsibility
  • Organizational Leadership: Partner closely with Data Center, Network, and SRE teams to build and scale a world-class engineering organization in Dublin
  • Site Leadership & Culture: Serve as the primary point of contact and face of Crusoe leadership in Dublin, proactively managing office sentiment and ensuring the team remains focused on high-impact objectives
  • Global Strategic Alignment: Build high-trust partnerships with US-based leadership to ensure local priorities are perfectly synchronized with the global business roadmap
  • Operational Excellence: Implement and refine "follow-the-sun" protocols to enable smooth hand-offs between time zones, ensuring zero customer disruption and 24/7 reliability
  • Unified Team Vision: Foster a "one-team" mindset across geographic boundaries, breaking down silos and promoting deep collaboration between Dublin and US offices
  • Talent Development: Level up the Dublin engineering team by identifying individual strengths and establishing a culture of mentorship to grow the next generation of Engineering Leads and ICs
  • Reliability Initiatives: Lead the development of SRE functions for IaaS and managed services, including Inference, SLURM, and automated cluster management
What we offer
What we offer
  • pension contributions
  • private health and dental insurance
  • income protection
  • life assurance
  • Fulltime
Read More
Arrow Right

Digitalization and Technology Director, Chief Engineer

A skilled Software Engineer who will design, build, and maintain software system...
Location
Location
China , Shanghai; Beijing; Dalian
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field with 15-20 years of relevant experience
  • Expert-level skills in Business Immersion, Data Integration, Full-Stack Development, Multi-Audience Communication, Problem Discovery, Rapid Prototyping & Validation, Stakeholder Management, Team Collaboration
  • Practitioner-level skills in AI Evaluation & Verification, AI Literacy, AI-Augmented Development, Architecture & Design, Code Quality & Review, Developer Experience, Knowledge Management, Pattern Generalization, Service Management, Site Reliability Engineering, Technical Writing
  • Working-level skills in Cloud Platforms, Data Modeling, DevOps & CI/CD, Lean Thinking & Flow, Technical Debt Management, Time Management & Deep Work
Job Responsibility
Job Responsibility
  • Drive delivery of the most critical technical initiatives
  • Establish engineering delivery practices across the business unit
  • Be the technical authority on high-stakes projects
  • Develop technical leaders
  • Shape engineering talent strategy across the business unit
  • Build high-performing engineering teams
  • Shape technology-driven business strategy
  • Represent technical perspective at executive level
  • Be recognized as a bridge between engineering and business
  • Design AI-augmented engineering workflows for your area
  • Fulltime
Read More
Arrow Right