This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Principal Engineer operates as a hands‑on expert, shaping strategy while directly influencing complex systems, mentoring senior engineers, and solving the hardest reliability and performance challenges. To serve as a technical authority and reliability architect across critical platforms and applications. This role drives reliability by design, sets enterprise SRE standards, and partners with engineering, architecture, and operations leadership to embed resilience, observability, and automation at scale.
Job Responsibility:
Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Define and evolve enterprise‑level SRE strategy, standards, and reference architectures
Establish and govern SLIs, SLOs, and error budgets for Tier‑1 and Tier‑2 services
drive adoption across engineering and operations
Architect highly resilient, scalable, and fault‑tolerant systems across cloud and hybrid platforms
Lead deep‑dive resiliency, capacity, and performance reviews for critical services
Design and mature end‑to‑end observability architectures (metrics, logs, traces) aligned to golden signals
Drive OpenTelemetry‑based standardization and telemetry consistency across platforms
Partner with performance engineering to execute load, stress, soak, failover, and chaos testing
Identify systemic performance bottlenecks and lead remediation across applications, middleware, and infrastructure layers
Lead large‑scale toil identification and elimination initiatives across platforms
Design and implement automation‑first reliability solutions, including self‑healing patterns, auto‑remediation, and AI‑assisted operations
Build reusable golden paths, reliability frameworks, and standardized automation patterns
Champion shift‑left reliability, embedding SRE controls into CI/CD pipelines and design reviews
Serve as senior technical authority during major incidents
provide deep technical triage and architectural guidance
Lead blameless postmortems for high‑impact incidents
ensure systemic fixes over tactical remediation
Drive problem management maturity through trend analysis, recurring issue elimination, and proactive risk reduction
Influence change management practices to ensure safe, predictable, and observable releases
Mentor and coach senior engineers, SREs, and platform teams on advanced reliability practices
Define and maintain SRE maturity models, scorecards, and executive‑level reliability metrics
Partner with architecture, security, and product leaders to align reliability with business outcomes
Establish and review runbooks, readiness checklists, dashboards, and reliability reviews for consistency and effectiveness
Requirements:
7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
7+ years’ experience designing and operating large‑scale distributed systems
7+ years hands‑on experience in SRE, Platform Engineering, or DevOps roles
Proven track record of driving enterprise‑scale reliability transformations
Deep expertise in AWS, Azure, or GCP (multi‑cloud experience preferred)
Strong understanding of container platforms (Kubernetes/OpenShift) and cloud‑native architectures
Strong proficiency in Python, Go, for automation, tooling, and platform integrations
Infrastructure as Code expertise: Terraform, Ansible/Chef
strong Git/GitOps practices
CI/CD expertise: Azure DevOps, GitHub Actions, Jenkins, GitLab CI
Advanced hands‑on experience with Prometheus, Grafana, OpenTelemetry, and APM tools (AppDynamics, Aternity, SPLOC, ThousandEyes)
Strong knowledge of capacity planning, DR strategies, chaos engineering, canary and blue‑green deployments
Expert understanding of Incident, Problem, and Change Management
Strong experience with on‑call models, runbook automation, and SRE operational best practices
Excellent communication skills with the ability to influence senior engineering and executive stakeholders