This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Sr Site Reliability Engineer will architect, develop, and maintain cloud environment in both the commercial and government cloud. The role will work closely with software engineers, architects, and DevOps engineers to architect and maintain a secure, resilient and high performance cloud infrastructure.
Job Responsibility:
Build, maintain, and operate IaaS and PaaS infrastructure in Azure commercial and government clouds
Work closely with dev teams to identify and measure SLOs, SLAs and SLIs
Act a strong contributor to development of platform services including architecture, provisioning, configuration, deployment, and support
Perform integrations with central logging, metrics dashboards, instrumentation, incident monitoring and management
Build/integrate/administer systems and tools that enable engineering teams to observe their applications in production with autonomy (Dashboards, APMs)
Support software and/or cloud-infrastructure in an on-call rotation basis
Assist with identification and remediation of technical problems at the root cause by continuously implementing automation, self-healing, and real-time monitoring to production systems
Maintain and improve operational tooling, frameworks, build frameworks that test the performance and resiliency of our platform services/tools
Automate alerts for metrics on performance, cost, vulnerabilities, risk, compliance violations
Improve processes and champion automation of any manual items around support
Requirements:
4 + years of experience working within a SRE engineer/cloud platform role
Experience leveraging AI tools in the software development (or product) lifecycle in order to improve quality and efficiency
Expert knowledge of a cloud service provider
Expert knowledge and hands on production experience in Kubernetes (bare metal or managed) cluster setup and management required
Experience with infrastructure as code (IaC) tools like Terraform, Pulumi
Experience with Kubernetes deployment tools like Helm, ArgoCD, Flux
Strong awareness of networking and internet protocols
Understanding of identity and access management (IAM)
Experience supporting infrastructure in production cloud environments
Knowledge of Encryption, Public Key Infrastructure (PKI), understanding of OWASP
Experience working with RESTful services
Some experience with monitoring tools (Azure Monitor, Splunk, Dynatrace, Graphana, Prometheus)
Familiarity with IDEs and Source Control tools like Visual Studio Code and Git
Nice to have:
Bachelor’s Degree in Computer Science, Information Technology, Software Engineering, Math, Physics
Master’s Degree with coursework focused on advanced algorithms, mathematics in computing, data structures or related field
Expert knowledge of Azure
Demonstrate passion about infrastructure automation
Ability to prioritize work in a fast-paced environment