This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Under limited supervision, the Site Reliability Engineer III is responsible for improving system reliability and resilience. This role focuses on building automation to reduce manual effort and prevent service-impacting incidents. The SRE combines software and systems engineering to build and support large-scale, distributed, fault-tolerant systems. This role ensures that critical platforms are available, reliable, and able to support a fast rate of improvement. This role relies on monitoring platforms and is continually taking a holistic view of system health and performance. The SRE will enhance and support cloud-based transformations and is focused on pushing capabilities forward, staying ahead of customer needs, and innovating for continuous improvement. The SRE provides operational support and engineering for multiple large-scale distributed software applications.
Job Responsibility:
Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance
Partners with development teams to improve services through testing and release procedures
Participates in system design, platform management and capacity planning
Balances feature development speed and reliability with service-level objectives
Works closely with the incident response team and restoring service to normal operation
Understands debugging and applying troubleshooting skills
Investigates, blocks and rate-limits unwanted traffic
Utilizes monitoring systems and dashboards for proactive changes and alerting
Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable
Performs other duties as assigned.
Requirements:
Typically requires a bachelor's degree and five (5) or more years of related experience or an equivalent combination
Understanding of Kubernetes, containers, clusters, and elastic scalability
Expertise in SRE principles
Mindset of continually finding ways to drive scalability, stability, and performance
Cloud Services experience with Google Cloud Platform (GCP)
Experience with API, service-based or microservice-based architecture
Proficiency in infrastructure, network, database, operating systems, or security troubleshooting and remediation
Architecture-level knowledge of Windows and Linux and Infrastructure systems
Experience with production deployment, monitoring, and operational support for enterprise-class applications (Dynatrace a plus)
Experience working with Continuous Integration/ Continuous Deployment tools
Experience in performance diagnostics, capacity planning, performance architecture design, performance tuning, and performance monitoring
A strong mix of software engineering and operational support skills
Knowledge of web technologies – HTTP, proxy, java, etc.
Experience with Azure DevOps (ADO), Dynatrace, Prometheus, Terraform and Grafana
You must be eligible to work in the US without Visa Sponsorship.
What we offer:
Options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay.