This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a talented Site Reliability Engineer (SRE) with a strong background in Google Cloud Platform (GCP) and kubernetes. The ideal candidate will be responsible for ensuring the reliability, performance, and scalability of our on-premise and cloud-based systems along with focus on reducing costs for Google Cloud.
Job Responsibility:
Ensure the reliability and uptime of critical services and infrastructure
Design, implement, and manage cloud infrastructure using Google Cloud services
Develop and maintain automation scripts and tools to improve system efficiency and reduce manual intervention
Implement monitoring solutions and respond to incidents to minimize downtime and ensure quick recovery
Work closely with development and operations teams to improve system reliability and performance
Conduct capacity planning and performance tuning to ensure systems can handle future growth
Create and maintain comprehensive documentation for system configurations, processes, and procedures
Requirements:
Bachelor's degree in computer science, Engineering, or a related field
4+ years of experience in site reliability engineering or a similar role
Proficiency in Google Cloud services (Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery, Pub/Sub, etc.)
Familiarity with Google BI and AI/ML tools (Looker, BigQuery ML, Vertex AI, etc.)
Experience with automation tools (Terraform, Ansible, Puppet)
Familiarity with CI/CD pipelines and tools (Azure pipelines Jenkins, GitLab CI, etc.)
Strong scripting skills (Python, Bash, etc.)
Knowledge of networking concepts and protocols
Service mesh experience a plus
Experience with monitoring tools (Prometheus, Grafana, etc.)