This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Senior DevOps Engineer, you will be a key contributor to our infrastructure strategy, focusing on automation, stability, and performance across both cloud and on-premise environments. You will drive best practices in CI/CD, configuration management, and monitoring, with a specific focus on optimizing the deployment and operation of large language models (LLMs) and related technologies.
Job Responsibility:
Design, deploy, and manage highly available, scalable infrastructure using Kubernetes and Docker across public cloud (e.g., AWS, GCP, Azure) and on-premise data centers
Develop and maintain robust Configuration Management solutions (e.g., Ansible, Terraform) for consistent environment provisioning and management
Implement and manage CI/CD pipelines to facilitate rapid, reliable, and automated software releases
Administer and troubleshoot operating systems, encompassing both Linux and Windows environments
Implement and optimize observability practices using monitoring tools like Datadog for logging, tracing, and alerting
Spearhead the operational deployment, scaling, and maintenance of LLM infrastructure, leveraging tools like LiteLLM, OpenRouter, or similar LLM orchestration/gateway technologies
Automate repetitive tasks and system operations using scripting languages, primarily Bash and Python
Collaborate closely with development, MLOps, and security teams to ensure infrastructure supports product requirements and compliance standards
Participate in an on-call rotation to ensure service reliability and responsiveness to incidents
Requirements:
5+ years of professional experience in a DevOps, SRE, or infrastructure engineering role
Deep expertise in containerization and orchestration, specifically Kubernetes (design, deployment, and troubleshooting) and Docker
Strong proficiency in managing infrastructure in both Cloud (e.g., AWS, GCP, Azure) and On-Premise environments
Expert-level administration skills in Linux and strong working knowledge of Windows Server environments
Proven experience with Infrastructure as Code (IaC) and Configuration Management tools (e.g., Terraform, Ansible)
High proficiency in scripting and automation using Python and Bash
Extensive experience with monitoring and observability platforms, especially Datadog (or comparable tools like Prometheus/Grafana, New Relic)
Hands-on experience deploying and managing technologies related to Large Language Models (LLMs), such as utilizing LiteLLM, OpenRouter, or setting up and managing LLM serving endpoints
Nice to have:
Experience with specific Kubernetes distributions (e.g., K3s, Rancher, OpenShift)
Familiarity with network configuration, firewalls, and security best practices for hybrid environments
Experience in MLOps workflows and related tools (e.g., MLflow, Kubeflow)
Certifications such as CKA, CKAD, or relevant cloud provider certifications