This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As an Engineer in Site Reliability Engineering (SRE) for AI Systems, you will help ensure the reliability, scalability, and performance of AI platforms. This role includes participating in on-call rotations, improving system observability, and supporting operations across cloud-native infrastructure. This is a hands-on role ideal for someone with foundational SRE skills and a growth mindset to expand in GenAI and LLM infrastructure operations.
Job Responsibility:
Participate in on-call rotations to support AI platforms and respond to production incidents with urgency and precision
Monitor system health and performance using tools like Grafana, Splunk, and PowerBI
Support cloud-native infrastructure deployments, with a focus on Azure (primary), and exposure to AWS or GCP
Implement runbooks and automate repetitive operational tasks to reduce toil
Support CI/CD pipelines and IaC deployments using Gitlab pipelines, Databricks
Assist in the development and enforcement of Service Level Objectives (SLOs) and real-time alerts for AI APIs and services
Collaborate with senior engineers to improve platform reliability and scale LLM-based applications
Requirements:
Bachelor's Degree Computer Science, Engineering or a related field
2–4 years of experience in DevOps, SRE, or cloud platform engineering
Hands-on experience with monitoring/logging systems such as Prometheus, Grafana, Splunk, or OpenSearch
Familiarity with cloud environments (preferably Azure
AWS/GCP a plus)
Experience in scripting or automation using Python, Bash, or PowerShell
Basic understanding of containerization (Docker, Kubernetes) and CI/CD concepts
Willingness to participate in an on-call schedule and incident resolution
Strong solving and root cause analysis skills
Communication
Customer Service
Analytics
Technical Writing
At least 18 years of age
Legally authorized to work in the United States
Nice to have:
Exposure to AI/ML infrastructure or LLM-based systems (e.g., OpenAI, ChatGPT, Azure OpenAI)
Experience with infrastructure-as-code tools like Terraform or ARM templates
Familiarity with LLM observability or API token usage metrics
Passion for learning AI reliability practices and collaborating with cross-functional teams
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.