Filters

Countries

United States (3)

Work Mode

On-site work (1)

Engineer, SRE GenAI Jobs

5 Job Offers

Filters

New

Cloud Native GCP Engineer

Location

United Kingdom , City of London, London

Salary

Not provided

Whitehall Resources Ltd

Expiration Date

Until further notice

Engineer, SRE GenAI

Location

United States , Bellevue; Overland Park; Frisco

Salary

92500.00 - 166800.00 USD / Year

T-Mobile

Expiration Date

Until further notice

Senior DevOps Engineer (GCP)

Location

Salary

Not provided

N-iX

Expiration Date

Until further notice

Distinguished Technologist, Deep Learning

Location

United States , San Jose

Salary

164500.00 - 398500.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Distinguished Technologist, Cloud Development (AI/ML)

Location

United States , San Jose

Salary

164500.00 - 398500.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Explore cutting-edge Engineer, SRE GenAI jobs and launch your career at the intersection of artificial intelligence and rock-solid infrastructure. An Engineer in Site Reliability Engineering (SRE) for Generative AI is a specialized professional dedicated to building and maintaining the reliable, scalable, and performant platforms that power next-generation AI applications. This role merges the proactive, automation-first mindset of classic SRE with the unique challenges of operating large language models (LLMs) and generative AI systems. Professionals in these jobs are the essential bridge between groundbreaking AI research and stable, user-facing production services. Typically, individuals in this profession are responsible for the entire lifecycle of AI platform reliability. Common duties include designing and implementing robust monitoring, logging, and alerting systems to gain deep observability into AI model performance, latency, and cost metrics. They define and uphold Service Level Objectives (SLOs) and Error Budgets specifically tailored for AI APIs and inference services. A core aspect of the role involves managing cloud-native, scalable infrastructure, often containerized with Docker and orchestrated with Kubernetes, across major providers like AWS, GCP, or Azure. Automating operational procedures, from deployments to incident response, is paramount to reduce manual toil and ensure consistency. These engineers also frequently participate in on-call rotations, leading swift incident response and conducting thorough post-mortems to prevent recurrence, ensuring AI services meet stringent uptime and performance expectations. To succeed in SRE GenAI jobs, a specific blend of skills is required. Foundational knowledge in DevOps and SRE principles is essential, coupled with hands-on experience in infrastructure-as-code (e.g., Terraform), CI/CD pipelines, and scripting languages like Python or Bash. A growing understanding of generative AI and LLM architecture—including concepts like model serving, vector databases, and inference optimization—is increasingly critical. Strong analytical and problem-solving skills are necessary for debugging complex, distributed systems and performing root cause analysis. Cloud platform expertise, proficiency with observability tools (e.g., Prometheus, Grafana), and experience with container technologies are standard requirements. Excellent collaboration and communication skills are vital, as these engineers work closely with AI research scientists, machine learning engineers, and product teams to align infrastructure capabilities with innovative AI ambitions. For those passionate about ensuring the future of AI is both powerful and dependable, Engineer, SRE GenAI jobs offer a dynamic and impactful career path.

We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.

Filters

Countries

United States (3)
United Kingdom (1)

Location

Salary

All (5)

Specified salary (3)