CrawlJobs Logo

LLM & AI DevOps Engineer

United States, Remote · Job Posted January 29, 2026
Apply Position
Job Link Share

Job Description

Join our team as a DevOps Engineer specializing in Artificial Intelligence (AI) and Large Language Model (LLM) infrastructure. You will play a critical role in architecting, deploying, and optimizing scalable AI platforms using modern DevOps practices and state-of-the-art tools.

Job Responsibility

  • Build, automate, and manage CI/CD pipelines for deploying and maintaining AI/LLM workloads
  • Collaborate with AI engineers and data scientists to streamline model deployment, versioning, and monitoring
  • Design and maintain cloud infrastructure using Infrastructure as Code (IaC) platforms such as Terraform and Ansible
  • Orchestrate and manage containerized AI environments using Kubernetes
  • Implement robust monitoring and logging solutions utilizing Grafana and Prometheus
  • Optimize AI model inference and training workloads—especially for NVIDIA GPU-powered environments
  • Apply strict security and compliance standards for all infrastructure components
  • Diagnose and resolve production issues, continuously improving reliability and scalability of AI services

Requirements

  • Proven experience as a DevOps Engineer, preferably supporting AI or machine learning platforms
  • Hands-on expertise with Kubernetes (EKS, AKS, GKE, or on-prem), Docker, Terraform, and Ansible
  • Experience with monitoring/observability tools such as Grafana and Prometheus
  • Familiarity with NVIDIA GPU drivers, CUDA, and hardware provisioning for machine learning tasks
  • Proficiency in at least one scripting language (Python, Bash, etc.)
  • Cloud platform experience (AWS, GCP, Azure)
  • hybrid/on-premise a plus
  • Previous work with MLOps tools and data pipeline automation is highly desirable
  • Bachelor’s degree in Computer Science or related field, or equivalent professional experience

Nice to have

  • Previous work with MLOps tools and data pipeline automation is highly desirable
  • Cloud platform experience (AWS, GCP, Azure)
  • hybrid/on-premise a plus

What we offer

  • medical
  • vision
  • dental
  • life and disability insurance
  • 401(k) plan

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

LLM & AI DevOps Engineer

8 matching positions

Sr. Cloud Infrastructure Engineer (Ai & Llm Platforms)

We are seeking a specialized Infrastructure Engineer to bridge the gap between o...
Location
Location
Salary
Salary:
Not provided
q6cyber.com Logo
Q6 Cyber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, Platform Engineering, or SRE, with at least 1-2 years specifically focused on AI/ML infrastructure
  • Proven track record of building production-grade RAG pipelines or LLM-integrated applications
  • Thrives in 'day zero' environments where the tools and protocols (like MCP) are evolving weekly
  • Deep understanding of the security implications of LLMs (prompt injection, data leakage, and secure tool execution)
  • Experience working with substantial datasets (over 1bn objects, dozens or hundreds of TBs) and the challenges of leveraging AI tools with these data sets
  • Bachelor's degree or equivalent in computer science or related field
  • Cloud & Orchestration: AWS/GCP/Azure, Kubernetes, Terraform, Helm
  • AI Frameworks: LangChain, LlamaIndex, LangGraph
  • Data & Vectors: Pinecone, Milvus, Qdrant, or pgvector
  • Apache Kafka/Pulsar
Job Responsibility
Job Responsibility
  • Guide the architecture that will allow us to leverage AI tools with our large existing data stores and incoming streams of realtime intelligence
  • Work closely with other infrastructure engineers and software development teams to integrate AI tools into existing systems
  • Design, deploy, and maintain Model Context Protocol (MCP) servers to allow LLMs to securely interact with our internal databases, APIs, and external tooling
  • Build and orchestrate sandboxed, scalable environments (e.g., using Docker or specialized runtimes) where users can safely build and execute AI agents
  • Develop and manage the infrastructure for our internal RAG (Retrieval-Augmented Generation) pipeline, including vector database management (e.g., Pinecone, Weaviate, or pgvector) and automated embedding pipelines
  • Utilize Kubernetes (K8s) and Infrastructure as Code (Terraform/Pulumi) to deploy LLM-related tools, ensuring high availability and low latency for model inference and data retrieval
  • Implement strict guardrails for data privacy within LLM workflows, ensuring internal datasets remain secure while being accessible to authorized AI tools
What we offer
What we offer
  • We offer a competitive compensation package and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Senior DevOps Engineer, AI

LogicMonitor® is the AI-first hybrid observability platform powering the next ge...
Location
Location
India , Pune
Salary
Salary:
Not provided
logicmonitor.com Logo
LogicMonitor
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in DevOps or similar roles
  • Proven experience with AWS (preferred), and GCP in production environments
  • Strong expertise in Infrastructure as Code practices
  • Solid knowledge of Kubernetes (EKS), container orchestration, and cluster security
  • Hands-on experience with Grafana, Prometheus, and alerting/monitoring systems
  • Understanding of network connectivity over the private link endpoint, VPC, cross-account vpc connectivity, how to make things accessible internally, externally, etc.
  • Experience in deploying automated Canary and Integration testing pipelines, CI/CD pipeline etc.
  • Exposing internal self-hosted services like LangFuse via WebUI for internal users using Traefik or Ingress controller or any other tool
  • Experience in deployment of LLM related solutions that require MCP, LangFuse, Airflow, GraphDB, VectorDB, Redis etc.
  • Experience working with developers on on-demand JIT access to Prod clusters to troubleshoot/debug issues with tools like Teleport or some other
Job Responsibility
Job Responsibility
  • Multi-Cloud Enablement: Expand and manage application hosting across AWS and Google Cloud, ensuring performance, flexibility, and resilience
  • Infrastructure as Code (IaC): Develop and maintain Terraform or similar installers for Azure and GCP to fully automate infrastructure deployments
  • Cost Optimization: Design and implement AWS cost optimization strategies, including reserved instances, right-sizing, and resource efficiency initiatives
  • Cloud Security: Strengthen infrastructure security with robust access controls, encryption, monitoring, and alerting frameworks
  • Observability: Build and enhance monitoring platforms with Grafana dashboards and Prometheus alerts for real-time performance insights and proactive issue resolution
  • Kubernetes Management: Implement Role-Based Access Control (RBAC) and optimize Ingress controllers (Traefik or similar) for enhanced security and delivery resilience
  • Automation & Scripting: Create Python and Bash scripts to automate repetitive tasks, streamline workflows, and improve operational efficiency
Read More
Arrow Right

Senior DevOps Engineer (AI & Cloud Infrastructure)

We are seeking a Senior DevOps Engineer to design, deploy, and operate the next ...
Location
Location
United States , Palo Alto
Salary
Salary:
175000.00 - 250000.00 USD / Year
inflection.ai Logo
Inflection AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience in DevOps, Site Reliability Engineering, or ML Infrastructure supporting high-scale, production systems
  • Deep expertise in Azure and AWS, including storage, compute, networking, databases, and cloud-native monitoring services
  • Strong Kubernetes administration experience, including GPU scheduling, operator deployment, and management of core infrastructure components
  • experience with Slurm is highly desirable
  • Proven experience deploying, scaling, and operating Large Language Models (LLMs) and inference engines such as vLLM, TGI, or Triton
  • Strong experience with modern DevOps tooling: Terraform, Helm, Kustomize, ArgoCD, GitHub Actions or GitLab CI, Prometheus, Grafana, and Clickhouse
  • Advanced scripting and automation skills in Python and Bash, with the ability to debug complex distributed systems and optimize performance at scale
  • Demonstrated ability to troubleshoot LLM servers, Kubernetes workloads, GPU utilization, and cloud infrastructure bottlenecks
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements.
Job Responsibility
Job Responsibility
  • Architect, deploy, and operate large-scale LLM inference servers and AI applications with a focus on low latency, high availability, and production reliability
  • Design, provision, and maintain complex cloud architectures across Azure and AWS, including storage, compute, networking, databases, and native LLM services
  • Manage GPU-enabled Kubernetes clusters and Slurm-based HPC environments, optimizing resource allocation for AI training and inference workloads
  • Deploy and operate core Kubernetes infrastructure components and operators (GPU operators, ingress controllers, service meshes, CNIs, CSIs, and storage drivers)
  • Build scalable infrastructure-as-code and deployment workflows using Terraform, Helm, Kustomize, ArgoCD, and GitOps best practices
  • Design and maintain centralized observability systems using Prometheus, Grafana, Clickhouse, and cloud-native monitoring tools
  • Participate in on-call rotations, lead incident response, perform post-mortems, and continuously improve system reliability and SLAs.
What we offer
What we offer
  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Meaningful equity component.
  • Fulltime
Read More
Arrow Right

Senior Java/Kotlin Engineer (AI-Driven DevOps & Automation)

We are looking for a Senior Java/Kotlin Engineer who goes beyond traditional dev...
Location
Location
Colombia
Salary
Salary:
Not provided
parserdigital.com Logo
Parser Limited
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in Java and/or Kotlin backend development
  • Solid understanding of software design, APIs, and distributed systems
  • Experience with CI/CD pipelines and DevOps practices
  • Hands-on experience with: Static code analysis tools
  • Dependency management and security remediation
  • Familiarity with AI-assisted coding tools (e.g., Claude, GitHub Copilot, etc.)
  • Experience working with Git-based workflows and multi-repo environments
Job Responsibility
Job Responsibility
  • Backend Development: Design, build, and maintain scalable backend services using Java/Kotlin
  • Deliver production-ready features with high quality and performance standards
  • Collaborate with product and engineering teams to translate requirements into technical solutions
  • AI-Driven DevOps & Automation: Use Claude (or similar agentic AI tools) to identify and fix vulnerabilities
  • Automate code improvements across repositories
  • Generate and maintain unit and integration tests using AI from code context and diffs
  • Continuously improve CI/CD workflows using AI-assisted processes
  • AI Readiness & Engineering Enablement: Improve AI readiness of repositories: clean architecture, modular structure, clear interfaces and contracts, type safety and documentation for LLM consumption
  • Build guardrails for AI usage: prompt design and versioning, output validation and consistency checks, safe code generation practices
What we offer
What we offer
  • The chance to work in innovative projects with leading brands that use the latest technologies that fuel transformation
  • The opportunity to be part of an amazing, multicultural community of tech experts
  • A competitive compensation package and medical insurance
  • A flexible working environment
  • Fulltime
Read More
Arrow Right

DevOps Engineer (Azure | Terraform | Ansible | Agentic AI for Infra/Monitoring/FinOps)

Job Summary: We are looking for a DevOps Engineer with strong hands-on experienc...
Location
Location
India , Bangalore South
Salary
Salary:
Not provided
votredircom.fr Logo
Wissen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3–5 years of experience in MLOps / ML Engineering / Cloud Engineering
  • Proficient in designing and deploying end-to-end ML pipelines
  • Terraform for Azure infrastructure automation
  • Python for ML, automation, and GenAI workflows
  • Azure Compute, Storage, Networking, and Identity
  • Running ML & GenAI workloads at scale on Azure
  • Supporting data pipelines for ML and LLM workloads
  • Experience with LangGraph for LLM workflow and agent orchestration
  • Hands-on exposure to Claude models, including skills/plugins integration
  • Understanding of prompt management, agent execution, and orchestration patterns
Job Responsibility
Job Responsibility
  • Build, deploy, and manage comprehensive MLOps and LLMOps pipelines on Azure
  • Design and oversee CI/CD pipelines for machine learning models and large language model workflows utilizing Harness or Azure DevOps
  • Streamline the promotion of models, prompts, and agent workflows between environments through automation
  • Establish approval gates, implement rollback mechanisms, and facilitate controlled release processes
  • Oversee the lifecycle of ML models and LLM-driven workflows, including their training, assessment, deployment, monitoring, and retraining
  • Administer Azure Machine Learning workspaces, computing resources, environments, model registries, and endpoints
  • Integrate LLM workflows and agent-centric architectures using LangGraph
  • Support the incorporation of Claude-based models, skills, and plugins into enterprise-level applications
  • Operationalize prompt versioning, orchestration strategies, and agent workflows in live production settings
  • Set up and govern Azure ML and Generative AI infrastructure via Terraform as Infrastructure as Code (IaC)
  • Fulltime
Read More
Arrow Right
New

Ai Engineer - Azure & C# .Net

We are seeking a capable and solutions‑focused AI Engineer to join our growing A...
Location
Location
United Kingdom
Salary
Salary:
60000.00 GBP / Year
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience building solutions with Azure AI Services and integrating them into applications and/or data solutions
  • Working knowledge of Azure OpenAI and common GenAI patterns (prompting, evaluation, basic RAG)
  • Some experience with ML delivery (training, packaging, deployment, monitoring) in a live environment
  • Proficiency in C#/.NET plus SQL fundamentals
  • Understanding of vector search concepts (embeddings, chunking, retrieval) and secure API integration
  • Experience using Git-based workflows and CI/CD pipelines (e.g., GitHub or Azure DevOps)
  • Strong communication and problem-solving skills, including clear technical documentation
Job Responsibility
Job Responsibility
  • Build and enhance GenAI solutions using Azure OpenAI, Azure AI Services, and Copilot extensibility, following agreed patterns
  • Implement RAG architectures (vector retrieval, embeddings, prompt strategies) with secure LLM integrations
  • Build conversational assistants and workflow automations using Copilot Studio and the Power Platform
  • Contribute to experimentation, prototyping, and PoC development to evaluate AI capabilities and feasibility
  • Support evaluation and integration of third‑party AI tools or APIs where required, working with the team to meet governance and security requirements
  • Develop and operationalise ML models with appropriate review and documentation
  • Translate prototypes into robust services, collaborating with engineering colleagues to meet non-functional requirements
  • Create ML pipelines and automate lifecycle workflows using team tooling and standards
  • Assist with monitoring, optimisation, escalating risks and issues where appropriate
  • Build AI-enabled services and APIs using .NET/C#, Azure Functions, and REST under guidance on patterns and quality
  • Fulltime
Read More
Arrow Right
New

Ai Engineer - Azure & C# .Net Or Python

We are seeking a capable and solutions-focused AI Engineer to join our growing A...
Location
Location
United Kingdom
Salary
Salary:
60000.00 GBP / Year
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience building solutions with Azure AI Services and integrating them into applications and/or data solutions
  • Working knowledge of Azure OpenAI and common GenAI patterns (prompting, evaluation, basic RAG)
  • Some experience with ML delivery (training, packaging, deployment, monitoring) in a live environment
  • Proficiency in either C#/.NET or Python, plus SQL fundamentals
  • Understanding of vector search concepts (embeddings, chunking, retrieval) and secure API integration
  • Experience using Git-based workflows and CI/CD pipelines (e.g., GitHub or Azure DevOps)
  • Strong communication and problem-solving skills, including clear technical documentation
  • Based in the UK with valid right to work
Job Responsibility
Job Responsibility
  • Build and enhance GenAI solutions using Azure OpenAI, Azure AI Services, and Copilot extensibility
  • Implement RAG architectures (vector retrieval, embeddings, prompt strategies) with secure LLM integrations
  • Build conversational assistants and workflow automations using Copilot Studio and the Power Platform
  • Contribute to experimentation, prototyping, and PoC development to evaluate AI capabilities and feasibility
  • Support evaluation and integration of third-party AI tools or APIs
  • Develop and operationalise ML models with appropriate review and documentation
  • Translate prototypes into robust services, collaborating with engineering colleagues
  • Create ML pipelines and automate lifecycle workflows using team tooling and standards
  • Assist with monitoring, optimisation, escalating risks and issues
  • Build AI-enabled services and APIs using .NET/C#, Python, Azure Functions, and REST
  • Fulltime
Read More
Arrow Right

Aws Agentic Framework Engineer / DevOps (Langgraph Focus)

We are looking for an experienced engineer to build and enhance observability ca...
Location
Location
Salary
Salary:
Not provided
Intellias
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as a Software / DevOps / Platform Engineer
  • Strong experience with Python (FastAPI, APIs, async workflows)
  • Hands-on experience with LangGraph or similar agent orchestration frameworks
  • Experience working with LLM-based systems (OpenAI, Anthropic, etc.)
  • Strong knowledge of AWS (EKS, Lambda, API Gateway, etc.)
  • Experience with Kubernetes and Terraform
  • Understanding of stateful workflows and distributed systems
  • Experience building and integrating APIs and microservices
  • Familiarity with CI/CD processes and cloud-native development
Job Responsibility
Job Responsibility
  • Design and implement agent workflows using LangGraph
  • Build stateful, multi-step AI pipelines with complex decision logic
  • Orchestrate interactions between multiple agents and external systems
  • Integrate LLM-based components into production-grade applications
  • Ensure scalability and reliability of agent execution flows
  • Collaborate with platform teams to integrate agent workflows with AWS infrastructure
  • Optimize performance and cost efficiency of agent-based systems
  • Contribute to architecture and best practices for agentic systems
Read More
Arrow Right