CrawlJobs Logo

Senior DevOps Engineer (AI & Cloud Infrastructure)

United States, Palo Alto 175000.00 - 250000.00 USD / Year · Job Posted January 26, 2026
Apply Position
Job Link Share

Job Description

We are seeking a Senior DevOps Engineer to design, deploy, and operate the next generation of Inflection AI’s cloud and AI infrastructure. This role sits at the intersection of AI research and production systems, owning the reliability, scalability, and performance of GPU-enabled platforms that power large-scale LLM training and inference. You will work across Azure and AWS to build highly automated, observable, and resilient infrastructure supporting low-latency AI applications in production.

Job Responsibility

  • Architect, deploy, and operate large-scale LLM inference servers and AI applications with a focus on low latency, high availability, and production reliability
  • Design, provision, and maintain complex cloud architectures across Azure and AWS, including storage, compute, networking, databases, and native LLM services
  • Manage GPU-enabled Kubernetes clusters and Slurm-based HPC environments, optimizing resource allocation for AI training and inference workloads
  • Deploy and operate core Kubernetes infrastructure components and operators (GPU operators, ingress controllers, service meshes, CNIs, CSIs, and storage drivers)
  • Build scalable infrastructure-as-code and deployment workflows using Terraform, Helm, Kustomize, ArgoCD, and GitOps best practices
  • Design and maintain centralized observability systems using Prometheus, Grafana, Clickhouse, and cloud-native monitoring tools
  • Participate in on-call rotations, lead incident response, perform post-mortems, and continuously improve system reliability and SLAs.

Requirements

  • 5+ years of hands-on experience in DevOps, Site Reliability Engineering, or ML Infrastructure supporting high-scale, production systems
  • Deep expertise in Azure and AWS, including storage, compute, networking, databases, and cloud-native monitoring services
  • Strong Kubernetes administration experience, including GPU scheduling, operator deployment, and management of core infrastructure components
  • experience with Slurm is highly desirable
  • Proven experience deploying, scaling, and operating Large Language Models (LLMs) and inference engines such as vLLM, TGI, or Triton
  • Strong experience with modern DevOps tooling: Terraform, Helm, Kustomize, ArgoCD, GitHub Actions or GitLab CI, Prometheus, Grafana, and Clickhouse
  • Advanced scripting and automation skills in Python and Bash, with the ability to debug complex distributed systems and optimize performance at scale
  • Demonstrated ability to troubleshoot LLM servers, Kubernetes workloads, GPU utilization, and cloud infrastructure bottlenecks
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements.

What we offer

  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Meaningful equity component.

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior DevOps Engineer (AI & Cloud Infrastructure)

8 matching positions

Senior DevOps AI Engineer

We are seeking a highly experienced and technically proficient Senior DevOps Eng...
Location
Location
United States , Columbia
Salary
Salary:
150000.00 - 250000.00 USD / Year
synergyecp.com Logo
Synergy ECP
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.S. in a relevant technical field with 12 years of experience, or M.S. in a relevant technical field with 10 years of experience
  • Advanced proficiency in DevOps principles and practices
  • Demonstrated expertise in containerization using Docker and Kubernetes
  • Proven experience in architecting and managing CI/CD pipelines
  • Extensive experience with AI model lifecycle management and maintenance
  • Familiarity with cloud platforms (AWS, Microsoft Azure) for infrastructure deployment and management
  • Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)
  • Excellent communication and interpersonal skills, with the ability to effectively collaborate with cross-functional teams
  • Ability to translate complex technical concepts into actionable engineering solutions
  • TS/SCI with CI Poly
Job Responsibility
Job Responsibility
  • Design, implement, and maintain robust infrastructure for enterprise AI applications in cloud environments (AWS, Microsoft Azure)
  • Develop and optimize engineering workflows and processes to support AI model development, deployment, and maintenance
  • Architect and manage CI/CD pipelines for continuous integration and continuous delivery of AI models and applications
  • Implement and manage containerization solutions using technologies like Docker and Kubernetes
  • Ensure efficient AI model lifecycle management, including versioning, monitoring, and scaling
  • Collaborate with AI/ML engineers and data scientists to streamline deployment processes and optimize resource utilization
  • Oversee system performance, security, and scalability of AI infrastructure
  • Continuously research and implement new DevOps tools and practices to enhance efficiency
What we offer
What we offer
  • Highly competitive compensation
  • Comprehensive Health Benefits package
  • 401K Retirement plan
  • People Partners to help navigate personal and professional worlds
  • Wellness resources
  • Company-sponsored continuing education program
  • Generous Paid Time Off
  • 11 paid holidays a year
  • Flexible work options
  • Philanthropy program participation
  • Fulltime
Read More
Arrow Right

Senior DevOps Engineer, AI

LogicMonitor® is the AI-first hybrid observability platform powering the next ge...
Location
Location
India , Pune
Salary
Salary:
Not provided
logicmonitor.com Logo
LogicMonitor
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in DevOps or similar roles
  • Proven experience with AWS (preferred), and GCP in production environments
  • Strong expertise in Infrastructure as Code practices
  • Solid knowledge of Kubernetes (EKS), container orchestration, and cluster security
  • Hands-on experience with Grafana, Prometheus, and alerting/monitoring systems
  • Understanding of network connectivity over the private link endpoint, VPC, cross-account vpc connectivity, how to make things accessible internally, externally, etc.
  • Experience in deploying automated Canary and Integration testing pipelines, CI/CD pipeline etc.
  • Exposing internal self-hosted services like LangFuse via WebUI for internal users using Traefik or Ingress controller or any other tool
  • Experience in deployment of LLM related solutions that require MCP, LangFuse, Airflow, GraphDB, VectorDB, Redis etc.
  • Experience working with developers on on-demand JIT access to Prod clusters to troubleshoot/debug issues with tools like Teleport or some other
Job Responsibility
Job Responsibility
  • Multi-Cloud Enablement: Expand and manage application hosting across AWS and Google Cloud, ensuring performance, flexibility, and resilience
  • Infrastructure as Code (IaC): Develop and maintain Terraform or similar installers for Azure and GCP to fully automate infrastructure deployments
  • Cost Optimization: Design and implement AWS cost optimization strategies, including reserved instances, right-sizing, and resource efficiency initiatives
  • Cloud Security: Strengthen infrastructure security with robust access controls, encryption, monitoring, and alerting frameworks
  • Observability: Build and enhance monitoring platforms with Grafana dashboards and Prometheus alerts for real-time performance insights and proactive issue resolution
  • Kubernetes Management: Implement Role-Based Access Control (RBAC) and optimize Ingress controllers (Traefik or similar) for enhanced security and delivery resilience
  • Automation & Scripting: Create Python and Bash scripts to automate repetitive tasks, streamline workflows, and improve operational efficiency
Read More
Arrow Right

Senior Devops & AI Engineer

This role presents a unique opportunity to contribute to the future of impactful...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
fissionlabs.com Logo
Fission Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field
  • 6+ years of experience in Infrastructure Mgmt. roles, with a focus on cloud platforms (Azure and AWS Preferred)
  • Hands-on experience with operations (DevSecOps) principles and best practices
  • Proficiency in scripting languages such as Python, PowerShell, or Bash
  • Excellent communication and collaboration skills
  • In-depth knowledge of Linux operating systems, including CentOS, Ubuntu, and Red Hat, with expertise in shell scripting, package management, and system administration
  • Hands-on experience with a wide range of AWS and Azure services
  • Develop and maintain Infrastructure as Code (IAC) templates using tools such as Terraform or AWS CloudFormation
  • Experience setting up cloud infrastructure stack, databases, service endpoints, GPU as well as CPU resource scaling, optimization etc.
  • Should have worked AIOps/MLOP
Job Responsibility
Job Responsibility
  • Configure and optimize Linux-based servers for performance, security, and resource utilization, including kernel tuning, file system management, and network configuration
  • Architect cloud solutions leveraging best practices and services offered by AWS and Azure, optimizing for scalability, reliability, and cost-effectiveness
  • Implement and manage hybrid cloud environments, facilitating seamless integration and interoperability between AWS and Azure services
  • Establish version control practices for IAC templates, ensuring traceability, auditability, and reproducibility of infrastructure changes
What we offer
What we offer
  • Opportunity to work on impactful technical challenges with global reach
  • Vast opportunities for self-development, including online university access and knowledge sharing opportunities
  • Sponsored Tech Talks & Hackathons to foster innovation and learning
  • Generous benefits packages including health insurance, retirement benefits, flexible work hours, and more
  • Supportive work environment with forums to explore passions beyond work
  • Fulltime
Read More
Arrow Right
New

Senior DevOps and Cloud Engineer

We are seeking a highly motivated and self-sufficient Senior DevOps / Cloud Engi...
Location
Location
United States , Scottsdale
Salary
Salary:
Not provided
gate6.com Logo
gate6
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience with AWS
  • experience with GCP and/or Azure is advantageous
  • Proven ability to work as a self-driven individual contributor with minimal dependency on others
  • Proficiency with cloud infrastructure setup, CI/CD processes, and DevOps toolchains
  • Strong scripting skills (Shell, YAML, Python, etc.)
  • Solid understanding of both Linux and Windows system environments
  • Experience with containerization, orchestration, and cloud automation tools
  • AWS certification (e.g., Solutions Architect, DevOps Engineer) is strongly preferred
  • Familiarity with AI/GenAI technologies, MLOps concepts, or AI-powered cloud solutions is highly desirable
  • Excellent problem-solving, analytical, and communication skills
Job Responsibility
Job Responsibility
  • Independently design, implement, and manage cloud infrastructure across AWS and GCP (Azure knowledge is a plus)
  • Build, configure, and maintain secure, scalable cloud resources with minimal external support
  • Set up and manage VPCs, IAM, security groups, and access control
  • Lead and execute migration of applications and workloads to the cloud
  • Establish disaster recovery processes, automate cloud management tasks, and maintain best practices
  • Monitor usage and apply tagging strategies for cost control and visibility
  • Use AWS tools (CloudFormation, CloudWatch, Migration Hub, DMS, AWS Transfer for SFTP) for deployments and monitoring
  • Design and manage CI/CD pipelines with Jenkins, GitLab CI, or AWS DevOps
  • Ensure compliance with security policies and industry standards
  • Implement log aggregation and performance monitoring for cloud environments
  • Fulltime
Read More
Arrow Right

Senior DevOps / Voice Infrastructure Engineer

As we grow and take on exciting new challenges, we’re on the lookout for excepti...
Location
Location
Salary
Salary:
Not provided
maddevs.io Logo
Mad Devs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of hands-on experience with Asterisk or FreeSWITCH
  • Deep knowledge of SIP, RTP, SRTP protocols
  • Experience with SIP proxies — Kamailio or OpenSIPS
  • WebRTC integrations
  • Trunk configuration, dialplan design, codec negotiation
  • GCP and/or AWS hands-on experience (2+ years)
  • Kubernetes (GKE or EKS) in production environments
  • Terraform — custom modules, multi-environment setups
  • Docker, Docker Compose
  • CI/CD: GitHub Actions, ArgoCD / Flux
Job Responsibility
Job Responsibility
  • Design, deploy, and maintain SIP/VoIP infrastructure (Asterisk, FreeSWITCH, Kamailio) for AI Agents
  • Integrate voice platforms with cloud services (GCP, AWS) and internal AI pipelines
  • Ensure high availability and low latency of voice services (HA, load balancing, failover)
  • Manage cloud infrastructure via IaC (Terraform) and container orchestration in Kubernetes
  • Set up call quality monitoring (MOS, jitter, packet loss) and alerting with Grafana / Victoria Metrics
  • Build and optimize CI/CD pipelines (GitHub Actions, ArgoCD) for voice services
  • Harden voice infrastructure security: encryption (SRTP, TLS), toll fraud prevention, DoS protection
  • Integrate with PSTN/SIP trunk providers, manage DID numbers and call routing
What we offer
What we offer
  • Flexible working hours
  • Remote-first culture
  • Long-term projects
  • Salary in dollars
  • Professional communities
  • Onsite business trips
  • Training budget
  • Paid conferences
  • Fulltime
Read More
Arrow Right

Senior Cloud Platform Engineer with AI Enablement

We are looking for a Senior Cloud Platform Engineer with AI Enablement experienc...
Location
Location
Poland , Warszawa
Salary
Salary:
Not provided
algoteque.com Logo
Algoteque
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid experience as a Cloud, DevOps, Platform, SRE or Infrastructure Engineer
  • Hands-on knowledge of at least one major cloud platform: AWS, Azure or GCP
  • Experience with Kubernetes in production or near-production environments
  • Experience with Infrastructure as Code, especially Terraform
  • Familiarity with CI/CD tools (GitHub Actions, GitLab CI/CD, Jenkins, Azure DevOps or Tekton)
  • Knowledge of observability and monitoring tools (Prometheus, Grafana, ELK, Loki, Datadog, New Relic, CloudWatch or OpenSearch)
  • Experience with production support, incident management, RCA and deployment stability
  • Good scripting or programming skills (Python, Bash, PowerShell or Go)
  • Understanding of security basics, IAM, secrets management and secure cloud delivery
  • Experience working with development teams and improving Developer Experience
Job Responsibility
Job Responsibility
  • Design and develop cloud-native platforms and internal developer platforms (IDP)
  • Deliver scalable, reliable and secure platform solutions for engineering teams
  • Work with Kubernetes and cloud services on AWS, Azure or GCP
  • Build and maintain Infrastructure as Code (Terraform, Pulumi, CloudFormation)
  • Develop CI/CD pipelines and automate deployments
  • Introduce platform standards, reusable templates, golden paths and best practices
  • Improve observability, monitoring, alerting and incident response
  • Support reliability, high availability, disaster recovery and operational stability
  • Use AI tools and AI-assisted workflows to boost engineering productivity and platform operations
  • Help define safe, controlled usage of AI tools, coding agents and LLM-based workflows
What we offer
What we offer
  • B2B contract
  • 100% remote work
  • A unique and engaging project in the EdTech space
  • Fulltime
Read More
Arrow Right

Senior ML Infrastructure / ML DevOps Engineer

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux,...
Location
Location
Salary
Salary:
Not provided
Pathway
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
  • 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
  • Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
  • Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
  • Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
  • Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
  • Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
  • Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
  • Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
  • Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer
Job Responsibility
Job Responsibility
  • Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
  • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
  • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
  • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
  • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
  • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
  • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
  • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break
What we offer
What we offer
  • Intellectually stimulating work environment
  • Be a pioneer: you get to work with realtime data processing & AI
  • Work in one of the hottest AI startups, with exciting career prospects
  • Team members are distributed across the world
  • Responsibilities and ability to make significant contribution to the company’s success
  • Inclusive workplace culture
  • Fulltime
Read More
Arrow Right

Senior AI Engineer – Microsoft Fabric & Azure AI Foundry

We are looking for an experienced AI Engineer to lead the implementation of Azur...
Location
Location
United States , New York City
Salary
Salary:
160000.00 - 220000.00 USD / Year
valtech.com Logo
Valtech
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in cloud engineering, AI engineering, or data platform architecture
  • Strong hands-on experience with: Microsoft Fabric, Azure AI Foundry, Azure OpenAI, Azure Machine Learning, Azure Data Services
  • Experience integrating AI workloads into enterprise analytics platforms
  • Proficiency in Python and/or C#
  • Experience with REST APIs, SDKs, and AI orchestration frameworks
  • Knowledge of: Vector databases, Retrieval-Augmented Generation (RAG), Prompt engineering, Model evaluation and monitoring
  • Familiarity with DevOps practices including GitHub Actions or Azure DevOps
  • Strong understanding of enterprise security and governance
Job Responsibility
Job Responsibility
  • Design and implement AI solutions using Microsoft Azure AI Foundry within an existing Microsoft Fabric architecture
  • Integrate AI services with Fabric components including: Data Factory, OneLake, Power BI, Lakehouse and Warehouse environments, Real-Time Analytics
  • Build and operationalize generative AI and machine learning workflows
  • Configure and manage: Azure AI Services, Azure OpenAI, Model deployment pipelines, Prompt orchestration and evaluation
  • Establish secure connectivity between Azure AI Foundry and enterprise data sources
  • Implement governance, RBAC, security, compliance, and cost management controls
  • Develop reusable AI pipelines, APIs, and automation frameworks
  • Collaborate with platform teams to ensure scalability, observability, and production readiness
  • Support CI/CD and Infrastructure-as-Code deployment patterns
  • Provide technical leadership and documentation for AI platform adoption
What we offer
What we offer
  • Flexibility, with remote and hybrid work options (country-dependent)
  • Career advancement, with international mobility and professional development programs
  • Learning and development, with access to cutting-edge tools, training and industry experts
  • Medical, dental, and vision insurance for you and your family, plus employer contributions to Health Savings Accounts
  • Fulltime
Read More
Arrow Right