AI Infrastructure Operations Engineer Job at Cerebras Systems (Sunnyvale)

AI Infrastructure Operations Engineer

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...

Location

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing
Strong proficiency in Python scripting for automation and system administration
Deep understanding of Linux-based compute systems and command-line tools
Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM
Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner
Experience with monitoring and alerting systems
Should have a proven track record to own and drive challenges to completion
Excellent communication and collaboration skills
Ability to work effectively in a fast-paced environment
Willingness to participate in a 24/7 on-call rotation

Job Responsibility

Manage and operate multiple advanced AI compute infrastructure clusters
Monitor and oversee cluster health, proactively identifying and resolving potential issues
Maximize compute capacity through optimization and efficient resource allocation
Deploy, configure, and debug container-based services using Docker
Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed
Handle engineering escalations and collaborate with other teams to resolve complex technical challenges
Contribute to the development and improvement of our monitoring and support processes
Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

AI Engineer – Intelligent Operations (Infrastructure)

We are seeking an experienced AI Engineer – Intelligent Operations (Infrastructu...

Location

Canada , Toronto

Salary:

129150.00 USD / Year

Realign

Expiration Date

Until further notice

Requirements

Strong experience in Python and AI/ML frameworks (TensorFlow, PyTorch, Scikit-learn)
Experience working with infrastructure monitoring data (logs, metrics, traces)
Knowledge of cloud platforms (AWS, Azure, or GCP)
Experience with Docker and Kubernetes
Understanding of DevOps and CI/CD practices
Strong analytical and problem-solving skills

Job Responsibility

Develop and deploy AI/ML models for infrastructure monitoring and predictive maintenance
Automate incident detection, root cause analysis, and remediation workflows
Integrate AI solutions with cloud and on-prem infrastructure platforms
Build data pipelines for infrastructure logs and telemetry analysis
Collaborate with DevOps, SRE, and Cloud teams
Optimize system performance, scalability, and reliability
Implement MLOps practices for model deployment and lifecycle management
Provide technical leadership and documentation

Fulltime

Ai Infrastructure Engineer

We are seeking a DevOps / Platform Engineer to join our team building and operat...

Location

United States , San Jose

Salary:

204000.00 - 306000.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

5+ years of experience in DevOps, Platform, or Infrastructure Engineering
Deep hands-on experience with Kubernetes and container orchestration at scale
Proven ability to design and deliver platform features that serve internal customers or developer teams
Experience building developer-facing platforms or internal developer portals (e.g.custom workflow tooling)

Job Responsibility

Build and extend platform capabilities to enable new classes of workloads (e.g., interactive development pods, CI pipelines, inference services, benchmarking jobs)
Design and operate scalable orchestration systems using Kubernetes across both on-prem and multi-cloud environments
Develop platform features such as secret management, configuration management, and deployment automation for customers
Partner with development teams to extend the GPU developer platform with features, APIs, templates, and self-service workflows that streamline job orchestration and environment management
Manage service lifecycle within Kubernetes using Helm and GitOps workflows (e.g., ArgoCD or Flux)
Apply expertise in storage and networking to design and integrate CSI drivers, persistent volumes, and network policies that enable high-performance GPU workloads

Fulltime

Vice President - Technology (Data & AI Infrastructure Engineer)

Our client's technology team is responsible for creating and continuously improv...

Location

United States , New York

Salary:

175000.00 - 215000.00 USD / Year

Renner Brown

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science, Computer Engineering, or related field (Master's degree is a plus)
8+ years in infrastructure engineering, cloud platform engineering, or data engineering
Demonstrated experience building shared platforms or developer services in an enterprise environment
Azure expertise: Azure AI Foundry, Azure Data Factory, Azure Databricks, AKS, Azure API Management, Azure Key Vault, Azure Entra ID
Strong Python skills: backend services, REST APIs (FastAPI or Flask), and automation scripting
PowerShell for infrastructure tasks
Infrastructure-as-Code: Terraform and/or Bicep
container orchestration with Docker and Kubernetes
Experience integrating LLM APIs (Anthropic Claude, Azure OpenAI) in production including token cost management and observability
RAG pipeline experience: vector search (Azure AI Search or pgvector), document processing, and retrieval patterns

Job Responsibility

Design, build, and operate the firm's AI platform, enabling developers to build and deploy Python-based AI applications
Implement and manage Azure AI Foundry environments: model deployments, AI hubs, project workspaces, and access controls
Integrate and operationalize third-party AI APIs (Anthropic Claude API, Azure OpenAI) with secure access patterns, API gateway controls, rate limiting, and cost monitoring
Build internal developer tooling and SDK scaffolding to accelerate AI application development across the firm
Build and maintain data pipelines using Azure Data Factory and Azure Databricks to serve AI application data needs
Implement vector search and document retrieval infrastructure (Azure AI Search) to support RAG-based applications
Manage structured and unstructured data stores including Azure Data Lake, Azure SQL, and Cosmos DB
Provision and maintain secure, scalable infrastructure on Azure (primary) and AWS using Infrastructure-as-Code (Terraform or Bicep)
Build and maintain CI/CD pipelines for AI application deployment via Azure DevOps or GitHub Actions
Manage containerized workloads using Docker and Kubernetes (AKS) for AI application hosting and API services

Fulltime

Senior AI Infrastructure Engineer - Training Platform

As a Software Engineer on the Machine Learning Infrastructure team, you will bui...

Location

United States , San Francisco; Seattle; New York

Salary:

216000.00 - 270000.00 USD / Year

Scale

Expiration Date

Until further notice

Requirements

5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
Proven ability to solve complex problems and work independently in fast-moving environments

Job Responsibility

Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
Design and implement scheduling primitives to optimize the lifecycle of training jobs
Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
Work closely with Finance and Procurement teams to drive our capacity planning process
Participate in our team's on call process to ensure the availability of our services
Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment

What we offer

Comprehensive health, dental and vision coverage
retirement benefits
a learning and development stipend
generous PTO
commuter stipend (may be eligible)

Fulltime

Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

The AI Platform organization builds the end-to-end Azure AI stack, from the infr...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript | OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Work on the design and development of the core AI Infrastructure distributed and in-cluster services that support large scale AI training and inferencing
Develop, test, and maintain control plane services written in C#, hosted on Service Fabric or Kubernetes (AKS) clusters
Enhance systems and applications to ensure high stability, efficiency and maintainability, low latency, tight cloud security
Provide operational support and DRI (on-call) responsibilities for the service
Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services used by our customers
Collaborate closely with service engineers, product managers, and internal applied research and data science teams within Microsoft to build better solutions together
Provide vision, expertise, and technical leadership to other team members
Help to grow talent in these areas
Embody our culture and values

Fulltime

Principal AI Operations Engineer

The Security AI Platform team builds and operates production infrastructure that...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
6+ years technical engineering experience in DevOps, SRE, or platform operations
6+ years driving complex operational initiatives across teams
demonstrated success leading without authority
4+ years hands-on experience with Kubernetes in production environments
3+ years building and maintaining CI/CD pipelines at scale
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Define the operational vision, standards, and roadmap for the platform
establish SLOs, error budgets, and reliability targets
Drive technical direction for the AI Operations group: architecture for deployments, pipelines, branch health, and production reliability
Own CI/CD pipeline architecture: Azure DevOps/GitHub Actions pipelines, build optimization, artifact management, and deployment automation
Manage Kubernetes infrastructure: AKS cluster operations, Helm chart management, node pool configuration, GPU resource allocation, and autoscaling (KEDA)
Drive production deployments: canary/ring rollouts, safe deployment practices, rollback procedures, and release coordination with Platform team
Establish and operate first-level on-call: incident response procedures, escalation paths, runbooks, and post-incident reviews
Build and maintain observability infrastructure: Prometheus, Grafana, OpenTelemetry collectors, alerting rules, and dashboard curation
Manage infrastructure as code: Bicep templates for Azure resources, Helm charts for Kubernetes deployments, and environment parity
Ensure branch health and code quality gates: PR validation pipelines, automated testing, security scanning, and merge policies

Fulltime

AI Infrastructure Engineer

Bright Vision Technologies is looking for a skilled AI Infrastructure Engineer t...

Location

United States , Bridgewater

Salary:

Not provided

Bright Vision Technologies

Expiration Date

Until further notice

Requirements

AI/ML Infrastructure
GPU Computing (NVIDIA CUDA)
Python
Linux
Kubernetes
Docker
Cloud Platforms (AWS / Azure / GCP)
AI Workload Orchestration
High-Performance Computing
Distributed Systems

Job Responsibility

Building innovative solutions that help businesses automate and optimize their operations
Leveraging cutting-edge AI infrastructure and cloud technologies to build scalable, secure, and high-performance platforms that support machine learning and AI workloads
Contributing to the mission of transforming business processes through technology

What we offer

H-1B sponsorship for the year 2027 quota
Career growth potential
Equal employment opportunities
Inclusive work environment

Fulltime

Select Country

AI Infrastructure Operations Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

AI Infrastructure Operations Engineer

AI Infrastructure Operations Engineer

AI Engineer – Intelligent Operations (Infrastructure)

Ai Infrastructure Engineer

Vice President - Technology (Data & AI Infrastructure Engineer)

Senior AI Infrastructure Engineer - Training Platform

Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

Principal AI Operations Engineer

AI Infrastructure Engineer

Our AI answers in your language