CrawlJobs Logo

AI Infrastructure Operations Engineer

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States; Canada , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

The AI Infrastructure Operations Engineer (SiteOps) is an entry-level individual contributor role focused on the deployment, bring-up, monitoring, and first-line troubleshooting of Cerebras AI infrastructure in data center environments. The role supports CS systems, cluster server hardware, cluster networking hardware, and hardware telemetry and monitoring tools.

Job Responsibility:

  • Assist with deployment and bring-up of CS-X systems, cluster servers, and networking hardware
  • Execute power-on sequencing, readiness checks, and validation tests
  • Monitor hardware telemetry, alerts, and dashboards
  • Perform first-line troubleshooting and structured escalation
  • Collect logs, telemetry, and observations during incidents
  • Participate in incident response under senior engineer guidance
  • Use existing monitoring, telemetry, and incident tracking tools
  • Provide feedback on tooling and process gaps
  • Build working knowledge of Cerebras system architecture
  • Learn cluster hardware and networking fundamentals
  • Shadow senior engineers during complex debugging
  • Progress toward independent ownership of defined workflows

Requirements:

  • Bachelor’s degree in a relevant engineering field or equivalent experience
  • 0–3 years experience in hardware operations, systems engineering, or datacenter environments
  • basic familiarity with server hardware, networking fundamentals, and Linux systems

Nice to have:

  • Internship or early-career experience in datacenter or hardware lab environments
  • exposure to monitoring or telemetry systems
  • comfort working in data centers
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI Infrastructure Operations Engineer

Senior AI Infrastructure Engineer

This role will be responsible for designing, deploying, and maintaining high-per...
Location
Location
United States , Bothell; Overland Park; Bellevue
Salary
Salary:
113600.00 - 205000.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years technical engineering experience, preferably in multiple technology focus areas
  • Expert understanding of AI/ML infrastructure components, or GPU-based systems – preferably in a high-availability, large scale environment
  • Hands-on Experience with NVIDIA DGX servers, BasePOD architectures, and advanced GPU technologies
  • Proficient in Linux/UNIX environments, including scripting/automation tools (Bash, Python, Ansible, Terraform)
  • Understanding of AI infrastructure security best practices
  • Experience with container orchestration (Kubernetes, Docker) and GPU workload management tools
  • Strong knowledge of networking (InfiniBand/Ethernet) and storage solutions in AI/ML contexts
Job Responsibility
Job Responsibility
  • Technical System Expertise: Understands system protocols, how systems operate and data flows
  • Technical Engineering Services: Drives engineering projects by active contribution to the application of engineering techniques
  • Innovation: Contributes to designs to implement new ideas which improve an existing and new system/process/service
  • Technical Writing: Writes basic documentation on how technology works
  • Technical Leadership: Collaborates with technical teams and utilizes system expertise to deliver technical solutions
  • Technology Strategy: Contributes to new and existing technology options that support business goals
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off
  • Paid holidays
  • Paid parental and family leave
  • Fulltime
Read More
Arrow Right

AI Research Engineer, Data Infrastructure

As a Research Engineer in Infrastructure, you will design and implement a robust...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 250000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in building data pipelines and ETL systems
  • Ability to design and implement systems for data collection and management from robotic fleets
  • Familiarity with architectures that span on-robot components, on-premise clusters, and cloud infrastructure
  • Experience with data labeling tools or building dataset visualization and annotation tooling
  • Proficiency in creating or applying machine learning models for dataset organization and automated labeling
Job Responsibility
Job Responsibility
  • Optimize operational efficiency of data collection across the NEO robot fleet
  • Design intelligent triggers to determine when and what data should be uploaded from the robots
  • Automate ETL pipelines to make fleet-wide data easily queryable and training-ready
  • Collaborate with external dataset providers to prepare diverse multi-modal pre-training datasets
  • Build frontend tools for visualizing and automating the labeling of large datasets
  • Develop machine learning models for automatic dataset labeling and organization
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager - AI Core Platform

We’re hiring a Senior Engineering Manager (or high-potential EM2) for the Core P...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
intercom.com Logo
Intercom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading engineering teams, ideally across infrastructure or platform domains
  • Recent hands-on coding experience — you’ve shipped production code in the last couple of years
  • Strong technical judgment and the ability to coach senior engineers through complex architectural trade-offs
  • Adaptable leadership style suited to a group that will grow quickly, and change shape over time
  • Curiosity and enthusiasm for AI, with a desire to learn how ML systems are developed and operated in production
Job Responsibility
Job Responsibility
  • Lead a high-performing team building the platform and infrastructure that power Intercom’s AI capabilities
  • Contribute directly to production code, staying close to the work and building knowledge & context through first-hand experience
  • Support teams of ML Scientists and Engineers building AI powered capabilities
  • Plan, prioritize, and deliver high-impact roadmaps in partnership with the team’s most senior engineers, balancing delivery, quality, and innovation
  • Improve developer experience across the AI infrastructure stack, ensuring that systems are observable, scalable, and easy to build upon
  • Empower the engineers on the team to act with agency and maximize their impact
  • Expand your scope over time, potentially taking ownership of additional platform domains as the team and AI initiatives grow
What we offer
What we offer
  • Competitive salary and equity in a fast-growing start-up
  • We serve lunch every weekday, plus a variety of snack foods and a fully stocked kitchen
  • Regular compensation reviews - we reward great work
  • Pension scheme & match up to 4%
  • Peace of mind with life assurance, as well as comprehensive health and dental insurance for you and your dependents
  • Flexible paid time off policy
  • Paid maternity leave, as well as 6 weeks paternity leave for fathers, to let you spend valuable time with your loved ones
  • If you’re cycling, we’ve got you covered on the Cycle-to-Work Scheme. With secure bike storage too
  • MacBooks are our standard, but we also offer Windows for certain roles when needed
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Is Data Center Operations Engineer

Bridging Information Technology (IT) and the Mechanical, Electrical, and Plumbin...
Location
Location
United States , New Albany
Salary
Salary:
91731.00 - 114948.00 USD / Year
amgen.com Logo
Amgen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master’s degree
  • Bachelor’s degree and 2 years of data center operations experience
  • Associate’s degree and 6 years of data center operations experience
  • High school diploma / GED and 8 years of data center operations experience
  • Hands-on experience with rack/stack, structured cabling, and IT hardware installation
  • Familiarity with Dell PowerEdge, Nutanix, NetApp, and Cisco platforms
  • Ability to interpret electrical and mechanical drawings (awareness-level competency)
  • Experience using monitoring, alerting, or automation systems (AI-enabled platforms preferred)
  • Solid understanding of IT operations concepts including hardware lifecycle management and disaster recovery
  • Ability to read and update documentation, diagrams, and cable records
Job Responsibility
Job Responsibility
  • Serve as the liaison between IT teams and facilities staff, ensuring flawless communication
  • Interpret electrical one-line diagrams, distribution drawings, and cooling schematics to support incident response and planning
  • Install, rack, cable, and support enterprise IT systems including Dell PowerEdge, Nutanix, NetApp, and Cisco technologies
  • Support day-to-day moves, adds, and changes (MACs) in building IDF and VDER environments
  • Perform fiber and copper patch cabling in data centers, IDFs, and VDER closets
  • Trace and troubleshoot cabling issues to restore connectivity
  • Monitor infrastructure, proactively detect issues, and bring up with urgency to appropriate teams
  • Apply AI-enabled monitoring and automation platforms to enhance data center operations
  • Maintain documentation of infrastructure layouts, procedures, and operational standards
  • Participate in capacity planning, disaster recovery drills, and continuous improvement initiatives
What we offer
What we offer
  • A comprehensive employee benefits package, including a Retirement and Savings Plan with generous company contributions, group medical, dental and vision coverage, life and disability insurance, and flexible spending accounts
  • A discretionary annual bonus program, or for field sales representatives, a sales-based incentive plan
  • Stock-based long-term incentives
  • Award-winning time-off plans
  • Flexible work models, including remote and hybrid work arrangements, where possible
  • Fulltime
Read More
Arrow Right

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Director of AI Engineering

We are entering a hyper-growth phase of AI innovation and are hiring a Director ...
Location
Location
Canada; United States
Salary
Salary:
300000.00 - 450000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10–15+ years in software engineering, with significant leadership experience owning AI/ML or applied LLM systems at scale
  • Proven history shipping LLM-powered features, agentic workflows, or AI assistants used by real customers in production
  • Deep understanding of LLM orchestration frameworks (LangChain, LlamaIndex), RAG pipelines, vector search, embeddings, and prompt engineering
  • Expert in backend & distributed systems (Python strongly preferred) and cloud infrastructure (AWS/GCP)
  • Strong experience with telemetry, observability, and cost-aware real-time inference optimizations
  • Demonstrated ability to lead senior engineers, define technical roadmaps, and deliver outcomes aligned to business metrics
  • Experience building or scaling teams working on experimentation, optimization, personalization, or ML-powered growth systems
  • Exceptional ability to simplify complex problems, set clear standards, and drive alignment across Product, Data, Design, and Engineering
  • Strong product sense, ability to weigh novelty vs. impact, focus on user value, and prioritize speed with guardrails
  • Fluent in integrating AI tools into engineering workflows for code generation, debugging, delivery velocity, and operational efficiency
Job Responsibility
Job Responsibility
  • Define the multi-year technical vision for Apollo’s AI stack, spanning agents, orchestration, inference, retrieval, and platformization
  • Prioritize high-impact AI investments by partnering with Product, Design, Research, and Data leaders to align engineering outcomes with business goals
  • Establish technical standards, evaluation criteria, and success metrics for every AI-powered feature shipped
  • Lead the architecture and deployment of long-horizon autonomous agents, multi-agent workflows, and API-driven orchestration frameworks
  • Build reusable, scalable agentic components that power GTM workflows like research, enrichment, sequencing, lead scoring, routing, and personalization
  • Own the evolution of Apollo’s internal LLM platform for high-scale, low-latency, cost-optimized inference
  • Oversee model-driven experiences for natural-language interfaces, RAG pipelines, semantic search, personalized recommendations, and email intelligence
  • Partner with Product & Design to build intuitive conversational UX that hides underlying complexity while elevating user productivity
  • Implement rigorous evaluation frameworks, including offline benchmarking, human-in-the-loop review, and online A/B experimentation
  • Ensure robust observability, monitoring, and safety guardrails for all AI systems in production
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA
  • Fulltime
Read More
Arrow Right