CrawlJobs Logo

Manager – AI Infrastructure Operations

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

As a senior leader on our team, you will be responsible for the overall health, performance, and reliability of our infrastructure, driving initiatives that maximize compute capacity and directly support our critical AI objectives. This role is a blend of strategic leadership and hands-on technical ownership. You will leverage your deep Site Reliability Engineering (SRE) expertise to build robust systems, lead high-stakes technical escalations, and champion customer success.

Job Responsibility:

  • Lead and Manage Infrastructure: Oversee the operation and reliability of our advanced AI compute infrastructure, defining strategy and setting a high bar for operational excellence
  • Drive Technical Ownership: Act as the primary owner for critical infrastructure systems, ensuring uptime, performance, and capacity are consistently optimized
  • Handle High-Stakes Escalations: Serve as the final point of contact for complex customer and engineering escalations, providing expert-level, hands-on support and driving issues to a rapid and complete resolution
  • Champion Reliability and Automation: Leverage your SRE experience to develop and implement robust monitoring, alerting, and automation solutions, reducing manual toil and preventing future issues
  • Collaborate and Strategize: Partner with cross-functional teams, including engineering and product, to align on long-term infrastructure strategy and support future AI initiatives
  • Innovate and Improve: Continuously evaluate and improve existing processes, tools, and technologies to enhance system reliability and operational efficiency

Requirements:

  • Technical Leadership: 15+ years of experience in managing and operating complex compute infrastructure, with a minimum of 5 years in a senior or leadership role
  • SRE and Operations Expertise: A strong background as a Site Reliability Engineer or in a similar role, with a proven track record of managing large-scale, mission-critical systems
  • Deep Systems Knowledge: Expert-level proficiency in Linux-based systems, Python scripting, and command-line tools for system administration and automation
  • Troubleshooting Acumen: Exceptional ability to lead and resolve complex technical challenges under pressure, especially during customer or engineering escalations
  • On-Call Leadership: Proven experience managing an on-call rotation and responding to 24/7 technical incidents
  • Communication: Excellent communication and leadership skills, with the ability to effectively mentor junior team members and communicate complex technical concepts to a diverse audience

Nice to have:

  • Prior experience operating large-scale GPU/accelerator clusters
  • Knowledge of networking protocols (Ethernet, RoCE, TCP/IP)
  • Familiarity with ML frameworks and AI/ML workflows
  • Exposure to cloud platforms (AWS, GCP, Azure)
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Manager – AI Infrastructure Operations

Principle Product Manager, AI infrastructure

We are seeking an experienced Product Manager to bring our Private Cloud AI prod...
Location
Location
United States
Salary
Salary:
148000.00 - 340500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent in computer science, engineering or related field of study
  • MBA or advanced degree in computer science or engineering preferred
  • 10+ years of work experience in AI/ML platform infrastructure and tools
  • Technical understanding and knowledge of the AI/ML tooling and infrastructure industry
  • 10+ years of product management experience in B2B SaaS, with a focus on AI/ML and data infrastructure
  • Demonstrate strong technical acumen across enterprise AI and data workflows
  • Expertise in AI and ML hardware, ecosystem, and software a plus
  • Experience motivating others at all levels by creating a shared sense of vision and purpose
  • Extensive team skills and ability to cross functionally drive/influence work through others, ability to mentor and lead teams to achieve results for complex, ambiguous projects
  • Have proven experience working in a technical environment with cross-functional teams to drive product vision, define requirements, and guide the team through key milestones
Job Responsibility
Job Responsibility
  • Define and execute a product strategy to unlock AI opportunities across the world’s largest organizations
  • Independently leads and drives the end to end strategy and operational product roadmap for one or more complex products or a product portfolio
  • Builds and delivers the value proposition, target customer segments, and business case to bring innovative and disruptive products to market for a product portfolio with respect to the whole company product portfolio
  • Synthesizes market requirements (MRD) into marketing/customer details through having intimate customer knowledge and business, financial and industry market acumen
  • Guides key stakeholders on the portfolio strategy across all phases of the lifecycle
  • Creates and drives goal alignment and collaborates across one or more products' value chain partners to optimize margins and enable success of products per plans across the product lifecycle
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Technical Program Manager, AI Platform

Figma is growing our team of passionate creatives and builders on a mission to m...
Location
Location
United States , San Francisco; New York
Salary
Salary:
180000.00 - 308000.00 USD / Year
figma.com Logo
Figma
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of technical program management experience (or equivalent) in AI platform, AI research, and AI infrastructure
  • Understand how AI gets built and scaled: model evaluation loops, annotation pipelines, quota limits, and data versioning
  • Have hands-on experience running AI cost/capacity reviews, forecast planning, and vendor oversight. Deep understanding of model cost mechanics, including token burn, cache hit rates, latency, and quota limits
  • Comfort operating in high-ambiguity, high-velocity environments with exec visibility
  • Strong writing and communication skills — you bring structure, clarity, and momentum to complex technical programs
  • Bring a systems-thinking mindset to the AI delivery pipeline, and know where to tighten loops or increase speed
Job Responsibility
Job Responsibility
  • Own and drive programs supporting Figma’s AI platform — including annotation velocity, evaluation pipelines, and cost/capacity readiness
  • Partner with Infra and Finance to plan model scaling across providers: track token usage, forecast traffic, manage regional limits, optimize caching strategies, and reduce latency
  • Lead our internal AI Annotation Program: manage vendors and design annotators. Define task priorities, improve quality standards and increase annotator throughput
  • Support internal AIOps initiatives — model go/no-go decision making, monitor model behavior, prevent regressions, and ensure readiness across quality gates
  • Drive cross-functional execution of key AI-powered product features — coordinate scope, risks, comms, and launch checklists
  • Partner with Data Science to maintain and improve internal visibility: annotation metrics, token quotas, reliability dashboards, and evaluation timelines
What we offer
What we offer
  • equity to employees
  • health, dental & vision
  • retirement with company contribution
  • parental leave & reproductive or family planning support
  • mental health & wellness benefits
  • generous PTO
  • company recharge days
  • a learning & development stipend
  • a work from home stipend
  • cell phone reimbursement
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager - AI Core Platform

We’re hiring a Senior Engineering Manager (or high-potential EM2) for the Core P...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
intercom.com Logo
Intercom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading engineering teams, ideally across infrastructure or platform domains
  • Recent hands-on coding experience — you’ve shipped production code in the last couple of years
  • Strong technical judgment and the ability to coach senior engineers through complex architectural trade-offs
  • Adaptable leadership style suited to a group that will grow quickly, and change shape over time
  • Curiosity and enthusiasm for AI, with a desire to learn how ML systems are developed and operated in production
Job Responsibility
Job Responsibility
  • Lead a high-performing team building the platform and infrastructure that power Intercom’s AI capabilities
  • Contribute directly to production code, staying close to the work and building knowledge & context through first-hand experience
  • Support teams of ML Scientists and Engineers building AI powered capabilities
  • Plan, prioritize, and deliver high-impact roadmaps in partnership with the team’s most senior engineers, balancing delivery, quality, and innovation
  • Improve developer experience across the AI infrastructure stack, ensuring that systems are observable, scalable, and easy to build upon
  • Empower the engineers on the team to act with agency and maximize their impact
  • Expand your scope over time, potentially taking ownership of additional platform domains as the team and AI initiatives grow
What we offer
What we offer
  • Competitive salary and equity in a fast-growing start-up
  • We serve lunch every weekday, plus a variety of snack foods and a fully stocked kitchen
  • Regular compensation reviews - we reward great work
  • Pension scheme & match up to 4%
  • Peace of mind with life assurance, as well as comprehensive health and dental insurance for you and your dependents
  • Flexible paid time off policy
  • Paid maternity leave, as well as 6 weeks paternity leave for fathers, to let you spend valuable time with your loved ones
  • If you’re cycling, we’ve got you covered on the Cycle-to-Work Scheme. With secure bike storage too
  • MacBooks are our standard, but we also offer Windows for certain roles when needed
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Senior Manager, Operations Knowledge Systems & Process Design

This isn't traditional knowledge management. You're building the operating syste...
Location
Location
United States , Nashville
Salary
Salary:
Not provided
https://checkr.com Logo
Checkr
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in operations, process improvement, knowledge management, or related fields
  • 3+ years leading teams or complex cross-functional initiatives
  • Demonstrated expertise in business process design, mapping, and optimization (Lean, Six Sigma, or similar methodologies)
  • Strong systems thinking—ability to see how knowledge, process, technology, and people interconnect
  • Proven ability to write clear, effective operational content that scales across audiences and channels
  • Data fluency: comfortable using metrics and analytics to drive decisions and measure impact
  • Experience building scalable solutions that work across multiple teams or functions
  • Excellent stakeholder management skills with ability to influence without authority
  • Clear, compelling communication—can translate complex systems into understandable frameworks
Job Responsibility
Job Responsibility
  • Design and evolve the knowledge infrastructure that powers compliance operations, customer support, and external help center content
  • Write and oversee the creation of content that works—clear, actionable knowledge that scales across channels and use cases
  • Develop and maintain structured taxonomies and leverage AI-powered approaches for organizing and surfacing unstructured content
  • Create systems that enable both human agents and AI systems to leverage knowledge effectively
  • Establish frameworks for knowledge quality, governance, and lifecycle management that scale with business growth
  • Map, document, and optimize cross-functional processes across compliance, support, and supply chain operations
  • Design processes that balance efficiency, quality, and customer experience outcomes
  • Build process frameworks that support continuous improvement and rapid iteration
  • Harness conversation analytics and AI to surface patterns, gaps, and opportunities in knowledge and process performance
  • Use operational data and performance metrics to identify knowledge and process gaps and translate insights into action
What we offer
What we offer
  • Lunch four times a week
  • Commuter stipend
  • Snacks and beverages
  • Fulltime
Read More
Arrow Right

Revenue Operations Manager

This is one of the most critical roles driving the scalability and financial per...
Location
Location
Sweden , Stockholm
Salary
Salary:
Not provided
mentimeter.com Logo
Mentimeter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience in Operations (Revenue, Sales or Marketing Ops), SaaS Sales or Consultancy
  • Highly driven, proactive, and action-oriented with a strong bias toward execution
  • Curious interest in leveraging AI and automation to drive smarter decisions and improve operational effectiveness
  • Excellent communicator with the ability to align and collaborate effectively with senior leadership and cross-functional teams
  • Ability to work cross-functionally and align operational initiatives with business goals
  • Attention to detail and a structured, problem-solving mindset
  • Familiarity with SaaS sales processes and CRM data models
Job Responsibility
Job Responsibility
  • Revenue Process Design and Implementation: Responsible for process design and driving scalability within our Enterprise Bow Tie funnel
  • Partnering with Revenue leaders to align Sales Ops initiatives with Mentimeter’s G2M strategy
  • Leading and contributing to cross-functional projects focused on revenue enablement and operational excellence
  • Implement process changes through tooling and data infrastructure, automating workflows where possible to ensure scalability
  • Drive cross-functional alignment and change management to ensure consistent process adoption and scalability
  • Tech Stack & System Enablement: Ownership of tools and systems that are the closest to your specialisation
  • Workflows and automation: Identify and implement workflow improvements that increase productivity and visibility throughout the funnel
  • Ensure data activation within the system
  • Ensure CRM data integrity: Responsible for legal compliance for the data in the tools and maintaining data hygiene
  • Having commercial ownership for driving renewal process and negotiations and optimise costs and tool ROI
What we offer
What we offer
  • Diverse and inclusive work environment
  • Continuous professional development
  • Access to a leadership program (including external personal coach)
  • Relevant education
  • Competitive compensation and benefits package, including pension contributions
Read More
Arrow Right

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Senior AI Infrastructure Engineer

This role will be responsible for designing, deploying, and maintaining high-per...
Location
Location
United States , Bothell; Overland Park; Bellevue
Salary
Salary:
113600.00 - 205000.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years technical engineering experience, preferably in multiple technology focus areas
  • Expert understanding of AI/ML infrastructure components, or GPU-based systems – preferably in a high-availability, large scale environment
  • Hands-on Experience with NVIDIA DGX servers, BasePOD architectures, and advanced GPU technologies
  • Proficient in Linux/UNIX environments, including scripting/automation tools (Bash, Python, Ansible, Terraform)
  • Understanding of AI infrastructure security best practices
  • Experience with container orchestration (Kubernetes, Docker) and GPU workload management tools
  • Strong knowledge of networking (InfiniBand/Ethernet) and storage solutions in AI/ML contexts
Job Responsibility
Job Responsibility
  • Technical System Expertise: Understands system protocols, how systems operate and data flows
  • Technical Engineering Services: Drives engineering projects by active contribution to the application of engineering techniques
  • Innovation: Contributes to designs to implement new ideas which improve an existing and new system/process/service
  • Technical Writing: Writes basic documentation on how technology works
  • Technical Leadership: Collaborates with technical teams and utilizes system expertise to deliver technical solutions
  • Technology Strategy: Contributes to new and existing technology options that support business goals
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off
  • Paid holidays
  • Paid parental and family leave
  • Fulltime
Read More
Arrow Right