Manager – AI Infrastructure Operations Job at Cerebras Systems (Sunnyvale)

Operations Program Manager, AI Infrastructure

OpenAI’s Hardware organization develops silicon and system-level solutions desig...

Location

United States , San Francisco

Salary:

177000.00 - 285000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

8+ years of experience in Operations, Engineering, Program Management, or equivalent, within hardware development, manufacturing, or supply chain domains (compute, networking, datacenter, or similarly complex systems)
Proven track record leading complex hardware NPI programs end-to-end, from early bring-up through production ramp
Strong understanding of manufacturing and supply chain fundamentals, including BOM management, ECO/MCO processes, build readiness, factory test, quality controls, and material planning
Demonstrated ability to lead cross-functional teams, influence senior stakeholders, and drive decisions in ambiguous, time-compressed environments
Exceptional written and verbal communication skills, with the ability to distill complex issues for executive and external audiences

Job Responsibility

Act as the single-threaded owner for operational readiness across NPI and ramp, accountable for outcomes from early bring-up through sustained production
Translate OpenAI’s infrastructure strategy and engineering objectives into clear operating plans, execution priorities, and decision frameworks
Drive alignment across Engineering, Operations, Strategic Sourcing, Finance, Capacity Planning, and Executive stakeholders by framing tradeoffs, risks, and recommendations
Proactively identify inflection points where decisions or investments are required to protect long-term scale, reliability, or cost targets
Influence operational strategy with manufacturing partners by setting expectations on execution rigor, accountability, and continuous improvement
Drive overall NPI build readiness, including material accountability, manufacturing and test readiness, product data availability, factory infrastructure, and qualification plans
Lead transition activities from NPI to mass production, partnering closely with Sustaining Operations teams to ensure seamless ownership transfer
Translate engineering requirements into actionable, factory-ready plans with tier-1 manufacturing and integration partners
Lead cross-functional build and debug cadences
ensure issues are clearly owned, aggressively driven, and formally closed with root cause and prevention

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Senior Manager, AI Infrastructure and Operations

The Sr. Manager/Staff Engineer, AI Infrastructure & MLOps Engineering is a senio...

Location

Japan , Tokyo

Salary:

Not provided

Pfizer

Expiration Date

Until further notice

Requirements

8+ years of hands-on software engineering experience in cloud infrastructure, DevOps, and MLOps
Deep expertise in Python, Kubernetes, Terraform, Helm, and CI/CD pipeline development
Proven experience architecting and operating containerized solutions on AWS, GCP, and Azure
Strong knowledge of Infrastructure-as-Code, distributed systems, and production system reliability
Bachelor’s or Master’s degree in Computer Science, Engineering, or related field

Job Responsibility

Design, implement, and own large-scale cloud-based HPC and MLOps platforms supporting AI model training, genomic sequencing, and precision medicine
Architect multi-environment clusters (AWS, GCP, Azure), enabling GPU/FPGA workloads and advanced observability
Lead the development of developer and cloud platforms, including internal engineering accelerators and reusable toolsets
Design, implement, and manage unified platform catalogs using Backstage, enhancing developer experience and application metadata management
Develop custom plugins and APIs for Backstage to support internal engineering workflows and documentation
Build and maintain Python-based automation frameworks, CI/CD pipelines, and Infrastructure-as-Code (Terraform, Helm, Pulumi, AWS CDK)
Operationalize containerized solutions using Docker and Kubernetes, integrating MLflow, Kubeflow, and other orchestration platforms
Implement robust automation for provisioning, configuring, and managing cloud resources across multiple environments
Lead the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and advanced observability (Prometheus, Grafana, PagerDuty)
Develop and maintain APIs and services for model management, feature stores, and inference pipelines

Fulltime

Senior Technical Program Manager – AI Infrastructure, Site Operations

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. ...

Location

United States , Sunnyvale

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

8+ years in Technical Program Management, Infrastructure Ops, or Data Center Ops
Experience leading large, cross-functional infrastructure programs
Strong understanding of: Data center power and cooling fundamentals
Network and storage basics
Hardware-centric platforms
Proven ability to define and operationalize metrics
Strong written and executive-level communication skills

Job Responsibility

Own end-to-end technical programs for data center and site operations
Act as single-threaded owner across: Hardware & Systems Engineering
AI Cloud Infrastructure & Operations
Network & Storage Engineering
Facilities, power, cooling, and colo partners
Drive site readiness for Cerebras Wafer-Scale Engine systems
Partner on installation, commissioning, change management, and break/fix workflows
Lead incident reviews and postmortems
ensure corrective actions are closed
Define and own operational metrics and KPIs, including: Availability and reliability

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

Senior Business Manager - AI Infrastructure

The AI Infrastructure team builds systems that turn hardware and AI models into ...

Location

United States , Redmond

Salary:

116900.00 - 203600.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in relevant field (e.g., Liberal Arts, Business Administration, Management, Computer Science) AND 6+ years experience in financial management, business planning, operations management, strategy, project management, human resources, or business-related roles OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings:  Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Own end-to-end financial tracking including spend, forecasts, and variance analysis. Deliver accurate, timely, and actionable reporting that supports leadership decision-making
Establish and maintain strong financial discipline, including data accuracy, consistent processes, and adherence to reporting standards
Lead headcount planning, hiring pacing, and Position Control Number (PCN) management. Support trade-off decisions aligned to organizational priorities and financial targets. Plan and support Early in Profession (EIP) hiring across the team
Design, build, and continuously improve reporting and tooling to move from manual processes to scalable, automated, and reliable systems
Translate data into clear insights, identifying risks and opportunities and providing recommendations to leadership
Partner with Finance, HR, Recruiting, and business leaders to align financials, workforce, and business priorities

Fulltime

Principal Technical Program Manager- AI Infrastructure

Microsoft is developing advanced AI infrastructure platforms that require deep i...

Location

United States , Redmond

Salary:

142800.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree AND 8+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience.
6+ years of experience managing cross-functional and/or cross-team projects.
Ability to meet Microsoft, client, and/or government security screening requirements are required for this role. These requirements include, but are not limited to, the following specialized security screenings: Microsoft Cloud Background Check. This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Job Responsibility

Own end-to-end delivery from development through production readiness, including integrated planning across the software stack
Drive execution by managing dependencies, risks, and cross-team tradeoffs to keep delivery on track
Ensure platform and performance readiness (bring-up, key workloads, benchmarking, optimization)
Establish strong operating rhythm (reporting, alignment, and clear escalation paths) while improving tools and processes to increase predictability
Identify systemic gaps and act as the bridge across infrastructure, research, and product, driving alignment and translating complexity into clear, actionable updates

Fulltime

AI Operations Manager

The AI Operations Manager will lead the design, deployment, and maintenance of p...

Location

Salary:

Not provided

Solvedex

Expiration Date

Until further notice

Requirements

Bachelor's degree in Statistics, Mathematics, Engineering, Data Science, Computer Science, Economics, or equivalent experience
5+ years of experience in AI Ops, ML Engineering, Data Science, Data Engineering, DevOps, or related fields
Advanced English proficiency (written and spoken)

Job Responsibility

Own the end-to-end delivery of productionized ML/DL/LLM solutions, from design and development to deployment and ongoing performance
Partner with Data Science teams on solution design and delivery
ensure ongoing monitoring, maintenance, and performance of AI solutions in production
Champion compliance, security, and governance best practices across all AI/ML deployments
Coordinate with Release Management, Infrastructure, and DevOps to ensure reliable, smooth deployments

Fulltime

HR Infrastructure & Operations Manager

We're seeking an experienced HR Infrastructure Manager to drive two critical pil...

Location

United States , San Francisco

Salary:

114000.00 - 191400.00 USD / Year

PagerDuty

Expiration Date

Until further notice

Requirements

7+ years of progressive Recruiting or HR Operations experience with demonstrated expertise in recruiting lifecycle management and augmented workforce management
Proven expertise with recruiting systems (ATS platforms such as Greenhouse, Lever, Workday Recruiting, or similar)
Strong technical skills including system configuration, integration management, and process automation
Knowledge of contingent workforce management, including compliance considerations for contractors, EOR arrangements, and intern programs
Demonstrated experience with AI tools and automation, including building AI agents, workflow automation, or applying AI to solve business problems
Experience building programs from the ground up, including policy development, system implementation, and stakeholder enablement
Data-driven mindset with demonstrated ability to build analytics frameworks and translate data into actionable insights
Strong project management skills with ability to manage multiple complex initiatives simultaneously
Effective stakeholder management and communication skills across all organizational levels

Job Responsibility

Own and optimize PagerDuty's recruiting technology ecosystem, ensuring seamless integrations, automation, and scalability across our hiring tools and platforms
Drive recruiting analytics strategy, delivering actionable insights that inform hiring decisions, identify bottlenecks, and measure recruiting effectiveness
Build and maintain integrations between recruiting systems and downstream HR/business tools to ensure data accuracy and process efficiency
Leverage AI and build AI agents to automate recruiting workflows, enhance candidate matching, improve data quality, and surface predictive insights
Partner with Talent Acquisition to identify opportunities for AI-driven automation that reduce time-to-hire and improve candidate experience
Manage vendor relationships for recruiting technology stack, including system upgrades, troubleshooting, and optimization
Design and execute PagerDuty's comprehensive augmented workforce program in partnership with the Director of HR Infrastructure & Operations
Develop clear policies and guidelines defining when and how to engage EOR employees, contractors, and interns, ensuring compliance and business alignment
Configure and optimize systems to support augmented worker lifecycle management, from requisition through offboarding
Build AI-powered tools and agents to streamline augmented workforce processes, automate compliance checks, and provide intelligent recommendations to managers

What we offer

Competitive salary
Comprehensive benefits package
Flexible work arrangements
Company equity
ESPP (Employee Stock Purchase Program)
Retirement or pension plan
Generous paid vacation time
Paid holidays and sick leave
Dutonian Wellness Days & HibernationDuty - companywide paid days off in addition to PTO
Paid parental leave: 22 weeks for pregnant parent, 12 weeks for non-pregnant parent

Fulltime

AI Operations Manager

An exciting new opportunity has arisen with a leading global organisation lookin...

Location

Egypt

Salary:

Not provided

Whitehall Resources Ltd

Expiration Date

Until further notice

Requirements

Experience in AIOps, observability, platform management, or operational technology roles
Experience working with enterprise monitoring or application performance platforms
Knowledge of anomaly detection, event correlation, alert optimisation, and operational automation
Experience integrating with ITSM platforms, cloud environments, or automation tools
Have you worked with Dynatrace, Datadog, Splunk, or similar tools?
Strong understanding of infrastructure, applications, networks, and databases
Experience working across complex enterprise environments

Job Responsibility

Improve service performance
automate operational processes
support a better end-user experience
work on large-scale transformation programmes
drive automation and operational efficiency
improve service reliability across enterprise systems
work with modern cloud technologies

Select Country

Manager – AI Infrastructure Operations

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?