CrawlJobs Logo

Software Engineer, Frontier Systems

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

250000.00 - 445000.00 USD / Year

Job Description:

The Frontier Systems team at OpenAI builds, launches, and supports the largest supercomputers in the world that OpenAI uses for its most cutting edge model training. We take data center designs, turn them into real, working systems and build any software needed for running large-scale frontier model trainings. Our mission is to bring up, stabilize and keep these hyperscale supercomputers reliable and efficient during the training of the frontier models.

Job Responsibility:

  • Own and improve the system health checks that keep our hyperscale supercomputers stable during model training
  • Lead deep dives into hardware failures and system-level bugs to understand how things break at scale
  • Build automation that monitors and fixes issues across thousands of machines - so researchers can keep moving without interruption

Requirements:

  • 7+ years of industry experience in software engineering
  • Proficiency with Python and shell scripting
  • A high degree of comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool necessary
  • Experience developing reproducible analyses
  • A balance of strengths in building and operationalizing

Nice to have:

  • Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)
  • Experience with visualization of large data centers and networks
  • Expertise with network operations and tooling
  • Expertise with power management and stabilization
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Frontier Systems

Senior / Staff Software Engineer (Database)

Our database technology sits at the heart of the Materialize product—a product t...
Location
Location
United States , New York
Salary
Salary:
164050.00 - 250000.00 USD / Year
materialize.com Logo
Materialize
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Several years of experience developing software
  • Passionate about distributed systems and/or databases
  • Excited to learn Rust if not already using it
  • Pride in owning work end-to-end
  • Ability to write clear design docs and well-documented code
  • Love solving hard problems in service of the customer
  • Excited about working at the intersection of frontier academic research and a venture-backed startup
Job Responsibility
Job Responsibility
  • Design and deliver improvements to the Database, with an eye on correctness, reliability, and performance
  • Own projects end-to-end, from early stage design to holding the pager
  • Debug and resolve complex distributed systems issues, sometimes directly with customers
  • Continually improve system reliability, observability, and automation
  • Collaborate across your team, with Product, with Field Eng, and all other stakeholders to align on direction, carefully prioritize, and build the best product for our users
  • Share your work through mentorship, demos, blog posts, and any other relevant channels
What we offer
What we offer
  • Equity
  • Fulltime
Read More
Arrow Right
New

Software Engineer, Frontier Systems - Power Management

As a Software Engineer on the Frontier Systems team focused on power management,...
Location
Location
United States , San Francisco
Salary
Salary:
295000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with a focus on solving large-scale, system-level challenges
  • Strong proficiency in Python and familiarity with automation and scripting tools (e.g., shell scripting)
  • Experience with distributed systems to efficiently aggregate and analyze streaming data
  • Knowledge of electrical engineering concepts including digital signal processing, power systems, Fast Fourier Transforms, or related areas
  • Experience in system-level investigations and development of automated solutions to address power management, fault detection, and remediation
  • Strong analytical skills and the ability to dig into noisy data (experience with SQL, PromQL, Pandas, etc.)
  • Comfort working with both hardware and software teams to solve multidisciplinary problems
Job Responsibility
Job Responsibility
  • Develop and implement system-level and software-level solutions to optimize power usage in large-scale supercomputers, ensuring efficient and reliable operations
  • Build automation to monitor power consumption patterns during training workloads and design algorithms to stabilize these fluctuations, preventing issues with grid reliability
  • Work with researchers and engineers to design tools for real-time monitoring, detection, and remediation of power-related hardware and system faults
  • Collaborate cross-functionally to translate complex electrical system requirements into code, while driving continuous improvements in power management solutions
  • Drive the development of power throttling mechanisms at the IT system level to dynamically adjust power usage based on workload demands and infrastructure limitations
  • Collaborate with hardware design teams to integrate system-level power control requirements into IT hardware design, ensuring seamless coordination between software-driven power management and hardware capabilities
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer, Backend — Frontier Data

The Frontier Data team builds the data and systems that power Scale’s most advan...
Location
Location
United States , San Francisco; New York
Salary
Salary:
216200.00 - 270250.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of full-time software engineering experience (post-graduation)
  • Strong backend engineering fundamentals: distributed systems, API design, data modeling, and production reliability
  • Strong experience with Docker and containerized development/production environments (building images, debugging, and operating container-based services)
  • Demonstrated ability to ship quickly in ambiguous, fast-changing environments (high-growth startup experience is a plus)
  • Experience building systems that scale: queues, async processing, workflow engines, data pipelines, or similar
  • Comfort working close to AI/ML systems (production experience welcome
  • curiosity and strong fundamentals also valued)
  • Proficiency with SQL and modern database-backed application development
Job Responsibility
Job Responsibility
  • Own major backend systems for frontier agentic data products, driving projects from early exploration through production deployment
  • Build scalable services and pipelines that support agent workflows (e.g., coding, tool-use orchestration, GUI/computer-use tasks), with strong reliability and observability
  • Architect modular, reusable backend systems that adapt to evolving product needs while maintaining scalability, reliability, and clean interfaces
  • Operate in high-ambiguity environments: break down open-ended problems, propose approaches, and execute with speed and clarity
  • Partner cross-functionally with product, research/ML, and infrastructure teams to define requirements and ship impactful systems
  • Improve system performance and cost efficiency through thoughtful architecture, profiling, and iterative optimization
  • Raise the engineering bar through design reviews, code reviews, and pragmatic best practices
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • equity grant
  • commuter stipend
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, AI Multimodal - MAI Superintelligence Team

At Microsoft AI, we are on a mission to train the world’s most capable AI fronti...
Location
Location
Switzerland , Zürich
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND experience in business analytics, data science, software development, data modelling or data engineering work
  • OR Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND experience in business analytics, data science, software development, or data engineering work
  • OR equivalent experience
  • Expertise in multimodal Research with a strong publishing track record
  • Proven expertise in areas of interest, evidenced by an exceptional publication track record and/or significant technical leadership in high-impact projects
  • Strong analytical skills, attention to detail, and a commitment to data-driven decision-making
  • Experience and/or in-depth understandings about large-scale distributed systems
  • Ability to work collaboratively in a fast-paced, innovative environment
Job Responsibility
Job Responsibility
  • Develop algorithms, design model architectures, conduct experiments, champion measurement and evaluation, innovate datasets and data pipelines
  • Improve training and deployment efficiency, paying careful attention to detail, persevering, and learning from everyone’s attempts whether successful or not
  • Follow a rigorous data-driven approach grounded in meticulous ablation studies and scientific analysis
  • Innovate and iterate over ideas, prototypes, and product
  • Collaborate closely with teams on infrastructure, data engineering, pre-training, post-training, and product feedback
  • Advance the AI frontier responsibly
  • Embody our culture and values
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, AI Multimodal

At Microsoft AI, we are on a mission to train the world’s most capable AI fronti...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND experience in business analytics, data science, software development, data modelling or data engineering work
  • OR Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND experience in business analytics, data science, software development, or data engineering work
  • OR equivalent experience
  • Expertise in multimodal Research with a strong publishing track record
  • Proven expertise in areas of interest, evidenced by an exceptional publication track record and/or significant technical leadership in high-impact projects
  • Strong analytical skills, attention to detail, and a commitment to data-driven decision-making
  • Experience and/or in-depth understandings about large-scale distributed systems
  • Ability to work collaboratively in a fast-paced, innovative environment
Job Responsibility
Job Responsibility
  • Develop algorithms, design model architectures, conduct experiments, champion measurement and evaluation, innovate datasets and data pipelines
  • Improve training and deployment efficiency, paying careful attention to detail, persevering, and learning from everyone’s attempts whether successful or not
  • Follow a rigorous data-driven approach grounded in meticulous ablation studies and scientific analysis
  • Innovate and iterate over ideas, prototypes, and product
  • Collaborate closely with teams on infrastructure, data engineering, pre-training, post-training, and product feedback
  • Advance the AI frontier responsibly
  • Embody our culture and values
  • Fulltime
Read More
Arrow Right
New

Software Engineer, Research - Human Data

OpenAI’s mission is to ensure that artificial general intelligence (AGI) benefit...
Location
Location
United States; United Kingdom , San Francisco; London
Salary
Salary:
230000.00 - 385000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering fundamentals
  • Experience building production systems at scale
  • Enjoy full-stack development with end-to-end ownership
  • Motivated by high-impact collaboration with research teams and solving novel, ambiguous problems
  • Excited to shape how AI systems learn from human preferences and reflect a broad range of human values
  • Care deeply about inclusive tooling and building systems that enhance model safety, reliability, and usefulness
Job Responsibility
Job Responsibility
  • Build and maintain robust full-stack systems for feedback collection, data labeling, and evaluation pipelines, while maintaining high levels of security
  • Translate experimental alignment research into scalable production infrastructure, including inference and model training stacks
  • Design and iterate on user-facing tools and backend services to support high-quality data workflows
  • Partner with researchers, engineers, and program leads to shape feedback loops and model interaction paradigms
  • Drive infrastructure improvements that enable faster iteration and scaling across OpenAI’s frontier models, from internal research tooling all the way to production ChatGPT
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right
New

Software Engineer, Frontier Clusters Infrastructure

This role blends distributed systems engineering with hands-on infrastructure wo...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 490000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience as an infrastructure, systems, or distributed systems engineer in large-scale or high-availability environments
  • Strong knowledge of Kubernetes internals, cluster scaling patterns, and containerized workloads
  • Proficiency in cloud infrastructure concepts (compute, networking, storage, security) and in automating cluster or data center operations
Job Responsibility
Job Responsibility
  • Spin up and scale large Kubernetes clusters, including automation for provisioning, bootstrapping, and cluster lifecycle management
  • Build software abstractions that unify multiple clusters and present a seamless interface to training workloads
  • Own node bring-up from bare metal through firmware upgrades, ensuring fast, repeatable deployment at massive scale
  • Improve operational metrics such as reducing cluster restart times (e.g., from hours to minutes) and accelerating firmware or OS upgrade cycles
  • Integrate networking and hardware health systems to deliver end-to-end reliability across servers, switches, and data center infrastructure
  • Develop monitoring and observability systems to detect issues early and keep clusters stable under extreme load
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right
New

Software Engineer, Platform Systems

The Platform Systems team at OpenAI operates at the intersection of cutting-edge...
Location
Location
United States , San Francisco
Salary
Salary:
310000.00 - 460000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Care deeply about performance, stability, and observability in distributed systems
  • Enjoy finding and fixing issues in large-scale systems and automating operational workflows
  • Have experience writing low-level software where system details matter
  • Understand hardware, operating systems, networking, concurrency, and distributed systems
  • Have a background in high-performance computing or low-level systems engineering
  • Are excited to work on critical infrastructure that powers frontier AI research
Job Responsibility
Job Responsibility
  • Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs
  • Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior
  • Improve observability, reliability, and performance across OpenAI’s training platform
  • Debug and resolve issues in complex, high-throughput distributed systems
  • Collaborate with systems, infrastructure, and research teams to evolve platform capabilities
  • Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right