This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Platform Systems team at OpenAI operates at the intersection of cutting-edge AI and large-scale distributed systems. We build the engineering and research infrastructure required to train OpenAI’s flagship models on some of the world’s largest, custom-built supercomputers. Our team develops core model training software and works deep in the stack - spanning collective communication, compute efficiency, parallelism strategies, fault tolerance, failure detection, and observability. The systems we build are foundational to OpenAI’s research velocity, enabling reliable, efficient training at frontier scale.
Job Responsibility:
Design and build distributed failure detection, tracing, and profiling systems for large-scale AI training jobs
Develop tooling to identify slow, faulty, or misbehaving nodes and provide actionable visibility into system behavior
Improve observability, reliability, and performance across OpenAI’s training platform
Debug and resolve issues in complex, high-throughput distributed systems
Collaborate with systems, infrastructure, and research teams to evolve platform capabilities
Extend and adapt failure detection systems or tracing systems to support new training paradigms and workloads
Requirements:
Care deeply about performance, stability, and observability in distributed systems
Enjoy finding and fixing issues in large-scale systems and automating operational workflows
Have experience writing low-level software where system details matter
Understand hardware, operating systems, networking, concurrency, and distributed systems
Have a background in high-performance computing or low-level systems engineering
Are excited to work on critical infrastructure that powers frontier AI research