CrawlJobs Logo

Software Engineer, Fleet Infrastructure

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 490000.00 USD / Year

Job Description:

This role will support the fleet infrastructure team at OpenAI. The fleet team focuses on running the world’s largest, most reliable, and frictionless GPU fleet to support OpenAI’s general purpose model training and deployment. Work on this team ranges from: Maximizing GPUs doing useful work by building user-friendly scheduling and quota systems; Running a reliable and low maintenance platform by building push-button automation for kubernetes cluster provisioning and upgrades; Supporting research workflows with service frameworks and deployment systems; Ensuring fast model startup times though high performance snapshot delivery across blob storage down to hardware caching; Much more! As an engineer within Fleet infrastructure, you will design, write, deploy, and operate infrastructure systems for model deployment and training on one of the world’s largest GPU fleet. The scale is immense, the timelines are tight, and the organization is moving fast; this is an opportunity to shape a critical system in support of OpenAI's mission to advance AI capabilities responsibly.

Job Responsibility:

  • Design, implement and operate components of our compute fleet including job scheduling, cluster management, snapshot delivery, and CI/CD systems
  • Interface with researchers and product teams to understand workload requirements
  • Collaborate with hardware, infrastructure, and business teams to provide a high utilization and high reliability service

Requirements:

  • Experience with hyperscale compute systems
  • Strong programming skills
  • Experience working in public clouds (especially Azure)
  • Experience working in Kubernetes
  • Execution focused mentality paired with a rigorous focus on user requirements

Nice to have:

Understanding of AI/ML workloads

What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Fleet Infrastructure

Software Engineer - Configuration

Figure is an AI robotics company developing autonomous general-purpose humanoid ...
Location
Location
United States , San Jose
Salary
Salary:
180000.00 - 260000.00 USD / Year
figure.ai Logo
Figure
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Master degree in Computer Science or related field
  • At least 5 years of experience writing production Software
  • Mastery of designing scalable software systems
  • Experience with modern C++ and Python
  • Experience working with complex configuration systems
Job Responsibility
Job Responsibility
  • Architect, design, implement a configuration system for the robot, all of its subsystems and the overall robot fleet
  • Integrate the configuration system into Python and C++ codebases
  • Develop infrastructure and tooling around managing, distributing and verifying the configuration
  • Help us ensure that our robot ecosystem is stable, scalable and well tested in CI in all configuration permutations
  • Fulltime
Read More
Arrow Right

Software Engineer, Build Compute

CI/CD is the beating heart of Vercel. Developers & agents alike create over 1 mi...
Location
Location
Germany; United Kingdom
Salary
Salary:
Not provided
vercel.com Logo
Vercel
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of relevant software engineering experience
  • Strong proficiency at least one of JavaScript/TypeScript/Golang, Golang preferred
  • Extended experience with Containers, Virtual Machines, Linux
  • Practical experience building, running and debugging distributed systems
  • Excellent problem solving and communication skills
  • An enthusiasm for digging into problems with unknown solutions
Job Responsibility
Job Responsibility
  • Manage and improve our fleet of clusters, running 100’s of instances, deployed in every region where our customers deploy code
  • Writing golang on a daily basis and using terraform to provision our infrastructure
  • Rethinking the primitives of our infrastructure, working with virtual filesystems and linux primitives
  • Building the underlying compute infrastructure that powers all of these builds at scale
  • Transforming the performance of builds
  • Working with open source authors to understand the requirements of their frameworks
What we offer
What we offer
  • Competitive compensation package, including equity
  • Inclusive Healthcare Package
  • Learn and Grow - we provide mentorship and send you to events that help you build your network and skills
  • Flexible Time Off
  • We will provide you the gear you need to do your role, and a WFH budget for you to outfit your space as needed
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - Backend

As the Staff Software Engineer for our SaaS platform team, you will be crucial i...
Location
Location
United States , Mountain View
Salary
Salary:
198000.00 - 225000.00 USD / Year
cyngn.com Logo
Cyngn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software development experience, with a strong focus on backend systems and distributed architectures
  • Extensive experience in building and scaling cloud-native SaaS platforms, preferably in the IoT or robotics domains
  • Expert-level proficiency in at least one of Python, Go, Java, or C++, with working knowledge of others
  • Deep understanding of cloud technologies and services (AWS, Azure, or GCP)
  • Proven experience with event-driven architectures and message queuing systems (e.g., Kafka, RabbitMQ, Apache Pulsar)
  • Strong background in database design and optimization, including both SQL and NoSQL solutions
  • Proficiency in developing scalable WebSocket-based real-time communication systems
  • Expertise in developing real-time data processing pipelines and analytics systems
  • Proficiency with containerization and orchestration technologies (Docker, Kubernetes)
  • Experience with infrastructure-as-code and CI/CD practices (e.g., Terraform, GitOps)
Job Responsibility
Job Responsibility
  • Architect and lead the development of a sophisticated, cloud-native fleet management system capable of real-time control and monitoring of numerous autonomous vehicles
  • Design and implement scalable, distributed systems that can handle high-volume, real-time data processing and decision-making
  • Develop robust APIs and microservices to support integration with various autonomous vehicle platforms and customer systems
  • Create efficient algorithms for route optimization, task scheduling, and resource allocation across vehicle fleets
  • Implement advanced data analytics and machine learning capabilities to provide predictive maintenance, performance optimization, and business intelligence features
  • Ensure system reliability, security, and compliance with industry standards and regulations
  • Lead a team of skilled engineers, fostering a culture of innovation, code quality, and continuous improvement
  • Collaborate with product managers, UX designers, and customers to translate business requirements into technical solutions
  • Mentor junior developers and contribute to the technical growth of the engineering team
  • Participate in the entire software development lifecycle, from concept and design to testing, deployment, and maintenance
What we offer
What we offer
  • Health benefits (Medical, Dental, Vision, HSA and FSA (Health & Dependent Daycare), Employee Assistance Program, 1:1 Health Concierge)
  • Life, Short-term, and long-term disability insurance (Cyngn funds 100% of premiums)
  • Company 401(k)
  • Commuter Benefits
  • Flexible vacation policy
  • Remote or hybrid work opportunities
  • Sabbatical leave opportunity after five years with the company
  • Paid Parental Leave
  • Daily lunches for in-office employees
  • Monthly meal and tech allowances for remote employees
  • Fulltime
Read More
Arrow Right

Head of Factory Software & Vehicle Diagnostics

At Mach Industries, we are designing and building the world’s most advanced prod...
Location
Location
United States , Huntington Beach
Salary
Salary:
170000.00 - 250000.00 USD / Year
machindustries.com Logo
Mach Industries
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Electrical Engineering, Mechanical Engineering, Robotics, or a related engineering field
  • 10+ years of experience in software engineering, controls engineering, automated testing, manufacturing software, or firmware systems
  • 5+ years of experience leading technical teams or engineering organizations
  • Proven track record of shipping production-critical software or managing large-scale automated test systems
  • Strong systems-level thinking across software, hardware, networks, and manufacturing workflows
  • Deep expertise in one or more of the following areas: Manufacturing Execution Systems (MES)
  • PLCs and industrial controls (Beckhoff, Siemens, B&R, Allen-Bradley)
  • Firmware flashing, bootloaders, and secure signing
  • Vehicle or embedded diagnostics (CAN, LIN, Ethernet, UDS, custom protocols)
  • Test automation frameworks, HIL systems, or end-of-line validation
Job Responsibility
Job Responsibility
  • Build, lead, and develop a cross-functional organization including manufacturing software engineers, controls engineers, firmware-tools engineers, diagnostic engineers, and data platform engineers
  • Own the end-to-end architecture for factory software, including MES-like systems, build tracking, serialization, and production workflow tools
  • Lead the design and implementation of vehicle flashing, commissioning, and diagnostics pipelines inside the factory
  • Define and deliver the vehicle–factory communication framework (CAN, Ethernet, custom protocols, telemetry ingestion, APIs)
  • Oversee all end-of-line (EOL) software, automated test stands, calibration systems, and data acquisition infrastructure
  • Partner with manufacturing engineering, build engineering, design engineering, flight software, and NPI teams to integrate software tools and processes across the vehicle lifecycle
  • Implement highly reliable production-grade software with redundancy, observability, and real-time data health monitoring
  • Drive rapid iteration and continuous improvement of test coverage, automation, and factory efficiency
  • Own uptime, performance, and correctness for all software critical to production and test operations
  • Establish coding standards, architecture strategies, and long-range roadmaps for factory software and diagnostics
What we offer
What we offer
  • Offers Equity
  • healthcare
  • dental and vision plans
  • retirement savings
  • paid time off
  • funds for continuing education, training, and career growth
  • Fulltime
Read More
Arrow Right

Software Engineer, Fleet Management

The Fleet team at OpenAI supports the computing environment that powers our cutt...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 490000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering skills with experience in large-scale infrastructure environments
  • Broad knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers)
  • Deep expertise in server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing)
  • Passionate about optimizing the performance and reliability of large compute fleets
  • Thrive in dynamic environments and are eager to solve complex infrastructure challenges
  • Value automation, efficiency, and continuous improvement in everything you build
Job Responsibility
Job Responsibility
  • Design and build systems to manage both cloud and bare-metal fleets at scale
  • Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms
  • Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows
  • Automate infrastructure processes, reducing repetitive toil and improving system reliability
  • Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack
  • Continuously improve tools, automation, processes, and documentation to enhance operational efficiency
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

As a Senior Backend Engineer in the Archer AI team, you will architect, develop,...
Location
Location
United States , San Jose
Salary
Salary:
144000.00 - 180000.00 USD / Year
archer.com Logo
Archer Aviation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Kubernetes, Docker, GitOps (Argo CD)
  • Layer 7 mesh networks (e.g. Linkerd or Itsio), load balancers, East/West and North/South hot failovers, etc
  • Infrastructure as code tooling (we use OpenTofu and Ansible)
  • Secrets management
  • Distributed tracing and observability stacks
  • Cost optimization and designing portable cloud-agnostic architectures
Job Responsibility
Job Responsibility
  • Architect, develop, and scale the core services and infrastructure that power our cutting-edge AI products
  • Take significant vertical technical ownership of complex problems
  • Design and maintain scalable infrastructure for managing a fleet of IoT sensors and related infrastructure
  • Eventually building infrastructure and processes to target an SLO of 8 9s of availability
  • Design and maintain observability infrastructure
  • Collaborate with AI experts to get their systems deployed into production
  • Consider what we’ve built that might help and solve problems other teams at Archer may have
  • Write clean, maintainable, and well-documented code
  • Mentor and guide junior team members with diverse technical backgrounds, fostering a culture of engineering excellence and software development best practices.
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Full Stack - ML Efficiency & Observability

Microsoft AI is looking for a Member of Technical Staff - Full Stack Engineer, M...
Location
Location
United States , Mountain View
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 6+ years experience in business analytics, data science, software development, data modeling or data engineering work
  • OR Master’s Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ year(s) experience in business analytics, data science, software development, or data engineering work
  • OR equivalent experience.
  • Experience with Capacity Management, Efficiency Management, ML Training and/or Inference
  • Solid expertise in JavaScript / TypeScript, React, HTML, CSS and browser internals
  • Solid understanding of web performance, accessibility, and cross‑browser compatibility
  • Experience with Development & Debugging with dev environments like Visual Studio or Visual Studio Code
  • Software development experience with Generative AI tools
  • Experience in leading technical projects and supporting architectural decisions with data.
Job Responsibility
Job Responsibility
  • Design and develop features for our capacity management portal
  • Design and develop features to provide visibility into model performance and quality across our fleet
  • Partner with ML researchers and PMs to translate functional requirements into highly functional, intuitive and appealing interfaces
  • Integrate with backend APIs from schedulers to training frameworks to build visibility across the training life cycle
  • Explore, develop, and adapt new innovations to the software development process
  • Contribute to the development of internal tooling and infrastructure
  • Implement best software development practices to ensure code quality. Hold a high quality bar.
  • Embody our culture and values.
  • Fulltime
Read More
Arrow Right