CrawlJobs Logo

Software Engineer, Fleet Infrastructure

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 490000.00 USD / Year

Job Description:

This role will support the fleet infrastructure team at OpenAI. The fleet team focuses on running the world’s largest, most reliable, and frictionless GPU fleet to support OpenAI’s general purpose model training and deployment. Work on this team ranges from: Maximizing GPUs doing useful work by building user-friendly scheduling and quota systems; Running a reliable and low maintenance platform by building push-button automation for kubernetes cluster provisioning and upgrades; Supporting research workflows with service frameworks and deployment systems; Ensuring fast model startup times though high performance snapshot delivery across blob storage down to hardware caching; Much more! As an engineer within Fleet infrastructure, you will design, write, deploy, and operate infrastructure systems for model deployment and training on one of the world’s largest GPU fleet. The scale is immense, the timelines are tight, and the organization is moving fast; this is an opportunity to shape a critical system in support of OpenAI's mission to advance AI capabilities responsibly.

Job Responsibility:

  • Design, implement and operate components of our compute fleet including job scheduling, cluster management, snapshot delivery, and CI/CD systems
  • Interface with researchers and product teams to understand workload requirements
  • Collaborate with hardware, infrastructure, and business teams to provide a high utilization and high reliability service

Requirements:

  • Experience with hyperscale compute systems
  • Strong programming skills
  • Experience working in public clouds (especially Azure)
  • Experience working in Kubernetes
  • Execution focused mentality paired with a rigorous focus on user requirements

Nice to have:

Understanding of AI/ML workloads

What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Fleet Infrastructure

Software Engineer - Configuration

Figure is an AI robotics company developing autonomous general-purpose humanoid ...
Location
Location
United States , San Jose
Salary
Salary:
180000.00 - 260000.00 USD / Year
figure.ai Logo
Figure
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Master degree in Computer Science or related field
  • At least 5 years of experience writing production Software
  • Mastery of designing scalable software systems
  • Experience with modern C++ and Python
  • Experience working with complex configuration systems
Job Responsibility
Job Responsibility
  • Architect, design, implement a configuration system for the robot, all of its subsystems and the overall robot fleet
  • Integrate the configuration system into Python and C++ codebases
  • Develop infrastructure and tooling around managing, distributing and verifying the configuration
  • Help us ensure that our robot ecosystem is stable, scalable and well tested in CI in all configuration permutations
  • Fulltime
Read More
Arrow Right

Software Engineer, Build Compute

CI/CD is the beating heart of Vercel. Developers & agents alike create over 1 mi...
Location
Location
Germany; United Kingdom
Salary
Salary:
Not provided
vercel.com Logo
Vercel
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of relevant software engineering experience
  • Strong proficiency at least one of JavaScript/TypeScript/Golang, Golang preferred
  • Extended experience with Containers, Virtual Machines, Linux
  • Practical experience building, running and debugging distributed systems
  • Excellent problem solving and communication skills
  • An enthusiasm for digging into problems with unknown solutions
Job Responsibility
Job Responsibility
  • Manage and improve our fleet of clusters, running 100’s of instances, deployed in every region where our customers deploy code
  • Writing golang on a daily basis and using terraform to provision our infrastructure
  • Rethinking the primitives of our infrastructure, working with virtual filesystems and linux primitives
  • Building the underlying compute infrastructure that powers all of these builds at scale
  • Transforming the performance of builds
  • Working with open source authors to understand the requirements of their frameworks
What we offer
What we offer
  • Competitive compensation package, including equity
  • Inclusive Healthcare Package
  • Learn and Grow - we provide mentorship and send you to events that help you build your network and skills
  • Flexible Time Off
  • We will provide you the gear you need to do your role, and a WFH budget for you to outfit your space as needed
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - Backend

As the Staff Software Engineer for our SaaS platform team, you will be crucial i...
Location
Location
United States , Mountain View
Salary
Salary:
198000.00 - 225000.00 USD / Year
cyngn.com Logo
Cyngn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software development experience, with a strong focus on backend systems and distributed architectures
  • Extensive experience in building and scaling cloud-native SaaS platforms, preferably in the IoT or robotics domains
  • Expert-level proficiency in at least one of Python, Go, Java, or C++, with working knowledge of others
  • Deep understanding of cloud technologies and services (AWS, Azure, or GCP)
  • Proven experience with event-driven architectures and message queuing systems (e.g., Kafka, RabbitMQ, Apache Pulsar)
  • Strong background in database design and optimization, including both SQL and NoSQL solutions
  • Proficiency in developing scalable WebSocket-based real-time communication systems
  • Expertise in developing real-time data processing pipelines and analytics systems
  • Proficiency with containerization and orchestration technologies (Docker, Kubernetes)
  • Experience with infrastructure-as-code and CI/CD practices (e.g., Terraform, GitOps)
Job Responsibility
Job Responsibility
  • Architect and lead the development of a sophisticated, cloud-native fleet management system capable of real-time control and monitoring of numerous autonomous vehicles
  • Design and implement scalable, distributed systems that can handle high-volume, real-time data processing and decision-making
  • Develop robust APIs and microservices to support integration with various autonomous vehicle platforms and customer systems
  • Create efficient algorithms for route optimization, task scheduling, and resource allocation across vehicle fleets
  • Implement advanced data analytics and machine learning capabilities to provide predictive maintenance, performance optimization, and business intelligence features
  • Ensure system reliability, security, and compliance with industry standards and regulations
  • Lead a team of skilled engineers, fostering a culture of innovation, code quality, and continuous improvement
  • Collaborate with product managers, UX designers, and customers to translate business requirements into technical solutions
  • Mentor junior developers and contribute to the technical growth of the engineering team
  • Participate in the entire software development lifecycle, from concept and design to testing, deployment, and maintenance
What we offer
What we offer
  • Health benefits (Medical, Dental, Vision, HSA and FSA (Health & Dependent Daycare), Employee Assistance Program, 1:1 Health Concierge)
  • Life, Short-term, and long-term disability insurance (Cyngn funds 100% of premiums)
  • Company 401(k)
  • Commuter Benefits
  • Flexible vacation policy
  • Remote or hybrid work opportunities
  • Sabbatical leave opportunity after five years with the company
  • Paid Parental Leave
  • Daily lunches for in-office employees
  • Monthly meal and tech allowances for remote employees
  • Fulltime
Read More
Arrow Right
New

Head of Factory Software & Vehicle Diagnostics

At Mach Industries, we are designing and building the world’s most advanced prod...
Location
Location
United States , Huntington Beach
Salary
Salary:
170000.00 - 250000.00 USD / Year
machindustries.com Logo
Mach Industries
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Electrical Engineering, Mechanical Engineering, Robotics, or a related engineering field
  • 10+ years of experience in software engineering, controls engineering, automated testing, manufacturing software, or firmware systems
  • 5+ years of experience leading technical teams or engineering organizations
  • Proven track record of shipping production-critical software or managing large-scale automated test systems
  • Strong systems-level thinking across software, hardware, networks, and manufacturing workflows
  • Deep expertise in one or more of the following areas: Manufacturing Execution Systems (MES)
  • PLCs and industrial controls (Beckhoff, Siemens, B&R, Allen-Bradley)
  • Firmware flashing, bootloaders, and secure signing
  • Vehicle or embedded diagnostics (CAN, LIN, Ethernet, UDS, custom protocols)
  • Test automation frameworks, HIL systems, or end-of-line validation
Job Responsibility
Job Responsibility
  • Build, lead, and develop a cross-functional organization including manufacturing software engineers, controls engineers, firmware-tools engineers, diagnostic engineers, and data platform engineers
  • Own the end-to-end architecture for factory software, including MES-like systems, build tracking, serialization, and production workflow tools
  • Lead the design and implementation of vehicle flashing, commissioning, and diagnostics pipelines inside the factory
  • Define and deliver the vehicle–factory communication framework (CAN, Ethernet, custom protocols, telemetry ingestion, APIs)
  • Oversee all end-of-line (EOL) software, automated test stands, calibration systems, and data acquisition infrastructure
  • Partner with manufacturing engineering, build engineering, design engineering, flight software, and NPI teams to integrate software tools and processes across the vehicle lifecycle
  • Implement highly reliable production-grade software with redundancy, observability, and real-time data health monitoring
  • Drive rapid iteration and continuous improvement of test coverage, automation, and factory efficiency
  • Own uptime, performance, and correctness for all software critical to production and test operations
  • Establish coding standards, architecture strategies, and long-range roadmaps for factory software and diagnostics
What we offer
What we offer
  • Offers Equity
  • healthcare
  • dental and vision plans
  • retirement savings
  • paid time off
  • funds for continuing education, training, and career growth
  • Fulltime
Read More
Arrow Right
New

Software Engineer, Fleet Management

The Fleet team at OpenAI supports the computing environment that powers our cutt...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 490000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering skills with experience in large-scale infrastructure environments
  • Broad knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers)
  • Deep expertise in server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing)
  • Passionate about optimizing the performance and reliability of large compute fleets
  • Thrive in dynamic environments and are eager to solve complex infrastructure challenges
  • Value automation, efficiency, and continuous improvement in everything you build
Job Responsibility
Job Responsibility
  • Design and build systems to manage both cloud and bare-metal fleets at scale
  • Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms
  • Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows
  • Automate infrastructure processes, reducing repetitive toil and improving system reliability
  • Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack
  • Continuously improve tools, automation, processes, and documentation to enhance operational efficiency
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Software Engineer - Cloud FinOps & Reliability

This is a foundational engineering position for a technical, data-driven expert ...
Location
Location
United States , Palo Alto
Salary
Salary:
120000.00 - 255000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
  • Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
  • Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
  • Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
  • Not an accountant
  • you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
  • A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers
Job Responsibility
Job Responsibility
  • Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
  • Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
  • Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
  • Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
  • Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders
  • Fulltime
Read More
Arrow Right

Software Engineer - Fleet Management

We're looking for a backend software engineer with strong data analysis skills t...
Location
Location
United States , San Mateo
Salary
Salary:
130000.00 - 280000.00 USD / Year
verkada.com Logo
Verkada
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS/MS in Computer Science (or similar degree)
  • 3+ years experience of industry experience in distributed software engineering
  • Strong Python skills: Proficiency in Python for data analysis, particularly with libraries like pandas
  • SQL expertise: Experience writing complex SQL queries and queries for time-series analysis
  • Backend engineering fundamentals: Solid software engineering skills
  • Data pipeline experience: Familiarity with pipeline technologies like Kafka, Firehose, or Spark
  • Log analysis at scale: Experience with high-volume log analysis technologies such as OpenSearch, text clustering, or AI-based log analysis techniques
  • Timeseries databases: Experience working with timeseries databases and temporal data
  • Metrics & observability: Hands-on experience with Grafana or similar monitoring tools
  • Anomaly detection: Understanding of anomaly detection techniques and their practical application
Job Responsibility
Job Responsibility
  • Build data pipelines: Design and implement data workflows using technologies like Kafka, Firehose, or Spark to process release metrics and device telemetry at scale
  • Develop analytical tools: Create Python-based analysis tools using pandas and SQL to identify release issues, detect anomalies, and measure fleet health
  • High-volume log analysis: Build systems to ingest, process, and analyze logs from millions of devices using technologies like OpenSearch, text clustering, and AI-based techniques
  • Create monitoring infrastructure: Develop Grafana dashboards and alerts that surface critical metrics and anomalies in real-time
  • Support release operations: Provide data-driven insights during releases, helping the team make informed decisions about rollout speed and risk
  • Design test infrastructure: Build test bench setups and CI pipelines that validate releases before they reach production
  • Query and optimize: Write efficient SQL queries against timeseries databases to extract insights from large-scale device data
What we offer
What we offer
  • Healthcare programs that can be tailored to meet the personal health and financial well-being needs - Premiums are 100% covered for the employee under at least one plan and 80% for family premiums under all plans
  • Nationwide medical, vision and dental coverage
  • Health Saving Account (HSA) with annual employer contributions and Flexible Spending Account (FSA) with tax saving options
  • Expanded mental health support
  • Paid parental leave policy & fertility benefits
  • Time off to relax and recharge through our paid holidays, firmwide extended holidays, flexible PTO and personal sick time
  • Professional development stipend
  • Fertility Stipend
  • Wellness/fitness benefits
  • Healthy lunches provided daily
  • Fulltime
Read More
Arrow Right