CrawlJobs Logo

Software Engineer, Fleet Hardware Health

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 490000.00 USD / Year

Job Description:

As a software engineer on the Fleet Hardware team, you will be responsible for the reliability and uptime of all of OpenAI’s compute fleet. Minimizing hardware failure is key to research training progress and stable services, as even a single hardware hiccup can cause significant disruptions. With increasingly large supercomputers, the stakes continue to rise. Being at the forefront of technology means that we are often the pioneers in troubleshooting these state-of-the-art systems at scale. This is a unique opportunity to work with cutting-edge technologies and devise innovative solutions to maintain the health and efficiency of our supercomputing infrastructure. Our team empowers strong engineers with a high degree of autonomy and ownership, as well as ability to effect change. This role will require a keen focus on system-level comprehensive investigations and the development of automated solutions. We want people who go deep on problems, investigate as thoroughly as possible, and build automation for detection and remediation at scale.

Job Responsibility:

  • Build and maintain automation systems for provisioning and managing server fleets
  • Develop tools to monitor server health, performance, and lifecycle events
  • Collaborate with clusters, networking, and infrastructure teams
  • Partner with external operators to ensure a high level of quality
  • Identify and fix performance bottlenecks and inefficiencies
  • Continuously improve automation to reduce manual work

Requirements:

  • Experience managing large-scale server environments
  • A balance of strengths in building and operationalizing
  • Proficiency in Python, Go, or similar languages
  • Strong Linux, networking, and server hardware knowledge
  • Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool

Nice to have:

  • Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)
  • Knowledge of hardware management protocols (e.g., IPMI, Redfish)
  • High-performance computing (HPC) or distributed systems experience
  • Prior experience developing, managing, or designing hardware
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana)
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Fleet Hardware Health

New

Head of Factory Software & Vehicle Diagnostics

At Mach Industries, we are designing and building the world’s most advanced prod...
Location
Location
United States , Huntington Beach
Salary
Salary:
170000.00 - 250000.00 USD / Year
machindustries.com Logo
Mach Industries
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Electrical Engineering, Mechanical Engineering, Robotics, or a related engineering field
  • 10+ years of experience in software engineering, controls engineering, automated testing, manufacturing software, or firmware systems
  • 5+ years of experience leading technical teams or engineering organizations
  • Proven track record of shipping production-critical software or managing large-scale automated test systems
  • Strong systems-level thinking across software, hardware, networks, and manufacturing workflows
  • Deep expertise in one or more of the following areas: Manufacturing Execution Systems (MES)
  • PLCs and industrial controls (Beckhoff, Siemens, B&R, Allen-Bradley)
  • Firmware flashing, bootloaders, and secure signing
  • Vehicle or embedded diagnostics (CAN, LIN, Ethernet, UDS, custom protocols)
  • Test automation frameworks, HIL systems, or end-of-line validation
Job Responsibility
Job Responsibility
  • Build, lead, and develop a cross-functional organization including manufacturing software engineers, controls engineers, firmware-tools engineers, diagnostic engineers, and data platform engineers
  • Own the end-to-end architecture for factory software, including MES-like systems, build tracking, serialization, and production workflow tools
  • Lead the design and implementation of vehicle flashing, commissioning, and diagnostics pipelines inside the factory
  • Define and deliver the vehicle–factory communication framework (CAN, Ethernet, custom protocols, telemetry ingestion, APIs)
  • Oversee all end-of-line (EOL) software, automated test stands, calibration systems, and data acquisition infrastructure
  • Partner with manufacturing engineering, build engineering, design engineering, flight software, and NPI teams to integrate software tools and processes across the vehicle lifecycle
  • Implement highly reliable production-grade software with redundancy, observability, and real-time data health monitoring
  • Drive rapid iteration and continuous improvement of test coverage, automation, and factory efficiency
  • Own uptime, performance, and correctness for all software critical to production and test operations
  • Establish coding standards, architecture strategies, and long-range roadmaps for factory software and diagnostics
What we offer
What we offer
  • Offers Equity
  • healthcare
  • dental and vision plans
  • retirement savings
  • paid time off
  • funds for continuing education, training, and career growth
  • Fulltime
Read More
Arrow Right
New

Datacenter Hardware Operations Technician, AI Compute Infrastructure - Stargate

OpenAI, in close collaboration with our capital partners, is embarking on a jour...
Location
Location
United States , Abilene, Texas
Salary
Salary:
86400.00 - 228000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in datacenter hardware operations, hardware engineering, or large-scale server maintenance
  • At least 2 years in a senior or lead technician capacity
  • Deep knowledge of high-density server hardware, including x86 platforms, GPUs, storage devices, and power/cooling systems
  • Excel at diagnosing hardware issues, coordinating complex repairs, and maintaining strong working relationships across organizations
  • Comfortable setting technical expectations and validating outcomes through collaboration, not direct management
  • Adapt quickly to changing operational conditions and enjoy solving problems at both the strategic and on-site levels
  • Communicate clearly and build trust across partner teams, vendors, and internal engineering stakeholders
  • Willing to be based full-time at a partner-operated campus
Job Responsibility
Job Responsibility
  • Serve as OpenAI’s primary on-site hardware contact, collaborating with Oracle teams and vendors to plan and coordinate maintenance, repairs, and lifecycle activities
  • Share technical requirements and verify that work performed supports OpenAI’s compute needs and agreed quality targets
  • Coordinate schedules, spare-parts planning, and issue escalation with partner teams to minimize downtime and keep operations running smoothly
  • Work with OpenAI fleet-health engineers to translate software-detected issues into on-site hardware actions in partnership with Oracle
  • Track hardware trends and provide joint recommendations with partner teams for design or operational improvements
  • Prepare documentation and runbooks that capture joint best practices and can be applied at additional campuses
  • Offer technical guidance and context to partner personnel while respecting their operational ownership
  • Collaborate with supply-chain teams to plan spares and manage hardware lifecycle activities
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right
New

Staff Fleet Operations Robot Captain, Atlas

The Fleet Operations Robot Captain is the mission lead for robot health, uptime,...
Location
Location
United States , Waltham
Salary
Salary:
116000.00 - 160000.00 USD / Year
bostondynamics.com Logo
Boston Dynamics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Robotics, Computer Science, Electrical Engineering, or a related technical field preferred
  • 5 years of experience in robotics operations, systems engineering, or a highly technical site reliability role
  • Proven track record of troubleshooting complex electromechanical systems
  • Experience in a high-intensity R&D environment where hardware availability is a critical path to project success
  • Strong understanding of robotics systems, including software stacks, controls, networking, and hardware interfaces
  • Proficiency in reading and interpreting Real-Time (RT) code, Linux system logs, and networking telemetry to troubleshoot controls
  • Experience with issue tracking, work management systems, such as JIRA, and data-driven monitoring tools/dashboards such as Tableau
  • Ability to stay calm and organized while managing multiple high-priority streams of work
  • Strong understanding of safety best practices in high-energy or mobile robotics environments
Job Responsibility
Job Responsibility
  • Act as the first responder for robot hardware and software issues, performing system-level triage to localize failures and determine appropriate escalation paths
  • Analyze logs, telemetry, and system behavior to narrow issues to specific subsystems or components
  • Create high-quality issue reports and tickets that enable subject-matter experts to root cause problems efficiently
  • Identify and escalate fleet-level blockers or safety risks that impact robot availability or operational continuity
  • Maintain the real-time health, configuration, and connectivity of all robots in the fleet
  • Distinguish between hardware availability and software-induced downtime to ensure accurate reporting and planning
  • Ensure fleet status, availability, and constraints are clearly communicated to stakeholders to avoid conflicts or idle time
  • Own and prioritize the daily experiment and test queue to maximize robot utilization during core operating hours
  • Balance risk by sequencing experiments appropriately, enabling progress while minimizing extended downtime
  • Provide clear visibility into what is running now, what is next, and expected completion timelines for all active users
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • paid time off
  • annual bonus structure
  • Fulltime
Read More
Arrow Right

Tech Support Admin Assoc

Ensures HIT environment is functioning at an optimal level and end-users’ needs ...
Location
Location
United States
Salary
Salary:
26.55 - 39.85 USD / Hour
advocatehealth.com Logo
Advocate Health Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent experience in Computer Science, Information System, Engineering, or related field
  • 1 year of experience in a complex IT operating environment
  • Must have excellent interpersonal and technical skills
  • Must troubleshoot problems accurately and possess a positive attitude to deal with a variety of situations
  • Excellent written and oral communication skills
  • Strong customer service skills
  • Excellent problem-solving skills
  • Ability to lift up to 35 pounds without assistance
  • Must be able to travel to various Advocate Health locations
  • 24 hour/7 day on call support required
Job Responsibility
Job Responsibility
  • Ensures HIT environment is functioning at an optimal level and end-users’ needs are met
  • Provides end-user support including training on new device capability, basic device operations, accessing network resources, and device security best practices
  • Follows procedures for managing tickets including timely acknowledgment, appropriate communication with complete resolution documentation
  • Ensures that technology problems and service requests are resolved in accordance with service level objectives and information systems policies
  • Contribute to Endpoint Fleet Technology Management for any Advocate Health Device
  • Analysis, configuration, installation, maintenance, upgrades and retirement of hardware and software which requires 24/7 support in addition to business travel
  • Ensures compliance with Advocate Health HIT standards
  • Preemptively identifies variations from standards and potential technology issues
  • Participate in root cause analysis, engage other Advocate Health Teams and vendors, as needed, to resolve identified issues
  • Perform software installation using defined Advocate Health processes and tools
What we offer
What we offer
  • Paid Time Off programs
  • Health and welfare benefits such as medical, dental, vision, life, and Short- and Long-Term Disability
  • Flexible Spending Accounts for eligible health care and dependent care expenses
  • Family benefits such as adoption assistance and paid parental leave
  • Defined contribution retirement plans with employer match and other financial wellness programs
  • Educational Assistance Program
  • Fulltime
Read More
Arrow Right

Tech Support Admin Assoc

Ensures HIT environment is functioning at an optimal level and end-users’ needs ...
Location
Location
United States
Salary
Salary:
26.55 - 39.85 USD / Hour
advocatehealth.com Logo
Advocate Health Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent experience in Computer Science, Information System, Engineering, or related field
  • 1 year of experience in a complex IT operating environment
  • Must have excellent interpersonal and technical skills
  • Must troubleshoot problems accurately and possess a positive attitude to deal with a variety of situations
  • Excellent written and oral communication skills
  • Strong customer service skills
  • Excellent problem-solving skills
  • Ability to lift up to 35 pounds without assistance
  • Must be able to travel to various Advocate Health locations
  • 24 hour/7 day on call support required
Job Responsibility
Job Responsibility
  • Ensures HIT environment is functioning at an optimal level and end-users’ needs are met
  • Provides end-user support including training on new device capability, basic device operations, accessing network resources, and device security best practices
  • Follows procedures for managing tickets including timely acknowledgment, appropriate communication with complete resolution documentation
  • Ensures that technology problems and service requests are resolved in accordance with service level objectives and information systems policies
  • Contribute to Endpoint Fleet Technology Management for any Advocate Health Device
  • Analysis, configuration, installation, maintenance, upgrades and retirement of hardware and software which requires 24/7 support in addition to business travel
  • Ensures compliance with Advocate Health HIT standards
  • Preemptively identifies variations from standards and potential technology issues
  • Participate in root cause analysis, engage other Advocate Health Teams and vendors, as needed, to resolve identified issues
  • Perform software installation using defined Advocate Health processes and tools
What we offer
What we offer
  • Paid Time Off programs
  • Health and welfare benefits such as medical, dental, vision, life, and Short- and Long-Term Disability
  • Flexible Spending Accounts for eligible health care and dependent care expenses
  • Family benefits such as adoption assistance and paid parental leave
  • Defined contribution retirement plans with employer match and other financial wellness programs
  • Educational Assistance Program
  • Fulltime
Read More
Arrow Right

IT Technical Support Administrator

Location
Location
United States
Salary
Salary:
26.55 - 39.85 USD / Hour
advocatehealth.com Logo
Advocate Health Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent experience in Computer Science, Information System, Engineering, or related field
  • 1 year of experience in a complex IT operating environment
  • Must have excellent interpersonal and technical skills
  • Must troubleshoot problems accurately and possess a positive attitude to deal with a variety of situations
  • Excellent written and oral communication skills
  • Strong customer service skills
  • Excellent problem-solving skills
  • Ability to lift up to 35 pounds without assistance
  • Must be able to travel to various Advocate Health locations
  • 24 hour/7 day on call support required
Job Responsibility
Job Responsibility
  • Ensures HIT environment is functioning at an optimal level and end-users’ needs are met
  • Provides end-user support including training on new device capability, basic device operations, accessing network resources, and device security best practices
  • Follows procedures for managing tickets including timely acknowledgment, appropriate communication with complete resolution documentation
  • Ensures that technology problems and service requests are resolved in accordance with service level objectives and information systems policies
  • Contribute to Endpoint Fleet Technology Management for any Advocate Health Device
  • Analysis, configuration, installation, maintenance, upgrades and retirement of hardware and software which requires 24/7 support in addition to business travel
  • Ensures compliance with Advocate Health HIT standards
  • Preemptively identifies variations from standards and potential technology issues
  • Participate in root cause analysis, engage other Advocate Health Teams and vendors, as needed, to resolve identified issues
  • Perform software installation using defined Advocate Health processes and tools
What we offer
What we offer
  • Paid Time Off programs
  • Health and welfare benefits such as medical, dental, vision, life, and Short- and Long-Term Disability
  • Flexible Spending Accounts for eligible health care and dependent care expenses
  • Family benefits such as adoption assistance and paid parental leave
  • Defined contribution retirement plans with employer match and other financial wellness programs
  • Educational Assistance Program
  • Fulltime
Read More
Arrow Right

IT Technical Support Administrator

This job description indicates the general nature and level of work expected of ...
Location
Location
United States
Salary
Salary:
26.55 - 39.85 USD / Hour
advocatehealth.com Logo
Advocate Health Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or equivalent experience in Computer Science, Information System, Engineering, or related field
  • 1 year of experience in a complex IT operating environment
  • Must have excellent interpersonal and technical skills
  • Must troubleshoot problems accurately and possess a positive attitude to deal with a variety of situations
  • Excellent written and oral communication skills
  • Strong customer service skills
  • Excellent problem-solving skills
  • Ability to lift up to 35 pounds without assistance
  • Must be able to travel to various Advocate Health locations
  • 24 hour/7 day on call support required
Job Responsibility
Job Responsibility
  • Ensures HIT environment is functioning at an optimal level and end-users’ needs are met
  • Provides end-user support including training on new device capability, basic device operations, accessing network resources, and device security best practices
  • Follows procedures for managing tickets including timely acknowledgment, appropriate communication with complete resolution documentation
  • Ensures that technology problems and service requests are resolved in accordance with service level objectives and information systems policies
  • Contribute to Endpoint Fleet Technology Management for any Advocate Health Device
  • Analysis, configuration, installation, maintenance, upgrades and retirement of hardware and software which requires 24/7 support in addition to business travel
  • Ensures compliance with Advocate Health HIT standards
  • Preemptively identifies variations from standards and potential technology issues
  • Participate in root cause analysis, engage other Advocate Health Teams and vendors, as needed, to resolve identified issues
  • Perform software installation using defined Advocate Health processes and tools
What we offer
What we offer
  • Paid Time Off programs
  • Health and welfare benefits such as medical, dental, vision, life, and Short- and Long-Term Disability
  • Flexible Spending Accounts for eligible health care and dependent care expenses
  • Family benefits such as adoption assistance and paid parental leave
  • Defined contribution retirement plans with employer match and other financial wellness programs
  • Educational Assistance Program
  • Fulltime
Read More
Arrow Right

Firmware Engineer II

Microsoft Silicon and Cloud Hardware Infrastructure Engineering (SCHIE) is the t...
Location
Location
United States , Redmond
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Perform system-level debugging and troubleshooting to identify and resolve complex hardware/firmware-related issues
  • Collaborate with cross-functional teams including hardware architects and engineers, software developers, validation & integration and product managers to define firmware requirements and specifications
  • Utilize AI and machine learning data science techniques to uncover actionable insights and enhance overall fleet health
  • Stay up to date with industry trends and advancements in cloud firmware technologies and provide recommendations for improvement
  • Fulltime
Read More
Arrow Right