CrawlJobs Logo

Training Performance Engineer

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

250000.00 - 445000.00 USD / Year

Job Description:

As a Training Performance Engineer, you’ll drive efficiency improvements across our distributed training stack. You’ll analyze large-scale training runs, identify utilization gaps, and design optimizations that push the boundaries of throughput and uptime. This role blends deep systems understanding with practical performance engineering — analyzing GPU kernel performance, collective communication throughput, investigating I/O bottlenecks, and sharding our models so we can train them at massive scale. You’ll help ensure that our clusters are running at peak performance, enabling OpenAI to train larger, more capable models with the same compute budget.

Job Responsibility:

  • Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage
  • Optimize GPU utilization and throughput for large-scale distributed model training
  • Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance
  • Implement model graph transforms to improve end to end throughput
  • Build tooling to monitor and visualize MFU, throughput, and uptime across clusters
  • Partner with researchers to ensure new model architectures scale efficiently during pre-training
  • Contribute to infrastructure decisions that improve reliability and efficiency of large training jobs

Requirements:

  • Love optimizing performance and digging into systems to understand how every layer interacts
  • Have strong programming skills in Python and C++ (Rust or CUDA a plus)
  • Have experience running distributed training jobs on multi-GPU systems or HPC clusters
  • Enjoy debugging complex distributed systems and measuring efficiency rigorously
  • Have exposure to frameworks like PyTorch, JAX, or TensorFlow and an understanding of how large-scale training loops are built
  • Are comfortable collaborating across teams and translating raw profiling data into practical engineering improvements

Nice to have:

  • Familiarity with NCCL, MPI, or UCX communication libraries
  • Experience with large-scale data loading and checkpointing systems
  • Prior work on training runtime, distributed scheduling, or ML compiler optimization
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Training Performance Engineer

Sr. Service Training Engineer

As we transition from research and development to full-scale manufacturing, we a...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 - 150000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in technical training, field service enablement, or related roles in robotics, industrial equipment, aerospace, or complex electromechanical systems
  • Experience creating and delivering training content, including hands-on instruction and digital materials
  • Strong technical troubleshooting skills and familiarity with common service tools and workflows
  • Clear, confident communication and presentation skills across in-person and remote formats
  • Strong organizational skills and attention to detail
  • Willingness to travel (up to 30%) to support onsite training and field operations
Job Responsibility
Job Responsibility
  • Develop and maintain training materials and curricula for internal technicians, external partners, and customers, including classroom, digital, and hands-on content
  • Deliver training sessions both onsite and virtually, ensuring consistent messaging and high knowledge retention
  • Build and manage the certification process for field technicians, including assessments, recertification, and tracking
  • Collaborate with Engineering and Product teams to stay ahead of design changes and incorporate updates into training programs
  • Support the setup and maintenance of training environments and rigs, including demo units and fault injection setups
  • Manage and administer training content in the Learning Management System (LMS), ensuring accessibility and compliance
  • Analyze learner performance and field data (e.g., first-time fix rate, MTTR) to improve training outcomes and impact
  • Contribute to the development of field documentation, including job aids and quick reference guides
  • Participate in field visits and service calls as needed to stay close to real-world service conditions and collect training insights
What we offer
What we offer
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Structural Engineer in Training

We're expanding our Florida team and looking for a Structural Engineer in Traini...
Location
Location
United States , Jacksonville
Salary
Salary:
Not provided
benesch.com Logo
RimePro Inc
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.S. and M.S. in Civil Engineering (Structural emphasis preferred)
  • EI certification or ability to obtain
  • 2+ years of structural design experience
  • Strong analytical and problem-solving skills
  • Strong written and verbal communication skills
  • Detail-oriented with a knack for staying organized and on task
  • Prior experience with FDOT projects and MicroStation
Job Responsibility
Job Responsibility
  • Perform basic analysis and design calculations for bridge and structural elements
  • Develop detailed structural drawings and design packages
  • Prepare well-organized and reviewable calculation packages
  • Support task delivery within schedule and budget
  • Collaborate with Project Managers and senior engineers for ongoing technical guidance
  • Contribute to the success of FDOT and municipal infrastructure projects
What we offer
What we offer
  • Insurance
  • Retirement plans
  • Wellness programs
  • Tuition reimbursement for job-related courses
  • Funding for training, committee work, professional organization memberships, and licenses/certifications
  • Flexible work schedules and hours, including work-from-home options
  • Generous Paid Time Benefits (PTB)
  • Ten days of paid parental leave for birth, adoption, or foster placement
  • Opportunities for community service, student scholarships, and matching gift opportunities
  • Fulltime
Read More
Arrow Right

Release Train Engineer

Reinventing Geospatial (RGi) is a leading expert in geospatial solutions for Def...
Location
Location
United States , Chantilly; St. Louis; Gaithersburg; Denver
Salary
Salary:
Not provided
rgi-corp.com Logo
Reinventing Geospatial
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s in Computer Science, Software Engineering, or related field with 12–15 years of relevant experience, or Master’s with 10–13 years
  • SAFe RTE, SPC, or equivalent Agile certification
  • 10+ years in Agile program delivery, including 5+ years as a Release Train Engineer or equivalent
  • Proven experience facilitating Agile/SAFe ceremonies across time zones and distributed teams
  • Deep knowledge of SAFe, Lean-Agile, and DevSecOps practices
  • Proficient with Jira, Confluence, and related Agile tools
  • Strong leadership, facilitation, and communication skills across technical and business teams
  • Experience integrating security and DevOps pipelines in high-assurance environments
  • Skilled in risk identification and mitigation to ensure program success
  • Active Top Secret clearance with an ability to obtain SCI access and willingness to obtain CI Polygraph
Job Responsibility
Job Responsibility
  • Serve as a servant leader and coach for the Agile Release Train (ART), ensuring alignment with SAFe and Lean-Agile principles
  • Facilitate key ceremonies including PI Planning, ART Syncs, and Inspect & Adapt workshops across distributed teams
  • Lead people management activities — performance reviews, professional development, mentorship, and career growth — while maintaining engagement and morale
  • Partner with Product Owners, System Architects, and Scrum Masters to refine features, prioritize work, remove impediments, and optimize delivery flow
  • Track and report ART metrics and progress to Leidos leadership and NGA stakeholders
  • Champion continuous improvement by identifying process gaps, implementing Lean-Agile practices, and coaching teams on SAFe principles
  • Manage program risks, resolve conflicts, and foster a culture of trust, transparency, and collaboration
  • Mentor Scrum Masters and Agile team members on facilitation, risk management, and DevSecOps practices
  • Collaborate with NGA stakeholders to align delivery goals with mission priorities and compliance requirements
What we offer
What we offer
  • 100% paid employee healthcare & dental insurance
  • Paid parental leave
  • 401k with matching
  • Escalating vacation time
  • Referral bonuses
  • Tuition reimbursement
  • Professional development training
  • Free beverages and snacks
  • Weekly catered lunches and breakfast on Fridays
  • Fulltime
Read More
Arrow Right

High Performance Computing Hardware Engineer

Provide technology consulting to external customers and internal project teams. ...
Location
Location
United States , Aberdeen
Salary
Salary:
105500.00 - 243000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Top Secret Clearance Required
  • 4+ years of professional experience
  • Bachelor of Arts/Science or equivalent degree in computer science or related area of study
  • Without a degree, 7+ years of relevant professional experience
  • Security+ Certification required
  • Linux+ Certification required
  • Extensive Linux based hardware troubleshooting and diagnostics experience
  • Ability to work in a multi-technology environment
  • Ability to diagnose complex technical problems to their root cause
  • Self-starter who can work independently without supervision
Job Responsibility
Job Responsibility
  • Break fix experience required
  • Reports daily to and works physically at the Customer Site
  • Accountable for meeting and maintaining customer's SLA (Service Level Agreement)
  • Engages in technical problem solving across multiple technologies
  • Owns and drives service tickets including ordering parts for needed repairs
  • Gather data, perform analysis, and escalate problems to higher-level product support groups
  • Preforms daily hardware diagnostics and repairs
  • Responsible for verifying and implementing detailed technical solutions to problems
  • Participates as part of a team and maintains good relationships with team members and customers
  • Collects and determines data from appropriate sources to assist in determining customer needs and requirements
What we offer
What we offer
  • 10K Sign-On Bonus
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Career development programs
  • Unconditional inclusion environment
  • Flexible work management
  • Fulltime
Read More
Arrow Right

High Performance Computing Hardware Engineer

High Performance Computing Hardware Engineer role requiring Top Secret clearance...
Location
Location
United States , Dayton
Salary
Salary:
78700.00 - 181200.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Top Secret security clearance
  • 4+ years of professional experience
  • Bachelor's degree in computer science or related field (or 7+ years total experience without degree)
  • Security+ Certification
  • Linux+ Certification (required before start date)
  • Extensive Linux-based hardware troubleshooting and diagnostics experience
  • Breakfix experience
  • Ability to work independently and within a team environment
  • Ability to diagnose complex technical problems to root cause
  • Professional communication skills with customers and internal teams
Job Responsibility
Job Responsibility
  • Reports daily to and works physically at customer site
  • Accountable for meeting and maintaining customer SLA
  • Engages in technical problem solving across multiple technologies
  • Owns and drives service tickets including ordering parts for repairs
  • Gathers data, performs analysis, and escalates problems to higher-level support
  • Performs daily hardware diagnostics and repairs
  • Verifies and implements detailed technical solutions
  • Maintains good relationships with team members and customers
  • Collects data to determine customer needs and requirements
  • Responds to requests for technical information
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

High Performance Computing Hardware Engineer

The role involves providing technology consulting to external customers and inte...
Location
Location
United States , Aberdeen
Salary
Salary:
101900.00 - 234500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience and a Bachelor of Arts/Science or equivalent degree in computer science or related area of study (or 11+ years of experience without a degree)
  • CompTIA Security+, CompTIA Linux+ certifications
  • specialized compute, network, and storage operating systems training
  • ability to troubleshoot hardware issues
  • ability to maintain accurate onsite inventory levels
  • detailed understanding of architectural dependencies of technologies in customer IT environments
  • strong technical communication skills
  • ability to adapt consulting style for different situations
  • understanding of market dynamics and commercial issues
  • ability to manage and mentor teams
Job Responsibility
Job Responsibility
  • Troubleshooting and repairing hardware issues daily
  • tracking and documenting hardware repairs
  • opening, tracking, and closing part cases
  • returning replaced and defective parts
  • attending weekly internal and client calls
  • creating, monitoring, and closing support cases
  • maintaining availability reports to track SLA performance
  • managing on-call schedules for 24/7 contracts
  • hardware and system installation in new systems
  • maintaining accurate inventory levels
What we offer
What we offer
  • Comprehensive suite of physical, financial, and emotional wellbeing benefits
  • career development programs
  • inclusive work environment
  • Fulltime
Read More
Arrow Right

High Performance Compute Hardware Engineer

Responsible for providing technical support to our client by maintaining the cor...
Location
Location
United States , Vicksburg
Salary
Salary:
78700.00 - 181200.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Top Secret Clearance, TS/SCI preferred
  • Security+ and Linux+ certification
  • Must be a self-starter who is able to work independently, without supervision, and within a team environment
  • Have extensive Linux-based hardware troubleshooting and diagnostics experience
  • Able to communicate prognosis and impact with both the customer and HPE teams
  • Ability to work in a multi-technology environment with the ability to diagnose complex technical problems to their root cause
  • Able to communicate with internal and external senior management confidently and demonstrate the professionalism of the job family
  • 4+ years of professional experience and a Bachelor of Arts/Science or equivalent degree in computer science or related area of study
  • without a degree, three additional years of relevant professional experience (7+ years in total).
Job Responsibility
Job Responsibility
  • Hardware break/fix experience required
  • Reports daily to, and works physically at, the Customer Site
  • Accountable for meeting and maintaining customer’s SLA (Service Level Agreement)
  • Engages in technical problem solving across multiple technologies
  • Owns and drives service tickets, including the ordering of parts for needed repairs
  • Gather data, perform analysis, and escalate problems to higher-level product support groups and appropriate management to ensure timely resolution of system or customer issues
  • Performs daily hardware diagnostics and repairs
  • Responsible for verifying and implementing the detailed technical solution to the problem
  • Participates as part of a team and maintains good relationships with team members and customers
  • Collects and determines data from appropriate sources to assist in determining customer needs and requirements
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion.
  • Fulltime
Read More
Arrow Right

Senior Manager, Performance AI/ML Network Deployment Engineering

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering i...
Location
Location
United States , Santa Clara
Salary
Salary:
210400.00 - 315600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
  • Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
  • Experience in working with large customers such as Cloud Service Providers and global enterprise customers
  • Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
  • Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
  • Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
  • Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
  • Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
  • Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
  • Excellent communication level from engineer to mid-management to C-level of audience
Job Responsibility
Job Responsibility
  • Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
  • Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
  • Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
  • Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
  • Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
  • Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
  • Engage with AMD product groups to drive resolution of application and customer issues
  • Develop and present training materials to internal audiences, at customer venues, and at industry conferences
Read More
Arrow Right