CrawlJobs Logo

Training Performance Engineer

United States, San Francisco 250000.00 - 445000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

As a Training Performance Engineer, you’ll drive efficiency improvements across our distributed training stack. You’ll analyze large-scale training runs, identify utilization gaps, and design optimizations that push the boundaries of throughput and uptime. This role blends deep systems understanding with practical performance engineering — analyzing GPU kernel performance, collective communication throughput, investigating I/O bottlenecks, and sharding our models so we can train them at massive scale. You’ll help ensure that our clusters are running at peak performance, enabling OpenAI to train larger, more capable models with the same compute budget.

Job Responsibility

  • Profile end-to-end training runs to identify performance bottlenecks across compute, communication, and storage
  • Optimize GPU utilization and throughput for large-scale distributed model training
  • Collaborate with runtime and systems engineers to improve kernel efficiency, scheduling, and collective communication performance
  • Implement model graph transforms to improve end to end throughput
  • Build tooling to monitor and visualize MFU, throughput, and uptime across clusters
  • Partner with researchers to ensure new model architectures scale efficiently during pre-training
  • Contribute to infrastructure decisions that improve reliability and efficiency of large training jobs

Requirements

  • Love optimizing performance and digging into systems to understand how every layer interacts
  • Have strong programming skills in Python and C++ (Rust or CUDA a plus)
  • Have experience running distributed training jobs on multi-GPU systems or HPC clusters
  • Enjoy debugging complex distributed systems and measuring efficiency rigorously
  • Have exposure to frameworks like PyTorch, JAX, or TensorFlow and an understanding of how large-scale training loops are built
  • Are comfortable collaborating across teams and translating raw profiling data into practical engineering improvements

Nice to have

  • Familiarity with NCCL, MPI, or UCX communication libraries
  • Experience with large-scale data loading and checkpointing systems
  • Prior work on training runtime, distributed scheduling, or ML compiler optimization

What we offer

  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Training Performance Engineer

8 matching positions

Sr. Service Training Engineer

As we transition from research and development to full-scale manufacturing, we a...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 - 150000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in technical training, field service enablement, or related roles in robotics, industrial equipment, aerospace, or complex electromechanical systems
  • Experience creating and delivering training content, including hands-on instruction and digital materials
  • Strong technical troubleshooting skills and familiarity with common service tools and workflows
  • Clear, confident communication and presentation skills across in-person and remote formats
  • Strong organizational skills and attention to detail
  • Willingness to travel (up to 30%) to support onsite training and field operations
Job Responsibility
Job Responsibility
  • Develop and maintain training materials and curricula for internal technicians, external partners, and customers, including classroom, digital, and hands-on content
  • Deliver training sessions both onsite and virtually, ensuring consistent messaging and high knowledge retention
  • Build and manage the certification process for field technicians, including assessments, recertification, and tracking
  • Collaborate with Engineering and Product teams to stay ahead of design changes and incorporate updates into training programs
  • Support the setup and maintenance of training environments and rigs, including demo units and fault injection setups
  • Manage and administer training content in the Learning Management System (LMS), ensuring accessibility and compliance
  • Analyze learner performance and field data (e.g., first-time fix rate, MTTR) to improve training outcomes and impact
  • Contribute to the development of field documentation, including job aids and quick reference guides
  • Participate in field visits and service calls as needed to stay close to real-world service conditions and collect training insights
What we offer
What we offer
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Structural Engineer in Training

We're expanding our Florida team and looking for a Structural Engineer in Traini...
Location
Location
United States , Jacksonville
Salary
Salary:
Not provided
benesch.com Logo
RimePro Inc
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.S. and M.S. in Civil Engineering (Structural emphasis preferred)
  • EI certification or ability to obtain
  • 2+ years of structural design experience
  • Strong analytical and problem-solving skills
  • Strong written and verbal communication skills
  • Detail-oriented with a knack for staying organized and on task
  • Prior experience with FDOT projects and MicroStation
Job Responsibility
Job Responsibility
  • Perform basic analysis and design calculations for bridge and structural elements
  • Develop detailed structural drawings and design packages
  • Prepare well-organized and reviewable calculation packages
  • Support task delivery within schedule and budget
  • Collaborate with Project Managers and senior engineers for ongoing technical guidance
  • Contribute to the success of FDOT and municipal infrastructure projects
What we offer
What we offer
  • Insurance
  • Retirement plans
  • Wellness programs
  • Tuition reimbursement for job-related courses
  • Funding for training, committee work, professional organization memberships, and licenses/certifications
  • Flexible work schedules and hours, including work-from-home options
  • Generous Paid Time Benefits (PTB)
  • Ten days of paid parental leave for birth, adoption, or foster placement
  • Opportunities for community service, student scholarships, and matching gift opportunities
  • Fulltime
Read More
Arrow Right

High Performance Computing Hardware Engineer

Provide technology consulting to external customers and internal project teams. ...
Location
Location
United States , Aberdeen
Salary
Salary:
105500.00 - 243000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Top Secret Clearance Required
  • 4+ years of professional experience
  • Bachelor of Arts/Science or equivalent degree in computer science or related area of study
  • Without a degree, 7+ years of relevant professional experience
  • Security+ Certification required
  • Linux+ Certification required
  • Extensive Linux based hardware troubleshooting and diagnostics experience
  • Ability to work in a multi-technology environment
  • Ability to diagnose complex technical problems to their root cause
  • Self-starter who can work independently without supervision
Job Responsibility
Job Responsibility
  • Break fix experience required
  • Reports daily to and works physically at the Customer Site
  • Accountable for meeting and maintaining customer's SLA (Service Level Agreement)
  • Engages in technical problem solving across multiple technologies
  • Owns and drives service tickets including ordering parts for needed repairs
  • Gather data, perform analysis, and escalate problems to higher-level product support groups
  • Preforms daily hardware diagnostics and repairs
  • Responsible for verifying and implementing detailed technical solutions to problems
  • Participates as part of a team and maintains good relationships with team members and customers
  • Collects and determines data from appropriate sources to assist in determining customer needs and requirements
What we offer
What we offer
  • 10K Sign-On Bonus
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Career development programs
  • Unconditional inclusion environment
  • Flexible work management
  • Fulltime
Read More
Arrow Right

High Performance Computing Hardware Engineer

High Performance Computing Hardware Engineer role requiring Top Secret clearance...
Location
Location
United States , Dayton
Salary
Salary:
78700.00 - 181200.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Top Secret security clearance
  • 4+ years of professional experience
  • Bachelor's degree in computer science or related field (or 7+ years total experience without degree)
  • Security+ Certification
  • Linux+ Certification (required before start date)
  • Extensive Linux-based hardware troubleshooting and diagnostics experience
  • Breakfix experience
  • Ability to work independently and within a team environment
  • Ability to diagnose complex technical problems to root cause
  • Professional communication skills with customers and internal teams
Job Responsibility
Job Responsibility
  • Reports daily to and works physically at customer site
  • Accountable for meeting and maintaining customer SLA
  • Engages in technical problem solving across multiple technologies
  • Owns and drives service tickets including ordering parts for repairs
  • Gathers data, performs analysis, and escalates problems to higher-level support
  • Performs daily hardware diagnostics and repairs
  • Verifies and implements detailed technical solutions
  • Maintains good relationships with team members and customers
  • Collects data to determine customer needs and requirements
  • Responds to requests for technical information
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

High Performance Compute Hardware Engineer

Responsible for providing technical support to our client by maintaining the cor...
Location
Location
United States , Vicksburg
Salary
Salary:
78700.00 - 181200.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Top Secret Clearance, TS/SCI preferred
  • Security+ and Linux+ certification
  • Must be a self-starter who is able to work independently, without supervision, and within a team environment
  • Have extensive Linux-based hardware troubleshooting and diagnostics experience
  • Able to communicate prognosis and impact with both the customer and HPE teams
  • Ability to work in a multi-technology environment with the ability to diagnose complex technical problems to their root cause
  • Able to communicate with internal and external senior management confidently and demonstrate the professionalism of the job family
  • 4+ years of professional experience and a Bachelor of Arts/Science or equivalent degree in computer science or related area of study
  • without a degree, three additional years of relevant professional experience (7+ years in total).
Job Responsibility
Job Responsibility
  • Hardware break/fix experience required
  • Reports daily to, and works physically at, the Customer Site
  • Accountable for meeting and maintaining customer’s SLA (Service Level Agreement)
  • Engages in technical problem solving across multiple technologies
  • Owns and drives service tickets, including the ordering of parts for needed repairs
  • Gather data, perform analysis, and escalate problems to higher-level product support groups and appropriate management to ensure timely resolution of system or customer issues
  • Performs daily hardware diagnostics and repairs
  • Responsible for verifying and implementing the detailed technical solution to the problem
  • Participates as part of a team and maintains good relationships with team members and customers
  • Collects and determines data from appropriate sources to assist in determining customer needs and requirements
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion.
  • Fulltime
Read More
Arrow Right

Senior Manager, Performance AI/ML Network Deployment Engineering

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering i...
Location
Location
United States , Santa Clara
Salary
Salary:
210400.00 - 315600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
  • Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
  • Experience in working with large customers such as Cloud Service Providers and global enterprise customers
  • Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
  • Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
  • Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
  • Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
  • Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
  • Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
  • Excellent communication level from engineer to mid-management to C-level of audience
Job Responsibility
Job Responsibility
  • Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
  • Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
  • Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
  • Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
  • Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
  • Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
  • Engage with AMD product groups to drive resolution of application and customer issues
  • Develop and present training materials to internal audiences, at customer venues, and at industry conferences
Read More
Arrow Right

Structural Designer

This is an entry-level position involved in the application of engineering funda...
Location
Location
United States , Honolulu
Salary
Salary:
68000.00 - 83000.00 USD / Year
baseengr.com Logo
Baldridge & Associates Structural Engineering
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors of Engineering Degree in Civil/Structural Engineering from an accredited university
  • Masters of Engineering Degree in Civil/Structural Engineering preferred
  • Completion of the Engineering-in-Training exam is required
  • 0-2 years previous design experience
Job Responsibility
Job Responsibility
  • Performs routine design tasks such as member design, load calculation, and some shop drawing review
  • Input of data into pre-written computer programs
  • Review and understanding of output
  • Research code issues, structural systems, etc. under direction of senior-level engineers
  • Expected to become proficient in the use of a computer for design and CAD/Revit applications
  • May assist in the preparation of drawings for designs he/she has originated or from sketches provided by others
  • Maintain neat and organized work area which allows efficient access to project information required by other team members
Read More
Arrow Right

Structural designer

This is an entry-level position involved in the application of engineering funda...
Location
Location
United States , Chicago
Salary
Salary:
68000.00 - 83000.00 USD / Year
baseengr.com Logo
Baldridge & Associates Structural Engineering
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors of Engineering Degree in Civil/Structural Engineering from an accredited university
  • Masters of Engineering Degree in Civil/Structural Engineering preferred
  • Completion of the Engineering-in-Training exam is required
  • 0-2 years previous design experience
Job Responsibility
Job Responsibility
  • Performs routine design tasks such as member design, load calculation, and some shop drawing review
  • Input of data into pre-written computer programs
  • Review and understanding of output
  • Research code issues, structural systems, etc. under direction of senior-level engineers
  • Expected to become proficient in the use of a computer for design and CAD/Revit applications
  • May assist in the preparation of drawings for designs he/she has originated or from sketches provided by others
  • Maintain neat and organized work area which allows efficient access to project information required by other team members
Read More
Arrow Right