CrawlJobs Logo

Supercomputing Engineer (Test)

United States, San Jose 150000.00 - 275000.00 USD / Year · Job Posted February 18, 2026
Apply Position
Job Link Share

Job Description

We are seeking highly motivated and detail-oriented Supercomputing Engineer (Test) to join our team. This team plays a critical role in ensuring the reliability and stability of our highest-performance Inference server hardware and software. As a Software Engineer on this team, you will design, develop, and execute comprehensive burn-in test suites, analyze test results, and collaborate with hardware and software engineering teams at Etched and our ODM partners to identify and resolve potential issues. You will be at the forefront of ensuring our server products meet the highest quality standards before they reach our customers.

Job Responsibility

  • Test Development: Design, develop, and implement automated burn-in test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
  • Test Execution: Execute burn-in tests on server hardware, monitor system performance and health, and analyze test results
  • Failure Analysis: Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
  • Collaboration: Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
  • Test Infrastructure: Contribute to the development and maintenance of the burn-in testing infrastructure, including portable test environments and automation tools runable in any environment
  • Documentation: Create and maintain comprehensive documentation for test plans, test cases, and test results
  • Performance Analysis: Analyze system performance metrics to identify potential bottlenecks and areas for optimization
  • Continuous Improvement: Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the burn-in testing process

Requirements

  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Experience with software testing methodologies and tools
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs

Nice to have

  • Experience with hardware burn-in testing or reliability testing
  • Knowledge of server virtualization and cloud computing concepts
  • Experience with performance testing and benchmarking tools
  • Familiarity with hardware diagnostic tools and techniques
  • Experience with containerization technologies (e.g., Docker, Kubernetes)
  • Experience with CI/CD pipelines
  • Knowledge of low level hardware communication protocols (i2c, etc.)
  • Experience with data analysis tools and techniques

What we offer

  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Supercomputing Engineer (Test)

8 matching positions

Supercomputing Test Software Engineer

We are seeking highly motivated and detail-oriented Software Engineers to join o...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Experience with software testing methodologies and tools
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Design, develop, and implement automated supercomputing test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
  • Execute tests on server hardware, monitor system performance and health, and analyze test results
  • Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
  • Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
  • Contribute to the development and maintenance of the supercomputing testing infrastructure, including portable test environments and automation tools runnable in any environment
  • Create and maintain comprehensive documentation for test plans, test cases, and test results
  • Analyze system performance metrics to identify potential bottlenecks and areas for optimization
  • Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the testing process
What we offer
What we offer
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits
  • Fulltime
Read More
Arrow Right

Supercomputing Engineer (Network)

We are seeking highly motivated and skilled Supercomputing Engineers (Network) t...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE
  • Experience with zero-copy networking, RDMA verbs and memory registration
  • Familiarity with queue pairs, completions queues, and transport types
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
Job Responsibility
Job Responsibility
  • Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks
  • Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance
  • Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing
  • Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Supercomputing Software Engineer

We are seeking a highly skilled and motivated Supercomputing Software Engineer t...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++ or Python
  • Strong understanding of BIOS and BMC firmware architectures
  • Experience with server boot processes
  • Knowledge of root-of-trust and security principles
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Experience with advanced system logging and diagnostic tools
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Integrate and maintain BIOS and BMC firmware, ensuring robust and efficient server boot processes
  • Measure and Tune System Performance Configuration: Analyze DRAM timings, PCIe configurations, power state transitions etc. to ensure high performance and maximal reliability
  • Root of Trust and Security: Validating security features, including root of trust mechanisms, to protect system integrity and data security
  • Advanced System Logging and Diagnostics: Design and implement advanced system logging and diagnostic capabilities to facilitate efficient troubleshooting and performance analysis
  • Data Center Orchestration Integration: Integrate and optimize node-level data center orchestration technologies, such as Kubernetes and Docker, into the system software stack
  • System Validation and Testing: Develop and execute comprehensive test plans to validate system software functionality, stability, and performance
  • Collaboration and Troubleshooting: Collaborate with hardware and software teams to diagnose and resolve complex system-level issues
What we offer
What we offer
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits
  • Fulltime
Read More
Arrow Right

Software Engineer II

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers
  • Manages operations of supercomputers by responding quickly to mitigate issues
  • Implements systemic solutions and mitigations to more complex issues impacting performance or functionality of supercomputers
  • Reviews and writes incident postmortem and presents insights that drive changes to reduce or eliminate incidents
  • Independently improves troubleshooting guides (TSGs), wikis, tests, and telemetry, adding comprehensive observability and monitoring capabilities
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of supercomputers while also driving consistency in monitoring and operations at scale
  • Fulltime
Read More
Arrow Right

Mechanical Engineer, Hardware Systems

OpenAI’s Hardware organization develops silicon and system-level solutions desig...
Location
Location
United States , San Francisco
Salary
Salary:
266000.00 - 455000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Mechanical Engineering, or a related field (Master’s preferred)
  • 7+ years of industry experience in mechanical design, simulation, and validation for IT hardware
  • Proficiency in CAD tools and FEA
  • Strong understanding of mechanical design, material properties, manufacturing methods, and test procedures
  • Ability to analyze experimental data, identify root causes of mechanical or manufacturing issues, and implement corrective actions
  • Strong problem-solving skills, with the ability to perform first-principles calculations to support design decisions
  • Self-driven, detail-oriented, and able to work independently or collaboratively in cross-functional engineering environments
Job Responsibility
Job Responsibility
  • Lead mechanical design for AI supercomputer product in the data center application
  • Collaborate with the cross functional team to design and optimize thermal solutions for data center hardware, including chips, power modules, and system-level cooling architectures
  • Collaborate with cross-functional teams to integrate thermal management strategies into hardware design, from concept to mass production
  • Design and validate mechanical systems, including chassis, enclosures, cooling systems, and high-power connections, ensuring alignment with performance and reliability standards
  • Perform 3D modeling, FEA, tolerance analysis, and prototyping, ensuring manufacturability and adherence to strict quality requirements
  • Conduct mechanical testing, including vibration, shock, and thermal cycling, to ensure long-term reliability under extreme operating conditions
  • Identify and evaluate new technologies and methodologies to improve mechanical and thermal performance in product design, and contribute to the development of new products and technology by providing expertise in mechanical design
  • Develop appropriate specifications and test procedures as necessary to ensure desired reliability and performance of electronic equipment, collaborate, supervise, mentor, and motivate suppliers to design and develop excellent products, and mentor engineers in all aspects of concept development, analysis, design, and specification
  • Lead the production and launch of new products and technologies and ensure compliance with relevant quality standards and regulation
  • Partner with suppliers and manufacturers to support the development of custom mechanical and thermal components and ensure quality in production
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Component and Product Quality Engineer, Interconnects

OpenAI's Hardware organization builds supercompute platforms from silicon and bo...
Location
Location
United States , San Francisco
Salary
Salary:
123000.00 - 285000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in quality engineering, manufacturing quality, supplier quality, or reliability for interconnects or high-speed hardware used in servers, networking, storage, or high-performance compute systems
  • Hands-on experience with high-speed copper interconnect products: connectors and/or cable assemblies
  • Strong command of problem-solving and quality tools: 8D, 5-Whys, Fishbone, PFMEA/control plans, SPC/MSA (gauge R&R), and change control
  • Ability to read and interpret mechanical drawings, GD&T basics, and electrical/interface specifications
  • Experience driving supplier/CM improvements (audits, scorecards, CAPA) and managing nonconformance/MRB workflows
  • Clear written and verbal communication skills
  • ability to drive alignment across internal teams and external partners
  • Experience with cable manufacturing and assembly processes (wire treatment, resistance welding/laser welding, crimping, overmolding/injection molding, braiding/shielding, plating, and automated test)
  • Ability to travel internationally and work effectively across time zones with ODM/JDM and supplier partners
  • To comply with U.S. export control laws and regulations, candidates for this role may need to meet certain legal status requirements as provided in those laws and regulations.
Job Responsibility
Job Responsibility
  • Own end-to-end quality for high-speed interconnect hardware across the product lifecycle: early design influence, supplier/contract manufacturer readiness, qualification, ramp, and fleet quality in lab and data center environments
  • Be the quality lead for advanced interconnect components and assemblies, including high-speed copper cables, cable cartridges, patch panels, backplane/cable-backplane solutions, high-speed connectors, and related electro-mechanical interfaces
  • Partner closely with electrical, mechanical, SI/PI, systems, reliability, operations, and external vendors to prevent escapes and drive rapid, data-driven containment and corrective action
  • Drive quality-by-design: participate in design reviews, DFM/DFx, tolerance stacks, material and plating selections, connector mating strategy, strain relief, and assembly methods to reduce variation and field failures
  • Define and track quality and reliability metrics (DPPM, yield, escapes, RMA/FRACAS trends, Cpk/Ppk where applicable) for interconnects across NPI and mass production
  • Build and execute qualification strategies for cables/connectors/patch panels (mechanical, environmental, electrical, and reliability), including test coverage, sample plans, clear pass/fail criteria, defining installation criteria and processes, optics termination quality management and setting fiber standards criteria
  • Partner with engineering and operations to drive smooth ramp: risk assessments, pilot build learnings, change control, and readiness reviews (EVT/DVT/PVT/MP or equivalent phases)
  • Own supplier and CM performance management: scorecards, audits (process and quality system), and follow-up to close findings with verified effectiveness
  • Work with suppliers to improve manufacturing throughput, stability, and yields for cable and connector assembly processes
  • Lead rapid containment and root-cause investigations for failures found during bring-up, system integration tests, reliability testing, and fleet deployments
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Electrical Engineer - Systems

The Scaling team works on the design of our AI supercomputers, doing everything ...
Location
Location
United States , San Francisco
Salary
Salary:
225000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 10 years of industry experience, including experience designing hardware systems for data center applications
  • experience in designing EE circuit, CPU/GPU/TPU hw system design, board bring up, system design, integration, and system bring up
  • Master's degree in Electrical Engineering, Computer Engineering, Physics, a related field, or equivalent practical experience
  • Have a strong bias toward action, and won’t take no for an answer
  • Have experience and good knowledge of system design experience in the mechanical and product design areas, from xPUs, board, rack level to data center level
  • Have a strong intrinsic desire to learn and fill in missing skills
  • and an equally strong talent for sharing that information clearly and concisely with others
  • Are comfortable with ambiguity and rapidly changing conditions
Job Responsibility
Job Responsibility
  • Work on Machine Learning/AI hardware systems projects to craft the solutions for current and future data center deployments
  • Worked with hardware team on test vehicle, bring up board design, evaluating end to end system design trade off
  • Lead EE circuit level design, work with power, thermal, mechanical teams to drive AI hardware system design
  • Work with product teams to ensure that goals are met with systems and will work with ASIC/FPGA, Software, and Verification teams to ensure proper verification of features
  • Work with the manufacturing teams to ensure that designs are manufacturable and ready for volume production, and with the field teams to support systems that are deployed in the data center
  • Gather system requirements, define architecture, execute hardware design, and product validation
  • Lead the system bring up, validation, NPI, deployment, and sustaining of hardware solutions
  • Work cross-functionally with Hardware, Software, Mechanical, Thermal, Validation, Manufacturing, and external vendors
  • Drive system development from concept through production
  • Lead debug and root cause analysis of deployed systems
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Signal Integrity Engineer

OpenAI’s Hardware organization develops silicon and system-level solutions desig...
Location
Location
United States , San Francisco
Salary
Salary:
225000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 10 years of industry experience
  • Experience design hardware system and SerDes testing for data center applications
  • Experience and good knowledge of system design experience in the SI areas, from chip, SerDes, board, rack level
  • Experience with PCB, connector and cable design
Job Responsibility
Job Responsibility
  • Lead system signal integrity (SI) design for AI supercomputer product in the data center application
  • Collaborate with chip, package, boards, rack and system engineers, design partners to drive system SI design and develop innovative interconnect and high-speed technologies
  • Identify and evaluate new technologies and methodologies to improve signal and power integrity in product design, and contribute to the development of new products and technology by providing expertise in signal integrity
  • Perform simulation and modeling to identify and troubleshoot signal integrity issues
  • Lead system interconnect design, bring up and qualification
  • As the scope of the role and team grows, understand and influence roadmaps for hardware partners for our datacenter networks, racks, and buildings
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right