Supercomputing Engineer Job at Etched (San Jose)

Supercomputing Engineer (Test)

We are seeking highly motivated and detail-oriented Supercomputing Engineer (Tes...

Location

United States , San Jose

Salary:

150000.00 - 275000.00 USD / Year

Etched

Expiration Date

Until further notice

Requirements

Proficiency in at least one scripting language (e.g., Python, Bash, Go)
Experience with software testing methodologies and tools
Strong understanding of operating systems (Linux preferred) and server hardware architectures
Ability to analyze complex technical problems and provide effective solutions
Excellent communication and collaboration skills
Ability to work independently and as part of a team
Experience with version control systems (e.g., Git)
Experience with reading and interpreting hardware logs

Job Responsibility

Test Development: Design, develop, and implement automated burn-in test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
Test Execution: Execute burn-in tests on server hardware, monitor system performance and health, and analyze test results
Failure Analysis: Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
Collaboration: Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
Test Infrastructure: Contribute to the development and maintenance of the burn-in testing infrastructure, including portable test environments and automation tools runable in any environment
Documentation: Create and maintain comprehensive documentation for test plans, test cases, and test results
Performance Analysis: Analyze system performance metrics to identify potential bottlenecks and areas for optimization
Continuous Improvement: Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the burn-in testing process

What we offer

Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits
Housing subsidy of $2k per month for those living within walking distance of the office
Relocation support for those moving to San Jose (Santana Row)
Various wellness benefits covering fitness, mental health, and more
Daily lunch + dinner in our office

Fulltime

Supercomputing Engineer (Network)

We are seeking highly motivated and skilled Supercomputing Engineers (Network) t...

Location

United States , San Jose

Salary:

150000.00 - 275000.00 USD / Year

Etched

Expiration Date

Until further notice

Requirements

Proficiency in C/C++
Proficiency in at least one scripting language (e.g., Python, Bash, Go)
Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE
Experience with zero-copy networking, RDMA verbs and memory registration
Familiarity with queue pairs, completions queues, and transport types
Strong understanding of operating systems (Linux preferred) and server hardware architectures
Ability to analyze complex technical problems and provide effective solutions
Excellent communication and collaboration skills
Ability to work independently and as part of a team
Experience with version control systems (e.g., Git)

Job Responsibility

Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks
Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance
Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing
Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads

What we offer

Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits
Housing subsidy of $2k per month for those living within walking distance of the office
Relocation support for those moving to San Jose (Santana Row)
Various wellness benefits covering fitness, mental health, and more
Daily lunch + dinner in our office

Fulltime

Senior Supercomputing Operations Engineer

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC...

Location

United States , Multiple Locations

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry

Job Responsibility

Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation

Fulltime

Supercomputing Software Engineer

We are seeking a highly skilled and motivated Supercomputing Software Engineer t...

Location

Taiwan , Taipei

Salary:

Not provided

Etched

Expiration Date

Until further notice

Requirements

Proficiency in C/C++ or Python
Strong understanding of BIOS and BMC firmware architectures
Experience with server boot processes
Knowledge of root-of-trust and security principles
Strong understanding of operating systems (Linux preferred) and server hardware architectures
Experience with advanced system logging and diagnostic tools
Ability to analyze complex technical problems and provide effective solutions
Excellent communication and collaboration skills
Experience with version control systems (e.g., Git)
Experience with reading and interpreting hardware logs

Job Responsibility

Integrate and maintain BIOS and BMC firmware, ensuring robust and efficient server boot processes
Measure and Tune System Performance Configuration: Analyze DRAM timings, PCIe configurations, power state transitions etc. to ensure high performance and maximal reliability
Root of Trust and Security: Validating security features, including root of trust mechanisms, to protect system integrity and data security
Advanced System Logging and Diagnostics: Design and implement advanced system logging and diagnostic capabilities to facilitate efficient troubleshooting and performance analysis
Data Center Orchestration Integration: Integrate and optimize node-level data center orchestration technologies, such as Kubernetes and Docker, into the system software stack
System Validation and Testing: Develop and execute comprehensive test plans to validate system software functionality, stability, and performance
Collaboration and Troubleshooting: Collaborate with hardware and software teams to diagnose and resolve complex system-level issues

What we offer

Competitive compensation packages including generous equity packages
Comprehensive insurance coverage and other top-of-market benefits

Fulltime

Principal Supercomputing Operations Software Engineer

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues

Job Responsibility

Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform

Fulltime

Supercomputing Test Software Engineer

We are seeking highly motivated and detail-oriented Software Engineers to join o...

Location

Taiwan , Taipei

Salary:

Not provided

Etched

Expiration Date

Until further notice

Requirements

Proficiency in at least one scripting language (e.g., Python, Bash, Go)
Experience with software testing methodologies and tools
Strong understanding of operating systems (Linux preferred) and server hardware architectures
Ability to analyze complex technical problems and provide effective solutions
Excellent communication and collaboration skills
Ability to work independently and as part of a team
Experience with version control systems (e.g., Git)
Experience with reading and interpreting hardware logs

Job Responsibility

Design, develop, and implement automated supercomputing test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
Execute tests on server hardware, monitor system performance and health, and analyze test results
Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
Contribute to the development and maintenance of the supercomputing testing infrastructure, including portable test environments and automation tools runnable in any environment
Create and maintain comprehensive documentation for test plans, test cases, and test results
Analyze system performance metrics to identify potential bottlenecks and areas for optimization
Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the testing process

What we offer

Competitive compensation packages including generous equity packages
Comprehensive insurance coverage and other top-of-market benefits

Fulltime

HPC & AI Senior Performance Engineer

HPE is seeking an experienced HPC performance engineer who is excited to help dr...

Location

United States , Bloomington; Spring

Salary:

119500.00 - 275000.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Master’s or PhD degree in computer science, engineering, mathematics, or equivalent
7+ years of demonstrated experience with high-end HPC systems and performance analysis/tuning of scalable HPC applications
Excellent analytical and problem-solving skills
Excellent written and verbal communication skills
mastery of English required
U.S. citizenship required

Job Responsibility

Provide pre- and post-sales technical and benchmarking support to enable HPC or AI customer opportunities
Collaborate with HPE product engineering, product management, and marketing to influence current and future product capabilities and direction
Evaluate emerging technologies and assess their impact on product differentiation (performance and features) for key customer applications
Develop and maintain current knowledge of HPE and competitor HPC/AI products and performance optimization techniques to ensure the team continues to deliver high-quality benchmark results
Lead application-focused performance studies and projects, and author technical reports and papers that communicate findings and recommendations

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

New

Product Owner

We are looking for a talented Product Owner with strong project management skill...

Location

Italy , Milan

Salary:

30000.00 - 60000.00 EUR / Year

iGenius

Expiration Date

Until further notice

Requirements

At least 3 years of experience as a Project Manager and at least 2 years as a Product Owner — ideally in the same or overlapping roles
Experience in a PM/PO role with deploying and operating supercomputers at scale, including data center infrastructure - power and cooling
Experience in a PM/PO role with designing, deploying and operating AI cloud environments
Good knowledge of Artificial Intelligence, Analytics and Business Intelligence
Ability to understand and translate business needs into technical requirements
Expertise in Agile methodologies and related tools such as Jira and Confluence
Expertise in Project management methodologies and related tools
Proficiency in managing relationship with clients
Solid organizational and multitasking skills
Excellent writing and communication skills

Job Responsibility

Drive the delivery of AI-computing solutions for our clients across the full engagement lifecycle, from pre-sales through target increment and hypercare
Define and prioritize the product backlog
Translate business needs into technical requirements
Own MVP definition and product increment deliverables
Work closely with engineering teams throughout the development lifecycle to ensure the success of the initiatives

What we offer

Learning Friday (training budget for books, online courses or other training materials)
Smart Working (work from home)
Opportunity to receive company equity
Opportunity to receive stock options based on seniority and performance

Select Country

Supercomputing Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?