CrawlJobs Logo

Supercomputing Engineer

etched.com Logo

Etched

Location Icon

Location:
United States , San Jose

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

200000.00 - 275000.00 USD / Year

Job Description:

Etched is building at-scale AI systems that will unlock faster, more efficient inference for billions of people, and the Supercomputing team is critical in enabling this mission. We are seeking a highly skilled and motivated Engineer to join our Supercomputing team to help build the foundational software that powers our cluster-scale AI compute deployments. This role on the core team involves the development, integration, and debugging of critical system components, including on control-plane software, system bring-up, telemetry, orchestration primitives, and performance tuning at the hardware–software boundary.

Job Responsibility:

  • Architect and implement low-level control-plane software responsible for system bring-up, configuration, and management of cluster-scale AI compute deployments
  • Build system services that interact directly with hardware, firmware, and the operating system
  • Develop telemetry, logging, and tracing infrastructure for diagnosing failures and driving performance improvements
  • Implement orchestration primitives for managing devices, nodes, and racks
  • Profile and tune performance across PCIe, memory, networking, kernel, and runtime layers
  • Collaborate closely with hardware, firmware, kernel, and runtime teams to co-design system interfaces and behavior

Requirements:

  • Strong proficiency in C/C++ or Rust for low-level systems programming
  • Deep understanding of Linux internals, kernel/user-space boundaries, and system-level debugging
  • Experience working close to hardware: drivers, DMA, interrupts, memory management, or device control paths
  • Strong debugging skills using logs, tracing, and low-level observability tools
  • Strong communication skills and comfort collaborating across hardware and software teams

Nice to have:

  • Experience with data center orchestration technologies such as Kubernetes and Docker
  • Experience with kernel development, device drivers, or firmware-adjacent software
  • Familiarity with PCIe, NUMA, networking, or high-speed interconnects
  • Experience with tracing and profiling tools such as perf, eBPF, ftrace, or custom instrumentation
  • Experience taking complex systems from early bring-up through stable operation
  • Background in HPC, AI infrastructure, or large-scale compute systems
  • Experience designing system test harnesses and failure-injection frameworks
  • Familiarity with Kubernetes or cluster orchestration at the node or control-plane level
What we offer:
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Supercomputing Engineer

Supercomputing Software Engineer

We are seeking a highly skilled and motivated Supercomputing Software Engineer t...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++ or Python
  • Strong understanding of BIOS and BMC firmware architectures
  • Experience with server boot processes
  • Knowledge of root-of-trust and security principles
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Experience with advanced system logging and diagnostic tools
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Integrate and maintain BIOS and BMC firmware, ensuring robust and efficient server boot processes
  • Measure and Tune System Performance Configuration: Analyze DRAM timings, PCIe configurations, power state transitions etc. to ensure high performance and maximal reliability
  • Root of Trust and Security: Validating security features, including root of trust mechanisms, to protect system integrity and data security
  • Advanced System Logging and Diagnostics: Design and implement advanced system logging and diagnostic capabilities to facilitate efficient troubleshooting and performance analysis
  • Data Center Orchestration Integration: Integrate and optimize node-level data center orchestration technologies, such as Kubernetes and Docker, into the system software stack
  • System Validation and Testing: Develop and execute comprehensive test plans to validate system software functionality, stability, and performance
  • Collaboration and Troubleshooting: Collaborate with hardware and software teams to diagnose and resolve complex system-level issues
What we offer
What we offer
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits
  • Fulltime
Read More
Arrow Right

Senior Supercomputing Operations Engineer

Microsoft Azure’s Artificial Intelligence and High‑Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, OR Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience operating high performance computing (HPC), artificial intelligence (AI), or largescale distributed systems in production environments
  • Handson experience operating interconnect fabrics for HPC, AI, or largescale distributed systems in production
  • Strong Linux systems knowledge with demonstrated experience debugging lowlevel infrastructure issues
  • Demonstrated ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve production issues
  • Familiarity with InfiniBand Subnet Manager behavior, including routing, congestion control, and fabric telemetry
Job Responsibility
Job Responsibility
  • Act as DRI for InfiniBand and GPU interconnect fabric operations, ensuring GPU availability and AI training stability
  • Lead incident triage, mitigation, recovery, and root cause analysis for fabric-related production issues
  • Perform deep multi-layer debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, and GPU interactions
  • Drive operational excellence and prevention by identifying systemic failure patterns and authoring TSGs, playbooks, and escalation guides
  • Build and leverage automation, telemetry, and tooling to improve detection, debuggability, and mean time to mitigation
  • Fulltime
Read More
Arrow Right

Principal Supercomputing Operations Engineering Manager

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet
  • Fulltime
Read More
Arrow Right

Principal Supercomputing Operations Software Engineer

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • 6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
  • Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
  • Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues
Job Responsibility
Job Responsibility
  • Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
  • Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
  • Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
  • Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform
  • Fulltime
Read More
Arrow Right

Strategic Finance Compute Lead

Compute is a key lever for OpenAI and AI progress. We are seeking a Strategic Fi...
Location
Location
United States , San Francisco
Salary
Salary:
185000.00 - 260000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience across strategic finance, private / growth equity, investment banking, strategy & operations, and/or business development with 3+ years of finance operating experience at a high-growth technology company
  • Experience partnering with engineering and product teams to provide financial analysis and insights to critical strategic decisions
  • Good understanding of cloud technology and compute infrastructure
  • Exceptionally strong analytical, financial modeling, and written and oral communication skills
  • Demonstrated track record of thoughtful investment decisions
  • Experience driving operational outcomes under ambitious deadlines
  • Exceptionally strong relationship building, business judgment, and communication skills
  • Bachelor’s degree or equivalent practical experience
Job Responsibility
Job Responsibility
  • Own and develop financial models across different elements of compute (GPUs, CPUs, storage and networking)
  • Lead strategic financial analysis for long-term capacity initiatives, working closely with scaling and supercomputing engineering teams
  • Maintain deep expertise on compute contract terms, pricing structures and optimization opportunities
  • Serve as a partner to FP&A and strategic finance teams, aligning compute and infrastructure with broader financial and business strategies
  • Create high-quality Exec and Board-facing presentations
  • Stay abreast of market trends and competitive dynamics to inform and improve our infrastructure strategy
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Supercomputing Engineer (Test)

We are seeking highly motivated and detail-oriented Supercomputing Engineer (Tes...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Experience with software testing methodologies and tools
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Test Development: Design, develop, and implement automated burn-in test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
  • Test Execution: Execute burn-in tests on server hardware, monitor system performance and health, and analyze test results
  • Failure Analysis: Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
  • Collaboration: Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
  • Test Infrastructure: Contribute to the development and maintenance of the burn-in testing infrastructure, including portable test environments and automation tools runable in any environment
  • Documentation: Create and maintain comprehensive documentation for test plans, test cases, and test results
  • Performance Analysis: Analyze system performance metrics to identify potential bottlenecks and areas for optimization
  • Continuous Improvement: Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the burn-in testing process
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Supercomputing Engineer (Network)

We are seeking highly motivated and skilled Supercomputing Engineers (Network) t...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE
  • Experience with zero-copy networking, RDMA verbs and memory registration
  • Familiarity with queue pairs, completions queues, and transport types
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
Job Responsibility
Job Responsibility
  • Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks
  • Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance
  • Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing
  • Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Supercomputing Test Software Engineer

We are seeking highly motivated and detail-oriented Software Engineers to join o...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Experience with software testing methodologies and tools
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Design, develop, and implement automated supercomputing test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
  • Execute tests on server hardware, monitor system performance and health, and analyze test results
  • Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
  • Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
  • Contribute to the development and maintenance of the supercomputing testing infrastructure, including portable test environments and automation tools runnable in any environment
  • Create and maintain comprehensive documentation for test plans, test cases, and test results
  • Analyze system performance metrics to identify potential bottlenecks and areas for optimization
  • Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the testing process
What we offer
What we offer
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits
  • Fulltime
Read More
Arrow Right