Supercomputing Software Engineer Job at Etched (Taipei)

Principal Supercomputing Operations Software Engineer

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
Hands‑on experience operating and debugging interconnect fabrics supporting large‑scale compute workloads
Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues

Job Responsibility

Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform

Fulltime

Supercomputing Test Software Engineer

We are seeking highly motivated and detail-oriented Software Engineers to join o...

Location

Taiwan , Taipei

Salary:

Not provided

Etched

Expiration Date

Until further notice

Requirements

Proficiency in at least one scripting language (e.g., Python, Bash, Go)
Experience with software testing methodologies and tools
Strong understanding of operating systems (Linux preferred) and server hardware architectures
Ability to analyze complex technical problems and provide effective solutions
Excellent communication and collaboration skills
Ability to work independently and as part of a team
Experience with version control systems (e.g., Git)
Experience with reading and interpreting hardware logs

Job Responsibility

Design, develop, and implement automated supercomputing test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
Execute tests on server hardware, monitor system performance and health, and analyze test results
Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
Contribute to the development and maintenance of the supercomputing testing infrastructure, including portable test environments and automation tools runnable in any environment
Create and maintain comprehensive documentation for test plans, test cases, and test results
Analyze system performance metrics to identify potential bottlenecks and areas for optimization
Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the testing process

What we offer

Competitive compensation packages including generous equity packages
Comprehensive insurance coverage and other top-of-market benefits

Fulltime

Software Engineer II

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...

Location

United States , Multiple Locations

Salary:

100600.00 - 199000.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter

Job Responsibility

Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers
Manages operations of supercomputers by responding quickly to mitigate issues
Implements systemic solutions and mitigations to more complex issues impacting performance or functionality of supercomputers
Reviews and writes incident postmortem and presents insights that drive changes to reduce or eliminate incidents
Independently improves troubleshooting guides (TSGs), wikis, tests, and telemetry, adding comprehensive observability and monitoring capabilities
Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of supercomputers while also driving consistency in monitoring and operations at scale

Fulltime

Software Engineer, Frontier Systems - Power Management

As a Software Engineer on the Frontier Systems team focused on power management,...

Location

United States , San Francisco

Salary:

295000.00 - 445000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

7+ years of software engineering experience with a focus on solving large-scale, system-level challenges
Strong proficiency in Python and familiarity with automation and scripting tools (e.g., shell scripting)
Experience with distributed systems to efficiently aggregate and analyze streaming data
Knowledge of electrical engineering concepts including digital signal processing, power systems, Fast Fourier Transforms, or related areas
Experience in system-level investigations and development of automated solutions to address power management, fault detection, and remediation
Strong analytical skills and the ability to dig into noisy data (experience with SQL, PromQL, Pandas, etc.)
Comfort working with both hardware and software teams to solve multidisciplinary problems

Job Responsibility

Develop and implement system-level and software-level solutions to optimize power usage in large-scale supercomputers, ensuring efficient and reliable operations
Build automation to monitor power consumption patterns during training workloads and design algorithms to stabilize these fluctuations, preventing issues with grid reliability
Work with researchers and engineers to design tools for real-time monitoring, detection, and remediation of power-related hardware and system faults
Collaborate cross-functionally to translate complex electrical system requirements into code, while driving continuous improvements in power management solutions
Drive the development of power throttling mechanisms at the IT system level to dynamically adjust power usage based on workload demands and infrastructure limitations
Collaborate with hardware design teams to integrate system-level power control requirements into IT hardware design, ensuring seamless coordination between software-driven power management and hardware capabilities

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Software Engineer, Frontier Systems

The Frontier Systems team at OpenAI builds, launches, and supports the largest s...

Location

United States , San Francisco

Salary:

250000.00 - 445000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

7+ years of industry experience in software engineering
Proficiency with Python and shell scripting
A high degree of comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool necessary
Experience developing reproducible analyses
A balance of strengths in building and operationalizing

Job Responsibility

Own and improve the system health checks that keep our hyperscale supercomputers stable during model training
Lead deep dives into hardware failures and system-level bugs to understand how things break at scale
Build automation that monitors and fixes issues across thousands of machines - so researchers can keep moving without interruption

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Software Engineer, Data Visualization

The Data Visualization team at OpenAI is responsible for building and maintainin...

Location

United States , San Francisco

Salary:

230000.00 - 385000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Strong experience in full-stack software development, with a focus on building scientific or infrastructure visualization tools
Proficiency in both front-end and back-end programming languages such as Python, JavaScript, SQL, or similar
Familiar with front-end technologies like React and back-end technologies like Node.js, and databases like Snowflake
Experience with visualization libraries and frameworks (e.g., Plotly, Grafana)
Strong understanding of full-stack architecture, design principles, and best practices
Excellent problem-solving skills and attention to detail
Strong communication skills and the ability to work collaboratively in a team environment

Job Responsibility

Develop and maintain full-stack visualization tools for hardware and software analysis
Design intuitive front-end interfaces and robust back-end systems for monitoring the performance and health of supercomputer systems
Collaborate with researchers and engineers to understand their needs and deliver effective full-stack visualization solutions
Ensure high performance, reliability, and scalability of visualization tools across both front-end and back-end systems
Continuously improve existing tools and develop new features to meet evolving requirements

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Software Engineer, Hardware

As a software engineer on the Scaling team, you’ll help build and optimize the l...

Location

United States , San Francisco

Salary:

266000.00 - 455000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Proficient in systems programming (e.g., Rust, C++) and scripting languages like Python
Experience in one or more of the following areas: compiler development, kernel authoring, accelerator programming, runtime systems, distributed systems, or high-performance simulation
Deep curiosity for how large-scale systems work and enjoy making them faster, simpler, and more reliable
Excited to work in a fast-paced, highly collaborative environment with evolving hardware and ML system demands
Value engineering excellence, technical leadership, and thoughtful system design

Job Responsibility

Design and build APIs and runtime components to orchestrate computation and data movement across heterogeneous ML workloads
Contribute to compiler infrastructure, including the development of optimizations and compiler passes to support evolving hardware
Engineer and optimize compute and data kernels, ensuring correctness, high performance, and portability across simulation and production environments
Profile and optimize system bottlenecks, especially around I/O, memory hierarchy, and interconnects, at both local and distributed scales
Develop simulation infrastructure to validate runtime behaviors, test training stack changes, and support early-stage hardware and system development
Rapidly deploy runtime and compiler updates to new supercomputing builds in close collaboration with hardware and research teams
Work across a diverse stack, primarily using Rust and Python, with opportunities to influence architecture decisions across the training framework

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Software Engineer, Collective Communication

The Workload Networking team is responsible for the collective communication sta...

Location

United States , San Francisco

Salary:

380000.00 - 555000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Background in low level performance critical software
Experience with collective communication is a bonus
Have written distributed algorithms using RDMA in the past
Are comfortable writing low level performance sensitive CPU and/or GPU code
Are familiar with network simulation techniques

Job Responsibility

Collaborate closely with ML researchers to design and implement efficient collective operations in C++ and CUDA
Ensure that our largest training jobs take full advantage of the different network transports used in our supercomputers
Work on simulations to inform our future supercomputer network designs

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Select Country

Supercomputing Software Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?