CrawlJobs Logo

Supercomputing Engineer (Network)

etched.com Logo

Etched

Location Icon

Location:
United States , San Jose

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

150000.00 - 275000.00 USD / Year

Job Description:

We are seeking highly motivated and skilled Supercomputing Engineers (Network) to join our team. This team plays a critical role in developing, qualifying, and optimizing high-performance networking solutions for large-scale inference workloads. As a Pod Software Engineer, you will focus on developing and qualifying software that drives communication amongst Sohu inference nodes in multi-rack inference clusters. You will collaborate closely with kernel, platform, and telemetry teams to push the boundaries of peer-to-peer RDMA efficiency.

Job Responsibility:

  • Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks
  • Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance
  • Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing
  • Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads

Requirements:

  • Proficiency in C/C++
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE
  • Experience with zero-copy networking, RDMA verbs and memory registration
  • Familiarity with queue pairs, completions queues, and transport types
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs

Nice to have:

  • Experience with networking technologies like NVLink, Infiniband, ML Pod interconnects
  • Experience with widely deployed Top of Rack Switches (Cisco, Juniper, Arista, etc.)
  • Knowledge of server virtualization
  • Experience with tracing tools like perf, eBPF, ftrace, etc.
  • Experience with performance testing and benchmarking tools (gProf, vTune, Wireshark, etc.)
  • Familiarity with hardware diagnostic tools and techniques
  • Experience with containerization technologies (e.g., Docker, Kubernetes)
  • Experience with CI/CD pipelines
  • Experience with Rust
What we offer:
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 31694 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Supercomputing Engineer (Network)

New

AI/HPC System Performance Engineer

Meta is building some of the world's largest AI and high-performance computing i...
Location
Location
United States , Menlo Park
Salary
Salary:
154000.00 - 217000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience profiling and optimizing distributed AI or HPC workloads, including familiarity with GPU interconnects, RDMA networking, and collective communication frameworks such as NCCL or MPI
  • Experience debugging complex, non-reproducible performance issues across multi-layer systems including network fabric, operating system, and application layers
  • Experience designing and implementing performance monitoring systems, including instrumentation, telemetry pipelines, and alerting for large-scale infrastructure
  • Experience driving cross-functional technical projects from requirements definition through production deployment, including communicating performance findings and trade-offs to diverse stakeholders
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of experience in system performance engineering, network infrastructure engineering, or a related field within large-scale distributed computing or HPC environments
Job Responsibility
Job Responsibility
  • Profile and benchmark AI training and inference workloads across large-scale HPC clusters to identify network, compute, and memory bottlenecks
  • Develop and maintain performance analysis frameworks and dashboards to track system-level metrics including GPU utilization, network bandwidth, latency, and collective communication efficiency
  • Investigate and resolve performance regressions in distributed AI training environments, including issues related to RDMA fabrics, collective communication libraries, and job scheduling
  • Collaborate with network infrastructure, hardware, and AI research teams to define performance requirements and validate new HPC cluster configurations
  • Design and execute capacity and scalability experiments to inform network topology decisions for AI supercomputing infrastructure
  • Build tooling and automation to continuously monitor HPC system health, detect anomalies, and reduce mean time to mitigation during performance incidents
  • Establish service level objectives for AI cluster network performance and drive cross-functional alignment on reliability and efficiency targets
  • Lead technical design reviews for network and system architecture changes affecting AI workload performance, communicating trade-offs clearly to engineering and product stakeholders
  • Mentor other engineers on HPC performance methodologies, debugging techniques, and instrumentation best practices
  • Leverage AI-assisted workflows to accelerate root cause analysis, automate routine performance reporting, and expand coverage across the HPC stack
What we offer
What we offer
  • bonus + equity + benefits
  • Fulltime
Read More
Arrow Right

Software Engineer, Collective Communication

The Workload Networking team is responsible for the collective communication sta...
Location
Location
United States , San Francisco
Salary
Salary:
380000.00 - 555000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Background in low level performance critical software
  • Experience with collective communication is a bonus
  • Have written distributed algorithms using RDMA in the past
  • Are comfortable writing low level performance sensitive CPU and/or GPU code
  • Are familiar with network simulation techniques
Job Responsibility
Job Responsibility
  • Collaborate closely with ML researchers to design and implement efficient collective operations in C++ and CUDA
  • Ensure that our largest training jobs take full advantage of the different network transports used in our supercomputers
  • Work on simulations to inform our future supercomputer network designs
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Talent Sourcer

As we scale, we’re looking for a Talent Sourcer (Supercomputing/ML) to build and...
Location
Location
United States , San Jose
Salary
Salary:
100000.00 - 220000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience sourcing technical talent in highly competitive markets
  • deep experience sourcing software, systems, infrastructure, or hardware engineers
  • highly resourceful and love finding exceptional candidates beyond obvious platforms
  • thrive in ambiguity and enjoy building sourcing engines from scratch
  • detail-oriented, organized, and operationally strong
  • care deeply about candidate experience and employer brand
  • love working in high-velocity environments with extremely high hiring bars
Job Responsibility
Job Responsibility
  • Own top-of-funnel sourcing strategy across priority engineering roles in supercomputing, ML systems, firmware, networking, and distributed systems
  • build and maintain high-quality talent pipelines through outbound sourcing, referrals, events, research, and creative outreach
  • partner closely with recruiters and hiring managers to deeply understand role requirements, ideal profiles, and search strategy
  • develop market maps for niche technical domains and continuously expand our talent network
  • run high-volume, high-signal outbound campaigns with thoughtful personalization
  • track sourcing performance, conversion rates, and funnel health
  • continuously experiment with new sourcing channels, tools, and techniques
  • deliver a best-in-class candidate experience from first touch onward
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • housing subsidy of $2k per month for those living within walking distance of the office
  • relocation support for those moving to San Jose (Santana Row)
  • various wellness benefits covering fitness, mental health, and more
  • daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Strategic Finance Compute Lead

Compute is a key lever for OpenAI and AI progress. We are seeking a Strategic Fi...
Location
Location
United States , San Francisco
Salary
Salary:
185000.00 - 260000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience across strategic finance, private / growth equity, investment banking, strategy & operations, and/or business development with 3+ years of finance operating experience at a high-growth technology company
  • Experience partnering with engineering and product teams to provide financial analysis and insights to critical strategic decisions
  • Good understanding of cloud technology and compute infrastructure
  • Exceptionally strong analytical, financial modeling, and written and oral communication skills
  • Demonstrated track record of thoughtful investment decisions
  • Experience driving operational outcomes under ambitious deadlines
  • Exceptionally strong relationship building, business judgment, and communication skills
  • Bachelor’s degree or equivalent practical experience
Job Responsibility
Job Responsibility
  • Own and develop financial models across different elements of compute (GPUs, CPUs, storage and networking)
  • Lead strategic financial analysis for long-term capacity initiatives, working closely with scaling and supercomputing engineering teams
  • Maintain deep expertise on compute contract terms, pricing structures and optimization opportunities
  • Serve as a partner to FP&A and strategic finance teams, aligning compute and infrastructure with broader financial and business strategies
  • Create high-quality Exec and Board-facing presentations
  • Stay abreast of market trends and competitive dynamics to inform and improve our infrastructure strategy
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Supercomputing Engineer

Etched is building at-scale AI systems that will unlock faster, more efficient i...
Location
Location
United States , San Jose
Salary
Salary:
200000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong proficiency in C/C++ or Rust for low-level systems programming
  • Deep understanding of Linux internals, kernel/user-space boundaries, and system-level debugging
  • Experience working close to hardware: drivers, DMA, interrupts, memory management, or device control paths
  • Strong debugging skills using logs, tracing, and low-level observability tools
  • Strong communication skills and comfort collaborating across hardware and software teams
Job Responsibility
Job Responsibility
  • Architect and implement low-level control-plane software responsible for system bring-up, configuration, and management of cluster-scale AI compute deployments
  • Build system services that interact directly with hardware, firmware, and the operating system
  • Develop telemetry, logging, and tracing infrastructure for diagnosing failures and driving performance improvements
  • Implement orchestration primitives for managing devices, nodes, and racks
  • Profile and tune performance across PCIe, memory, networking, kernel, and runtime layers
  • Collaborate closely with hardware, firmware, kernel, and runtime teams to co-design system interfaces and behavior
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Industrial Design Intern

The industrial Design Intern will be responsible for supporting staff Industrial...
Location
Location
United States , Spring
Salary
Salary:
35.00 - 40.25 USD / Hour
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently pursuing a Bachelors Degree in Industrial Design, having completed Junior year
  • Possess passion for Industrial Design and ability to understand project scope and articulate design details
  • Understanding and use of 2-D and 3-D CAD tools and software packages
  • Creo and Keyshot preferred but not required
  • Adobe Suite
  • Ability to apply analytical and problem-solving skills across complex technical programs
  • Understanding and experience in implementing and creating sketches, renderings, 3D models and prototypes
  • Strong written and verbal communication skills and able to communicate with other disciplines such as mechanical engineers and product marketing
Job Responsibility
Job Responsibility
  • Support staff Industrial Designers and Human Factors in researching, creating, and developing concepts and specifications
  • Design portions of the industrial design for physical hardware products and systems
  • Support more senior Industrial Designers on new design projects creating artifacts, such as artwork, sketches, renderings and 3D prototypes
  • Collaborate with local and international team members providing timely support across a range of programs
  • Support multiple programs across multiple business units including servers, storage, networking and supercomputing
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Supercomputing Intern

Our supercomputing role focuses on the design, development, and deployment of ML...
Location
Location
United States , San Jose
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Progress towards a Bachelor’s, Master’s, or PhD degree in Computer Science, Engineering, or a related technical field
  • Proficiency in C/C++ or Rust
  • Proficiency in Python
  • Strong fundamentals in data structures and algorithms
  • Strong understanding of low-level software engineering
  • Strong understanding of hardware/software co-design
  • Excellent communication and collaboration skills
Job Responsibility
Job Responsibility
  • Design, development, and deployment of ML system software required for operating rack-scale systems
  • Work spanning network performance, telemetry creation and processing pipelines, and analysis of system-level health and performance
  • Deployment and provisioning of software frameworks and hardware validation
  • Maintaining secure and performant systems for data center scale ML workloads
What we offer
What we offer
  • 12-week paid internship
  • Generous housing support for those relocating
  • Daily lunch and dinner in our office
  • Direct mentorship from industry leaders and world-class engineers
  • Opportunity to work on one of the most important problems of our time
  • Fulltime
Read More
Arrow Right