CrawlJobs Logo

Supercomputing Engineer (Network)

etched.com Logo

Etched

Location Icon

Location:
United States , San Jose

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

150000.00 - 275000.00 USD / Year

Job Description:

We are seeking highly motivated and skilled Supercomputing Engineers (Network) to join our team. This team plays a critical role in developing, qualifying, and optimizing high-performance networking solutions for large-scale inference workloads. As a Pod Software Engineer, you will focus on developing and qualifying software that drives communication amongst Sohu inference nodes in multi-rack inference clusters. You will collaborate closely with kernel, platform, and telemetry teams to push the boundaries of peer-to-peer RDMA efficiency.

Job Responsibility:

  • Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks
  • Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance
  • Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing
  • Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads

Requirements:

  • Proficiency in C/C++
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE
  • Experience with zero-copy networking, RDMA verbs and memory registration
  • Familiarity with queue pairs, completions queues, and transport types
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs

Nice to have:

  • Experience with networking technologies like NVLink, Infiniband, ML Pod interconnects
  • Experience with widely deployed Top of Rack Switches (Cisco, Juniper, Arista, etc.)
  • Knowledge of server virtualization
  • Experience with tracing tools like perf, eBPF, ftrace, etc.
  • Experience with performance testing and benchmarking tools (gProf, vTune, Wireshark, etc.)
  • Familiarity with hardware diagnostic tools and techniques
  • Experience with containerization technologies (e.g., Docker, Kubernetes)
  • Experience with CI/CD pipelines
  • Experience with Rust
What we offer:
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Supercomputing Engineer (Network)

New

Talent Sourcer

As we scale, we’re looking for a Talent Sourcer (Supercomputing/ML) to build and...
Location
Location
United States , San Jose
Salary
Salary:
100000.00 - 220000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience sourcing technical talent in highly competitive markets
  • deep experience sourcing software, systems, infrastructure, or hardware engineers
  • highly resourceful and love finding exceptional candidates beyond obvious platforms
  • thrive in ambiguity and enjoy building sourcing engines from scratch
  • detail-oriented, organized, and operationally strong
  • care deeply about candidate experience and employer brand
  • love working in high-velocity environments with extremely high hiring bars
Job Responsibility
Job Responsibility
  • Own top-of-funnel sourcing strategy across priority engineering roles in supercomputing, ML systems, firmware, networking, and distributed systems
  • build and maintain high-quality talent pipelines through outbound sourcing, referrals, events, research, and creative outreach
  • partner closely with recruiters and hiring managers to deeply understand role requirements, ideal profiles, and search strategy
  • develop market maps for niche technical domains and continuously expand our talent network
  • run high-volume, high-signal outbound campaigns with thoughtful personalization
  • track sourcing performance, conversion rates, and funnel health
  • continuously experiment with new sourcing channels, tools, and techniques
  • deliver a best-in-class candidate experience from first touch onward
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • housing subsidy of $2k per month for those living within walking distance of the office
  • relocation support for those moving to San Jose (Santana Row)
  • various wellness benefits covering fitness, mental health, and more
  • daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right
New

Supercomputing Engineer

Etched is building at-scale AI systems that will unlock faster, more efficient i...
Location
Location
United States , San Jose
Salary
Salary:
200000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong proficiency in C/C++ or Rust for low-level systems programming
  • Deep understanding of Linux internals, kernel/user-space boundaries, and system-level debugging
  • Experience working close to hardware: drivers, DMA, interrupts, memory management, or device control paths
  • Strong debugging skills using logs, tracing, and low-level observability tools
  • Strong communication skills and comfort collaborating across hardware and software teams
Job Responsibility
Job Responsibility
  • Architect and implement low-level control-plane software responsible for system bring-up, configuration, and management of cluster-scale AI compute deployments
  • Build system services that interact directly with hardware, firmware, and the operating system
  • Develop telemetry, logging, and tracing infrastructure for diagnosing failures and driving performance improvements
  • Implement orchestration primitives for managing devices, nodes, and racks
  • Profile and tune performance across PCIe, memory, networking, kernel, and runtime layers
  • Collaborate closely with hardware, firmware, kernel, and runtime teams to co-design system interfaces and behavior
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Industrial Design Intern

The industrial Design Intern will be responsible for supporting staff Industrial...
Location
Location
United States , Spring
Salary
Salary:
35.00 - 40.25 USD / Hour
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently pursuing a Bachelors Degree in Industrial Design, having completed Junior year
  • Possess passion for Industrial Design and ability to understand project scope and articulate design details
  • Understanding and use of 2-D and 3-D CAD tools and software packages
  • Creo and Keyshot preferred but not required
  • Adobe Suite
  • Ability to apply analytical and problem-solving skills across complex technical programs
  • Understanding and experience in implementing and creating sketches, renderings, 3D models and prototypes
  • Strong written and verbal communication skills and able to communicate with other disciplines such as mechanical engineers and product marketing
Job Responsibility
Job Responsibility
  • Support staff Industrial Designers and Human Factors in researching, creating, and developing concepts and specifications
  • Design portions of the industrial design for physical hardware products and systems
  • Support more senior Industrial Designers on new design projects creating artifacts, such as artwork, sketches, renderings and 3D prototypes
  • Collaborate with local and international team members providing timely support across a range of programs
  • Support multiple programs across multiple business units including servers, storage, networking and supercomputing
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right
New

Supercomputing Intern

Our supercomputing role focuses on the design, development, and deployment of ML...
Location
Location
United States , San Jose
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Progress towards a Bachelor’s, Master’s, or PhD degree in Computer Science, Engineering, or a related technical field
  • Proficiency in C/C++ or Rust
  • Proficiency in Python
  • Strong fundamentals in data structures and algorithms
  • Strong understanding of low-level software engineering
  • Strong understanding of hardware/software co-design
  • Excellent communication and collaboration skills
Job Responsibility
Job Responsibility
  • Design, development, and deployment of ML system software required for operating rack-scale systems
  • Work spanning network performance, telemetry creation and processing pipelines, and analysis of system-level health and performance
  • Deployment and provisioning of software frameworks and hardware validation
  • Maintaining secure and performant systems for data center scale ML workloads
What we offer
What we offer
  • 12-week paid internship
  • Generous housing support for those relocating
  • Daily lunch and dinner in our office
  • Direct mentorship from industry leaders and world-class engineers
  • Opportunity to work on one of the most important problems of our time
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Luma AI is building the engine for multimodal general intelligence. To teach mod...
Location
Location
United States; United Kingdom , Palo Alto; London
Salary
Salary:
170000.00 - 360000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Elite knowledge of high-performance computing (HPC), including job schedulers and the nuances of GPU architecture
  • Comfortable navigating the Linux terminal to solve complex performance issues, utilizing tools like perf and strace to optimize at the OS level
  • History of building infrastructure from the ground up, demonstrating the ability to design systems where no playbook currently exists
Job Responsibility
Job Responsibility
  • Serve as a technical authority on the systems that power our research and product velocity
  • Architect, optimize, and maintain the massive, multi-vendor GPU supercomputers required to train our foundational models
  • Design and deploy high-performance clusters combining thousands of GPUs, CPUs, and high-throughput networking to maximize training efficiency
  • Optimize low-level networking (InfiniBand, RDMA) to ensure seamless communication between accelerators, eliminating bottlenecks in distributed training jobs
  • Collaborate with hardware partners to push the boundaries of what is possible, debugging failures at the intersection of the kernel, driver, and silicon
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Infrastructure Data & Analytics

We are seeking experienced Infrastructure Data & Analytics Engineers to join our...
Location
Location
United States , Multiple Locations; Mountain View; San Francisco Bay area; New York City metropolitan area
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, or related technical field AND 8+ years technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 6+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with technical engineering experience with data engineering, analytics, or data science, with increasing technical ownership in startup environment AND 10+ years experience with distributed data processing frameworks and large-scale data systems
  • OR equivalent experience
  • Proven technical leadership in data engineering, analytics platforms, or large-scale telemetry systems
  • Hands-on experience with ETL orchestration frameworks such as Airflow, Dagster, or similar
  • Strong communication skills
  • can explain complex systems clearly to senior leader
Job Responsibility
Job Responsibility
  • Act as the technical lead and owner for infrastructure analytics across compute, storage, and networking
  • Design and build durable, scalable data pipelines that ingest telemetry from clusters, schedulers, health systems, and capacity trackers into Data Warehouse
  • Define and standardize core metrics and semantics (e.g., utilization, occupancy, MFU, goodput, capacity readiness, delivery-to-production)
  • Architect and maintain self-service dashboards and APIs for fleet, cluster, and squad-level visibility
  • Partner closely with stakeholders across Supercomputing Infra, Researchers, Strategy and Executives to ensure metrics reflect operational and business reality
  • Implement robust and fault-tolerant systems for data ingestion and processing
  • Lead data architecture and engineering decisions, applying strong technical judgment to proactively shape executive-level discussions and decisions
  • Identify data gaps and instrumentation issues
  • drive fixes by influencing upstream engineering teams
  • Establish data quality, validation, documentation, and governance so metrics are trusted and repeatable
  • Fulltime
Read More
Arrow Right
New

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
  • 3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  • Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  • Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
  • Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact
  • Fulltime
Read More
Arrow Right
New

SEN Teaching Assistant

Are you a passionate SEN Teaching Assistant ready to empower young learners? Joi...
Location
Location
United Kingdom , Betchworth, Surrey
Salary
Salary:
Not provided
https://www.randstad.com Logo
Randstad
Expiration Date
February 19, 2026
Flip Icon
Requirements
Requirements
  • Proven experience supporting children with SEN
  • Patient, empathetic, and resilient approach
  • Excellent communication and teamwork skills
  • Proactive, adaptable, and understanding of safeguarding
  • Right to work in the UK
Job Responsibility
Job Responsibility
  • Provide vital 1:1 and small group support to students with diverse Special Educational Needs
  • Help implement IEPs
  • Adapt materials
  • Foster an inclusive learning space
Read More
Arrow Right