Supercomputing Engineer (Network) Job at Etched (San Jose)

Supercomputing Engineer

Etched is building at-scale AI systems that will unlock faster, more efficient i...

Location

United States , San Jose

Salary:

200000.00 - 275000.00 USD / Year

Etched

Expiration Date

Until further notice

Requirements

Strong proficiency in C/C++ or Rust for low-level systems programming
Deep understanding of Linux internals, kernel/user-space boundaries, and system-level debugging
Experience working close to hardware: drivers, DMA, interrupts, memory management, or device control paths
Strong debugging skills using logs, tracing, and low-level observability tools
Strong communication skills and comfort collaborating across hardware and software teams

Job Responsibility

Architect and implement low-level control-plane software responsible for system bring-up, configuration, and management of cluster-scale AI compute deployments
Build system services that interact directly with hardware, firmware, and the operating system
Develop telemetry, logging, and tracing infrastructure for diagnosing failures and driving performance improvements
Implement orchestration primitives for managing devices, nodes, and racks
Profile and tune performance across PCIe, memory, networking, kernel, and runtime layers
Collaborate closely with hardware, firmware, kernel, and runtime teams to co-design system interfaces and behavior

What we offer

Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits
Housing subsidy of $2k per month for those living within walking distance of the office
Relocation support for those moving to San Jose (Santana Row)
Various wellness benefits covering fitness, mental health, and more
Daily lunch + dinner in our office

Fulltime

HPC&AI Sales Specialist

As an HPC & AI Sales Specialist, you are a product and solution expert responsib...

Location

South Korea , Seoul

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Education: University or Bachelor's degree (Degrees in Computer Science, Electrical/Electronic Engineering, Data Science, or related STEM fields are highly preferred, but equivalent advanced IT sales experience will be fully valued)
Overall Experience: Typically 10 to 15+ years of advanced IT infrastructure sales experience (with a demonstrated track record of achieving progressively higher quotas and managing diverse enterprise/public customers)
(Preferred) Specialized Experience: 2 to 3+ years of dedicated product sales experience in the HPC and AI infrastructure space (e.g., GPU servers, high-density cluster systems, large-scale data center infrastructure solutions)
Proven Track Record: Experience driving large-scale infrastructure bidding processes, dealing with complex public Procurement/RFP processes, or closing multi-million-dollar enterprise deals
Project Management Skills: Required to coordinate cross-functional supporting sales activities, technical pre-sales, and vendor-side pricing approval desks
HPC & AI Domain Expertise: Regarded as a subject matter expert with deep knowledge of next-generation accelerator architectures (NVIDIA/AMD), high-density cluster setups, and advanced interconnect technologies
Market & Industry Acumen: Thoroughly understands the industry trends, macroeconomic trends, and market segments of key manufacturing, Telco, and research sectors in Korea
Deal Orchestration & Leadership: Demonstrates proactive leadership and initiative in successfully driving specialty sales in shared accounts—expertly managing prospecting, technical scoping, pricing negotiations, and closing
Complex Solution Selling: Ability to integrate hardware, high-value software stacks, and professional implementation services into a comprehensive strategic offer
Global Collaboration & Communication: Excellent professional working relationship-building skills, with the English proficiency required to smoothly navigate regional/global approvals (APAC/HQ) and training programs

Job Responsibility

Drive HPC & AI Pipeline & Sales Pursuit: Create, expand, and drive a proactive sales pipeline for HPC & AI solution portfolios—including NVIDIA & AMD high-performance GPUs, NVIDIA/AMD/Intel CPU cluster servers, AI networks (InfiniBand/High-speed Ethernet), and high-performance storage
Consultative Selling & C-Level Engagement: Establish professional, working, and consultative relationships with clients up to the C-level for major accounts by thoroughly understanding their unique business needs, GenAI/LLM workloads, and advanced computing challenges
Target Segment Focus: Lead and manage key strategic deals across assigned target customer segments: Enterprise: Major manufacturing companies and telecommunication providers (Telcos)
Public: Top-tier national research institutes and academic supercomputing centers
Strategic Positioning against Competitors: Maintain an in-depth understanding of the competitive landscape (e.g., alternative GPU/server vendors, CSP options) to strategically position our high-density AI infrastructure and change the playing field
Account Team Collaboration: Provide specialized business development and solution expertise to Account Managers, ensuring seamless integration of specialist sales with broader account activities
Partner & Alliance Ecosystem Leverage: Invest time working with and leveraging key external partners (including NVIDIA, AMD, Intel, and domestic Top-tier SI/distributor partners) to maximize win rates and deliver complex turnkey solutions
Quota Alignment & Forecast Management: Develop and meet quarterly/annual quota objectives for the defined product category and accurately forecast business pipeline within internal CRM systems

What we offer

Health & Wellbeing: comprehensive suite of benefits that supports their physical, financial and emotional wellbeing
Personal & Professional Development: specific programs catered to helping you reach any career goals
Unconditional Inclusion

Fulltime

AI/HPC System Performance Engineer

Meta is building some of the world's largest AI and high-performance computing i...

Location

United States , Menlo Park

Salary:

154000.00 - 217000.00 USD / Year

Member of Technical Staff, Hardware Health

Microsoft AI operates one of the world’s most advanced AI training infrastructur...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
Experience with exascale-class systems or cloud-scale AI clusters.
Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.

Job Responsibility

Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.

Fulltime

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...

Location

United States , Redmond

Salary:

163000.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
5+ years of people management experience leading software engineering teams, including managing principal engineers
Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Job Responsibility

Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate

Fulltime

Senior AI Presales Consultant

We are seeking a high-impact, strategic AI Presales Consultant to join our elite...

Location

India , Mumbai

Salary:

Not provided

Eviden

Expiration Date

Until further notice

Requirements

7+ years in a customer-facing technical role (e.g., Presales, Solutions Architecture, AI Specialist, or Technical Consulting), with a proven track record of designing large-scale AI, ML, or HPC solutions
Deep, hands-on understanding of LLM architectures. Must be able to architect, explain, and build PoCs for RAG pipelines, including vector databases (e.g., Milvus, Pinecone, Chroma), embedding models, and data ingestion strategies
Direct experience in sizing AI infrastructure. Must be able to perform "napkin math" and detailed calculations for GPU, CPU, memory, and network requirements
Must be able to fluently discuss performance metrics (tokens/second, latency, throughput, TFLOPS) and their relationship to hardware choice (e.g., NVIDIA H100 vs. A100, memory bandwidth, interconnects like NVLink/InfiniBand)
Expertise in the AI software stack. Strong understanding of MLOps principles (Kubeflow, MLflow), Kubernetes (K8s) for AI workloads, and model serving platforms (NVIDIA Triton, KServe, or similar)
Strong, current knowledge of the AI model landscape (e.g., Llama family, Mistral, GPT-family, foundation models). Ability to discuss fine-tuning techniques, quantization, and pruning
Exceptional communication, whiteboarding, and presentation skills. Ability to translate executive-level business needs into detailed technical architecture and build a compelling C-level value proposition
Bachelor's or Master's degree in Computer Science, AI, Data Science, or a related engineering field

Job Responsibility

Strategic Client Advisory: Lead executive-level "Art of the Possible" workshops and technical discovery sessions to understand a client's business goals, data readiness, and AI maturity
Full-Stack Solution Architecture: Design holistic, end-to-end AI solutions that synergize our supercomputing hardware, AI software platform, and MLOps capabilities to meet specific client needs
Generative AI & LLM Expertise: Act as the subject matter expert on Generative AI. Architect and evangelize scalable data ingestion and preparation pipelines, specializing in Retrieval-Augmented Generation (RAG) frameworks
Infrastructure Sizing & Performance Modelling: Analyse customer workloads (data volume, model complexity, training frequency, inference throughput) to accurately size the required platform infrastructure, including Kubernetes clusters, data storage, and software licenses. This includes calculating compute, storage, and network requirements based on key performance metrics like model parameters, token performance (tokens/sec), desired latency, and concurrent user load
Model & Software Consultation: Advise clients on AI model selection, comparing the trade-offs of open-source vs. proprietary LLMs, fine-tuning vs. foundation models, and model quantization
Position and demonstrate our proprietary AI software platform, MLOps tools, and libraries, integrating them into the client's ecosystem
Inference Optimization: Design and architect robust, low-latency, and high-throughput inference solutions for complex AI models, including large-scale LLM serving
User Experience (UX) Advocacy: Collaborate with client teams to define the end-user experience, ensuring the solution delivers tangible business value and a seamless interface for data scientists, analysts, and application users
Sales Cycle Enablement: Own the technical narrative throughout the sales cycle. Build and deliver compelling presentations, custom demonstrations, and Proofs of Concept (PoCs). Lead the technical response to complex RFIs/RFPs

Fulltime

Member of Technical Staff, High Performance Computing Engineer

Microsoft AI is looking for experienced Member of Technical Staff, High Performa...

Location

United Kingdom , London

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, or related technical field AND 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters
4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.)
4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP
OR equivalent experience

Job Responsibility

Design, operate, and maintain large-scale HPC environments
Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes)
Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar)
Develop and maintain automation and tooling using Bash and/or Python
Partner closely with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs
Drive work forward independently by navigating ambiguity and technical roadblocks
Enjoy working in a fast-paced, design-driven product development environment
Embody our Culture and Values

Fulltime

Component and Product Quality Engineer, Interconnects

OpenAI's Hardware organization builds supercompute platforms from silicon and bo...

Location

United States , San Francisco

Salary:

123000.00 - 285000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

7+ years of experience in quality engineering, manufacturing quality, supplier quality, or reliability for interconnects or high-speed hardware used in servers, networking, storage, or high-performance compute systems
Hands-on experience with high-speed copper interconnect products: connectors and/or cable assemblies
Strong command of problem-solving and quality tools: 8D, 5-Whys, Fishbone, PFMEA/control plans, SPC/MSA (gauge R&R), and change control
Ability to read and interpret mechanical drawings, GD&T basics, and electrical/interface specifications
Experience driving supplier/CM improvements (audits, scorecards, CAPA) and managing nonconformance/MRB workflows
Clear written and verbal communication skills
ability to drive alignment across internal teams and external partners
Experience with cable manufacturing and assembly processes (wire treatment, resistance welding/laser welding, crimping, overmolding/injection molding, braiding/shielding, plating, and automated test)
Ability to travel internationally and work effectively across time zones with ODM/JDM and supplier partners
To comply with U.S. export control laws and regulations, candidates for this role may need to meet certain legal status requirements as provided in those laws and regulations.

Job Responsibility

Own end-to-end quality for high-speed interconnect hardware across the product lifecycle: early design influence, supplier/contract manufacturer readiness, qualification, ramp, and fleet quality in lab and data center environments
Be the quality lead for advanced interconnect components and assemblies, including high-speed copper cables, cable cartridges, patch panels, backplane/cable-backplane solutions, high-speed connectors, and related electro-mechanical interfaces
Partner closely with electrical, mechanical, SI/PI, systems, reliability, operations, and external vendors to prevent escapes and drive rapid, data-driven containment and corrective action
Drive quality-by-design: participate in design reviews, DFM/DFx, tolerance stacks, material and plating selections, connector mating strategy, strain relief, and assembly methods to reduce variation and field failures
Define and track quality and reliability metrics (DPPM, yield, escapes, RMA/FRACAS trends, Cpk/Ppk where applicable) for interconnects across NPI and mass production
Build and execute qualification strategies for cables/connectors/patch panels (mechanical, environmental, electrical, and reliability), including test coverage, sample plans, clear pass/fail criteria, defining installation criteria and processes, optics termination quality management and setting fiber standards criteria
Partner with engineering and operations to drive smooth ramp: risk assessments, pilot build learnings, change control, and readiness reviews (EVT/DVT/PVT/MP or equivalent phases)
Own supplier and CM performance management: scorecards, audits (process and quality system), and follow-up to close findings with verified effectiveness
Work with suppliers to improve manufacturing throughput, stability, and yields for cable and connector assembly processes
Lead rapid containment and root-cause investigations for failures found during bring-up, system integration tests, reliability testing, and fleet deployments

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Select Country

Supercomputing Engineer (Network)

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Supercomputing Engineer (Network)

Supercomputing Engineer

HPC&AI Sales Specialist

AI/HPC System Performance Engineer

Member of Technical Staff, Hardware Health

Senior Principal Engineering Manager

Senior AI Presales Consultant

Member of Technical Staff, High Performance Computing Engineer

Component and Product Quality Engineer, Interconnects

Our AI answers in your language