CrawlJobs Logo

Member of Technical Staff - GPU Infrastructure

Prime Intellect

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Prime Intellect is building the open superintelligence stack - from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and orchestrate global compute into a single control plane and pair it with the full rl post-training stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups and enterprises to run end-to-end reinforcement learning at frontier scale, adapting models to real tools, workflows, and deployment contexts. As our Solutions Architect for GPU Infrastructure, you'll be the technical expert who transforms customer requirements into production-ready systems capable of training the world's most advanced AI models.

Job Responsibility:

  • Partner with clients to understand workload requirements and design optimal GPU cluster architectures
  • Create technical proposals and capacity planning for clusters ranging from 100 to 10,000+ GPUs
  • Develop deployment strategies for LLM training, inference, and HPC workloads
  • Present architectural recommendations to technical and executive stakeholders
  • Deploy and configure orchestration systems including SLURM and Kubernetes for distributed workloads
  • Implement high-performance networking with InfiniBand, RoCE, and NVLink interconnects
  • Optimize GPU utilization, memory management, and inter-node communication
  • Configure parallel filesystems (Lustre, BeeGFS, GPFS) for optimal I/O performance
  • Tune system performance from kernel parameters to CUDA configurations
  • Serve as primary technical escalation point for customer infrastructure issues
  • Diagnose and resolve complex problems across the full stack - hardware, drivers, networking, and software
  • Implement monitoring, alerting, and automated remediation systems
  • Provide 24/7 on-call support for critical customer deployments
  • Create runbooks and documentation for customer operations teams

Requirements:

  • 3+ years hands-on experience with GPU clusters and HPC environments
  • Deep expertise with SLURM and Kubernetes in production GPU settings
  • Proven experience with InfiniBand configuration and troubleshooting
  • Strong understanding of NVIDIA GPU architecture, CUDA ecosystem, and driver stack
  • Experience with infrastructure automation tools (Ansible, Terraform)
  • Proficiency in Python, Bash, and systems programming
  • Track record of customer-facing technical leadership
  • NVIDIA driver installation and troubleshooting (CUDA, Fabric Manager, DCGM)
  • Container runtime configuration for GPUs (Docker, Containerd, Enroot)
  • Linux kernel tuning and performance optimization
  • Network topology design for AI workloads
  • Power and cooling requirements for high-density GPU deployments

Nice to have:

  • Experience with 1000+ GPU deployments
  • NVIDIA DGX, HGX, or SuperPOD certification
  • Distributed training frameworks (PyTorch FSDP, DeepSpeed, Megatron-LM)
  • ML framework optimization and profiling
  • Experience with AMD MI300 or Intel Gaudi accelerators
  • Contributions to open-source HPC/AI infrastructure projects

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Member of Technical Staff - GPU Infrastructure

Member of Technical Staff, Performance Optimization

We're looking for a Software Engineer focused on Performance Optimization to hel...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
  • 5+ years of experience working on performance optimization or high-performance computing systems
  • Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
  • Familiarity with PyTorch and performance-critical model execution
  • Experience with distributed system debugging and optimization in multi-GPU environments
  • Deep understanding of GPU architecture, parallel programming models, and compute kernels
Job Responsibility
Job Responsibility
  • Optimize system and GPU performance for high-throughput AI workloads across training and inference
  • Analyze and improve latency, throughput, memory usage, and compute efficiency
  • Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
  • Implement low-level optimizations using CUDA, Triton, and other performance tooling
  • Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)
  • Collaborate with ML researchers to co-design and tune model architectures for hardware efficiency
  • Improve support for mixed precision, quantization, and model graph optimization
  • Build and maintain performance benchmarking and monitoring infrastructure
  • Scale inference and training systems across multi-GPU, multi-node environments
  • Evaluate and integrate optimizations for emerging hardware accelerators and specialized runtimes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Capacity & Efficiency Infrastructure

Microsoft AI is looking for a Member of Technical Staff – Capacity & Efficiency ...
Location
Location
United States , Mountain View
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Deep understanding of the fundamentals of GPU architectures and DL/LLM architectures
  • Deep experience in profiling and analyzing performance in large-scale distributed computing systems
  • Deep experience in profiling and analyzing performance in ML models especially GenAI models
  • Experience with low-level GPU programming (CUDA, Triton, NCCL) and frameworks such as PyTorch or JAX
  • Experience in leading technical projects and supporting architectural decisions with data
  • Experience building infrastructure for large-scale machine learning or generative AI workloads
  • Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • Track record of contributing to high-performance computing or large-scale AI infrastructure projects
Job Responsibility
Job Responsibility
  • Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
  • Build and evolve telemetry systems to provide visibility into infrastructure & ML model performance, utilization, and cost related metrics
  • Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
  • Drive architectural improvements across various ML services which deliver measurable efficiency improvements
  • Build and evolve tools to automatically provide insights and recommendations to improve fleet-wide efficiency
  • Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
  • Partner with ML researchers and infrastructure engineers to understand their plans and future needs and develop plans to balance growth with efficiency
  • Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, MAIA, and beyond)
  • Embody our Culture and Values
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Pre-Training Infrastructure

Microsoft AI is looking for a Member of Technical Staff, Pre-Training Infrastruc...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Experience in distributed computing and large-scale systems
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Proven ability to profile, benchmark, and optimize performance-critical systems
  • Experience in leading technical projects and supporting architectural decisions with data
  • Experience building infrastructure for large-scale machine learning or generative AI workloads
  • Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • Track record of contributing to high-performance computing or large-scale AI infrastructure projects
Job Responsibility
Job Responsibility
  • Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
  • Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
  • Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
  • Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, and beyond)
  • Gather data and insights to develop the pretraining compute roadmap
  • Care deeply about conversational AI and its deployment
  • Actively contribute to the development of AI models powering our innovative products
  • Find solutions to overcome roadblocks and deliver your work to users quickly and iteratively
  • Enjoy working in a fast-paced, design-driven product development cycle
  • Embody our Culture and Values
  • Fulltime
Read More
Arrow Right

Member of Technical Staff - Distributed Training Engineer

Our Training Infrastructure team is building the distributed systems that power ...
Location
Location
United States , San Francisco; Boston
Salary
Salary:
Not provided
liquid.ai Logo
Liquid AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP)
  • Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers)
  • Understanding of hardware accelerators and networking topologies
  • Experience optimizing data pipelines for ML workloads
Job Responsibility
Job Responsibility
  • Design and build core systems that make large training runs fast and reliable
  • Build scalable distributed training infrastructure for GPU clusters
  • Implement and tune parallelism/sharding strategies for evolving architectures
  • Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation)
  • Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
  • Develop checkpointing mechanisms balancing memory constraints with recovery needs
  • Create monitoring, profiling, and debugging tools for training stability and performance
What we offer
What we offer
  • Competitive base salary with equity in a unicorn-stage company
  • We pay 100% of medical, dental, and vision premiums for employees and dependents
  • 401(k) matching up to 4% of base pay
  • Unlimited PTO plus company-wide Refill Days throughout the year
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Inference

We're looking for an ML infrastructure engineer to bridge the gap between resear...
Location
Location
United States
Salary
Salary:
240000.00 - 290000.00 USD / Year
runwayml.com Logo
Runway
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience running ML model inference at scale in production environments
  • Strong experience with PyTorch and multi-GPU inference for large models
  • Experience with Kubernetes for ML workloads—deploying, scaling, and debugging GPU-based services
  • Comfortable working across multiple cloud providers and managing GPU driver compatibility
  • Experience with monitoring and observability for ML systems (errors, throughput, GPU utilization)
  • Self-starter who can work embedded with research teams and move fast
  • Strong systems thinking and pragmatic approach to production reliability
  • Humility and open mindedness
Job Responsibility
Job Responsibility
  • Productionize model checkpoints end-to-end: from research completion to internal testing to production deployment to post-release support
  • Build and optimize inference systems for large-scale generative models running on multi-GPU environments
  • Design and implement model serving infrastructure specialized for diffusion models and real-time diffusion workflows
  • Add monitoring and observability for new model releases—track errors, throughput, GPU utilization, and latency
  • Embed with research teams to gather training data, run preprocessing scripts, and support the model development process
  • Explore and integrate with GPU inference providers (Modal, E2E, Baseten, etc.)
  • Fulltime
Read More
Arrow Right
New

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Engineer

Help build the infrastructure that powers training, evaluation, and data platfor...
Location
Location
Switzerland , Zürich
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background building reliable, scalable production systems (Python preferred)
  • Hands‑on experience supporting large‑scale ML / LLM training, evaluation, or experimentation infrastructure
  • Operating GPU‑heavy workloads in cloud environments using Docker and Kubernetes (scheduling, utilization, isolation)
  • Designing and running data / compute pipelines and orchestration (e.g., Airflow, Argo) with object storage (Azure Blob / S3)
  • Platform reliability and operability: observability, metrics, logging, tracing, alerting (Prometheus, Grafana, OpenTelemetry)
Job Responsibility
Job Responsibility
  • Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management
  • Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
  • advocate for best practices in security, reproducibility, and cost efficiency
  • Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry)
  • Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
  • Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams
  • Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management
  • Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals
  • Fulltime
Read More
Arrow Right