Member of Technical Staff, AI Training Infrastructure Job at Fireworks AI (San Mateo)

Member of Technical Staff, Cloud Infrastructure

As a Software Engineer on our Cloud Infrastructure team, you'll be at the forefr...

Location

United States , New York, NY; San Mateo, CA; Redwood City, CA

Salary:

175000.00 - 220000.00 USD / Year

Fireworks AI

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure)
Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.)
Strong software development skills in languages like Python, or C++
Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization

Job Responsibility

Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines
Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure
Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency
Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning
Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions
Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance our platform’s capabilities and reliability
Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence

What we offer

Meaningful equity in a fast-growing startup
Competitive salary
Comprehensive benefits package

Fulltime

Member of Technical Staff, Performance Optimization

We're looking for a Software Engineer focused on Performance Optimization to hel...

Location

United States , San Mateo

Salary:

175000.00 - 220000.00 USD / Year

Fireworks AI

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
5+ years of experience working on performance optimization or high-performance computing systems
Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
Familiarity with PyTorch and performance-critical model execution
Experience with distributed system debugging and optimization in multi-GPU environments
Deep understanding of GPU architecture, parallel programming models, and compute kernels

Job Responsibility

Optimize system and GPU performance for high-throughput AI workloads across training and inference
Analyze and improve latency, throughput, memory usage, and compute efficiency
Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
Implement low-level optimizations using CUDA, Triton, and other performance tooling
Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)
Collaborate with ML researchers to co-design and tune model architectures for hardware efficiency
Improve support for mixed precision, quantization, and model graph optimization
Build and maintain performance benchmarking and monitoring infrastructure
Scale inference and training systems across multi-GPU, multi-node environments
Evaluate and integrate optimizations for emerging hardware accelerators and specialized runtimes

What we offer

Meaningful equity in a fast-growing startup
Competitive salary
Comprehensive benefits package

Fulltime

Member of Technical Staff, Pre-Training Infrastructure

Microsoft AI is looking for a Member of Technical Staff, Pre-Training Infrastruc...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Experience in distributed computing and large-scale systems
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Proven ability to profile, benchmark, and optimize performance-critical systems
Experience in leading technical projects and supporting architectural decisions with data
Experience building infrastructure for large-scale machine learning or generative AI workloads
Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
Track record of contributing to high-performance computing or large-scale AI infrastructure projects

Job Responsibility

Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, and beyond)
Gather data and insights to develop the pretraining compute roadmap
Care deeply about conversational AI and its deployment
Actively contribute to the development of AI models powering our innovative products
Find solutions to overcome roadblocks and deliver your work to users quickly and iteratively
Enjoy working in a fast-paced, design-driven product development cycle
Embody our Culture and Values

Fulltime

Member of Technical Staff - Distributed Training Engineer

Our Training Infrastructure team is building the distributed systems that power ...

Location

United States , San Francisco; Boston

Salary:

Not provided

Liquid AI

Expiration Date

Until further notice

Requirements

Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP)
Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers)
Understanding of hardware accelerators and networking topologies
Experience optimizing data pipelines for ML workloads

Job Responsibility

Design and build core systems that make large training runs fast and reliable
Build scalable distributed training infrastructure for GPU clusters
Implement and tune parallelism/sharding strategies for evolving architectures
Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation)
Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
Develop checkpointing mechanisms balancing memory constraints with recovery needs
Create monitoring, profiling, and debugging tools for training stability and performance

What we offer

Competitive base salary with equity in a unicorn-stage company
We pay 100% of medical, dental, and vision premiums for employees and dependents
401(k) matching up to 4% of base pay
Unlimited PTO plus company-wide Refill Days throughout the year

Fulltime

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Extremely strong software engineering skills
Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
Experience using large-scale distributed training strategies
Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure

Job Responsibility

Design and write high-performant and scalable software for training
Improve our training setup from an infrastructure and codebase performance standpoint
Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
Research, implement, and experiment with ideas on our supercompute and data infrastructure
Learn from and work with the best researchers in the field

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Member of Technical Staff, Capacity & Efficiency Infrastructure

Microsoft AI is looking for a Member of Technical Staff – Capacity & Efficiency ...

Location

United States , Mountain View

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor’s Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Deep understanding of the fundamentals of GPU architectures and DL/LLM architectures
Deep experience in profiling and analyzing performance in large-scale distributed computing systems
Deep experience in profiling and analyzing performance in ML models especially GenAI models
Experience with low-level GPU programming (CUDA, Triton, NCCL) and frameworks such as PyTorch or JAX
Experience in leading technical projects and supporting architectural decisions with data
Experience building infrastructure for large-scale machine learning or generative AI workloads
Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
Track record of contributing to high-performance computing or large-scale AI infrastructure projects

Job Responsibility

Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
Build and evolve telemetry systems to provide visibility into infrastructure & ML model performance, utilization, and cost related metrics
Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
Drive architectural improvements across various ML services which deliver measurable efficiency improvements
Build and evolve tools to automatically provide insights and recommendations to improve fleet-wide efficiency
Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
Partner with ML researchers and infrastructure engineers to understand their plans and future needs and develop plans to balance growth with efficiency
Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, MAIA, and beyond)
Embody our Culture and Values

Fulltime

Senior Member of technical staff (Infrastructure)

About the Team: The Infrastructure team aims to make it seamless for our researc...

Location

United Kingdom; France , London; Paris

Salary:

Not provided

H Company

Expiration Date

Until further notice

Requirements

Infrastructure as code (CDK, Terraform, ...)
Experience architecting and deploying distributed systems on public cloud (AWS, Azure, GCP)
Observability and monitoring (Datadog, Prometheus, Grafana, …)
Good knowledge of a modern programming language (ideally Python or JS/Typescript)

Job Responsibility

Designing and managing the infrastructure to support Research efforts in Model and Agent development incl. training infrastructure, data pipelines and inference
Designing and managing the infrastructure to support Product Engineering efforts on H Company’s agent platform including client-facing APIs and agent runtimes within various deployment scenarios (multi-tenant and on-prem)
Setup and maintain observability and monitoring strategies
Mentor and grow other engineers in infrastructure-related topics as well as general engineering practices

What we offer

Join the exciting journey of shaping the future of AI, and be part of the early days of one of the hottest AI startups
Collaborate with a fun, dynamic and multicultural team, working alongside world-class AI talent in a highly collaborative environment
Enjoy a competitive salary
Unlock opportunities for professional growth, continuous learning, and career development

Fulltime

Senior Staff Machine Learning Engineer

Help design our AI platform and develop our next generation of machine learning ...

Location

United States , San Francisco

Salary:

216500.00 - 324500.00 USD / Year

GoFundMe

Expiration Date

Until further notice

Requirements

9+ years of hands-on experience in machine learning engineering, AI development, software engineering, or related fields
Experience emphasizing secure, large-scale, distributed system design, AI/ML pipeline development, and implementation
Extensive experience designing, developing, and operating scalable backend systems
Experience applying software engineering best practices such as domain-driven design, event-driven architectures, and microservices
Deep expertise in agentic workflows, AI evaluation solutions, prompt management, and secure AI development and testing practices
Strong knowledge of relational and document-based databases, data storage paradigms, and efficient RESTful API design
Experience establishing robust CI/CD pipelines, automated testing (unit and integration), and deployment practices
Strong leadership skills, including effective planning and management of complex projects, mentoring of team members, and fostering a collaborative, high-performing engineering culture
Excellent communicator, able to articulate complex technical concepts clearly to both technical and non-technical stakeholders
Bachelor's degree in Computer Science, Software Engineering, or a related technical field (preferred)

Job Responsibility

Design and implement AI platforms to enable scalable and secure access to LLMs from multiple model providers for diverse use cases
Design and implement agentic workflows, agentic tool ecosystems, and LLM prompt management solutions
Design, build, and optimize scalable model training, fine tuning, and inference pipelines, ensuring robust integration with production systems
Influence technical strategy and approach to developing embedding stores, vector databases, and other reusable assets
Lead initiatives to streamline ML and AI workflows, improve operational efficiency, and establish standardized procedures to achieve consistent, high-quality results across our AI systems
Design and develop backend services and RESTful APIs using Python and FastAPI, integrating seamlessly with ML pipelines and services
Take operational responsibility for team-owned services, including performance monitoring, optimization, troubleshooting, and participation in an on-call rotation
Collaborate with both technical and non-technical colleagues, including data and applied scientists, software engineers, product managers, and business stakeholders, to deliver reliable and scalable ML-driven products
Coach and mentor fellow ML engineers, promoting a culture of collaboration, continuous improvement, and engineering excellence within the team
Employ a diverse set of tools and platforms including Python, AWS, Databricks, Docker, Kubernetes, FastAPI, Terraform, Snowflake, Coralogix, and GitHub to build, deploy, and maintain scalable, highly available machine learning infrastructure

What we offer

Competitive pay
Comprehensive healthcare benefits
Financial assistance for things like hybrid work, family planning
Generous parental leave
Flexible time-off policies
Mental health and wellness resources
Learning, development, and recognition programs

Fulltime

Select Country

Member of Technical Staff, AI Training Infrastructure

Fireworks AI

Location:
United States , San Mateo

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
December 08, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Member of Technical Staff, AI Training Infrastructure

Member of Technical Staff, Cloud Infrastructure

Member of Technical Staff, Performance Optimization

Member of Technical Staff, Pre-Training Infrastructure

Member of Technical Staff - Distributed Training Engineer

Member of Technical Staff, Training Infra Engineer

Member of Technical Staff, Capacity & Efficiency Infrastructure

Senior Member of technical staff (Infrastructure)

Senior Staff Machine Learning Engineer

Our AI answers in your language

Member of Technical Staff, AI Training Infrastructure

Fireworks AI

Location:United States , San Mateo

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:December 08, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Member of Technical Staff, AI Training Infrastructure

Member of Technical Staff, Cloud Infrastructure

Member of Technical Staff, Performance Optimization

Member of Technical Staff, Pre-Training Infrastructure

Member of Technical Staff - Distributed Training Engineer

Member of Technical Staff, Training Infra Engineer

Member of Technical Staff, Capacity & Efficiency Infrastructure

Senior Member of technical staff (Infrastructure)

Senior Staff Machine Learning Engineer

Location:
United States , San Mateo

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
December 08, 2025