CrawlJobs Logo

Senior AI Infrastructure Engineer - Training Platform

scale.com Logo

Scale

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

216000.00 - 270000.00 USD / Year

Job Description:

As a Software Engineer on the Machine Learning Infrastructure team, you will build the "Operating System" for our large-scale GPU clusters. You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads, ensuring every cycle is used efficiently. Your work directly determines the velocity at which our researchers can train and iterate on the world's most advanced models.

Job Responsibility:

  • Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
  • Design and implement scheduling primitives to optimize the lifecycle of training jobs
  • Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
  • Work closely with Finance and Procurement teams to drive our capacity planning process
  • Participate in our team's on call process to ensure the availability of our services
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment

Requirements:

  • 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
  • Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
  • Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
  • Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
  • Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
  • Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments

Nice to have:

  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience with the NVIDIA software and hardware stack (CUDA, NCCL)
  • Experience with PyTorch
  • Familiarity with post-training algorithms such as GRPO, and with Reinforcement Learning
What we offer:
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • commuter stipend (may be eligible)

Additional Information:

Job Posted:
May 04, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior AI Infrastructure Engineer - Training Platform

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer

As a Senior Platform Engineer at Aignostics, you work hand in hand with our team...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
aignostics.com Logo
Aignostics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience: 5+ years in platform engineering, with a proven track record in managing complex, cloud-native infrastructure
  • Technical expert: work with Kubernetes, Docker, Terraform, and Cloud Environments, with a deep understanding of CI/CD processes and tools
  • Excellent coder: automate with strong programming and scripting abilities, preferably in Go/Python and bash
  • Network architect: get our services flying by leveraging technologies like Virtual Private Clouds, DNS, Reverse Proxies and Firewalls
  • Outstanding communicator: ability to explain complex technical concepts, drive decisions, and collaborate across teams (fluent in English, German is a plus)
  • Self-driven learner: stays current with technology trends, evaluates new solutions (including AI), and grows with our challenges
  • Live the Devops Culture: Educating our developers on core principles, implementing collaborative processes, and providing the necessary tools and frameworks that enable them to do so
Job Responsibility
Job Responsibility
  • Introduce, implement and own architectural solutions of our Kubernetes clusters, internal services, and cloud-based infrastructure to improve our developer’s experience
  • Identify & Automate workflows and processes leveraging from event driven pipelines like Gitlab CICD or Argo Workflows
  • Introduce and drive the security of our applications and data by implementing state of the art security concepts in Kubernetes and cloud environments
  • Work closely with the engineering and data science teams to bring our products and AI model training to its excellence
  • Propose and drive the adoption of best practices in infrastructure management, scalability, and security
What we offer
What we offer
  • Cutting-edge AI research and development, with involvement of Charité, TU Berlin and our other partners
  • Work with a welcoming, diverse and highly international team of colleagues
  • Opportunity to take responsibility and grow your role within the startup
  • Expand your skills by benefitting from our Learning & Development yearly budget of 1,000€ (plus 2 L&D days), language classes and internal development programs
  • Mentoring program, you’ll learn from great experts
  • Flexible working hours and teleworking policy
  • Enjoy your well-deserved time off within our 30 paid vacations days per year
  • We are family & pet friendly and support flexible parental leave options
  • Pick a subsidized membership of your choice among public transport, sports and well-being
  • Enjoy our social gatherings, lunches, and off-site events for a fun and inclusive work environment
Read More
Arrow Right

Senior Engineering Manager - AI

We are seeking a Senior Engineering Manager (Level 5) to lead a high-performing ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional experience in software engineering
  • At least 4+ years in engineering leadership roles
  • Strong technical background in AI/ML systems, large-scale data pipelines, and cloud-native platforms
  • Hands-on experience with Python (preferred), modern ML frameworks (PyTorch/TensorFlow), and cloud services (AWS)
  • Proven success in managing teams of 4–6 engineers, scaling processes, and building diverse, high-performance teams
  • Strong architectural design and system-thinking abilities
  • Excellent communication skills with ability to influence cross-functional stakeholders
  • Passion for sustainability, decarbonization, and using technology to create positive climate impact
  • Experienced with building agentic pipelines with the latest models from Anthropic, Google, OpenAI, and more
Job Responsibility
Job Responsibility
  • Lead and grow a team of engineers focused on building AI-driven and data-intensive systems for the Arcadia platform
  • Design and train ML/AI models (forecasting, NLP, graph learning, generative AI) to improve data quality, cost effectiveness, and system scalability
  • Build true agentic workflows with multi-step processing incorporating RAG pipelines and MCPs
  • Balance management responsibilities (hiring, coaching, performance reviews, career growth) with technical leadership (architecture, system design, technical strategy)
  • Drive end-to-end delivery of complex projects in partnership with Product, Data, and Infrastructure teams
  • Guide the adoption of modern AI/ML technologies, ensuring practical, scalable use in production
  • Foster a culture of high performance, ownership, and technical excellence
  • Establish engineering best practices in testing, observability, reliability, and CI/CD
  • Partner with leadership to define roadmaps, set priorities, and align execution with Arcadia’s strategic goals
  • Represent AI across the company, articulating technical trade-offs and championing innovation
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior ML Platform Engineer

At WHOOP, we're on a mission to unlock human performance and healthspan. WHOOP e...
Location
Location
United States , Boston
Salary
Salary:
150000.00 - 210000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science, Engineering, or a related field
  • or equivalent practical experience
  • 5+ years of experience in software engineering with a focus on ML infrastructure, cloud platforms, or MLOps
  • Strong programming skills in Python, with experience in building distributed systems and REST/gRPC APIs
  • Deep knowledge of cloud-native services and infrastructure-as-code (e.g., AWS CDK, Terraform, CloudFormation)
  • Hands-on experience with model deployment platforms such as AWS SageMaker, Vertex AI, or Kubernetes-based serving stacks
  • Proficiency in ML lifecycle tools (MLflow, Weights & Biases, BentoML) and containerization strategies (Docker, Kubernetes)
  • Understanding of data engineering and ingestion pipelines, with ability to interface with data lakes, feature stores, and streaming systems
  • Proven ability to work cross-functionally with Data Science, Data Platform, and Software Engineering teams, influencing decisions and driving alignment
  • Passion for AI and automation to solve real-world problems and improve operational workflows
Job Responsibility
Job Responsibility
  • Architect, build, own, and operate scalable ML infrastructure in cloud environments (e.g., AWS), optimizing for speed, observability, cost, and reproducibility
  • Create, support, and maintain core MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry), ensuring reliability, scalability, and long-term sustainability
  • Develop, evolve, and operate MLOps platforms and frameworks that standardize model deployment, versioning, drift detection, and lifecycle management at scale
  • Implement and continuously maintain end-to-end CI/CD pipelines for ML models using orchestration tools (e.g., Prefect, Airflow, Argo Workflows), ensuring robust testing, reproducibility, and traceability
  • Partner closely with Data Science, Sensor Intelligence, and Data Platform teams to operationalize and support model development, deployment, and monitoring workflows
  • Build, manage, and maintain both real-time and batch inference infrastructure, supporting diverse use cases from physiological analytics to personalized feedback loops for WHOOP members
  • Design, implement, and own automated observability tooling (e.g., for model latency, data drift, accuracy degradation), integrating metrics, logging, and alerting with existing platforms
  • Leverage AI-powered tools and automation to reduce operational overhead, enhance developer productivity, and accelerate model release cycles
  • Contribute to and maintain internal platform documentation, SDKs, and training materials, enabling self-service capabilities for model deployment and experimentation
  • Continuously evaluate and integrate emerging technologies and deployment strategies, influencing WHOOP’s roadmap for AI-driven platform efficiency, reliability, and scale
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Senior Devops & AI Engineer

This role presents a unique opportunity to contribute to the future of impactful...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
fissionlabs.com Logo
Fission Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field
  • 6+ years of experience in Infrastructure Mgmt. roles, with a focus on cloud platforms (Azure and AWS Preferred)
  • Hands-on experience with operations (DevSecOps) principles and best practices
  • Proficiency in scripting languages such as Python, PowerShell, or Bash
  • Excellent communication and collaboration skills
  • In-depth knowledge of Linux operating systems, including CentOS, Ubuntu, and Red Hat, with expertise in shell scripting, package management, and system administration
  • Hands-on experience with a wide range of AWS and Azure services
  • Develop and maintain Infrastructure as Code (IAC) templates using tools such as Terraform or AWS CloudFormation
  • Experience setting up cloud infrastructure stack, databases, service endpoints, GPU as well as CPU resource scaling, optimization etc.
  • Should have worked AIOps/MLOP
Job Responsibility
Job Responsibility
  • Configure and optimize Linux-based servers for performance, security, and resource utilization, including kernel tuning, file system management, and network configuration
  • Architect cloud solutions leveraging best practices and services offered by AWS and Azure, optimizing for scalability, reliability, and cost-effectiveness
  • Implement and manage hybrid cloud environments, facilitating seamless integration and interoperability between AWS and Azure services
  • Establish version control practices for IAC templates, ensuring traceability, auditability, and reproducibility of infrastructure changes
What we offer
What we offer
  • Opportunity to work on impactful technical challenges with global reach
  • Vast opportunities for self-development, including online university access and knowledge sharing opportunities
  • Sponsored Tech Talks & Hackathons to foster innovation and learning
  • Generous benefits packages including health insurance, retirement benefits, flexible work hours, and more
  • Supportive work environment with forums to explore passions beyond work
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager- AI/ML

As the Senior Engineering Manager, you will lead by being a highly technical lea...
Location
Location
United States
Salary
Salary:
Not provided
aledade.com Logo
Aledade, Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS/BTech (or higher) in Computer Science, Engineering or a related field required
  • 10+ years of production-level experience as an engineer and technical lead building highly scalable and reliable software
  • 5+ years of managerial experience building and leading technical engineering teams
  • 7+ years of experience in machine learning related technologies, with a strong preference for Python
  • Extensive experience in designing and implementing secure, scalable, and maintainable AI/ML platform architectures
  • Proficiency in distributed systems, microservices, containerization technologies (e.g., Docker, Kubernetes), model training infrastructure, orchestration tools, and MLOps principles
  • Sitting for prolonged periods of time
  • Extensive use of computers and keyboard
  • Occasional walking and lifting may be required
Job Responsibility
Job Responsibility
  • Build a high performing team by hiring and nurturing engineering talent
  • Strong technical leadership - drive technical solutioning and building roadmaps
  • Set aggressive and clear goals and remove all roadblocks for the team to achieve them
  • Working seamlessly and collaboratively with stakeholders across Aledade to achieve business outcomes
  • Work closely with engineering leaders to drive engineering excellence in our processes and systems
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Help design our AI platform and develop our next generation of machine learning ...
Location
Location
United States , San Francisco
Salary
Salary:
216500.00 - 324500.00 USD / Year
gofundme.com Logo
GoFundMe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ years of hands-on experience in machine learning engineering, AI development, software engineering, or related fields
  • Experience emphasizing secure, large-scale, distributed system design, AI/ML pipeline development, and implementation
  • Extensive experience designing, developing, and operating scalable backend systems
  • Experience applying software engineering best practices such as domain-driven design, event-driven architectures, and microservices
  • Deep expertise in agentic workflows, AI evaluation solutions, prompt management, and secure AI development and testing practices
  • Strong knowledge of relational and document-based databases, data storage paradigms, and efficient RESTful API design
  • Experience establishing robust CI/CD pipelines, automated testing (unit and integration), and deployment practices
  • Strong leadership skills, including effective planning and management of complex projects, mentoring of team members, and fostering a collaborative, high-performing engineering culture
  • Excellent communicator, able to articulate complex technical concepts clearly to both technical and non-technical stakeholders
  • Bachelor's degree in Computer Science, Software Engineering, or a related technical field (preferred)
Job Responsibility
Job Responsibility
  • Design and implement AI platforms to enable scalable and secure access to LLMs from multiple model providers for diverse use cases
  • Design and implement agentic workflows, agentic tool ecosystems, and LLM prompt management solutions
  • Design, build, and optimize scalable model training, fine tuning, and inference pipelines, ensuring robust integration with production systems
  • Influence technical strategy and approach to developing embedding stores, vector databases, and other reusable assets
  • Lead initiatives to streamline ML and AI workflows, improve operational efficiency, and establish standardized procedures to achieve consistent, high-quality results across our AI systems
  • Design and develop backend services and RESTful APIs using Python and FastAPI, integrating seamlessly with ML pipelines and services
  • Take operational responsibility for team-owned services, including performance monitoring, optimization, troubleshooting, and participation in an on-call rotation
  • Collaborate with both technical and non-technical colleagues, including data and applied scientists, software engineers, product managers, and business stakeholders, to deliver reliable and scalable ML-driven products
  • Coach and mentor fellow ML engineers, promoting a culture of collaboration, continuous improvement, and engineering excellence within the team
  • Employ a diverse set of tools and platforms including Python, AWS, Databricks, Docker, Kubernetes, FastAPI, Terraform, Snowflake, Coralogix, and GitHub to build, deploy, and maintain scalable, highly available machine learning infrastructure
What we offer
What we offer
  • Competitive pay
  • Comprehensive healthcare benefits
  • Financial assistance for things like hybrid work, family planning
  • Generous parental leave
  • Flexible time-off policies
  • Mental health and wellness resources
  • Learning, development, and recognition programs
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

As an ML Engineer at Axon, you will contribute to developing AI solutions transf...
Location
Location
United States , Seattle
Salary
Salary:
150750.00 - 221000.00 USD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, Engineering, Electronics, Mathematics or an equivalent highly technical field
  • 6+ years of software engineering experience and a proven track record of successfully deploying AI models to the cloud
  • Experience with Infrastructure-as-code and cloud architecture
  • Proficiency in Python and C++
  • familiarity with ML frameworks such as TensorFlow, or PyTorch
  • Advanced knowledge and hands-on experience with Linux
  • Excellent problem solving skills and ability to dive deep into system architecture
  • Excellent software design skills
  • Comfort communicating and interacting with scientists, engineers and product managers
Job Responsibility
Job Responsibility
  • Collaborate with scientists and product managers to build proof-of-concepts (POCs) contributing to shaping the Axon of tomorrow
  • Architect and develop secure, privacy-preserving, solutions to enable the continuous improvement of existing AI models
  • Architect platforms that accelerate research and AI product development
  • Collaborate with scientists in architecting and implementing state-of-the-art training techniques
  • Set high standards for ethical and responsible AI development
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right