Senior AI Infrastructure Engineer - Training Platform Job at Scale (San Francisco)

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...

Location

United States , San Francisco

Salary:

180000.00 - 270000.00 USD / Year

Plaid

Expiration Date

Until further notice

Requirements

5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
Proven experience delivering reliable and scalable infrastructure in production
Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
Strong communication skills and ability to collaborate across teams

Job Responsibility

Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
Contribute to technical strategy and architecture discussions within the team
Mentor and support other engineers through code reviews, design discussions, and technical guidance

What we offer

medical, dental, vision, and 401(k)

Fulltime

Senior Platform Engineer

As a Senior Platform Engineer at Aignostics, you work hand in hand with our team...

Location

Germany , Berlin

Salary:

Not provided

Aignostics

Expiration Date

Until further notice

Requirements

Proven experience: 5+ years in platform engineering, with a proven track record in managing complex, cloud-native infrastructure
Technical expert: work with Kubernetes, Docker, Terraform, and Cloud Environments, with a deep understanding of CI/CD processes and tools
Excellent coder: automate with strong programming and scripting abilities, preferably in Go/Python and bash
Network architect: get our services flying by leveraging technologies like Virtual Private Clouds, DNS, Reverse Proxies and Firewalls
Outstanding communicator: ability to explain complex technical concepts, drive decisions, and collaborate across teams (fluent in English, German is a plus)
Self-driven learner: stays current with technology trends, evaluates new solutions (including AI), and grows with our challenges
Live the Devops Culture: Educating our developers on core principles, implementing collaborative processes, and providing the necessary tools and frameworks that enable them to do so

Job Responsibility

Introduce, implement and own architectural solutions of our Kubernetes clusters, internal services, and cloud-based infrastructure to improve our developer’s experience
Identify & Automate workflows and processes leveraging from event driven pipelines like Gitlab CICD or Argo Workflows
Introduce and drive the security of our applications and data by implementing state of the art security concepts in Kubernetes and cloud environments
Work closely with the engineering and data science teams to bring our products and AI model training to its excellence
Propose and drive the adoption of best practices in infrastructure management, scalability, and security

What we offer

Cutting-edge AI research and development, with involvement of Charité, TU Berlin and our other partners
Work with a welcoming, diverse and highly international team of colleagues
Opportunity to take responsibility and grow your role within the startup
Expand your skills by benefitting from our Learning & Development yearly budget of 1,000€ (plus 2 L&D days), language classes and internal development programs
Mentoring program, you’ll learn from great experts
Flexible working hours and teleworking policy
Enjoy your well-deserved time off within our 30 paid vacations days per year
We are family & pet friendly and support flexible parental leave options
Pick a subsidized membership of your choice among public transport, sports and well-being
Enjoy our social gatherings, lunches, and off-site events for a fun and inclusive work environment

Senior Engineering Manager - AI

We are seeking a Senior Engineering Manager (Level 5) to lead a high-performing ...

Location

India , Chennai

Salary:

Not provided

Arcadia

Expiration Date

Until further notice

Requirements

15+ years of professional experience in software engineering
At least 4+ years in engineering leadership roles
Strong technical background in AI/ML systems, large-scale data pipelines, and cloud-native platforms
Hands-on experience with Python (preferred), modern ML frameworks (PyTorch/TensorFlow), and cloud services (AWS)
Proven success in managing teams of 4–6 engineers, scaling processes, and building diverse, high-performance teams
Strong architectural design and system-thinking abilities
Excellent communication skills with ability to influence cross-functional stakeholders
Passion for sustainability, decarbonization, and using technology to create positive climate impact
Experienced with building agentic pipelines with the latest models from Anthropic, Google, OpenAI, and more

Job Responsibility

Lead and grow a team of engineers focused on building AI-driven and data-intensive systems for the Arcadia platform
Design and train ML/AI models (forecasting, NLP, graph learning, generative AI) to improve data quality, cost effectiveness, and system scalability
Build true agentic workflows with multi-step processing incorporating RAG pipelines and MCPs
Balance management responsibilities (hiring, coaching, performance reviews, career growth) with technical leadership (architecture, system design, technical strategy)
Drive end-to-end delivery of complex projects in partnership with Product, Data, and Infrastructure teams
Guide the adoption of modern AI/ML technologies, ensuring practical, scalable use in production
Foster a culture of high performance, ownership, and technical excellence
Establish engineering best practices in testing, observability, reliability, and CI/CD
Partner with leadership to define roadmaps, set priorities, and align execution with Arcadia’s strategic goals
Represent AI across the company, articulating technical trade-offs and championing innovation

What we offer

Competitive compensation and employee stock options
Hybrid/remote-first working model (India-based role, with global collaboration)
Flexible leave policy
Comprehensive medical insurance (self + family members)
Annual performance cycle + quarterly recognition awards
A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation

Fulltime

Senior ML Platform Engineer

At WHOOP, we're on a mission to unlock human performance and healthspan. WHOOP e...

Location

United States , Boston

Salary:

150000.00 - 210000.00 USD / Year

Whoop

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s Degree in Computer Science, Engineering, or a related field
or equivalent practical experience
5+ years of experience in software engineering with a focus on ML infrastructure, cloud platforms, or MLOps
Strong programming skills in Python, with experience in building distributed systems and REST/gRPC APIs
Deep knowledge of cloud-native services and infrastructure-as-code (e.g., AWS CDK, Terraform, CloudFormation)
Hands-on experience with model deployment platforms such as AWS SageMaker, Vertex AI, or Kubernetes-based serving stacks
Proficiency in ML lifecycle tools (MLflow, Weights & Biases, BentoML) and containerization strategies (Docker, Kubernetes)
Understanding of data engineering and ingestion pipelines, with ability to interface with data lakes, feature stores, and streaming systems
Proven ability to work cross-functionally with Data Science, Data Platform, and Software Engineering teams, influencing decisions and driving alignment
Passion for AI and automation to solve real-world problems and improve operational workflows

Job Responsibility

Architect, build, own, and operate scalable ML infrastructure in cloud environments (e.g., AWS), optimizing for speed, observability, cost, and reproducibility
Create, support, and maintain core MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry), ensuring reliability, scalability, and long-term sustainability
Develop, evolve, and operate MLOps platforms and frameworks that standardize model deployment, versioning, drift detection, and lifecycle management at scale
Implement and continuously maintain end-to-end CI/CD pipelines for ML models using orchestration tools (e.g., Prefect, Airflow, Argo Workflows), ensuring robust testing, reproducibility, and traceability
Partner closely with Data Science, Sensor Intelligence, and Data Platform teams to operationalize and support model development, deployment, and monitoring workflows
Build, manage, and maintain both real-time and batch inference infrastructure, supporting diverse use cases from physiological analytics to personalized feedback loops for WHOOP members
Design, implement, and own automated observability tooling (e.g., for model latency, data drift, accuracy degradation), integrating metrics, logging, and alerting with existing platforms
Leverage AI-powered tools and automation to reduce operational overhead, enhance developer productivity, and accelerate model release cycles
Contribute to and maintain internal platform documentation, SDKs, and training materials, enabling self-service capabilities for model deployment and experimentation
Continuously evaluate and integrate emerging technologies and deployment strategies, influencing WHOOP’s roadmap for AI-driven platform efficiency, reliability, and scale

What we offer

equity
benefits

Fulltime

Senior Devops & AI Engineer

This role presents a unique opportunity to contribute to the future of impactful...

Location

India , Hyderabad

Salary:

Not provided

Fission Labs

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Engineering, or related field
6+ years of experience in Infrastructure Mgmt. roles, with a focus on cloud platforms (Azure and AWS Preferred)
Hands-on experience with operations (DevSecOps) principles and best practices
Proficiency in scripting languages such as Python, PowerShell, or Bash
Excellent communication and collaboration skills
In-depth knowledge of Linux operating systems, including CentOS, Ubuntu, and Red Hat, with expertise in shell scripting, package management, and system administration
Hands-on experience with a wide range of AWS and Azure services
Develop and maintain Infrastructure as Code (IAC) templates using tools such as Terraform or AWS CloudFormation
Experience setting up cloud infrastructure stack, databases, service endpoints, GPU as well as CPU resource scaling, optimization etc.
Should have worked AIOps/MLOP

Job Responsibility

Configure and optimize Linux-based servers for performance, security, and resource utilization, including kernel tuning, file system management, and network configuration
Architect cloud solutions leveraging best practices and services offered by AWS and Azure, optimizing for scalability, reliability, and cost-effectiveness
Implement and manage hybrid cloud environments, facilitating seamless integration and interoperability between AWS and Azure services
Establish version control practices for IAC templates, ensuring traceability, auditability, and reproducibility of infrastructure changes

What we offer

Opportunity to work on impactful technical challenges with global reach
Vast opportunities for self-development, including online university access and knowledge sharing opportunities
Sponsored Tech Talks & Hackathons to foster innovation and learning
Generous benefits packages including health insurance, retirement benefits, flexible work hours, and more
Supportive work environment with forums to explore passions beyond work

Fulltime

Senior Engineering Manager- AI/ML

As the Senior Engineering Manager, you will lead by being a highly technical lea...

Location

United States

Salary:

Not provided

Aledade, Inc.

Expiration Date

Until further notice

Requirements

BS/BTech (or higher) in Computer Science, Engineering or a related field required
10+ years of production-level experience as an engineer and technical lead building highly scalable and reliable software
5+ years of managerial experience building and leading technical engineering teams
7+ years of experience in machine learning related technologies, with a strong preference for Python
Extensive experience in designing and implementing secure, scalable, and maintainable AI/ML platform architectures
Proficiency in distributed systems, microservices, containerization technologies (e.g., Docker, Kubernetes), model training infrastructure, orchestration tools, and MLOps principles
Sitting for prolonged periods of time
Extensive use of computers and keyboard
Occasional walking and lifting may be required

Job Responsibility

Build a high performing team by hiring and nurturing engineering talent
Strong technical leadership - drive technical solutioning and building roadmaps
Set aggressive and clear goals and remove all roadblocks for the team to achieve them
Working seamlessly and collaboratively with stakeholders across Aledade to achieve business outcomes
Work closely with engineering leaders to drive engineering excellence in our processes and systems

Fulltime

Senior Staff Machine Learning Engineer

Help design our AI platform and develop our next generation of machine learning ...

Location

United States , San Francisco

Salary:

216500.00 - 324500.00 USD / Year

GoFundMe

Expiration Date

Until further notice

Requirements

9+ years of hands-on experience in machine learning engineering, AI development, software engineering, or related fields
Experience emphasizing secure, large-scale, distributed system design, AI/ML pipeline development, and implementation
Extensive experience designing, developing, and operating scalable backend systems
Experience applying software engineering best practices such as domain-driven design, event-driven architectures, and microservices
Deep expertise in agentic workflows, AI evaluation solutions, prompt management, and secure AI development and testing practices
Strong knowledge of relational and document-based databases, data storage paradigms, and efficient RESTful API design
Experience establishing robust CI/CD pipelines, automated testing (unit and integration), and deployment practices
Strong leadership skills, including effective planning and management of complex projects, mentoring of team members, and fostering a collaborative, high-performing engineering culture
Excellent communicator, able to articulate complex technical concepts clearly to both technical and non-technical stakeholders
Bachelor's degree in Computer Science, Software Engineering, or a related technical field (preferred)

Job Responsibility

Design and implement AI platforms to enable scalable and secure access to LLMs from multiple model providers for diverse use cases
Design and implement agentic workflows, agentic tool ecosystems, and LLM prompt management solutions
Design, build, and optimize scalable model training, fine tuning, and inference pipelines, ensuring robust integration with production systems
Influence technical strategy and approach to developing embedding stores, vector databases, and other reusable assets
Lead initiatives to streamline ML and AI workflows, improve operational efficiency, and establish standardized procedures to achieve consistent, high-quality results across our AI systems
Design and develop backend services and RESTful APIs using Python and FastAPI, integrating seamlessly with ML pipelines and services
Take operational responsibility for team-owned services, including performance monitoring, optimization, troubleshooting, and participation in an on-call rotation
Collaborate with both technical and non-technical colleagues, including data and applied scientists, software engineers, product managers, and business stakeholders, to deliver reliable and scalable ML-driven products
Coach and mentor fellow ML engineers, promoting a culture of collaboration, continuous improvement, and engineering excellence within the team
Employ a diverse set of tools and platforms including Python, AWS, Databricks, Docker, Kubernetes, FastAPI, Terraform, Snowflake, Coralogix, and GitHub to build, deploy, and maintain scalable, highly available machine learning infrastructure

What we offer

Competitive pay
Comprehensive healthcare benefits
Financial assistance for things like hybrid work, family planning
Generous parental leave
Flexible time-off policies
Mental health and wellness resources
Learning, development, and recognition programs

Fulltime

Senior Machine Learning Engineer

As an ML Engineer at Axon, you will contribute to developing AI solutions transf...

Location

United States , Seattle

Salary:

150750.00 - 221000.00 USD / Year

Axon

Expiration Date

Until further notice

Requirements

Bachelor’s Degree in Computer Science, Engineering, Electronics, Mathematics or an equivalent highly technical field
6+ years of software engineering experience and a proven track record of successfully deploying AI models to the cloud
Experience with Infrastructure-as-code and cloud architecture
Proficiency in Python and C++
familiarity with ML frameworks such as TensorFlow, or PyTorch
Advanced knowledge and hands-on experience with Linux
Excellent problem solving skills and ability to dive deep into system architecture
Excellent software design skills
Comfort communicating and interacting with scientists, engineers and product managers

Job Responsibility

Collaborate with scientists and product managers to build proof-of-concepts (POCs) contributing to shaping the Axon of tomorrow
Architect and develop secure, privacy-preserving, solutions to enable the continuous improvement of existing AI models
Architect platforms that accelerate research and AI product development
Collaborate with scientists in architecting and implementing state-of-the-art training techniques
Set high standards for ethical and responsible AI development

What we offer

Competitive salary and 401k with employer match
Discretionary paid time off
Paid parental leave for all
Medical, Dental, Vision plans
Fitness Programs
Emotional & Mental Wellness support
Learning & Development programs
Snacks in our offices

Fulltime

Senior AI Infrastructure Engineer - Training Platform

Scale

Location:
United States , San Francisco ▼
Seattle
New York

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
May 04, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Senior AI Infrastructure Engineer - Training Platform