CrawlJobs Logo

Senior Machine Learning Infrastructure Engineer

plus.ai Logo

PlusAI

Location Icon

Location:
United States , Santa Clara

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

160000.00 - 200000.00 USD / Year

Job Description:

As a Senior ML Infrastructure Engineer at Plus, you will design scalable architectures capable of handling petabytes of data while ensuring optimal performance for both training and inference phases. You will build robust pipelines for managing model versioning systems and experiment tracking frameworks, which are essential for maintaining reproducibility across experiments. Additionally, you will be responsible for managing large-scale GPU clusters. This role offers unparalleled opportunities—both technically and professionally—for individuals passionate about solving challenging problems using modern cloud-native technologies. Ideal candidates thrive in environments that leverage tools such as Docker containers orchestrated via Kubernetes clusters, seamlessly integrated with state-of-the-art deep learning frameworks like PyTorch or TensorFlow. If you are eager to push the boundaries of what's possible in machine learning infrastructure and contribute to cutting-edge solutions, this position is an excellent fit!

Job Responsibility:

  • Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale
  • Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks
  • Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability
  • Implement distributed systems and storage solutions optimized for machine learning workloadsDrive improvements in CI/CD workflows for ML models and infrastructure
  • Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems
  • Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform
  • Mentor junior engineers and contribute to a culture of technical excellence
  • Ensure that your work is performed in accordance with the company’s Quality Management System (QMS) requirements and contribute to continuous improvement efforts
  • Ensure team compliance with QMS, monitor quality, and drive process improvements

Requirements:

  • Phd or MS in Computer Science, Electrical Engineering, or related field
  • Good oral and written communication skills
  • Phd new grad or Masters with 3+ years of software engineering experience with a focus on ML infrastructure or distributed systems
  • Proficiency in in Python, C++, SQL
  • Deep understanding of containerization, orchestration technologies, distributed ML workload, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, and mlflow)
  • Deploy and manage resources across multiple cloud platforms (AWS, GCP, or on-prem environments)
  • Proficiency in at least one deep learning framework, such as PyTorch and data pipeline tools (e.g., Apache Airflow, Prefect)
  • Strong knowledge of distributed systems, databases, and storage solutions
  • Extensive software design and development skills
  • Ability to learn and adapt to new technologies and contribute in a productive environment

Nice to have:

  • Familiarity with fundamental deep learning architectures, such as Convolutional Neural Networks (CNNs) and Transformer models
  • Experience in building large-scale ML datasets, MLOps pipelines, and distributed computing frameworks like Ray
  • Experience working with autonomous vehicles or robotics
What we offer:
  • Work, learn and grow in a highly future-oriented, innovative and dynamic field
  • Wide range of opportunities for personal and professional development
  • Catered free lunch, unlimited snacks and beverages
  • Highly competitive salary and benefits package, including 401(k) plan

Additional Information:

Job Posted:
December 11, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Machine Learning Infrastructure Engineer

Senior Machine Learning Engineer

Machine Learning is a cornerstone at Taskrabbit, and we're looking for a seasone...
Location
Location
United States , New York; San Francisco
Salary
Salary:
148000.00 - 200000.00 USD / Year
taskrabbit.com Logo
Taskrabbit
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS, MS, or PhD in Computer Science, Statistics, Operations Research, or a related quantitative field
  • 3+ years of industry experience building and deploying high-quality, production-grade machine learning models and systems
  • Strong theoretical knowledge and hands-on experience in machine learning, particularly in areas like search, ranking, recommender systems, or NLP
  • Solid software engineering skills with proficiency in one or more programming languages, including Python
  • Experience with popular ML libraries like Scikit-learn, lightgbm, xgboost, TensorFlow, PyTorch, etc.
  • Proficiency in SQL is also required for writing complex queries and transforming data
  • Experience building REST API-based services
  • Experience with modern data and ML technologies, such as Docker, Kubernetes, Kafka, Airflow, data warehouses (eg snowflake, redshift or BigQuery), and data lakes
  • Excellent communication skills, with the ability to present complex findings and recommendations clearly to both technical and non-technical audiences
  • A passion for quickly learning new technologies and a drive to solve challenging problems
Job Responsibility
Job Responsibility
  • Model Development & Research: Research, design, and implement machine learning models to solve key business problems in areas like search ranking, recommendations, and content discovery
  • End-to-End ML Lifecycle: Own the entire lifecycle of ML models, including feature engineering, training, evaluation, deployment, and monitoring
  • Infrastructure & Scalability: Build scalable and reliable ML infrastructure and data pipelines that support reproducible feature engineering and machine learning model deployment in real-time, near real-time, and batch processes
  • Performance & Quality: Build monitoring services to understand data quality and model performance of complex systems, and collaborate with engineering and science teams to optimize existing algorithms for training and evaluation
  • Software Engineering Excellence: Independently solve complex problems, write clean, efficient, and sustainable code, and actively participate in code reviews, documentation, and the full software engineering lifecycle
What we offer
What we offer
  • Taskrabbit is a Hybrid Company
  • The People
  • The Diverse Culture
  • Taskrabbit offers our employees with employer-paid health insurance and a 401k match with immediate vesting for our US based employees
  • We offer all of our global employees generous and flexible time off with 2 company-wide closure weeks, Taskrabbit product stipends, wellness + productivity + education stipends, IKEA discounts, reproductive health support, and more
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Join the Affirm team as a Senior Staff Machine Learning Engineer and become a pi...
Location
Location
United States
Salary
Salary:
232000.00 - 310000.00 USD / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience researching, designing, deploying, and operating large-scale, real-time machine learning systems
  • Experience leading end-to-end ML system design, from data architecture and feature pipelines to model training, evaluation, and production deployment
  • Proficient in Python and ML frameworks, including PyTorch and XGBoost
  • Strong understanding of representation learning and embedding-based modeling
  • Deep expertise in neural network-based sequence modeling, including architectures such as Transformers, recurrent, or attention-based models, and multi-task learning systems
  • Deep hands-on experience with large-scale distributed ML infrastructure, including streaming or batch data ingestion, feature stores, feature engineering, training pipelines, model serving and inference infrastructure, monitoring, and automated retraining
  • Strong technical leadership: defining long-term strategy, guiding research direction, and aligning work across teams
  • Exceptional judgment, collaboration, and communication skills
  • Strong verbal and written communication skills that support effective collaboration across our global engineering organization
  • Equivalent practical experience or a Bachelor’s degree in a related field
Job Responsibility
Job Responsibility
  • Define and drive multi-year, multi-team technical strategy for machine learning across Affirm
  • Lead the design, implementation, and scaling of advanced ML systems
  • Partner deeply with ML Platform, product, engineering, and risk leadership to shape long-term modeling capabilities
  • Provide broad technical leadership across the ML organization, mentoring senior engineers
  • Drive clarity and alignment on ambiguous, high-stakes technical decisions
  • Champion operational and system excellence at the area level
What we offer
What we offer
  • Equity rewards
  • Monthly stipends for health, wellness and tech spending
  • 100% subsidized medical coverage, dental and vision for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Competitive vacation and holiday schedules
  • Employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Join the Affirm team as a Senior Staff Machine Learning Engineer and become a pi...
Location
Location
Canada
Salary
Salary:
206000.00 - 256000.00 CAD / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience researching, designing, deploying, and operating large-scale, real-time machine learning systems
  • Experience leading end-to-end ML system design, from data architecture and feature pipelines to model training, evaluation, and production deployment
  • Proficiency in Python and ML frameworks, including PyTorch and XGBoost
  • Experience with ML tooling for training orchestration, experimentation, and model monitoring, such as Kubeflow, MLflow, or equivalent
  • Strong understanding of representation learning and embedding-based modeling
  • Deep expertise in neural network-based sequence modeling, including architectures such as Transformers, recurrent, or attention-based models, and multi-task learning systems
  • Deep hands-on experience with large-scale distributed ML infrastructure, including streaming or batch data ingestion, feature stores, feature engineering, training pipelines, model serving and inference infrastructure, monitoring, and automated retraining
  • Strong technical leadership: defining long-term strategy, guiding research direction, and aligning work across teams
  • Exceptional judgment, collaboration, and communication skills
  • Strong verbal and written communication skills
Job Responsibility
Job Responsibility
  • Define and drive multi-year, multi-team technical strategy for machine learning across Affirm
  • Lead the design, implementation, and scaling of advanced ML systems
  • Partner deeply with ML Platform, product, engineering, and risk leadership to shape long-term modeling capabilities
  • Provide broad technical leadership across the ML organization
  • Drive clarity and alignment on ambiguous, high-stakes technical decisions
  • Champion operational and system excellence at the area level
What we offer
What we offer
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

As a Senior Machine Learning Engineer at Aignostics, you work hand in hand with ...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
aignostics.com Logo
Aignostics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Engineering, Mathematics, or a related field. PhD is a plus
  • 4+ years of work experience in software development, machine learning or a related field
  • Advanced programming skills in Python, with experience with other languages (e.g. C/C++, CUDA, Java, Rust) being a plus
  • Good understanding of distributed systems and frameworks, parallel computing and scalability
  • Experience with cloud platforms (GCP, AWS or Azure), familiarity with MLOps / DevOps best practices (incl. CI/CD, Docker, Kubernetes and observability)
  • Dedicated to high coding standards and knowledgeable about best practices in development workflow
  • Experience with Linux, version control and container technologies
  • Data engineering skills, experience with working with large datasets
  • Excellent problem-solving skills and the ability to work independently and as part of a team
  • Strong communication skills, with the ability to convey complex technical concepts to non-technical stakeholders
Job Responsibility
Job Responsibility
  • Design, develop, deploy and maintain robust ML pipelines to make them usable, efficient and scalable
  • Optimize and fine-tune data pipelines for production
  • Engage in code reviews, upholding high standards for clean, reliable code
  • Collaborate with cross-functional teams to understand business requirements and translate them into ML solutions
  • Embrace learning new technologies, fostering innovation, and tackling diverse challenges. Contribute to the development of our ML infrastructure, pipelines, services, monitoring systems and codebase in general
  • Work in an agile development environment and clearly communicate your results to the team
  • Mentor and guide junior engineers, providing technical leadership and insights
What we offer
What we offer
  • Join a purpose-driven startup: We are working collectively to fight cancer and improve patient outcomes
  • Cutting-edge AI research and development, with involvement of Charité, TU Berlin and our other partners
  • Work with a welcoming, diverse and highly international team of colleagues
  • Opportunity to take responsibility and grow your role within the startup
  • Expand your skills by benefitting from our Learning & Development yearly budget of 1,000€ (plus 2 L&D days), language classes and internal development programs
  • Mentoring program, you’ll learn from great experts
  • Flexible working hours and teleworking policy
  • Enjoy your well-deserved time off within our 30 paid vacations days per year
  • We are family & pet friendly and support flexible parental leave options
  • Pick a subsidized membership of your choice among public transport sports and well-being
  • Fulltime
Read More
Arrow Right

LLM - Senior Staff Engineer - Python + Machine Learning

AquSag is seeking a hands-on Machine Learning Senior Staff Engineer to lead cros...
Location
Location
Salary
Salary:
40.00 - 60.00 USD / Hour
aqusag.com Logo
AquSag Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ yrs of strong background in Machine Learning, NLP, and modern deep learning architectures (Transformers, LLMs)
  • Hands-on experience with frameworks such as PyTorch, TensorFlow, Hugging Face, or DeepSpeed
  • Hands-on experience in Docker for Production deployment
  • Proven experience managing teams delivering ML/LLM models in production environments
  • Knowledge of distributed training, GPU/TPU optimization, and cloud platforms (AWS, GCP, Azure)
  • Familiarity with MLOps tools like MLflow, Kubeflow, or Vertex AI for scalable ML pipelines
  • Excellent leadership, communication, and cross-functional collaboration skills
  • Bachelor’s or Master’s in Computer Science, Engineering, or related field (PhD preferred)
  • Overlap of 6 hours with PST time zone is mandatory
  • Commitments Required: 8 hours per day with overlap of 6 hours with PST
Job Responsibility
Job Responsibility
  • Lead and mentor a cross-functional team of ML engineers, data scientists, and MLOps professionals
  • Oversee the full lifecycle of LLM and ML projects — from data collection to training, evaluation, and deployment
  • Collaborate with Research, Product, and Infrastructure teams to define goals, milestones, and success metrics
  • Provide technical direction on large-scale model training, fine-tuning, and distributed systems design
  • Implement best practices in MLOps, model governance, experiment tracking, and CI/CD for ML
  • Manage compute resources, budgets, and ensure compliance with data security and responsible AI standards
  • Communicate progress, risks, and results to stakeholders and executives effectively
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

As an ML Engineer at Axon, you will contribute to developing AI solutions transf...
Location
Location
United States , Seattle
Salary
Salary:
150750.00 - 221000.00 USD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, Engineering, Electronics, Mathematics or an equivalent highly technical field
  • 6+ years of software engineering experience and a proven track record of successfully deploying AI models to the cloud
  • Experience with Infrastructure-as-code and cloud architecture
  • Proficiency in Python and C++
  • familiarity with ML frameworks such as TensorFlow, or PyTorch
  • Advanced knowledge and hands-on experience with Linux
  • Excellent problem solving skills and ability to dive deep into system architecture
  • Excellent software design skills
  • Comfort communicating and interacting with scientists, engineers and product managers
Job Responsibility
Job Responsibility
  • Collaborate with scientists and product managers to build proof-of-concepts (POCs) contributing to shaping the Axon of tomorrow
  • Architect and develop secure, privacy-preserving, solutions to enable the continuous improvement of existing AI models
  • Architect platforms that accelerate research and AI product development
  • Collaborate with scientists in architecting and implementing state-of-the-art training techniques
  • Set high standards for ethical and responsible AI development
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer (Infrastructure)

We are looking for an experienced MLOps Engineer to join our team as a Senior Ma...
Location
Location
United States , Boston
Salary
Salary:
152800.00 - 224100.00 USD / Year
simplisafe.com Logo
SimpliSafe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering, data engineering, or a related field, with at least 3 years focused on MLOps or ML infrastructure
  • Deep hands-on experience with AWS or similar public clouds, including compute, networking, container orchestration, and observability stacks
  • Hands-on experience with: CI/CD pipelines, Docker
  • Kubernetes
  • Infrastructure-as-code tools (e.g., Terraform, Cloud Formation)
  • Proficiency in programming languages like Python, and familiarity with machine learning frameworks (e.g., TensorFlow, PyTorch)
  • Solid understanding of ML lifecycle management, including experiment tracking, versioning, and monitoring
  • LLM application development, including prompt engineering and evaluation
  • Strong communication skills for partnering with cross-functional technical and non-technical teams
Job Responsibility
Job Responsibility
  • Lead the architecture, deployment, and optimization of scalable ML model serving systems for real-time and batch use cases
  • Collaborate with data scientists, engineers, and stakeholders to operationalize ML models
  • Develop CI/CD pipelines for ML models enabling rapid, safe, and consistent model releases
  • Design, implement, and own comprehensive production monitoring for ML models/systems
  • Manage cloud infrastructure, primarily in AWS or other major public clouds, to support ML workloads
  • Drive best practices in model versioning, observability, reproducibility, and deployment reliability
  • Serve in an on-call rotation as a first responder for software owned by your team
What we offer
What we offer
  • A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
  • A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
  • Free SimpliSafe system and professional monitoring for your home
  • Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change
  • Participation in our annual bonus program, equity, and other forms of compensation
  • A full range of medical, retirement, and lifestyle benefits
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

Join Axon and be a Force for Good. At Axon, we’re on a mission to Protect Life. ...
Location
Location
United States , Seattle
Salary
Salary:
150750.00 - 221000.00 USD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, Engineering, Electronics, Mathematics or an equivalent highly technical field
  • 6+ years of software engineering experience and a proven track record of successfully deploying AI models to the cloud
  • Experience with Infrastructure-as-code and cloud architecture
  • Proficiency in Python and C++
  • familiarity with ML frameworks such as TensorFlow, or PyTorch
  • Advanced knowledge and hands-on experience with Linux
  • Excellent problem solving skills and ability to dive deep into system architecture
  • Excellent software design skills
  • Comfort communicating and interacting with scientists, engineers and product managers
Job Responsibility
Job Responsibility
  • Collaborate with scientists and product managers to build proof-of-concepts (POCs) contributing to shaping the Axon of tomorrow
  • Architect and develop secure, privacy-preserving, solutions to enable the continuous improvement of existing AI models
  • Architect platforms that accelerate research and AI product development
  • Collaborate with scientists in architecting and implementing state-of-the-art training techniques
  • Set high standards for ethical and responsible AI development
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right