CrawlJobs Logo

Senior Machine Learning Infrastructure Engineer

United States, Santa Clara 160000.00 - 200000.00 USD / Year · Job Posted December 11, 2025
Apply Position
Job Link Share

Job Description

As a Senior ML Infrastructure Engineer at Plus, you will design scalable architectures capable of handling petabytes of data while ensuring optimal performance for both training and inference phases. You will build robust pipelines for managing model versioning systems and experiment tracking frameworks, which are essential for maintaining reproducibility across experiments. Additionally, you will be responsible for managing large-scale GPU clusters. This role offers unparalleled opportunities—both technically and professionally—for individuals passionate about solving challenging problems using modern cloud-native technologies. Ideal candidates thrive in environments that leverage tools such as Docker containers orchestrated via Kubernetes clusters, seamlessly integrated with state-of-the-art deep learning frameworks like PyTorch or TensorFlow. If you are eager to push the boundaries of what's possible in machine learning infrastructure and contribute to cutting-edge solutions, this position is an excellent fit!

Job Responsibility

  • Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale
  • Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks
  • Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability
  • Implement distributed systems and storage solutions optimized for machine learning workloadsDrive improvements in CI/CD workflows for ML models and infrastructure
  • Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems
  • Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform
  • Mentor junior engineers and contribute to a culture of technical excellence
  • Ensure that your work is performed in accordance with the company’s Quality Management System (QMS) requirements and contribute to continuous improvement efforts
  • Ensure team compliance with QMS, monitor quality, and drive process improvements

Requirements

  • Phd or MS in Computer Science, Electrical Engineering, or related field
  • Good oral and written communication skills
  • Phd new grad or Masters with 3+ years of software engineering experience with a focus on ML infrastructure or distributed systems
  • Proficiency in in Python, C++, SQL
  • Deep understanding of containerization, orchestration technologies, distributed ML workload, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, and mlflow)
  • Deploy and manage resources across multiple cloud platforms (AWS, GCP, or on-prem environments)
  • Proficiency in at least one deep learning framework, such as PyTorch and data pipeline tools (e.g., Apache Airflow, Prefect)
  • Strong knowledge of distributed systems, databases, and storage solutions
  • Extensive software design and development skills
  • Ability to learn and adapt to new technologies and contribute in a productive environment

Nice to have

  • Familiarity with fundamental deep learning architectures, such as Convolutional Neural Networks (CNNs) and Transformer models
  • Experience in building large-scale ML datasets, MLOps pipelines, and distributed computing frameworks like Ray
  • Experience working with autonomous vehicles or robotics

What we offer

  • Work, learn and grow in a highly future-oriented, innovative and dynamic field
  • Wide range of opportunities for personal and professional development
  • Catered free lunch, unlimited snacks and beverages
  • Highly competitive salary and benefits package, including 401(k) plan

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Machine Learning Infrastructure Engineer

8 matching positions

Senior Machine Learning Engineer (Infrastructure)

We are looking for an experienced MLOps Engineer to join our team as a Senior Ma...
Location
Location
United States , Boston
Salary
Salary:
152800.00 - 224100.00 USD / Year
simplisafe.com Logo
SimpliSafe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering, data engineering, or a related field, with at least 3 years focused on MLOps or ML infrastructure
  • Deep hands-on experience with AWS or similar public clouds, including compute, networking, container orchestration, and observability stacks
  • Hands-on experience with: CI/CD pipelines, Docker
  • Kubernetes
  • Infrastructure-as-code tools (e.g., Terraform, Cloud Formation)
  • Proficiency in programming languages like Python, and familiarity with machine learning frameworks (e.g., TensorFlow, PyTorch)
  • Solid understanding of ML lifecycle management, including experiment tracking, versioning, and monitoring
  • LLM application development, including prompt engineering and evaluation
  • Strong communication skills for partnering with cross-functional technical and non-technical teams
Job Responsibility
Job Responsibility
  • Lead the architecture, deployment, and optimization of scalable ML model serving systems for real-time and batch use cases
  • Collaborate with data scientists, engineers, and stakeholders to operationalize ML models
  • Develop CI/CD pipelines for ML models enabling rapid, safe, and consistent model releases
  • Design, implement, and own comprehensive production monitoring for ML models/systems
  • Manage cloud infrastructure, primarily in AWS or other major public clouds, to support ML workloads
  • Drive best practices in model versioning, observability, reproducibility, and deployment reliability
  • Serve in an on-call rotation as a first responder for software owned by your team
What we offer
What we offer
  • A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
  • A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
  • Free SimpliSafe system and professional monitoring for your home
  • Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change
  • Participation in our annual bonus program, equity, and other forms of compensation
  • A full range of medical, retirement, and lifestyle benefits
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer - ML Training Infrastructure

We are seeking an experienced, technical oriented, impact delivering-driven expe...
Location
Location
United States , Mountain View
Salary
Salary:
170000.00 - 240000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors degree or higher in Computer Science or equivalent major OR equivalent relevant experience
  • 3+ years professional software engineering experience
  • 2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
  • Strong programming skills in Python, with proficiency in frameworks such as, PyTorch (preferred), TensorFlow, or similar
  • Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
  • Willingness to travel to Sunnyvale, CA as needed
  • Comfortable working in highly ambiguous and dynamic environments
Job Responsibility
Job Responsibility
  • Design and development of scalable, reliable, high-performance ML framework to support model training at scale
  • Model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost
  • Raise the bar on system observability, debuggability, and operational excellence, and user experience
  • Collaborate with cross-functional teams to integrate new features and technologies into the platform
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right
New

Senior Machine Learning Engineer, AI Platform

The AI Platform team is responsible for building the foundational infrastructure...
Location
Location
United States; Canada
Salary
Salary:
139000.00 - 218000.00 USD / Year
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree with 4–6 years of relevant industry experience, or Master’s degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
  • Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
  • Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
  • Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
  • Hands-on experience working with GPU-based workloads and accelerated computing in production settings
  • Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
  • Ability to independently scope and drive technical initiatives while balancing product and operational priorities
  • Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
  • Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams
Job Responsibility
Job Responsibility
  • Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
  • Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
  • Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
  • Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
  • Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
  • Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
  • Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
  • Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
  • Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews
What we offer
What we offer
  • Generous performance-based bonus plans
  • Rich medical, dental, and vision coverage
  • Generous retirement contributions with 100% immediate vesting
  • Quarterly all-company wellness days
  • Country specific holidays plus a day off for your birthday
  • One-time home office stipend
  • Annual professional development budget
  • Quarterly well-being stipend
  • Considerable paid parental leave
  • Employee referral bonus program
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

IT AND R&D REMOTE - Senior Machine Learning Engineer - RTB House is a global com...
Location
Location
Poland
Salary
Salary:
Not provided
rtbhouse.com Logo
RTB House
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in designing and implementing complex IT systems
  • Ability to develop user-friendly, versatile tools
  • Proficiency in at least one programming language, such as Python, C++, Java, or Scala, along with expertise in Linux
  • Strong skills in evaluating and optimizing system performance, from initial design through to production troubleshooting
  • Deep understanding of algorithms and data structures
  • Initiative and creativity to improve existing solutions
  • Ability to work effectively both within and across teams
  • C1 level in Polish
Job Responsibility
Job Responsibility
  • Developing and maintaining the ML training platform and the bidding infrastructure that evaluates ML models in the production environment
  • Identifying performance bottlenecks and optimizing critical, low-level parts of the system
  • Ensuring the reliability and scalability of implementations, and creating performance and correctness tests for new system components
  • Testing and benchmarking open-source Big Data and ML technologies to assess their suitability for the production environment
What we offer
What we offer
  • Access to the latest technologies, with the opportunity to apply them in a large-scale and fast-paced project
  • Opportunity to cooperate with a team of enthusiasts experienced in Machine Learning, Big Data, and distributed systems
  • Flexible cooperate hours, with the possibility of remote cooperate or cooperate from our office in Warsaw
  • An opportunity to apply your expertise in optimizing algorithms that support hundreds of millions of internet users and billions of ad views per month within the RTB model
  • The ability to see the immediate impact of your cooperate on the company's business outcomes
  • The possibility of publishing your results
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer, Search Assistant

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform ...
Location
Location
United States , San Jose
Salary
Salary:
361300.00 - 510000.00 USD / Year
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of industry experience (or PhD with 5+ years) applying ML at scale in search, recommendation, ads, personalization, or related domains
  • Strong expertise in ranking systems, recommendation systems, retrieval, personalization, and multi-objective optimization
  • Experience building large-scale ML systems leveraging deep learning, sequence models, LLMs, reinforcement learning, or bandit frameworks
  • Strong product intuition and experience optimizing user engagement, retention, and monetization simultaneously
  • Proficiency in Python, Java, or Scala
  • Experience with distributed systems and ML infrastructure such as Spark, Airflow, streaming systems, feature stores, and cloud platforms
  • Strong technical leadership, system design, communication, and problem-solving skills
  • MS or PhD in Computer Science, Statistics, or a related field
Job Responsibility
Job Responsibility
  • Lead the technical vision and roadmap for ranking, personalization, and recommendation systems powering Roku’s entertainment assistant
  • Develop and deploy state-of-the-art ML models using deep learning, transformers, LLMs, bandits, reinforcement learning, and causal inference techniques
  • Build multi-objective optimization systems balancing engagement, retention, relevance, and monetization goals
  • Drive innovation in conversational discovery, contextual recommendations, and personalized content experiences across the platform
  • Design, run, and analyze online A/B experiments tied to key product and business KPIs
  • Architect scalable ML systems, feature platforms, and data pipelines supporting rapid experimentation and long-term growth
  • Mentor engineers and provide technical leadership across cross-functional initiatives involving engineering, product, UX, and analytics teams
What we offer
What we offer
  • Health insurance
  • Equity awards
  • Life insurance
  • Disability benefits
  • Parental leave
  • Wellness benefits
  • Paid time off
  • Global access to mental health and financial wellness support and resources
  • Healthcare (medical, dental, and vision)
  • Life, accident, disability, commuter, and retirement options (401(k)/pension)
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer (LLMs, MLOps, Computer Vision & Cloud AI)

We are seeking a highly skilled Senior Machine Learning Engineer to design, deve...
Location
Location
United States , Austin
Salary
Salary:
Not provided
dutechsystems.com Logo
Dutech Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in cloud platforms including AWS, Azure, GCP, or OCI
  • 8+ years of experience with DevOps technologies including Docker, Kubernetes, Ansible, and CI/CD automation
  • Strong experience with SQL databases (PostgreSQL, MySQL) and NoSQL/vector databases
  • Proficiency in Bash and PowerShell scripting for automation and infrastructure management
  • Experience with Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD platforms
  • 3+ years of hands-on Python development experience in production environments
  • 3+ years of experience with NLP, LLMs, transformers, prompt engineering, RAG, and AI application development
  • Experience building and deploying machine learning models serving real-world users
  • Experience with time-series forecasting, anomaly detection, and predictive analytics
  • Experience developing recommendation systems and personalization engines
Job Responsibility
Job Responsibility
  • Design, develop, deploy, and maintain production-grade machine learning and AI solutions
  • Build and optimize Large Language Model (LLM) applications using GPT, BERT, T5, Hugging Face, Ollama, and similar technologies
  • Develop Retrieval-Augmented Generation (RAG) systems, prompt engineering strategies, and fine-tuning workflows
  • Implement and maintain MLOps pipelines using MLflow, Kubeflow, Airflow, Weights & Biases, or similar tools
  • Deploy and manage AI workloads across AWS, Azure, GCP, and OCI cloud environments
  • Design and support scalable infrastructure using Docker, Kubernetes, Ansible, and CI/CD pipelines
  • Develop machine learning models for forecasting, anomaly detection, predictive analytics, and real-time monitoring
  • Build recommendation engines, personalization platforms, ranking systems, and collaborative filtering solutions
  • Develop and deploy computer vision solutions using PyTorch, TensorFlow, OpenCV, YOLO, object detection, and image segmentation techniques
  • Implement feature engineering strategies and feature stores such as Feast or Tecton
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer, Shopping AI

As the engine behind Zillow Group's mission to build a seamless digital real est...
Location
Location
United States
Salary
Salary:
163200.00 - 274300.00 USD / Year
Zillow
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3-5 years of experience in developing applications in search, personalized ranking, or recommender systems
  • Experience developing and deploying ML models that scale to high-traffic, latency sensitive customer-facing services (100s of millions of requests per day)
  • Strong programming skills in a high-level language such as Python or Java
  • Familiarity with common machine learning libraries like PyTorch, TensorFlow, Catboost, scikit-learn and huggingface (repository)
  • Expertise with large scale distributed data processing systems such as Hive, Spark, Airflow, or Databricks
  • Experience owning the full lifecycle of customer facing machine learning models, from offline experimentation and prototyping to online deployment, A/B testing, and performance monitoring
Job Responsibility
Job Responsibility
  • Design, build, and ship production new machine learning models that power core product features on the Zillow app, website, and email/push notifications
  • Re-architect our core home ranking and recommendation systems to support advanced neural networks and dramatically accelerate the pace of experimentation across surfaces
  • Own the full lifecycle of your models, from offline experimentation and prototyping with massive datasets to online deployment, A/B testing, and performance monitoring
  • Pioneer the application of cutting-edge deep learning and large language models (LLMs) to improve our home shopping experience
  • Develop new AI components that optimize how we display and when we recommend homes, ensuring we connect shoppers with the right content on the right properties at the right time
  • Collaborate in a cross-functional group of engineers, applied scientists, product managers, and designers to define, execute, and iterate on the team's strategic roadmap
  • Contribute to the team's engineering excellence by improving our machine learning infrastructure, development standards, and shared tooling
  • Act as a key technical voice, mentoring other engineers and helping to shape the long-term vision for artificial intelligence in the home shopping experience
What we offer
What we offer
  • Equity awards
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

Location
Location
United States , Raceland
Salary
Salary:
Not provided
bollingershipyards.com Logo
BOLLINGER MISSISSIPPI SHIPBUILDING LLC
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Information Systems, Engineering, Data Management, or related field
  • Minimum of 6–10 years of experience ML or software engineering
  • Strong Python and ML deployment experience
  • Experience with cloud ML systems
Job Responsibility
Job Responsibility
  • Deploy, integrate, and maintain machine learning and AI solutions within enterprise workflows and operational systems
  • Design and develop scalable ML pipelines, feature stores, APIs, and model-serving infrastructure
  • Collaborate with Data Scientists to productionize models and improve deployment readiness
  • Monitor model performance, drift, availability, and reliability across production environments
  • Implement processes for model retraining, versioning, governance, and lifecycle management
  • Partner with Data Engineering teams to support feature engineering and data pipeline integration
  • Ensure ML solutions are secure, scalable, maintainable, and aligned with enterprise architecture standards
  • Support AI applications across forecasting, operational optimization, bidding, scheduling, maintenance, and automation use cases
  • Troubleshoot and resolve issues related to model deployment and operational performance
  • Contribute to ML engineering standards, best practices, and platform improvements
What we offer
What we offer
  • Competitive Pay
  • Comprehensive Benefits Package
  • Hybrid Schedule Available
  • Career Development
  • Cutting-Edge Projects
  • Positive Work Environment & Company Values
  • Fulltime
Read More
Arrow Right