Machine Learning Infrastructure Engineer Job at Suno (Boston, NYC)

Senior Machine Learning Infrastructure Engineer

As a Senior ML Infrastructure Engineer at Plus, you will design scalable archite...

Location

United States , Santa Clara

Salary:

160000.00 - 200000.00 USD / Year

PlusAI

Expiration Date

Until further notice

Requirements

Phd or MS in Computer Science, Electrical Engineering, or related field
Good oral and written communication skills
Phd new grad or Masters with 3+ years of software engineering experience with a focus on ML infrastructure or distributed systems
Proficiency in in Python, C++, SQL
Deep understanding of containerization, orchestration technologies, distributed ML workload, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, and mlflow)
Deploy and manage resources across multiple cloud platforms (AWS, GCP, or on-prem environments)
Proficiency in at least one deep learning framework, such as PyTorch and data pipeline tools (e.g., Apache Airflow, Prefect)
Strong knowledge of distributed systems, databases, and storage solutions
Extensive software design and development skills
Ability to learn and adapt to new technologies and contribute in a productive environment

Job Responsibility

Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale
Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks
Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability
Implement distributed systems and storage solutions optimized for machine learning workloadsDrive improvements in CI/CD workflows for ML models and infrastructure
Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems
Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform
Mentor junior engineers and contribute to a culture of technical excellence
Ensure that your work is performed in accordance with the company’s Quality Management System (QMS) requirements and contribute to continuous improvement efforts
Ensure team compliance with QMS, monitor quality, and drive process improvements

What we offer

Work, learn and grow in a highly future-oriented, innovative and dynamic field
Wide range of opportunities for personal and professional development
Catered free lunch, unlimited snacks and beverages
Highly competitive salary and benefits package, including 401(k) plan

Fulltime

Senior Machine Learning Engineer (Infrastructure)

We are looking for an experienced MLOps Engineer to join our team as a Senior Ma...

Location

United States , Boston

Salary:

152800.00 - 224100.00 USD / Year

SimpliSafe

Expiration Date

Until further notice

Requirements

5+ years of experience in software engineering, data engineering, or a related field, with at least 3 years focused on MLOps or ML infrastructure
Deep hands-on experience with AWS or similar public clouds, including compute, networking, container orchestration, and observability stacks
Hands-on experience with: CI/CD pipelines, Docker
Kubernetes
Infrastructure-as-code tools (e.g., Terraform, Cloud Formation)
Proficiency in programming languages like Python, and familiarity with machine learning frameworks (e.g., TensorFlow, PyTorch)
Solid understanding of ML lifecycle management, including experiment tracking, versioning, and monitoring
LLM application development, including prompt engineering and evaluation
Strong communication skills for partnering with cross-functional technical and non-technical teams

Job Responsibility

Lead the architecture, deployment, and optimization of scalable ML model serving systems for real-time and batch use cases
Collaborate with data scientists, engineers, and stakeholders to operationalize ML models
Develop CI/CD pipelines for ML models enabling rapid, safe, and consistent model releases
Design, implement, and own comprehensive production monitoring for ML models/systems
Manage cloud infrastructure, primarily in AWS or other major public clouds, to support ML workloads
Drive best practices in model versioning, observability, reproducibility, and deployment reliability
Serve in an on-call rotation as a first responder for software owned by your team

What we offer

A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
Free SimpliSafe system and professional monitoring for your home
Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change
Participation in our annual bonus program, equity, and other forms of compensation
A full range of medical, retirement, and lifestyle benefits

Fulltime

New

Staff Machine Learning Engineer - ML Training Infrastructure

The Role:   We are seeking an experienced, technically strong, impact-driven ex...

Location

United States , Austin; Mountain View

Salary:

185000.00 - 335300.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
8+ years of professional software engineering experience
5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
Willingness to travel to Sunnyvale, CA as needed
Comfortable operating in highly ambiguous and dynamic environments

Job Responsibility

Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Senior Machine Learning Engineer - ML Training Infrastructure

We are seeking an experienced, technical oriented, impact delivering-driven expe...

Location

United States , Mountain View

Salary:

170000.00 - 240000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Bachelors degree or higher in Computer Science or equivalent major OR equivalent relevant experience
3+ years professional software engineering experience
2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
Strong programming skills in Python, with proficiency in frameworks such as, PyTorch (preferred), TensorFlow, or similar
Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
Willingness to travel to Sunnyvale, CA as needed
Comfortable working in highly ambiguous and dynamic environments

Job Responsibility

Design and development of scalable, reliable, high-performance ML framework to support model training at scale
Model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost
Raise the bar on system observability, debuggability, and operational excellence, and user experience
Collaborate with cross-functional teams to integrate new features and technologies into the platform

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

New

Senior Machine Learning Engineer, AI Platform

The AI Platform team is responsible for building the foundational infrastructure...

Location

United States; Canada

Salary:

139000.00 - 218000.00 USD / Year

Mozilla

Expiration Date

Until further notice

Requirements

Bachelor’s degree with 4–6 years of relevant industry experience, or Master’s degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
Hands-on experience working with GPU-based workloads and accelerated computing in production settings
Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
Ability to independently scope and drive technical initiatives while balancing product and operational priorities
Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams

Job Responsibility

Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews

What we offer

Generous performance-based bonus plans
Rich medical, dental, and vision coverage
Generous retirement contributions with 100% immediate vesting
Quarterly all-company wellness days
Country specific holidays plus a day off for your birthday
One-time home office stipend
Annual professional development budget
Quarterly well-being stipend
Considerable paid parental leave
Employee referral bonus program

Fulltime

New

Sr. Lead Machine Learning Engineer

As a Capital One Machine Learning Engineer (MLE), you'll be part of an Agile tea...

Location

United States , New York; San Francisco; San Jose; Cambridge; McLean

Salary:

229900.00 - 286200.00 USD / Year

Capital One

Expiration Date

Until further notice

Requirements

Bachelor's Degree
At least 8 years of experience designing and building data-intensive solutions using distributed computing (Internship experience does not apply)
At least 4 years of experience programming with Python, Scala, or Java
At least 3 years of experience building, scaling, and optimizing ML systems
At least 2 years of experience leading teams developing ML solutions

Job Responsibility

Design, build, and/or deliver ML models and components that solve real-world business problems, while working in collaboration with the Product and Data Science teams
Inform your ML infrastructure decisions using your understanding of ML modeling techniques and issues, including choice of model, data, and feature selection, model training, hyperparameter tuning, dimensionality, bias/variance, and validation
Solve complex problems by writing and testing application code, developing and validating ML models, and automating tests and deployment
Collaborate as part of a cross-functional Agile team to create and enhance software that enables state-of-the-art big data and ML applications
Retrain, maintain, and monitor models in production
Leverage or build cloud-based architectures, technologies, and/or platforms to deliver optimized ML models at scale
Construct optimized data pipelines to feed ML models
Leverage continuous integration and continuous deployment best practices, including test automation and monitoring, to ensure successful deployment of ML models and application code
Ensure all code is well-managed to reduce vulnerabilities, models are well-governed from a risk perspective, and the ML follows best practices in Responsible and Explainable AI
Use programming languages like Python, Scala, or Java

What we offer

performance based incentive compensation
cash bonus(es)
long term incentives (LTI)
health, financial and other benefits

Fulltime

New

Senior Machine Learning Engineer

IT AND R&D REMOTE - Senior Machine Learning Engineer - RTB House is a global com...

Location

Poland

Salary:

Not provided

RTB House

Expiration Date

Until further notice

Requirements

Expertise in designing and implementing complex IT systems
Ability to develop user-friendly, versatile tools
Proficiency in at least one programming language, such as Python, C++, Java, or Scala, along with expertise in Linux
Strong skills in evaluating and optimizing system performance, from initial design through to production troubleshooting
Deep understanding of algorithms and data structures
Initiative and creativity to improve existing solutions
Ability to work effectively both within and across teams
C1 level in Polish

Job Responsibility

Developing and maintaining the ML training platform and the bidding infrastructure that evaluates ML models in the production environment
Identifying performance bottlenecks and optimizing critical, low-level parts of the system
Ensuring the reliability and scalability of implementations, and creating performance and correctness tests for new system components
Testing and benchmarking open-source Big Data and ML technologies to assess their suitability for the production environment

What we offer

Access to the latest technologies, with the opportunity to apply them in a large-scale and fast-paced project
Opportunity to cooperate with a team of enthusiasts experienced in Machine Learning, Big Data, and distributed systems
Flexible cooperate hours, with the possibility of remote cooperate or cooperate from our office in Warsaw
An opportunity to apply your expertise in optimizing algorithms that support hundreds of millions of internet users and billions of ad views per month within the RTB model
The ability to see the immediate impact of your cooperate on the company's business outcomes
The possibility of publishing your results

Fulltime

Lead Machine Learning Engineer

Lead Machine Learning Engineer At Capital One, we are changing banking for good...

Location

United States , Cambridge, Massachusetts; Richmond, Virginia; McLean, Virginia

Salary:

197300.00 - 225100.00 USD / Year

Capital One

Expiration Date

Until further notice

Requirements

Bachelor’s Degree
At least 6 years of experience designing and building data-intensive solutions using distributed computing (Internship experience does not apply)
At least 4 years of experience programming with Python, Scala, or Java
At least 2 years of experience building, scaling, and optimizing ML systems

Job Responsibility

Partner with a cross-functional team of engineers, data scientists, product managers, and designers to deliver AI-powered products that change how our associates work and provide value to our customers
Design, develop, test, deploy, and support AI software components utilizing machine learning models, including model evaluation and experimentation, large language model inference, similarity search, guardrails, governance, observability and agentic AI
Fine-tune, develop and evaluate machine learning and foundation models
Collaborate as part of a cross-functional Agile team to create and enhance software that utilizes state-of-the-art AI and ML capabilities
Contribute thought leadership and technical vision to the long term roadmap of pioneering AI systems at Capital One
Leverage a broad stack of Open Source and SaaS AI technologies
Inform your ML infrastructure decisions using your understanding of ML modeling techniques and issues
Retrain, maintain, and monitor models in production
Construct optimized data pipelines to feed ML models
Ensure all code is well-managed to reduce vulnerabilities, models are well-governed from a risk perspective, and the ML follows best practices in Responsible and Explainable AI

What we offer

Performance based incentive compensation
cash bonus(es) and/or long term incentives (LTI)
comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being

Fulltime

Select Country

Machine Learning Infrastructure Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?