CrawlJobs Logo

Machine Learning Infrastructure Engineer

United States, Boston, NYC 170000.00 - 240000.00 USD / Year · Job Posted January 13, 2026
Apply Position
Job Link Share

Job Description

We’re looking for early members of our machine learning team. You’ll work closely with the founding team and have ownership of a wide variety of technical decisions on how we build and deploy our state of the art ML models.

Job Responsibility

  • Design and build Suno’s machine learning models and infrastructure
  • Build and deploy systems comprising multiple low-latency machine learning models
  • Build and optimize distributed training systems
  • Optimize the performance, joy, beauty, and feel of our products

Requirements

  • 5+ years experience building production ML systems
  • Python, pytorch, distributed systems
  • Experience building and optimizing latency and throughput of machine learning systems and GPU workloads
  • An obsession with great user experiences, getting the details right, iterating & learning rapidly, and working hard
  • Applicants must be eligible to work in the US

Nice to have

A love of music (listening, exploring, making) is a huge plus

What we offer

  • Company Equity Package
  • 401(k) with 3% Employer Match & Roth 401(k)
  • Medical, Dental, & Vision Insurance (PPO w/ HSA & FSA options)
  • 11 Paid Holidays + Unlimited PTO & Sick Time
  • 16 Weeks of Paid Parental Leave
  • Creative Education Stipend
  • Generous Commuter Allowance
  • In-Office Lunch (5 days per week)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Machine Learning Infrastructure Engineer

8 matching positions

Senior Machine Learning Infrastructure Engineer

As a Senior ML Infrastructure Engineer at Plus, you will design scalable archite...
Location
Location
United States , Santa Clara
Salary
Salary:
160000.00 - 200000.00 USD / Year
plus.ai Logo
PlusAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Phd or MS in Computer Science, Electrical Engineering, or related field
  • Good oral and written communication skills
  • Phd new grad or Masters with 3+ years of software engineering experience with a focus on ML infrastructure or distributed systems
  • Proficiency in in Python, C++, SQL
  • Deep understanding of containerization, orchestration technologies, distributed ML workload, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, and mlflow)
  • Deploy and manage resources across multiple cloud platforms (AWS, GCP, or on-prem environments)
  • Proficiency in at least one deep learning framework, such as PyTorch and data pipeline tools (e.g., Apache Airflow, Prefect)
  • Strong knowledge of distributed systems, databases, and storage solutions
  • Extensive software design and development skills
  • Ability to learn and adapt to new technologies and contribute in a productive environment
Job Responsibility
Job Responsibility
  • Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale
  • Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks
  • Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability
  • Implement distributed systems and storage solutions optimized for machine learning workloadsDrive improvements in CI/CD workflows for ML models and infrastructure
  • Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems
  • Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform
  • Mentor junior engineers and contribute to a culture of technical excellence
  • Ensure that your work is performed in accordance with the company’s Quality Management System (QMS) requirements and contribute to continuous improvement efforts
  • Ensure team compliance with QMS, monitor quality, and drive process improvements
What we offer
What we offer
  • Work, learn and grow in a highly future-oriented, innovative and dynamic field
  • Wide range of opportunities for personal and professional development
  • Catered free lunch, unlimited snacks and beverages
  • Highly competitive salary and benefits package, including 401(k) plan
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer (Infrastructure)

We are looking for an experienced MLOps Engineer to join our team as a Senior Ma...
Location
Location
United States , Boston
Salary
Salary:
152800.00 - 224100.00 USD / Year
simplisafe.com Logo
SimpliSafe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering, data engineering, or a related field, with at least 3 years focused on MLOps or ML infrastructure
  • Deep hands-on experience with AWS or similar public clouds, including compute, networking, container orchestration, and observability stacks
  • Hands-on experience with: CI/CD pipelines, Docker
  • Kubernetes
  • Infrastructure-as-code tools (e.g., Terraform, Cloud Formation)
  • Proficiency in programming languages like Python, and familiarity with machine learning frameworks (e.g., TensorFlow, PyTorch)
  • Solid understanding of ML lifecycle management, including experiment tracking, versioning, and monitoring
  • LLM application development, including prompt engineering and evaluation
  • Strong communication skills for partnering with cross-functional technical and non-technical teams
Job Responsibility
Job Responsibility
  • Lead the architecture, deployment, and optimization of scalable ML model serving systems for real-time and batch use cases
  • Collaborate with data scientists, engineers, and stakeholders to operationalize ML models
  • Develop CI/CD pipelines for ML models enabling rapid, safe, and consistent model releases
  • Design, implement, and own comprehensive production monitoring for ML models/systems
  • Manage cloud infrastructure, primarily in AWS or other major public clouds, to support ML workloads
  • Drive best practices in model versioning, observability, reproducibility, and deployment reliability
  • Serve in an on-call rotation as a first responder for software owned by your team
What we offer
What we offer
  • A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
  • A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
  • Free SimpliSafe system and professional monitoring for your home
  • Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change
  • Participation in our annual bonus program, equity, and other forms of compensation
  • A full range of medical, retirement, and lifestyle benefits
  • Fulltime
Read More
Arrow Right
New

Staff Machine Learning Engineer - ML Training Infrastructure

The Role:   We are seeking an experienced, technically strong, impact-driven ex...
Location
Location
United States , Austin; Mountain View
Salary
Salary:
185000.00 - 335300.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
  • 8+ years of professional software engineering experience
  • 5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
  • Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
  • Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
  • Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
  • Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
  • Willingness to travel to Sunnyvale, CA as needed
  • Comfortable operating in highly ambiguous and dynamic environments
Job Responsibility
Job Responsibility
  • Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
  • Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
  • Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
  • Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
  • Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
  • Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
  • Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer - ML Training Infrastructure

We are seeking an experienced, technical oriented, impact delivering-driven expe...
Location
Location
United States , Mountain View
Salary
Salary:
170000.00 - 240000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors degree or higher in Computer Science or equivalent major OR equivalent relevant experience
  • 3+ years professional software engineering experience
  • 2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
  • Strong programming skills in Python, with proficiency in frameworks such as, PyTorch (preferred), TensorFlow, or similar
  • Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
  • Willingness to travel to Sunnyvale, CA as needed
  • Comfortable working in highly ambiguous and dynamic environments
Job Responsibility
Job Responsibility
  • Design and development of scalable, reliable, high-performance ML framework to support model training at scale
  • Model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost
  • Raise the bar on system observability, debuggability, and operational excellence, and user experience
  • Collaborate with cross-functional teams to integrate new features and technologies into the platform
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right
New

Senior Machine Learning Engineer, AI Platform

The AI Platform team is responsible for building the foundational infrastructure...
Location
Location
United States; Canada
Salary
Salary:
139000.00 - 218000.00 USD / Year
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree with 4–6 years of relevant industry experience, or Master’s degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
  • Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
  • Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
  • Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
  • Hands-on experience working with GPU-based workloads and accelerated computing in production settings
  • Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
  • Ability to independently scope and drive technical initiatives while balancing product and operational priorities
  • Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
  • Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams
Job Responsibility
Job Responsibility
  • Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
  • Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
  • Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
  • Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
  • Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
  • Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
  • Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
  • Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
  • Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews
What we offer
What we offer
  • Generous performance-based bonus plans
  • Rich medical, dental, and vision coverage
  • Generous retirement contributions with 100% immediate vesting
  • Quarterly all-company wellness days
  • Country specific holidays plus a day off for your birthday
  • One-time home office stipend
  • Annual professional development budget
  • Quarterly well-being stipend
  • Considerable paid parental leave
  • Employee referral bonus program
  • Fulltime
Read More
Arrow Right
New

Sr. Lead Machine Learning Engineer

As a Capital One Machine Learning Engineer (MLE), you'll be part of an Agile tea...
Location
Location
United States , New York; San Francisco; San Jose; Cambridge; McLean
Salary
Salary:
229900.00 - 286200.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree
  • At least 8 years of experience designing and building data-intensive solutions using distributed computing (Internship experience does not apply)
  • At least 4 years of experience programming with Python, Scala, or Java
  • At least 3 years of experience building, scaling, and optimizing ML systems
  • At least 2 years of experience leading teams developing ML solutions
Job Responsibility
Job Responsibility
  • Design, build, and/or deliver ML models and components that solve real-world business problems, while working in collaboration with the Product and Data Science teams
  • Inform your ML infrastructure decisions using your understanding of ML modeling techniques and issues, including choice of model, data, and feature selection, model training, hyperparameter tuning, dimensionality, bias/variance, and validation
  • Solve complex problems by writing and testing application code, developing and validating ML models, and automating tests and deployment
  • Collaborate as part of a cross-functional Agile team to create and enhance software that enables state-of-the-art big data and ML applications
  • Retrain, maintain, and monitor models in production
  • Leverage or build cloud-based architectures, technologies, and/or platforms to deliver optimized ML models at scale
  • Construct optimized data pipelines to feed ML models
  • Leverage continuous integration and continuous deployment best practices, including test automation and monitoring, to ensure successful deployment of ML models and application code
  • Ensure all code is well-managed to reduce vulnerabilities, models are well-governed from a risk perspective, and the ML follows best practices in Responsible and Explainable AI
  • Use programming languages like Python, Scala, or Java
What we offer
What we offer
  • performance based incentive compensation
  • cash bonus(es)
  • long term incentives (LTI)
  • health, financial and other benefits
  • Fulltime
Read More
Arrow Right
New

Senior Machine Learning Engineer

IT AND R&D REMOTE - Senior Machine Learning Engineer - RTB House is a global com...
Location
Location
Poland
Salary
Salary:
Not provided
rtbhouse.com Logo
RTB House
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in designing and implementing complex IT systems
  • Ability to develop user-friendly, versatile tools
  • Proficiency in at least one programming language, such as Python, C++, Java, or Scala, along with expertise in Linux
  • Strong skills in evaluating and optimizing system performance, from initial design through to production troubleshooting
  • Deep understanding of algorithms and data structures
  • Initiative and creativity to improve existing solutions
  • Ability to work effectively both within and across teams
  • C1 level in Polish
Job Responsibility
Job Responsibility
  • Developing and maintaining the ML training platform and the bidding infrastructure that evaluates ML models in the production environment
  • Identifying performance bottlenecks and optimizing critical, low-level parts of the system
  • Ensuring the reliability and scalability of implementations, and creating performance and correctness tests for new system components
  • Testing and benchmarking open-source Big Data and ML technologies to assess their suitability for the production environment
What we offer
What we offer
  • Access to the latest technologies, with the opportunity to apply them in a large-scale and fast-paced project
  • Opportunity to cooperate with a team of enthusiasts experienced in Machine Learning, Big Data, and distributed systems
  • Flexible cooperate hours, with the possibility of remote cooperate or cooperate from our office in Warsaw
  • An opportunity to apply your expertise in optimizing algorithms that support hundreds of millions of internet users and billions of ad views per month within the RTB model
  • The ability to see the immediate impact of your cooperate on the company's business outcomes
  • The possibility of publishing your results
  • Fulltime
Read More
Arrow Right

Lead Machine Learning Engineer

Lead Machine Learning Engineer At Capital One, we are changing banking for good...
Location
Location
United States , Cambridge, Massachusetts; Richmond, Virginia; McLean, Virginia
Salary
Salary:
197300.00 - 225100.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree
  • At least 6 years of experience designing and building data-intensive solutions using distributed computing (Internship experience does not apply)
  • At least 4 years of experience programming with Python, Scala, or Java
  • At least 2 years of experience building, scaling, and optimizing ML systems
Job Responsibility
Job Responsibility
  • Partner with a cross-functional team of engineers, data scientists, product managers, and designers to deliver AI-powered products that change how our associates work and provide value to our customers
  • Design, develop, test, deploy, and support AI software components utilizing machine learning models, including model evaluation and experimentation, large language model inference, similarity search, guardrails, governance, observability and agentic AI
  • Fine-tune, develop and evaluate machine learning and foundation models
  • Collaborate as part of a cross-functional Agile team to create and enhance software that utilizes state-of-the-art AI and ML capabilities
  • Contribute thought leadership and technical vision to the long term roadmap of pioneering AI systems at Capital One
  • Leverage a broad stack of Open Source and SaaS AI technologies
  • Inform your ML infrastructure decisions using your understanding of ML modeling techniques and issues
  • Retrain, maintain, and monitor models in production
  • Construct optimized data pipelines to feed ML models
  • Ensure all code is well-managed to reduce vulnerabilities, models are well-governed from a risk perspective, and the ML follows best practices in Responsible and Explainable AI
What we offer
What we offer
  • Performance based incentive compensation
  • cash bonus(es) and/or long term incentives (LTI)
  • comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being
  • Fulltime
Read More
Arrow Right