CrawlJobs Logo

Machine Learning Systems Engineer

United States, Bala Cynwyd (Philadelphia Area), Pennsylvania · Job Posted February 03, 2026
Apply Position
Job Link Share

Job Description

We’re looking for a Machine Learning Systems Engineer to strengthen the performance and scalability of our distributed training infrastructure. In this role, you'll work closely with researchers to streamline the development and execution of large-scale training runs, helping them make the most of our compute resources. You’ll contribute to building tools that make distributed training more efficient and accessible, while continuously refining system performance through careful analysis and optimization. This position is a great fit for someone who enjoys working at the intersection of distributed systems and machine learning, values high-performance code, and has an interest in supporting innovative machine learning efforts.

Job Responsibility

  • Collaborate with researchers to enable them to develop systems-efficient models and architectures
  • Apply the latest techniques to our internal training runs to achieve impressive hardware efficiency for our training runs
  • Create tooling to help researchers distribute their training jobs more effectively
  • Profile and optimize our training runs

Requirements

  • Experience with large-scale ML training pipelines and distributed training frameworks
  • Strong software engineering skills in python
  • Passion for diving deep into systems implementations and understanding fundamentals to improve their performance and maintainability
  • Experience improving resource efficiency across distributed computing environments by leveraging profiling, benchmarking, and implementing system-level optimizations

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Machine Learning Systems Engineer

8 matching positions

Machine Learning Systems Engineer

As a Machine Learning Systems Engineer on the AI & ML Platform team, you will bu...
Location
Location
United States
Salary
Salary:
145800.00 - 229125.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fluency in at least one modern object-oriented programming language (preferably Java/Kotlin)
  • Understanding and experience with Machine Learning project lifecycle and tools
  • Understanding of LLMs, best deployment practices and inference optimisation
  • Experience in building and implementing high-performance RESTful micro-services
  • Experience building and operating large scale distributed systems using Amazon Web Services (Sagemaker, S3, Cloud Formation, AWS Security and Networking)
  • Experience with Continuous Delivery and Continuous Integration
Job Responsibility
Job Responsibility
  • Build and scale the core infrastructure to allow software engineers, ML engineers & data scientists to develop, train, evaluate, deploy, and operate Machine Learning models and pipelines
  • Build systems for product teams like Jira & Confluence to provide access to curated LLMs
  • Use software development expertise to solve difficult problems, tackling infrastructure and architecture challenges
  • Lead engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input and communicate results
  • Regularly tackle complex problems in the team, from technical design to launch
  • Routinely tackle complex architecture challenges and defines coding standards & patterns for the team
  • Lead the team through times of ambiguity, help them adapt and deliver positive impact
  • Mentor junior members on the team
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Bonuses
  • Commissions
  • Equity
  • Fulltime
Read More
Arrow Right

Principal Machine Learning Systems Engineer

Search Platform powers the search functionality in Atlassian products. The team ...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years experience in multiple hands-on software/technology leadership roles, with end-to-end responsibility through the software development lifecycle
  • Worked on scaling ML use cases for 50+ TB of data
  • Good understanding of PySpark and Databricks jobs scaling challenges
  • Experience with ML workflows and observability at scale.
  • Bachelor's degree with a preference for Computer Science degree
  • Expertise with one or more prominent languages such as Java, Python, Kotlin, Go, or TypeScript is required.
  • Understanding of SaaS, PaaS, IaaS industry with hands-on experience with public cloud offerings (e.g., AWS, GCP, or Azure)
  • Java, Spring, REST, and NoSQL databases
  • Experience building event-driven based on SQS, SNS, Kafka or equivalent technologies
  • Knowledge to evaluate trade-offs between correctness, robustness, performance, space and time
Job Responsibility
Job Responsibility
  • Handle complex problems in the team from technical design to launch
  • Determine plans-of-attack on large projects
  • Solve complex architecture challenges and apply architectural standards and start using them on new projects
  • Lead code reviews & documentation and take on complex bug fixes, especially on high-risk problems
  • Set the standard for meaningful code reviews
  • Partner across engineering teams to take on company-wide programmes in multiple projects
  • Transfer your depth of knowledge from your current language to excel as a Software Engineer
  • Mentor junior members of the team
What we offer
What we offer
  • Atlassians can choose where they work – whether in an office, from home, or a combination of the two
  • health and wellbeing resources
  • paid volunteer days
Read More
Arrow Right

Senior Machine Learning Systems Engineer

Our organization drives AI innovation across Jira products. We deliver seamless ...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience building Machine Learning and AI solutions (4+ years)
  • Proven experience developing, deploying, and maintaining end-to-end ML systems, including data engineering, model serving, and monitoring
  • Expert proficiency with GenAI frameworks and tools, including developing and fine-tuning large language models (LLMs) and building retrieval-augmented generation (RAG) systems
  • Expert proficiency in Python and ML frameworks like PyTorch, TensorFlow, or JAX
  • Experience implementing MLOps, CI/CD pipelines, and automation for continuous training, deployment, and monitoring of ML models
Job Responsibility
Job Responsibility
  • Collaborate with software engineers, data scientists, and product managers to solve complex problems
  • Lead projects from technical design through launch
  • Partner with teams to achieve impactful results
  • Deliver robust ML solutions to build AI features reaching millions
  • This includes curating ML datasets, fine-tuning open-source LLMs, or accessing proprietary LLMs
  • Mentor junior members of the team
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
Read More
Arrow Right

Senior Machine Learning Systems Engineer

As a Senior Machine Learning Systems Engineer at Abridge, you’ll play a pivotal ...
Location
Location
United States , San Francisco
Salary
Salary:
221000.00 - 260000.00 USD / Year
abridge.com Logo
Abridge
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in building and deploying machine learning models in production environments
  • Deep understanding of container orchestration and distributed systems architecture
  • Expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Excellent communication skills, with the ability to interface between research and product engineering
Job Responsibility
Job Responsibility
  • Design, deploy and maintain scalable Kubernetes clusters for AI model inference and training
  • Develop, optimize, and maintain ML model serving and training infrastructure, ensuring high-performance and low-latency
  • Collaborate with ML and product teams to scale backend infrastructure for AI-driven products, focusing on model deployment, throughout optimization, and compute efficiency
  • Optimize compute-heavy workflows and enhance GPU utilization for ML workloads
  • Build a robust model API orchestration system
  • Collaborate with leadership to define and implement strategies for scaling infrastructure as the company grows, ensuring long-term efficiency and performance
What we offer
What we offer
  • Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
  • Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
  • Paid Parental Leave: Generous paid parental leave for all full-time employees
  • Family Forming Benefits: Resources and financial support to help you build your family
  • 401(k) Matching: Contribution matching to help invest in your future
  • Personal Device Allowance: Tax free funds for personal device usage
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
  • Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
  • Fulltime
Read More
Arrow Right

Research Engineer - Machine Learning and Systems

We are hiring a principal level Research Engineer with deep strength in machine ...
Location
Location
United States , New York City
Salary
Salary:
100000.00 - 250000.00 USD / Year
helpcare.ai Logo
Helpcare AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD in Computer Science, Machine Learning, Computer Graphics, Computer Vision, or related field, or equivalent research track record
  • Seven or more years of experience in applied ML or research engineering including significant time in fast paced or startup settings
  • Strong publication record in top venues such as NeurIPS, ICLR, ICML, CVPR, ECCV, ICCV, SIGGRAPH, or TOG with multiple first author papers or equivalent impactful artifacts
  • Proven experience training and serving large models at scale including multi GPU or multi node training, distributed data loading, mixed precision, and memory optimization
  • Fluency in Python and C++ and experience writing efficient CUDA or Triton kernels
  • Expertise with PyTorch or JAX and modern tooling for experiment tracking, evaluation, and deployment
  • Demonstrated ability to take ideas from paper to production with measurable impact on users or business outcomes
  • Strong systems skills including profiling, performance tuning, reliability engineering, and cost awareness
  • Excellent communication with the ability to work across research and product teams
Job Responsibility
Job Responsibility
  • Research, design, and implement models and systems across vision, generative modeling, simulation, rendering, and 3D perception
  • Build data, training, evaluation, and deployment pipelines with strong observability and reproducibility
  • Translate research insights into reliable production services that meet product and latency requirements
  • Contribute hands on across prototyping, optimization, integration, and scaling
  • Survey new methods and run grounded evaluations to identify what to adopt and when
  • Share expertise through design reviews, mentoring, and documentation
What we offer
What we offer
  • Relocation support available
  • Fulltime
Read More
Arrow Right

Machine Learning Systems Research Engineer, Agent Post-training - Enterprise GenAI

The Enterprise ML Research Lab works on the front lines of this AI revolution. W...
Location
Location
United States , San Francisco; New York
Salary
Salary:
218400.00 - 273000.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 1-3 years of LLM training in a production environment
  • Passionate about system optimization
  • Experience with post-training methods like RLHF/RLVR and related algorithms like PPO/GRPO etc.
  • Ability to demonstrate know-how on how to operate the architecture of the modern GPU cluster
  • Experience with multi-node LLM training and inference
  • Strong software engineering skills, proficient in frameworks and tools such as CUDA, Pytorch, transformers, flash attention, etc.
  • Strong written and verbal communication skills to operate in a cross functional team environment
  • PhD or Masters in Computer Science or a related field
Job Responsibility
Job Responsibility
  • Build, profile and optimize our training and inference framework
  • Post-train state of the art models, developed both internally and from the community, to define stable post-training recipes for our enterprise engagements
  • Collaborate with ML teams to accelerate their research and development, and enable them to develop the next generation of models and data curation
  • Create a next-gen agent training algorithm for multi-agent/multi-tool rollouts
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • additional benefits such as a commuter stipend
  • equity based compensation
  • Fulltime
Read More
Arrow Right

Machine Learning Data Engineer - Systems & Retrieval

As a Machine Learning Data Engineer - Systems & Retrieval, you will build and op...
Location
Location
United States , Palo Alto
Salary
Salary:
Not provided
zyphra.com Logo
Zyphra
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background with fluency in Python
  • Experience designing, building, and maintaining data pipelines in production environments
  • Deep understanding of data structures, storage formats, and distributed data systems
  • Familiarity with indexing and retrieval techniques for large-scale document corpora
  • Understanding of database systems (SQL and NoSQL), their internals, and performance characteristics
  • Strong attention to security, access controls, and compliance best practices (e.g., GDPR, SOC2)
  • Excellent debugging, observability, and logging practices to support reliability at scale
  • Strong communication skills and experience collaborating across ML, infra, and product teams
Job Responsibility
Job Responsibility
  • Design and implementation of distributed data ingestion and transformation pipelines
  • Building retrieval and indexing systems that support RAG and other LLM-based methods
  • Mining and organizing large unstructured datasets, both in research and production environments
  • Collaborating with ML engineers, systems engineers, and DevOps to scale pipelines and observability
  • Ensuring compliance and access control in data handling, with security and auditability in mind
What we offer
What we offer
  • Comprehensive medical, dental, vision, and FSA plans
  • Competitive compensation and 401(k)
  • Relocation and immigration support on a case-by-case basis
  • On-site meals prepared by a dedicated culinary team
  • Thursday Happy Hours
  • Fulltime
Read More
Arrow Right

Machine Learning Engineer, Distributed Data Systems

As a Research Engineer, Distributed Data Systems, you will design and scale the ...
Location
Location
United States , San Francisco
Salary
Salary:
295000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience with distributed systems and large-scale infrastructure
  • Detail-oriented and bring rigor to building and maintaining reliable systems
  • Excellent software engineering fundamentals and organizational skills
  • Comfortable with ambiguity and rapid change
Job Responsibility
Job Responsibility
  • Design, build, and maintain data infrastructure systems such as distributed compute, data orchestration, distributed storage, streaming infrastructure, machine learning infrastructure while ensuring scalability, reliability, and security
  • Ensure our data platform can scale by orders of magnitude while remaining reliable and efficient
  • Partner with researchers to deeply understand requirements and translate them into production-ready systems
  • Harden, optimize, and maintain critical data infrastructure systems that power multimodal training and evaluation
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right