CrawlJobs Logo

Machine Learning Operations Lead

together.ai Logo

Together AI

Location Icon

Location:
United States of America , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

160000.00 - 280000.00 USD / Year

Job Description:

Together AI is building the AI Inference & Model Shaping Platform that brings the most advanced generative AI models to the world. Our platform powers multi-tenant serverless workloads and dedicated endpoints, enabling developers, enterprises, and researchers to harness the latest LLMs, multimodal models, image, audio, video, and reasoning models at scale. We are looking for an exceptional MLOps Engineering Lead to partner closely with our cross-functional engineering, infrastructure, research, and sales teams to ensure excellence of our ML API offerings. Your primary focus will be on delivering world-class inference and fine-tuning in our public APIs and customer deployments by building automation and operations processes.

Job Responsibility:

  • Own availability and performance SLAs for production inference and fine-tuning services across serverless and dedicated deployments
  • Own & improve testing, deployment, configuration management, and monitoring practices for multi-cluster ML infrastructure – partnering closely with Infra SREs
  • Build self-serve tooling and automation to reduce operational toil and enable internal users (MLOps, customer experience) and self-serve offerings
  • Define and enforce configuration best practices for inference engines (vLLM, tvLLM, Pulsar) to prevent runtime issues
  • Lead incident response, conduct postmortems, and drive reliability improvements
  • Hire, mentor, and grow an MLOps engineering team
  • Partner with infrastructure and ML engineering teams to improve system reliability and cost efficiency

Requirements:

  • 5+ years operating production ML inference or training systems at scale
  • 2+ years leading engineering teams, with experience building teams from scratch
  • Deep expertise with Kubernetes, multi-cluster orchestration, and ML serving frameworks
  • Strong track record owning production SLAs (e.g. availability, TTFT, TPS)
  • Experience with LLM inference serving systems (vLLM, TRT-LLM, or similar)
  • Ability to influence cross-functional teams and make deployment/architecture decisions

Nice to have:

  • Experience building internal developer platforms or self-serve tooling
  • Background in cost optimization for GPU infrastructure
  • Contributions to open-source ML infrastructure projects
What we offer:
  • startup equity
  • health insurance
  • competitive benefits

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Machine Learning Operations Lead

Senior Staff Machine Learning Engineer

Join the Affirm team as a Senior Staff Machine Learning Engineer and become a pi...
Location
Location
United States
Salary
Salary:
232000.00 - 310000.00 USD / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience researching, designing, deploying, and operating large-scale, real-time machine learning systems
  • Experience leading end-to-end ML system design, from data architecture and feature pipelines to model training, evaluation, and production deployment
  • Proficient in Python and ML frameworks, including PyTorch and XGBoost
  • Strong understanding of representation learning and embedding-based modeling
  • Deep expertise in neural network-based sequence modeling, including architectures such as Transformers, recurrent, or attention-based models, and multi-task learning systems
  • Deep hands-on experience with large-scale distributed ML infrastructure, including streaming or batch data ingestion, feature stores, feature engineering, training pipelines, model serving and inference infrastructure, monitoring, and automated retraining
  • Strong technical leadership: defining long-term strategy, guiding research direction, and aligning work across teams
  • Exceptional judgment, collaboration, and communication skills
  • Strong verbal and written communication skills that support effective collaboration across our global engineering organization
  • Equivalent practical experience or a Bachelor’s degree in a related field
Job Responsibility
Job Responsibility
  • Define and drive multi-year, multi-team technical strategy for machine learning across Affirm
  • Lead the design, implementation, and scaling of advanced ML systems
  • Partner deeply with ML Platform, product, engineering, and risk leadership to shape long-term modeling capabilities
  • Provide broad technical leadership across the ML organization, mentoring senior engineers
  • Drive clarity and alignment on ambiguous, high-stakes technical decisions
  • Champion operational and system excellence at the area level
What we offer
What we offer
  • Equity rewards
  • Monthly stipends for health, wellness and tech spending
  • 100% subsidized medical coverage, dental and vision for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Competitive vacation and holiday schedules
  • Employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Join the Affirm team as a Senior Staff Machine Learning Engineer and become a pi...
Location
Location
Canada
Salary
Salary:
206000.00 - 256000.00 CAD / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience researching, designing, deploying, and operating large-scale, real-time machine learning systems
  • Experience leading end-to-end ML system design, from data architecture and feature pipelines to model training, evaluation, and production deployment
  • Proficiency in Python and ML frameworks, including PyTorch and XGBoost
  • Experience with ML tooling for training orchestration, experimentation, and model monitoring, such as Kubeflow, MLflow, or equivalent
  • Strong understanding of representation learning and embedding-based modeling
  • Deep expertise in neural network-based sequence modeling, including architectures such as Transformers, recurrent, or attention-based models, and multi-task learning systems
  • Deep hands-on experience with large-scale distributed ML infrastructure, including streaming or batch data ingestion, feature stores, feature engineering, training pipelines, model serving and inference infrastructure, monitoring, and automated retraining
  • Strong technical leadership: defining long-term strategy, guiding research direction, and aligning work across teams
  • Exceptional judgment, collaboration, and communication skills
  • Strong verbal and written communication skills
Job Responsibility
Job Responsibility
  • Define and drive multi-year, multi-team technical strategy for machine learning across Affirm
  • Lead the design, implementation, and scaling of advanced ML systems
  • Partner deeply with ML Platform, product, engineering, and risk leadership to shape long-term modeling capabilities
  • Provide broad technical leadership across the ML organization
  • Drive clarity and alignment on ambiguous, high-stakes technical decisions
  • Champion operational and system excellence at the area level
What we offer
What we offer
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Principal Machine Learning Engineer

As a Principal Engineer on the ITSM team, you will get the opportunity to work o...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of total experience
  • Fluency in at least 1 scripting, OOP language
  • Solid understanding of machine learning concepts and algorithms, including supervised and unsupervised learning, deep learning, and NLP
  • Familiarity with popular ML libraries like sci-kit-learn, Keras/TensorFlow/PyTorch, numpy, pandas
  • Good Understanding of Machine Learning project lifecycle
  • Familiarity with MLOps and experience with scaling and deploying Machine Learning models
Job Responsibility
Job Responsibility
  • Work on cutting-edge AI and ML algorithms that help modernize IT Operations by reducing MTTR (mean time to resolve), and MTTI (Mean time to identify)
  • Use software development expertise to solve difficult problems, tackling complex infrastructure and architecture challenges
  • Lead engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input, and communicate results
  • Work with a distributed, world-class team shaping the future of AIOps
  • Master Generative AI
  • Become a machine learning maestro
  • Collaborate with diverse minds
  • Make a tangible impact
  • Routinely tackle complex architectural challenges
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Fulltime
Read More
Arrow Right

Principal Machine Learning System Engineer

As a Principal Machine Learning Systems Engineer, you will lead the design, deve...
Location
Location
United States , Seattle; San Francisco
Salary
Salary:
190300.00 - 305600.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Lead the design, development, and deployment of scalable machine learning (ML) systems and infrastructure
  • Collaborate closely with data scientists, software engineers, and product teams
  • Optimize model performance
  • Ensure system reliability
  • Implement efficient data pipelines
  • Drive architectural decisions for high-performance computing and cloud-based ML platforms
  • Mentor junior engineers
  • Promote best practices in ML operations (MLOps)
  • Stay updated on emerging technologies
Job Responsibility
Job Responsibility
  • Translate complex ML models into production-ready solutions
  • Ensure scalability and security
  • Deliver robust, scalable, and efficient machine learning solutions that support business growth and innovation
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

As a Senior Machine Learning Engineer in the Central AI team, you will build and...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master or PhD in a quantitative subject (Statistics, Mathematics, Computer Science, Operations Research, or relevant work experience)
  • 3+ years of related industry experience in the data science domain
  • Expertise in Python or Java with and the ability to write performant production-quality code, familiarity with SQL, knowledge of Spark and cloud data environments (e.g. AWS, Databricks)
  • Experience building and scaling machine learning models in business applications using large amounts of data
  • Ability to communicate and explain data science concepts to diverse audiences, craft a compelling story
  • Focus on business practicality and the 80/20 rule
  • very high bar for output quality, but recognize the business benefit of "having something now" vs "perfection sometime in the future"
  • Agile development mindset, appreciating the benefit of constant iteration and improvement
Job Responsibility
Job Responsibility
  • Build and maintain the core infrastructure to allow machine learning engineers and data scientists to develop, train, evaluate, deploy, and operate Machine Learning models and pipelines
  • Use software development expertise to solve difficult problems, tackling complex infrastructure and architecture challenges
  • Design system and model architectures, conducting rigorous experimentation and model evaluations, and providing guidance to junior ML engineers
  • Lead other engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input and communicate results
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Systems Engineer

Our team is building the foundations to democratise Machine Learning for Atlassi...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fluency in at least one modern object-oriented programming language (preferably Java/Kotlin)
  • Understanding and experience with Machine Learning project lifecycle and tools
  • Understanding of LLMs, best deployment practices and inference optimisation
  • Experience in building and implementing high-performance RESTful micro-services
  • Experience building and operating large scale distributed systems using Amazon Web Services (Sagemaker, S3, Cloud Formation, AWS Security and Networking)
  • Experience with Continuous Delivery and Continuous Integration
Job Responsibility
Job Responsibility
  • Build and scale the core infrastructure to allow software engineers, ML engineers & data scientists to develop, train, evaluate, deploy, and operate Machine Learning models and pipelines
  • Build systems for product teams like Jira & Confluence to provide access to curated LLMs
  • Use software development expertise to solve difficult problems, tackling infrastructure and architecture challenges
  • Lead engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input and communicate results
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Fulltime
Read More
Arrow Right

Machine Learning Systems Engineer

As a Machine Learning Systems Engineer on the AI & ML Platform team, you will bu...
Location
Location
United States
Salary
Salary:
145800.00 - 229125.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fluency in at least one modern object-oriented programming language (preferably Java/Kotlin)
  • Understanding and experience with Machine Learning project lifecycle and tools
  • Understanding of LLMs, best deployment practices and inference optimisation
  • Experience in building and implementing high-performance RESTful micro-services
  • Experience building and operating large scale distributed systems using Amazon Web Services (Sagemaker, S3, Cloud Formation, AWS Security and Networking)
  • Experience with Continuous Delivery and Continuous Integration
Job Responsibility
Job Responsibility
  • Build and scale the core infrastructure to allow software engineers, ML engineers & data scientists to develop, train, evaluate, deploy, and operate Machine Learning models and pipelines
  • Build systems for product teams like Jira & Confluence to provide access to curated LLMs
  • Use software development expertise to solve difficult problems, tackling infrastructure and architecture challenges
  • Lead engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input and communicate results
  • Regularly tackle complex problems in the team, from technical design to launch
  • Routinely tackle complex architecture challenges and defines coding standards & patterns for the team
  • Lead the team through times of ambiguity, help them adapt and deliver positive impact
  • Mentor junior members on the team
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Bonuses
  • Commissions
  • Equity
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Help design our AI platform and develop our next generation of machine learning ...
Location
Location
United States , San Francisco
Salary
Salary:
216500.00 - 324500.00 USD / Year
gofundme.com Logo
GoFundMe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ years of hands-on experience in machine learning engineering, AI development, software engineering, or related fields
  • Experience emphasizing secure, large-scale, distributed system design, AI/ML pipeline development, and implementation
  • Extensive experience designing, developing, and operating scalable backend systems
  • Experience applying software engineering best practices such as domain-driven design, event-driven architectures, and microservices
  • Deep expertise in agentic workflows, AI evaluation solutions, prompt management, and secure AI development and testing practices
  • Strong knowledge of relational and document-based databases, data storage paradigms, and efficient RESTful API design
  • Experience establishing robust CI/CD pipelines, automated testing (unit and integration), and deployment practices
  • Strong leadership skills, including effective planning and management of complex projects, mentoring of team members, and fostering a collaborative, high-performing engineering culture
  • Excellent communicator, able to articulate complex technical concepts clearly to both technical and non-technical stakeholders
  • Bachelor's degree in Computer Science, Software Engineering, or a related technical field (preferred)
Job Responsibility
Job Responsibility
  • Design and implement AI platforms to enable scalable and secure access to LLMs from multiple model providers for diverse use cases
  • Design and implement agentic workflows, agentic tool ecosystems, and LLM prompt management solutions
  • Design, build, and optimize scalable model training, fine tuning, and inference pipelines, ensuring robust integration with production systems
  • Influence technical strategy and approach to developing embedding stores, vector databases, and other reusable assets
  • Lead initiatives to streamline ML and AI workflows, improve operational efficiency, and establish standardized procedures to achieve consistent, high-quality results across our AI systems
  • Design and develop backend services and RESTful APIs using Python and FastAPI, integrating seamlessly with ML pipelines and services
  • Take operational responsibility for team-owned services, including performance monitoring, optimization, troubleshooting, and participation in an on-call rotation
  • Collaborate with both technical and non-technical colleagues, including data and applied scientists, software engineers, product managers, and business stakeholders, to deliver reliable and scalable ML-driven products
  • Coach and mentor fellow ML engineers, promoting a culture of collaboration, continuous improvement, and engineering excellence within the team
  • Employ a diverse set of tools and platforms including Python, AWS, Databricks, Docker, Kubernetes, FastAPI, Terraform, Snowflake, Coralogix, and GitHub to build, deploy, and maintain scalable, highly available machine learning infrastructure
What we offer
What we offer
  • Competitive pay
  • Comprehensive healthcare benefits
  • Financial assistance for things like hybrid work, family planning
  • Generous parental leave
  • Flexible time-off policies
  • Mental health and wellness resources
  • Learning, development, and recognition programs
  • Fulltime
Read More
Arrow Right