CrawlJobs Logo

Training: ML Framework Engineer

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

205000.00 - 445000.00 USD / Year

Job Description:

Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs. With a dual mandate to accelerate researchers and enable frontier scale, we’re building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve. Our work focuses on three pillars: high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement; performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability); and distributed process management for long-lived, job-specific and user-provided processes. We integrate proven large-scale capabilities into a composable, developer-facing runtime so teams can iterate quickly and run reliably at any scale, partnering closely with model-stack, research, and platform teams. Success for us is measured by raising both training throughput (how fast models train) and researcher throughput (how fast ideas become experiments and products).

Job Responsibility:

  • Apply the latest techniques in our internal training framework to achieve impressive hardware efficiency for our training runs
  • Profile and optimize our training framework
  • Work with researchers to enable them to develop the next generation of models

Requirements:

  • Have run small scale ML experiments
  • Love figuring out how systems work and continuously come up with ideas for how to make them faster while minimizing complexity and maintenance burden
  • Have strong software engineering skills and are proficient in Python
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Training: ML Framework Engineer

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Senior ML Data Engineer

As a Senior Data Engineer, you will play a pivotal role in our AI/ML workstream,...
Location
Location
Salary
Salary:
Not provided
awin.com Logo
Awin Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Master’s degree in data science, data engineering, Computer Science with focus on math and statistics / Master’s degree is preferred
  • At least 5 years experience as AI/ML data engineer undertaking above task and accountabilities
  • Strong foundation in computer science principes and statistical methods
  • Strong experience with cloud technology (AWS or Azure)
  • Strong experience with creation of data ingestion pipeline and ET process
  • Strong knowledge of big data tool such as Spark, Databricks and Python
  • Strong understanding of common machine learning techniques and frameworks (e.g. mlflow)
  • Strong knowledge of Natural language processing (NPL) concepts
  • Strong knowledge of scrum practices and agile mindset
Job Responsibility
Job Responsibility
  • Design and maintain scalable data pipelines and storage systems for both agentic and traditional ML workloads
  • Productionise LLM- and agent-based workflows, ensuring reliability, observability, and performance
  • Build and maintain feature stores, vector/embedding stores, and core data assets for ML
  • Develop and manage end-to-end traditional ML pipelines: data prep, training, validation, deployment, and monitoring
  • Implement data quality checks, drift detection, and automated retraining processes
  • Optimise cost, latency, and performance across all AI/ML infrastructure
  • Collaborate with data scientists and engineers to deliver production-ready ML and AI systems
  • Ensure AI/ML systems meet governance, security, and compliance requirements
  • Mentor teams and drive innovation across both agentic and classical ML engineering practices
  • Participate in team meetings and contribute to project planning and strategy discussions
What we offer
What we offer
  • Flexi-Week and Work-Life Balance: We prioritise your mental health and well-being, offering you a flexible four-day Flexi-Week at full pay and with no reduction to your annual holiday allowance. We also offer a variety of different paid special leaves as well as volunteer days
  • Remote Working Allowance: You will receive a monthly allowance to cover part of your running costs. In addition, we will support you in setting up your remote workspace appropriately
  • Pension: Awin offers access to an additional pension insurance to all employees in Germany
  • Flexi-Office: We offer an international culture and flexibility through our Flexi-Office and hybrid/remote work possibilities to work across Awin regions
  • Development: We’ve built our extensive training suite Awin Academy to cover a wide range of skills that nurture you professionally and personally, with trainings conveniently packaged together to support your overall development
  • Appreciation: Thank and reward colleagues by sending them a voucher through our peer-to-peer program
Read More
Arrow Right

Senior Platform Engineer, ML Data Systems

We’re looking for an ML Data Engineer to evolve our eval dataset tools to meet t...
Location
Location
United States , Mountain View
Salary
Salary:
137871.00 - 172339.00 USD / Year
khanacademy.org Logo
Khan Academy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field
  • 5 years of Software Engineering experience with 3+ of those years working with large ML datasets, especially those in open-source repositories such as Hugging Face
  • Strong programming skills in Go, Python, SQL, and at least one data pipeline framework (e.g., Airflow, Dagster, Prefect)
  • Experience with data versioning tools (e.g., DVC, LakeFS) and cloud storage systems
  • Familiarity with machine learning workflows — from training data preparation to evaluation
  • Familiarity with the architecture and operation of large language models, and a nuanced understanding of their capabilities and limitations
  • Attention to detail and an obsession with data quality and reproducibility
  • Motivated by the Khan Academy mission “to provide a free world-class education for anyone, anywhere.”
  • Proven cross-cultural competency skills demonstrating self-awareness, awareness of other, and the ability to adopt inclusive perspectives, attitudes, and behaviors to drive inclusion and belonging throughout the organization.
Job Responsibility
Job Responsibility
  • Evolve and maintain pipelines for transforming raw trace data into ML-ready datasets
  • Clean, normalize, and enrich data while preserving semantic meaning and consistency
  • Prepare and format datasets for human labeling, and integrate results into ML datasets
  • Develop and maintain scalable ETL pipelines using Airflow, DBT, Go, and Python running on GCP
  • Implement automated tests and validation to detect data drift or labeling inconsistencies
  • Collaborate with AI engineers, platform developers, and product teams to define data strategies in support of continuously improving the quality of Khan’s AI-based tutoring
  • Contribute to shared tools and documentation for dataset management and AI evaluation
  • Inform our data governance strategies for proper data retention, PII controls/scrubbing, and isolation of particularly sensitive data such as offensive test imagery.
What we offer
What we offer
  • Competitive salaries
  • Ample paid time off as needed
  • 8 pre-scheduled Wellness Days in 2026 occurring on a Monday or a Friday for a 3-day weekend boost
  • Remote-first culture - that caters to your time zone, with open flexibility as needed, at times
  • Generous parental leave
  • An exceptional team that trusts you and gives you the freedom to do your best
  • The chance to put your talents towards a deeply meaningful mission and the opportunity to work on high-impact products that are already defining the future of education
  • Opportunities to connect through affinity, ally, and social groups
  • 401(k) + 4% matching & comprehensive insurance, including medical, dental, vision, and life.
  • Fulltime
Read More
Arrow Right

Senior ML Data Engineer

As a Senior Data Engineer, you will play a pivotal role in our AI/ML workstream,...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
awin.com Logo
Awin Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Master’s degree in data science, data engineering, Computer Science with focus on math and statistics / Master’s degree is preferred
  • At least 5 years experience as AI/ML data engineer undertaking above task and accountabilities
  • Strong foundation in computer science principes and statistical methods
  • Strong experience with cloud technology (AWS or Azure)
  • Strong experience with creation of data ingestion pipeline and ET process
  • Strong knowledge of big data tool such as Spark, Databricks and Python
  • Strong understanding of common machine learning techniques and frameworks (e.g. mlflow)
  • Strong knowledge of Natural language processing (NPL) concepts
  • Strong knowledge of scrum practices and agile mindset
  • Strong Analytical and Problem-Solving Skills with attention to data quality and accuracy
Job Responsibility
Job Responsibility
  • Design and maintain scalable data pipelines and storage systems for both agentic and traditional ML workloads
  • Productionise LLM- and agent-based workflows, ensuring reliability, observability, and performance
  • Build and maintain feature stores, vector/embedding stores, and core data assets for ML
  • Develop and manage end-to-end traditional ML pipelines: data prep, training, validation, deployment, and monitoring
  • Implement data quality checks, drift detection, and automated retraining processes
  • Optimise cost, latency, and performance across all AI/ML infrastructure
  • Collaborate with data scientists and engineers to deliver production-ready ML and AI systems
  • Ensure AI/ML systems meet governance, security, and compliance requirements
  • Mentor teams and drive innovation across both agentic and classical ML engineering practices
  • Participate in team meetings and contribute to project planning and strategy discussions
What we offer
What we offer
  • Flexi-Week and Work-Life Balance: We prioritise your mental health and well-being, offering you a flexible four-day Flexi-Week at full pay and with no reduction to your annual holiday allowance. We also offer a variety of different paid special leaves as well as volunteer days
  • Remote Working Allowance: You will receive a monthly allowance to cover part of your running costs. In addition, we will support you in setting up your remote workspace appropriately
  • Pension: Awin offers access to an additional pension insurance to all employees in Germany
  • Flexi-Office: We offer an international culture and flexibility through our Flexi-Office and hybrid/remote work possibilities to work across Awin regions
  • Development: We’ve built our extensive training suite Awin Academy to cover a wide range of skills that nurture you professionally and personally, with trainings conveniently packaged together to support your overall development
  • Appreciation: Thank and reward colleagues by sending them a voucher through our peer-to-peer program
Read More
Arrow Right

ML Engineer

The international IT сompany Andersen invites a ML Engineer to join our dynamic ...
Location
Location
Salary
Salary:
Not provided
andersenlab.com Logo
Andersen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience as a Machine Learning Engineer or in a similar role for 3+ years
  • Proficiency in Python, including hands-on experience with libraries such as scikit-learn, pandas, NumPy, and matplotlib
  • Strong understanding of core ML concepts — regression, classification, clustering, model validation, and performance metrics
  • Practical experience with deep learning frameworks such as TensorFlow, PyTorch, or Keras
  • Proven experience building, training, and deploying ML models using AWS SageMaker
  • Familiarity with AWS Bedrock for working with foundation and generative models (e.g., fine-tuning and orchestration of LLMs)
  • Hands-on experience with data preprocessing, feature engineering, and model evaluation
  • Knowledge of SQL and experience working with structured and semi-structured datasets
  • Understanding of ML model deployment (e.g., REST APIs with FastAPI or Flask
  • model packaging and containerization with Docker)
Job Responsibility
Job Responsibility
  • Designing, training, and evaluating machine learning models (supervised, unsupervised, NLP, etc.)
  • Building scalable data and ML pipelines using modern tools
  • Collaborating with subject matter experts and analysts to prepare training datasets
  • Deploying models for production (batch or real-time inference)
  • Monitoring and maintaining model performance and data quality
  • Optimizing models for performance, interpretability, and cost
  • Documenting ML workflows and ensuring reproducibility
What we offer
What we offer
  • Experience in teamwork with leaders in FinTech, Healthcare, Retail, Telecom, and others
  • The opportunity to change the project and/or develop expertise in an interesting business domain
  • Guarantee of professional, financial, and career growth
  • The opportunity to earn up to an additional 1,000 USD per month, depending on the level of expertise, which will be included in the annual bonus, by participating in the company's activities
  • Access to the corporate training portal
  • Bright corporate life (parties / pizza days / PlayStation / fruits / coffee / snacks / movies)
  • Certification compensation (AWS, PMP, etc)
  • Referral program
  • English courses
  • Private health insurance and compensation for sports activities
Read More
Arrow Right

AI ML Engineer

We seeking a talented ML/AI Engineer to join our innovative team and drive the d...
Location
Location
India , hyderabad
Salary
Salary:
Not provided
genzeon.com Logo
Genzeon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in building ML models with proven track record of successful deployments
  • Extensive experience in Generative AI including LLMs, diffusion models, and related technologies
  • Experience in Agentic AI and understanding of autonomous agent architectures
  • Proficiency with Model Control Protocol (MCP) for agent communication and control
  • Advanced Python programming with expertise in ML libraries (scikit-learn, TensorFlow, PyTorch, etc.)
  • Google Cloud Platform (GCP) experience with ML-focused services
  • Vertex AI hands-on experience for model lifecycle management
  • AutoML experience for automated machine learning workflows
  • Model Armour or similar model security and protection frameworks
  • 3+ years of experience in machine learning engineering or related field
Job Responsibility
Job Responsibility
  • Design, develop, and deploy robust machine learning models for various business applications
  • Build and optimize generative AI solutions using latest frameworks and techniques
  • Implement agentic AI systems that can autonomously perform complex tasks
  • Develop and maintain ML pipelines from data ingestion to model deployment
  • Leverage Google Cloud Platform (GCP) services for scalable ML infrastructure
  • Utilize Vertex AI for model training, deployment, and management
  • Implement AutoML solutions for rapid prototyping and model development
  • Ensure model security and compliance using Model Armour and related tools
  • Write clean, efficient Python code for ML applications and data processing
  • Optimize model performance, accuracy, and computational efficiency
Read More
Arrow Right

AI ML Engineer

We seeking a talented ML/AI Engineer to join our innovative team and drive the d...
Location
Location
India , hyderabad
Salary
Salary:
Not provided
genzeon.com Logo
Genzeon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in building ML models with proven track record of successful deployments
  • Extensive experience in Generative AI including LLMs, diffusion models, and related technologies
  • Experience in Agentic AI and understanding of autonomous agent architectures
  • Proficiency with Model Control Protocol (MCP) for agent communication and control
  • Advanced Python programming with expertise in ML libraries (scikit-learn, TensorFlow, PyTorch, etc.)
  • Google Cloud Platform (GCP) experience with ML-focused services
  • Vertex AI hands-on experience for model lifecycle management
  • AutoML experience for automated machine learning workflows
  • Model Armour or similar model security and protection frameworks
  • 3+ years of experience in machine learning engineering or related field
Job Responsibility
Job Responsibility
  • Design, develop, and deploy robust machine learning models for various business applications
  • Build and optimize generative AI solutions using latest frameworks and techniques
  • Implement agentic AI systems that can autonomously perform complex tasks
  • Develop and maintain ML pipelines from data ingestion to model deployment
  • Leverage Google Cloud Platform (GCP) services for scalable ML infrastructure
  • Utilize Vertex AI for model training, deployment, and management
  • Implement AutoML solutions for rapid prototyping and model development
  • Ensure model security and compliance using Model Armour and related tools
  • Write clean, efficient Python code for ML applications and data processing
  • Optimize model performance, accuracy, and computational efficiency
Read More
Arrow Right

Sr. Staff ML Platform Engineer

Machine learning is the crucial enabler for every financial service that EarnIn ...
Location
Location
United States , Mountain View
Salary
Salary:
360000.00 - 440000.00 USD / Year
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master’s degree in Computer Science, Engineering, or a related field
  • 8+ years of industry machine learning experience and excellent software engineering skills
  • Strong programming skills in Python, with familiarity in ML frameworks such as TensorFlow or PyTorch
  • Experience with ML cloud platforms such as AWS Sagemaker, Databricks, or GCP Vertex AI
  • Familiarity with data pipelines and workflow management tools
  • Strong communication and collaboration skills
  • Passion for learning and staying updated with the latest industry trends in machine learning and platform engineering
Job Responsibility
Job Responsibility
  • Design, build, and maintain a robust ML platform and tooling ecosystem that supports the entire machine learning lifecycle, from experimentation to production
  • Lead and mentor a team of ML engineers, deeply understanding their workflows to streamline model training, deployment, and monitoring, while ensuring reproducibility and consistency of results
  • Drive scalability, reliability, and cost efficiency of the ML platform, balancing performance with ease of use for scientists and engineers
  • Evaluate and adopt emerging technologies to continually advance the organization’s machine learning capabilities and maintain a competitive edge
  • Champion operational excellence, setting a high bar for engineering quality, reliability, and automation
  • Act as a catalyst for innovation, spearheading step-change improvements that unlock new opportunities for growth and efficiency
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right