Senior Machine Learning Engineer, ML Training Platform Job at Reddit

Senior Machine Learning Engineer - ML Training Infrastructure

We are seeking an experienced, technical oriented, impact delivering-driven expe...

Location

United States , Mountain View

Salary:

170000.00 - 240000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Bachelors degree or higher in Computer Science or equivalent major OR equivalent relevant experience
3+ years professional software engineering experience
2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
Strong programming skills in Python, with proficiency in frameworks such as, PyTorch (preferred), TensorFlow, or similar
Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
Willingness to travel to Sunnyvale, CA as needed
Comfortable working in highly ambiguous and dynamic environments

Job Responsibility

Design and development of scalable, reliable, high-performance ML framework to support model training at scale
Model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost
Raise the bar on system observability, debuggability, and operational excellence, and user experience
Collaborate with cross-functional teams to integrate new features and technologies into the platform

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Senior Machine Learning Engineer, Platform

We seek an outstanding, creative, and passionate Machine Learning Platform Engin...

Location

United States , San Jose

Salary:

229500.00 - 367100.00 USD / Year

Roku

Expiration Date

Until further notice

Requirements

5+ years of experience building software solutions to concrete problems
Strong CS fundamentals. Should be able to write an algorithm with ease
Fluent with one of high-level programming languages like Java, Scala, Kotlin or Python
Worked with big data systems (Spark, Kafka, Flink, S3, AirFlow)
Familiar with model ML framework and tools: Ray, PyTorch, HuggingFace, AWS Sagemaker
AI literacy and curiosity. You have either tried Gen AI in your previous work or outside of work or are curious about Gen AI and have explored it
MS in Computer Science or related field

Job Responsibility

Design, build, and maintain scalable platform services: feature store, real-time inference services, vector DBs etc., that serve millions of transactions per second
Run and monitor online AB tests via robust platform services, analyzing platform metrics and business KPIs to optimize recommendation system performance
Collaborate closely with US-based engineering and cross-functional teams to translate business requirements into modular platform components and APIs
Enhance and evolve the ML platform ecosystem to support high developer velocity, system scalability, and adaptability to future business needs
Contribute to onboarding, training, and mentoring new team members on emerging platform engineering best practices and technologies

What we offer

health insurance
equity awards
life insurance
disability benefits
parental leave
wellness benefits
paid time off
global access to mental health and financial wellness support and resources
commuter benefits
retirement options (401(k)/pension)

Fulltime

Senior Machine Learning Engineer, AI Platform

Location

United States; Canada

Salary:

128000.00 - 171000.00 CAD / Year

Mozilla

Expiration Date

Until further notice

Requirements

Bachelor’s degree with 4–6 years of relevant industry experience, or Master’s degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
Hands-on experience working with GPU-based workloads and accelerated computing in production settings
Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
Ability to independently scope and drive technical initiatives while balancing product and operational priorities
Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams

Job Responsibility

Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews

What we offer

Generous performance-based bonus plans to all eligible employees
Rich medical, dental, and vision coverage
Generous retirement contributions with 100% immediate vesting (regardless of whether you contribute)
Quarterly all-company wellness days where everyone takes a pause together
Country specific holidays plus a day off for your birthday
One-time home office stipend
Annual professional development budget
Quarterly well-being stipend
Considerable paid parental leave
Employee referral bonus program

Fulltime

Senior Machine Learning Engineer, AI Platform

The AI Platform team is responsible for building the foundational infrastructure...

Location

United States; Canada

Salary:

139000.00 - 218000.00 USD / Year

Mozilla

Expiration Date

Until further notice

Requirements

Bachelor’s degree with 4–6 years of relevant industry experience, or Master’s degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
Hands-on experience working with GPU-based workloads and accelerated computing in production settings
Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
Ability to independently scope and drive technical initiatives while balancing product and operational priorities
Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams

Job Responsibility

Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews

What we offer

Generous performance-based bonus plans
Rich medical, dental, and vision coverage
Generous retirement contributions with 100% immediate vesting
Quarterly all-company wellness days
Country specific holidays plus a day off for your birthday
One-time home office stipend
Annual professional development budget
Quarterly well-being stipend
Considerable paid parental leave
Employee referral bonus program

Fulltime