AI Production Engineer Job at Meta (Menlo Park)

Job Description

Production Engineers (PEs) at Meta are specialized software engineers who develop the underlying infrastructure for all of Meta's products and services, forming the backbone of every major engineering effort that keeps our platforms running smoothly and scaling efficiently. As a AI Production Engineer on our AI Transformation team, you will apply this discipline to build and scale production-grade AI systems that enhance the productivity and experience of our executive leadership. This role is primarily a software and systems engineering role—you will spend the majority of your time writing high-quality code, designing resilient systems, building automation, and creating tooling that enables AI to run reliably and efficiently.

Job Responsibility

Design and implement production-grade AI/ML systems for executive productivity, including LLMs, RAG systems, agents, inference pipelines, and MLOps infrastructure
Write and review code, develop documentation and capacity plans, and debug the hardest problems, live, on complex AI systems serving executive leadership
Build automation, self-healing systems, and CI/CD pipelines to minimize manual intervention and operational toil
Own AI infrastructure—training, inference, data pipelines, and GPU fleet management—across cloud platforms (AWS, Azure, GCP) and Kubernetes
Set technical direction, lead design reviews, mentor engineers, and advise leadership on AI technology trends and trade-offs
Share an on-call rotation (~1 week per quarter) and serve as an escalation contact for critical AI system incidents
Champion reliability by design—building resilience into systems from the start with circuit breakers, fallbacks, and graceful degradation
Travel globally up to 20% of the year to engage with executive partners and scale business opportunities

Requirements

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
7+ years of experience in Linux/Unix and network fundamentals
7+ years of coding experience in an industry-standard language (e.g., Python, Go, C++, Java, Rust)
Experience with Internet service architecture, capacity planning, and handling needs for urgent capacity augmentation
Knowledge of common web technologies and Internet service architectures (CDN, load balancing, distributed systems)
Experience configuring and running infrastructure-level applications such as Kubernetes, Terraform, and cloud platforms (AWS, Azure, GCP)
Experience building and productionizing AI/ML systems, including LLMs, RAG architectures, inference optimization, and MLOps
Proven track record of leading complex technical initiatives and mentoring other engineers

Nice to have

BS or MS in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Background in Production Engineering, Platform Engineering, or Site Reliability Engineering (SRE)
Experience with GPU infrastructure, ML accelerators, and model serving at scale
Familiarity with observability tools (Prometheus, Grafana, Datadog) and database/caching technologies (MySQL, Redis, Memcached)

What we offer

bonus
equity
benefits

Meta - All Job Offers

Select Country

AI Production Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

AI Production Engineer

Application Production Support Engineer - Generative AI Tools- Assistant Vice President

Application Production Support Engineer Generative AI

Application Production Support Engineer Generative AI

Application Production Support Engineer Generative AI

Application Production Support Engineer Generative AI

Lead AI Engineer (MLX, Agentic AI, Gen AI platform Services)

Lead AI Engineer (AI Foundations, LLM Customization and Finetuning)

Lead AI Engineer (AI Foundations, LLM Customization and Finetuning)

Our AI answers in your language