CrawlJobs Logo

AI Production Engineer

United States, Menlo Park 184000.00 - 257000.00 USD / Year · Job Posted February 16, 2026
Apply Position
Job Link Share

Job Description

Production Engineers (PEs) at Meta are specialized software engineers who develop the underlying infrastructure for all of Meta's products and services, forming the backbone of every major engineering effort that keeps our platforms running smoothly and scaling efficiently. As a AI Production Engineer on our AI Transformation team, you will apply this discipline to build and scale production-grade AI systems that enhance the productivity and experience of our executive leadership. This role is primarily a software and systems engineering role—you will spend the majority of your time writing high-quality code, designing resilient systems, building automation, and creating tooling that enables AI to run reliably and efficiently.

Job Responsibility

  • Design and implement production-grade AI/ML systems for executive productivity, including LLMs, RAG systems, agents, inference pipelines, and MLOps infrastructure
  • Write and review code, develop documentation and capacity plans, and debug the hardest problems, live, on complex AI systems serving executive leadership
  • Build automation, self-healing systems, and CI/CD pipelines to minimize manual intervention and operational toil
  • Own AI infrastructure—training, inference, data pipelines, and GPU fleet management—across cloud platforms (AWS, Azure, GCP) and Kubernetes
  • Set technical direction, lead design reviews, mentor engineers, and advise leadership on AI technology trends and trade-offs
  • Share an on-call rotation (~1 week per quarter) and serve as an escalation contact for critical AI system incidents
  • Champion reliability by design—building resilience into systems from the start with circuit breakers, fallbacks, and graceful degradation
  • Travel globally up to 20% of the year to engage with executive partners and scale business opportunities

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 7+ years of experience in Linux/Unix and network fundamentals
  • 7+ years of coding experience in an industry-standard language (e.g., Python, Go, C++, Java, Rust)
  • Experience with Internet service architecture, capacity planning, and handling needs for urgent capacity augmentation
  • Knowledge of common web technologies and Internet service architectures (CDN, load balancing, distributed systems)
  • Experience configuring and running infrastructure-level applications such as Kubernetes, Terraform, and cloud platforms (AWS, Azure, GCP)
  • Experience building and productionizing AI/ML systems, including LLMs, RAG architectures, inference optimization, and MLOps
  • Proven track record of leading complex technical initiatives and mentoring other engineers

Nice to have

  • BS or MS in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Background in Production Engineering, Platform Engineering, or Site Reliability Engineering (SRE)
  • Experience with GPU infrastructure, ML accelerators, and model serving at scale
  • Familiarity with observability tools (Prometheus, Grafana, Datadog) and database/caching technologies (MySQL, Redis, Memcached)

What we offer

  • bonus
  • equity
  • benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

AI Production Engineer

8 matching positions

Application Production Support Engineer - Generative AI Tools- Assistant Vice President

Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or a related field.
  • 8-13 years of experience in application production support or a related role.
  • Strong understanding of software development lifecycle and DevOps principles.
  • Experience supporting cloud-based applications, preferably on AWS, Azure, or GCP.
  • Change Management , Incdient management , Problem Managemet,Stakeholder management
  • Proficiency in at least one scripting language (e.g., Python, Bash).
  • Familiarity with monitoring tools (e.g., Datadog, Prometheus, Grafana).
  • Excellent problem-solving and analytical skills.
  • Strong communication and collaboration skills.
  • A passion for AI and machine learning.
Job Responsibility
Job Responsibility
  • Provide front-line technical support: Diagnose and resolve production issues related to our generative AI applications, including performance bottlenecks, API errors, data inconsistencies, and infrastructure problems.
  • Monitor application health: Utilize monitoring tools and dashboards to track key metrics, identify anomalies, and proactively address potential issues before they impact users.
  • Incident Management: Follow established incident management procedures to document, escalate, and resolve production incidents, ensuring timely communication with stakeholders.
  • Collaborate with engineering teams: Work closely with development and infrastructure teams to identify the root cause of issues, implement fixes, and prevent future occurrences.
  • Develop and maintain documentation: Create and update technical documentation, including runbooks, knowledge base articles, and troubleshooting guides.
  • Automate support tasks: Identify opportunities to automate repetitive tasks and improve support efficiency through scripting and tooling.
  • Participate in on-call rotation: Provide on-call support on a rotational basis to ensure 24/7 coverage for critical applications.
  • Continuous Improvement: Contribute to the continuous improvement of our support processes and tools by identifying areas for optimization and implementing best practices.
  • Fulltime
Read More
Arrow Right

Application Production Support Engineer Generative AI

We are seeking a motivated team member to support our AI and DevOps Platform Sup...
Location
Location
India , Chennai
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years of relevant experience in technical support, platform operations, or engineering
  • Exposure to architecture concepts with the ability to contribute to technical discussions and understand design decisions
  • Experience working with business partners, engineering teams, or technology stakeholders
  • Demonstrated experience supporting IT services, platform operations, or infrastructure components
  • Strong verbal and written communication skills, with the ability to document technical issues clearly
  • Experience supporting operational workstreams or participating in platform improvement initiatives
  • Participation in resilience‑related or stability‑focused activities preferred
  • Ability to collaborate effectively with cross‑functional teams
  • Strong organizational skills and ability to manage daily workload and task priorities
  • Working knowledge of Generative AI concepts preferred
Job Responsibility
Job Responsibility
  • Understand how application support functions within the broader technology organization and contributes to business objectives
  • Assist with vendor coordination and day‑to‑day interactions with offshore managed services
  • Support efforts to improve service levels, including participating in incident management, problem management, and knowledge‑sharing initiatives
  • Partner with development and engineering teams to support application stability and operational readiness
  • Assist in collecting capacity, performance, and latency data to support platform planning efforts
  • Support application onboarding activities using established guidelines and standards
  • Contribute to fostering a collaborative and supportive team environment that encourages skill development
  • Participate in cost‑efficiency initiatives such as Root Cause Analysis reviews, knowledge management, and performance tuning
  • Assist in preparing materials for business review meetings and help align technology activities with business needs
  • Follow established support processes and tool standards and provide input on improvement opportunities
  • Fulltime
Read More
Arrow Right

Application Production Support Engineer Generative AI

Engineer the future of global finance. At Citi, our Tech team doesn't just suppo...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years of relevant experience in technical support, platform operations, or engineering
  • Exposure to architecture concepts with the ability to contribute to technical discussions and understand design decisions
  • Experience working with business partners, engineering teams, or technology stakeholders
  • Demonstrated experience supporting IT services, platform operations, or infrastructure components
  • Strong verbal and written communication skills, with the ability to document technical issues clearly
  • Experience supporting operational workstreams or participating in platform improvement initiatives
  • Participation in resilience-related or stability-focused activities preferred
  • Ability to collaborate effectively with cross-functional teams
  • Strong organizational skills and ability to manage daily workload and task priorities
  • Working knowledge of Generative AI concepts preferred
Job Responsibility
Job Responsibility
  • Understand how application support functions within the broader technology organization and contributes to business objectives
  • Assist with vendor coordination and day-to-day interactions with offshore managed services
  • Support efforts to improve service levels, including participating in incident management, problem management, and knowledge-sharing initiatives
  • Partner with development and engineering teams to support application stability and operational readiness
  • Assist in collecting capacity, performance, and latency data to support platform planning efforts
  • Support application onboarding activities using established guidelines and standards
  • Contribute to fostering a collaborative and supportive team environment that encourages skill development
  • Participate in cost-efficiency initiatives such as Root Cause Analysis reviews, knowledge management, and performance tuning
  • Assist in preparing materials for business review meetings and help align technology activities with business needs
  • Follow established support processes and tool standards and provide input on improvement opportunities
  • Fulltime
Read More
Arrow Right

Application Production Support Engineer Generative AI

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years of relevant experience in technical support, platform operations, or engineering
  • Exposure to architecture concepts with the ability to contribute to technical discussions and understand design decisions
  • Experience working with business partners, engineering teams, or technology stakeholders
  • Demonstrated experience supporting IT services, platform operations, or infrastructure components
  • Strong verbal and written communication skills, with the ability to document technical issues clearly
  • Experience supporting operational workstreams or participating in platform improvement initiatives
  • Participation in resilience‑related or stability‑focused activities preferred
  • Ability to collaborate effectively with cross‑functional teams
  • Strong organizational skills and ability to manage daily workload and task priorities
  • Working knowledge of Generative AI concepts preferred
Job Responsibility
Job Responsibility
  • Understand how application support functions within the broader technology organization and contributes to business objectives
  • Assist with vendor coordination and day‑to‑day interactions with offshore managed services
  • Support efforts to improve service levels, including participating in incident management, problem management, and knowledge‑sharing initiatives
  • Partner with development and engineering teams to support application stability and operational readiness
  • Assist in collecting capacity, performance, and latency data to support platform planning efforts
  • Support application onboarding activities using established guidelines and standards
  • Contribute to fostering a collaborative and supportive team environment that encourages skill development
  • Participate in cost‑efficiency initiatives such as Root Cause Analysis reviews, knowledge management, and performance tuning
  • Assist in preparing materials for business review meetings and help align technology activities with business needs
  • Follow established support processes and tool standards and provide input on improvement opportunities
  • Fulltime
Read More
Arrow Right

Application Production Support Engineer Generative AI

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years of relevant experience in technical support, platform operations, or engineering
  • Exposure to architecture concepts with the ability to contribute to technical discussions and understand design decisions
  • Experience working with business partners, engineering teams, or technology stakeholders
  • Demonstrated experience supporting IT services, platform operations, or infrastructure components
  • Strong verbal and written communication skills, with the ability to document technical issues clearly
  • Experience supporting operational workstreams or participating in platform improvement initiatives
  • Participation in resilience‑related or stability‑focused activities preferred
  • Ability to collaborate effectively with cross‑functional teams
  • Strong organizational skills and ability to manage daily workload and task priorities
  • Working knowledge of Generative AI concepts preferred
Job Responsibility
Job Responsibility
  • Understand how application support functions within the broader technology organization and contributes to business objectives
  • Assist with vendor coordination and day‑to‑day interactions with offshore managed services
  • Support efforts to improve service levels, including participating in incident management, problem management, and knowledge‑sharing initiatives
  • Partner with development and engineering teams to support application stability and operational readiness
  • Assist in collecting capacity, performance, and latency data to support platform planning efforts
  • Support application onboarding activities using established guidelines and standards
  • Contribute to fostering a collaborative and supportive team environment that encourages skill development
  • Participate in cost‑efficiency initiatives such as Root Cause Analysis reviews, knowledge management, and performance tuning
  • Assist in preparing materials for business review meetings and help align technology activities with business needs
  • Follow established support processes and tool standards and provide input on improvement opportunities
  • Fulltime
Read More
Arrow Right

Lead AI Engineer (MLX, Agentic AI, Gen AI platform Services)

Lead AI Engineer (MLX, Agentic AI, Gen AI platform Services)
Location
Location
United States , New York; San Francisco; San Jose; Cambridge; McLean
Salary
Salary:
197300.00 - 245600.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 4 years of experience developing AI and ML algorithms or technologies, or a Master's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 2 years of experience developing AI and ML algorithms or technologies
  • At least 4 years of experience programming with Python, Go, Scala, or Java
Job Responsibility
Job Responsibility
  • Partner with a cross-functional team of engineers, research scientists, technical program managers, and product managers to deliver AI-powered products that change how our associates work and how our customers interact with Capital One
  • Design, develop, test, deploy, and support AI software components including foundation model training, large language model inference, similarity search, guardrails, model evaluation, experimentation, governance, and observability, etc
  • Leverage a broad stack of Open Source and SaaS AI technologies such as AWS Ultraclusters, Huggingface, VectorDBs, Nemo Guardrails, PyTorch, and more
  • Invent and introduce state-of-the-art LLM optimization techniques to improve the performance — scalability, cost, latency, throughput — of large scale production AI systems
  • Contribute to the technical vision and the long term roadmap of foundational AI systems at Capital One
What we offer
What we offer
  • performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
  • comprehensive, competitive, and inclusive set of health, financial and other benefits
  • Fulltime
Read More
Arrow Right

Lead AI Engineer (AI Foundations, LLM Customization and Finetuning)

At Capital One, we are creating responsible and reliable AI systems, changing ba...
Location
Location
United States , Cambridge; McLean; New York; San Jose
Salary
Salary:
197300.00 - 245600.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 4 years of experience developing AI and ML algorithms or technologies, or a Master's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 2 years of experience developing AI and ML algorithms or technologies
  • At least 4 years of experience programming with Python, Go, Scala, or Java
Job Responsibility
Job Responsibility
  • Partner with a cross-functional team of engineers, research scientists, technical program managers, and product managers to deliver AI-powered products that change how our associates work and how our customers interact with Capital One
  • Design, develop, test, deploy, and support AI software components including foundation model training, large language model inference, similarity search, guardrails, model evaluation, experimentation, governance, and observability, etc.
  • Leverage a broad stack of Open Source and SaaS AI technologies such as AWS Ultraclusters, Huggingface, VectorDBs, Nemo Guardrails, PyTorch, and more
  • Invent and introduce state-of-the-art LLM optimization techniques to improve the performance — scalability, cost, latency, throughput — of large scale production AI systems
  • Contribute to the technical vision and the long term roadmap of foundational AI systems at Capital One
What we offer
What we offer
  • performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
  • comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being
Read More
Arrow Right

Lead AI Engineer (AI Foundations, LLM Customization and Finetuning)

At Capital One, we are creating responsible and reliable AI systems, changing ba...
Location
Location
United States , Cambridge; McLean; New York; San Jose
Salary
Salary:
197300.00 - 245600.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 4 years of experience developing AI and ML algorithms or technologies, or a Master's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 2 years of experience developing AI and ML algorithms or technologies
  • At least 4 years of experience programming with Python, Go, Scala, or Java
Job Responsibility
Job Responsibility
  • Partner with a cross-functional team of engineers, research scientists, technical program managers, and product managers to deliver AI-powered products that change how our associates work and how our customers interact with Capital One
  • Design, develop, test, deploy, and support AI software components including foundation model training, large language model inference, similarity search, guardrails, model evaluation, experimentation, governance, and observability, etc
  • Leverage a broad stack of Open Source and SaaS AI technologies such as AWS Ultraclusters, Huggingface, VectorDBs, Nemo Guardrails, PyTorch, and more
  • Invent and introduce state-of-the-art LLM optimization techniques to improve the performance — scalability, cost, latency, throughput — of large scale production AI systems
  • Contribute to the technical vision and the long term roadmap of foundational AI systems at Capital One
What we offer
What we offer
  • Eligible to earn performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
  • comprehensive, competitive, and inclusive set of health, financial and other benefits
  • Fulltime
Read More
Arrow Right