CrawlJobs Logo

Principal AI Operations Engineer

United States, Multiple Locations 139900.00 - 274800.00 USD / Year · Job Posted March 04, 2026
Apply Position
Job Link Share

Job Description

The Security AI Platform team builds and operates production infrastructure that powers AI-native security capabilities at Microsoft scale. We are organized into two focused groups: Platform + Apps develops the core product, microservices, and architecture; AI Operations ensures reliability, deployments, and operational excellence. Together, we deliver mission-critical services that process millions of requests daily. We are seeking a Principal AI Operations Engineer to define the technical direction for the AI Operations group. In this role, you will design and architect operational systems, establish standards for branch health, CI/CD pipelines, production deployments, and on-call processes. You will drive reliability initiatives, maintain production health and uptime, and ensure the platform meets its SLOs. You will be the escalation point for complex incidents and work closely with the Platform team to ensure services are operationally ready.

Job Responsibility

  • Define the operational vision, standards, and roadmap for the platform
  • establish SLOs, error budgets, and reliability targets
  • Drive technical direction for the AI Operations group: architecture for deployments, pipelines, branch health, and production reliability
  • Own CI/CD pipeline architecture: Azure DevOps/GitHub Actions pipelines, build optimization, artifact management, and deployment automation
  • Manage Kubernetes infrastructure: AKS cluster operations, Helm chart management, node pool configuration, GPU resource allocation, and autoscaling (KEDA)
  • Drive production deployments: canary/ring rollouts, safe deployment practices, rollback procedures, and release coordination with Platform team
  • Establish and operate first-level on-call: incident response procedures, escalation paths, runbooks, and post-incident reviews
  • Build and maintain observability infrastructure: Prometheus, Grafana, OpenTelemetry collectors, alerting rules, and dashboard curation
  • Manage infrastructure as code: Bicep templates for Azure resources, Helm charts for Kubernetes deployments, and environment parity
  • Ensure branch health and code quality gates: PR validation pipelines, automated testing, security scanning, and merge policies
  • Debug and diagnose production issues: analyze logs (Kusto/ADX), traces, and metrics to identify root causes and drive resolution
  • Collaborate with Platform team on operational readiness: review service designs for operability, define deployment requirements, and validate runbooks
  • Drive reliability improvements: capacity planning, performance optimization, chaos engineering, and disaster recovery testing
  • Guide and mentor operations engineers
  • establish operational effective practices and continuous improvement culture
  • Embody our culture and values

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • 6+ years technical engineering experience in DevOps, SRE, or platform operations
  • 6+ years driving complex operational initiatives across teams
  • demonstrated success leading without authority
  • 4+ years hands-on experience with Kubernetes in production environments
  • 3+ years building and maintaining CI/CD pipelines at scale
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice to have

  • Experienced with Kubernetes: cluster operations, Helm, troubleshooting, autoscaling, and production management
  • Proficiency with CI/CD platforms: Azure DevOps, GitHub Actions, or similar pipeline tooling
  • Experience with cloud platforms (Azure preferred): AKS, networking, identity management, and resource provisioning
  • Infrastructure as Code: Bicep, Terraform, or Helm chart development
  • Observability tooling: Prometheus, Grafana, OpenTelemetry, and log analytics (Kusto/KQL)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal AI Operations Engineer

8 matching positions

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 8+ years of experience in backend service engineering, including work on high-scale infrastructures
  • Proficiency in one or more systems programming languages such as C#, C++
  • 1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
  • 2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
  • 2+ years of experience building software for scale, performance, and reliability
  • Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)
Job Responsibility
Job Responsibility
  • Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
  • Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
  • Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
  • Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
  • Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
  • Design and implement scalable backend services optimized for machine learning workflows and large language model integration
  • Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
  • Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
  • Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
  • Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience
  • Fulltime
Read More
Arrow Right

Principal AI Engineer

We are seeking a highly accomplished Principal AI Engineer to define and drive t...
Location
Location
Ireland , Dublin 18
Salary
Salary:
Not provided
mastercard.com Logo
Mastercard
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrated experience designing and building AI/ML systems in production at scale, ideally across multiple problem domains
  • Expert-level proficiency in Python and deep experience with modern AI frameworks such as PyTorch and TensorFlow
  • Strong experience with cloud-native architectures and AI infrastructure on platforms such as AWS, Azure, or GCP
  • Deep understanding of machine learning, deep learning, NLP, generative AI, and transformer-based architectures (e.g., BERT, GPT-style models, ViTs)
  • Proven expertise in MLOps, including model versioning, deployment strategies, monitoring, evaluation, and lifecycle management
  • Strong systems-thinking mindset, with experience designing resilient, scalable, and cost-efficient AI services
  • Experience working with large-scale data architectures, streaming and batch processing, and model inference optimization
  • Excellent communication skills with the ability to explain complex technical concepts to both technical and non-technical stakeholders
  • Track record of technical mentorship and influence without relying on formal line management
  • Comfortable operating in high-ambiguity environments and making sound technical judgments with incomplete information
Job Responsibility
Job Responsibility
  • Define and drive the technical direction of our AI platforms and solutions
  • Architect, build, and scale production-grade AI systems that deliver durable business impact
  • Lead through deep hands-on expertise, influence technical strategy across teams, and raise the engineering bar for AI development across the organization
  • Design, implement, and operate advanced AI systems that support critical business and client needs in a scalable, secure, and reliable manner
  • Partner closely with product, engineering, and data leaders to translate business intent into robust AI architectures and platforms
  • Fulltime
Read More
Arrow Right

Principal Ai Engineer (Prisma Browser - Agents Platform)

The Prisma Browser group is building an agentic development lifecycle, an infras...
Location
Location
Israel , Tel Aviv
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 8+ years of experience in software development, architecture, or owning operational systems in production
  • Computer Science B.Sc. or equivalent education or equivalent military experience required
  • A product builder's mindset: you can extract requirements, talk to stakeholders, and tell the difference between what's important and what's noise
  • Experience in building production grade agents. Deep understanding of the agent loop, its states and transitions. You know how to build it correctly, not just use it
  • Positive 'can-do' mindset, able to work independently and within a team
  • Hands-on experience with LLM APIs, including a practical, highly-skeptical understanding of token costs, caching, context windows, and model failure points
  • You know how to build the right context for a task, including memory systems, session storage, and vector databases
  • You understand where LLMs fail and how to design around those failure points
  • You've used traces or observability tooling to diagnose and improve agent behavior
  • A systems-level background that touches reliability, observability, or platform engineering, with a strong preference for writing narrow, deterministic code over building hypothetical abstractions
Job Responsibility
Job Responsibility
  • Design and implement automated evaluation loops, static analysis, and rigorous quality gates to ensure the ADLC process doesn't just write code, but consistently produces great, production-ready code
  • Help the team tackle complex, hard problems to elevate our autonomous development product from 'good' to 'excellent'
  • Lead complex initiatives in Context Engineering and Prompt Engineering
  • Manage and orchestrate the complex ecosystem of autonomous agents utilized for internal development
  • Serve as a leading individual in a very strong team professionally and personally
  • Find space for growth to push the entire team or group forward
  • View prompt engineering as a core engineering discipline—where rewriting agent behavior is a versioned, reviewed, and tested code change
  • Act with a debugging temperament
  • conduct deep-dive analyses of raw agent transcripts to diagnose non-deterministic failures and ascertain root causes instead of merely working around them
  • Fulltime
Read More
Arrow Right

Principal AI Engineer

As a Principal AI Engineer on the AI Foundations team, you are an established su...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
mastercard.com Logo
Mastercard
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, Data Science, Applied Mathematics, or related technical field
  • advanced degree preferred
  • Strong foundation in software engineering, distributed systems, and applied machine learning relevant to production AI systems
  • Demonstrated understanding of responsible AI, model/system risk, privacy/security considerations, and governance requirements for enterprise deployments
  • Demonstrated, sustained ownership of production AI/ML systems, including design, build, deployment, and ongoing lifecycle operations
  • Real-world experience shipping complex agentic systems into production, including multi-agent coordination and multi-tool integration with safe action policies
  • Hands-on experience building production pipelines for evaluation, monitoring, versioning, and continuous improvement (including retraining or policy/guardrail updates)
  • Proven ability to define and operationalize observability and reliability practices for agentic systems (traceability, telemetry, SLOs, incident management)
  • Track record of influencing architecture and standards across multiple teams or programs, and mentoring engineers to raise overall engineering rigor
Job Responsibility
Job Responsibility
  • Serve as an established subject matter expert in AI Engineering, influencing stakeholders and shaping technical direction across multiple initiatives
  • Architect, design, develop, and maintain advanced AI/ML systems, with emphasis on complex agentic solutions (multi-agent orchestration, tool/function-calling, memory, reflection/self-correction, and autonomy policies)
  • Lead production implementation of agentic AI systems, including scalable training and evaluation pipelines, deployment frameworks, and runtime orchestration patterns
  • Define and implement safe tool-use patterns: structured outputs, robust error handling, permissioning and auditability, human-in-the-loop (HITL) approval steps for sensitive actions, and guardrail enforcement
  • Establish end-to-end AgentOps/LLMOps practices for agentic systems: release pipelines for prompts/tools/policies, canary strategies, safe rollback mechanisms, and continuous regression/safety evaluations as release gates
  • Build and optimize data ingestion, preprocessing, feature/embedding engineering, and retrieval/memory workflows to improve grounding quality and reduce failure modes
  • Own production observability for agentic systems: trace capture, cost/token telemetry, latency and reliability SLOs, and incident response practices for agent failures
  • Implement drift detection and performance decay monitoring (data drift, concept drift), and automate model/agent retraining, policy updates, and redeployment to maintain output quality over time
  • Drive measurable improvements in system effectiveness, safety, and efficiency by defining success metrics (task success, intervention rate, policy violations, cost and latency per task) and continuously improving evaluation coverage
  • Mentor and grow senior and junior engineers through design reviews, code reviews, hands-on coaching, and the creation of reusable patterns, playbooks, and standards for agentic delivery
  • Fulltime
Read More
Arrow Right

Principal AI Engineer

The Principal AI Engineer will serve as a technical cornerstone of VideoAmp's AI...
Location
Location
United States , Los Angeles; Boulder; New York; St. Petersburg
Salary
Salary:
175000.00 - 200000.00 USD / Year
videoamp.com Logo
VideoAmp
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or a related field preferred
  • equivalent practical experience considered
  • 8+ years of software engineering experience
  • 3+ years in AI/ML infrastructure, LLM platform engineering, or agentic systems
  • Deep hands-on experience with LLM APIs (Anthropic, OpenAI, or equivalent)
  • Familiarity with prompt engineering, tool use / function calling, and multi-step agent orchestration
  • Strong background in resource-based API design
  • Experience building or consuming developer-facing platform APIs at scale
  • Experience with MCP or equivalent tool-layer abstractions for exposing platform capabilities to AI agents
  • Proficiency in Golang, Python, and SQL
Job Responsibility
Job Responsibility
  • Design, build, and operate VideoAmp's AI infrastructure and its universal tool layers
  • Lead the development of scenario evaluation frameworks
  • Architect and implement efficient tool discovery systems
  • Partner with internal engineering teams to negotiate and promote API-first designs
  • Own full SDLC of new Agent APIs from design through production, testing, releases, and enhancement
  • Facilitate weekly AI office hours
  • Contribute to multi-provider LLM abstraction layers
  • Author, review, and drive clear, technical requirements documentation for new solutions
What we offer
What we offer
  • Equity participation included
  • Discretionary & flexible PTO + Spring, Summer & Winter company breaks
  • Inclusive and comprehensive medical, dental & vision
  • 401(k) with matching
  • HSA & FSA
  • Paid Maternity & Parental Leave for all family additions
  • Cell phone & wifi reimbursement
  • Commuter benefits
  • Fulltime
Read More
Arrow Right

Principal Engineer, AI Inference Reliability

We’re looking for a hands-on Reliability Tech Lead (IC) to own the mission of ma...
Location
Location
United States; Canada , Sunnyvale; Toronto
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master's degree in computer science or related field
  • 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems
  • Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust
  • Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture
  • Excellent communication and cross-functional leadership skills
Job Responsibility
Job Responsibility
  • Define and drive reliability strategy: establish SLOs and ensure alignment across engineering
  • Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers
  • Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents
  • Architect for reliability and observability: influence system design for redundancy, durability, and debuggability
  • Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection
  • Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service
  • Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights
  • Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
  • Fulltime
Read More
Arrow Right

Principal AI Software Engineer, Senior Vice President

Are you looking for a career move that will put you at the heart of a global fin...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Exceptional Python Expertise: Demonstrated mastery of core Python, including advanced features, performance optimization, and a deep understanding of the FastAPI framework
  • Prior hands-on experience with Generative AI, Large Language Model (LLM) frameworks (e.g. LangChain, LlamaIndex), and their application in enterprise environments is a must. This must be underpinned by a profound understanding of core machine learning principles, algorithms, and data science methodologies
  • Full Lifecycle Ownership: Extensive hands-on experience and technical authority throughout the entire software development lifecycle, from conceptualization and design to implementation, deployment, and operational ownership of enterprise software solutions, involving significant cross-functional collaboration
  • Strategic System Design: Significant hands-on experience in architecting and designing (architecture, design patterns, reliability, scaling) highly complex new and current systems with broad technical impact
  • Hands-on expertise with containerized deployment technologies (e.g. Kubernetes, OpenShift, Docker) and orchestration strategies
  • Hands-on experience and in-depth understanding of C++ is a significant bonus, particularly for complex code analysis, parsing, and integration into knowledge graph structures
Job Responsibility
Job Responsibility
  • Architect and implement cutting-edge software systems, defining the technical design for our AI solutions to ensure scalability, performance, and reliability
  • Drive the hands-on design, implementation, and deployment of sophisticated systems that automate the analysis of data, code, and documentation
  • Apply deep expertise to structure extracted knowledge within a Credit Risk Domain-aware knowledge graph, including advanced strategies for effectively modelling complex codebases, particularly C++, within this graph
  • Act as a critical technical partner with data scientists, business analysts, and other engineering teams to translate challenging business requirements into robust technical solutions and ensure successful, high-quality project delivery
  • Tackle the most complex technical challenges within our AI initiatives, providing solutions that set the standard for engineering excellence
What we offer
What we offer
  • Generous holiday allowance starting at 27 days plus bank holidays
  • increasing with tenure
  • A discretional annual performance related bonus
  • Private medical insurance packages to suit your personal circumstances
  • Employee Assistance Program
  • Pension Plan
  • Paid Parental Leave
  • Special discounts for employees, family, and friends
  • Access to an array of learning and development resources
  • Fulltime
Read More
Arrow Right

Principal Engineer Conversational AI

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States , Wellesley
Salary
Salary:
144200.00 - 288400.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
July 02, 2026
Flip Icon
Requirements
Requirements
  • 10+ years of experience with designing and building software engineering solutions in cloud environments
  • 10+ years of experience in one or more modern languages, such as Python, Java, C#, JavaScript, SQL, etc
  • 10+ years of experience with APIs, microservices, and modern software patterns
  • 5+ year(s) of soliciting complex requirements and managing relationships with key stakeholders
  • Strong understanding of conversational AI technologies (e.g., chatbots, voice assistants, LLMs, STT/TTS, NLU/NLP engines)
  • Bachelor degree from accredited university or equivalent work experience (HS diploma + 4 years relevant experience)
Job Responsibility
Job Responsibility
  • Create a technical vision to meet short- and longer-term business needs
  • Ensure the long-term quality of the design and code of our software systems
  • Oversee the creation and own critical software components
  • Lead hands-on, perform design and code and reviews
  • Help deploy and maintain large scale software in Production, ensure operational excellence
  • Possess expert knowledge in performance, scalability, distributed architecture, and engineering best practices
  • Help translate business requirements into software system in an Agile environment
  • Serve as technical lead on the most demanding, cross-functional projects
  • Advise leadership on technical decisions, strong communications skills required
  • Help the career development of other team members, mentor individuals on advanced technical issues, help managers grow their teams
What we offer
What we offer
  • Medical
  • dental
  • vision coverage
  • paid time off
  • retirement savings options
  • wellness programs
  • bonus
  • commission or short-term incentive program
  • equity award program
  • Fulltime
!
Read More
Arrow Right