Principal AI Operations Engineer Job at Microsoft Corporation (Multiple Locations)

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
8+ years of experience in backend service engineering, including work on high-scale infrastructures
Proficiency in one or more systems programming languages such as C#, C++
1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
2+ years of experience building software for scale, performance, and reliability
Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)

Job Responsibility

Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
Design and implement scalable backend services optimized for machine learning workflows and large language model integration
Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience

Fulltime

Principal AI Engineer

We are seeking a highly accomplished Principal AI Engineer to define and drive t...

Location

Ireland , Dublin 18

Salary:

Not provided

Mastercard

Expiration Date

Until further notice

Requirements

Demonstrated experience designing and building AI/ML systems in production at scale, ideally across multiple problem domains
Expert-level proficiency in Python and deep experience with modern AI frameworks such as PyTorch and TensorFlow
Strong experience with cloud-native architectures and AI infrastructure on platforms such as AWS, Azure, or GCP
Deep understanding of machine learning, deep learning, NLP, generative AI, and transformer-based architectures (e.g., BERT, GPT-style models, ViTs)
Proven expertise in MLOps, including model versioning, deployment strategies, monitoring, evaluation, and lifecycle management
Strong systems-thinking mindset, with experience designing resilient, scalable, and cost-efficient AI services
Experience working with large-scale data architectures, streaming and batch processing, and model inference optimization
Excellent communication skills with the ability to explain complex technical concepts to both technical and non-technical stakeholders
Track record of technical mentorship and influence without relying on formal line management
Comfortable operating in high-ambiguity environments and making sound technical judgments with incomplete information

Job Responsibility

Define and drive the technical direction of our AI platforms and solutions
Architect, build, and scale production-grade AI systems that deliver durable business impact
Lead through deep hands-on expertise, influence technical strategy across teams, and raise the engineering bar for AI development across the organization
Design, implement, and operate advanced AI systems that support critical business and client needs in a scalable, secure, and reliable manner
Partner closely with product, engineering, and data leaders to translate business intent into robust AI architectures and platforms

Fulltime

Principal Ai Engineer (Prisma Browser - Agents Platform)

The Prisma Browser group is building an agentic development lifecycle, an infras...

Location

Israel , Tel Aviv

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

At least 8+ years of experience in software development, architecture, or owning operational systems in production
Computer Science B.Sc. or equivalent education or equivalent military experience required
A product builder's mindset: you can extract requirements, talk to stakeholders, and tell the difference between what's important and what's noise
Experience in building production grade agents. Deep understanding of the agent loop, its states and transitions. You know how to build it correctly, not just use it
Positive 'can-do' mindset, able to work independently and within a team
Hands-on experience with LLM APIs, including a practical, highly-skeptical understanding of token costs, caching, context windows, and model failure points
You know how to build the right context for a task, including memory systems, session storage, and vector databases
You understand where LLMs fail and how to design around those failure points
You've used traces or observability tooling to diagnose and improve agent behavior
A systems-level background that touches reliability, observability, or platform engineering, with a strong preference for writing narrow, deterministic code over building hypothetical abstractions

Job Responsibility

Design and implement automated evaluation loops, static analysis, and rigorous quality gates to ensure the ADLC process doesn't just write code, but consistently produces great, production-ready code
Help the team tackle complex, hard problems to elevate our autonomous development product from 'good' to 'excellent'
Lead complex initiatives in Context Engineering and Prompt Engineering
Manage and orchestrate the complex ecosystem of autonomous agents utilized for internal development
Serve as a leading individual in a very strong team professionally and personally
Find space for growth to push the entire team or group forward
View prompt engineering as a core engineering discipline—where rewriting agent behavior is a versioned, reviewed, and tested code change
Act with a debugging temperament
conduct deep-dive analyses of raw agent transcripts to diagnose non-deterministic failures and ascertain root causes instead of merely working around them

Fulltime

Principal AI Engineer

As a Principal AI Engineer on the AI Foundations team, you are an established su...

Location

Singapore , Singapore

Salary:

Not provided

Mastercard

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, Data Science, Applied Mathematics, or related technical field
advanced degree preferred
Strong foundation in software engineering, distributed systems, and applied machine learning relevant to production AI systems
Demonstrated understanding of responsible AI, model/system risk, privacy/security considerations, and governance requirements for enterprise deployments
Demonstrated, sustained ownership of production AI/ML systems, including design, build, deployment, and ongoing lifecycle operations
Real-world experience shipping complex agentic systems into production, including multi-agent coordination and multi-tool integration with safe action policies
Hands-on experience building production pipelines for evaluation, monitoring, versioning, and continuous improvement (including retraining or policy/guardrail updates)
Proven ability to define and operationalize observability and reliability practices for agentic systems (traceability, telemetry, SLOs, incident management)
Track record of influencing architecture and standards across multiple teams or programs, and mentoring engineers to raise overall engineering rigor

Job Responsibility

Serve as an established subject matter expert in AI Engineering, influencing stakeholders and shaping technical direction across multiple initiatives
Architect, design, develop, and maintain advanced AI/ML systems, with emphasis on complex agentic solutions (multi-agent orchestration, tool/function-calling, memory, reflection/self-correction, and autonomy policies)
Lead production implementation of agentic AI systems, including scalable training and evaluation pipelines, deployment frameworks, and runtime orchestration patterns
Define and implement safe tool-use patterns: structured outputs, robust error handling, permissioning and auditability, human-in-the-loop (HITL) approval steps for sensitive actions, and guardrail enforcement
Establish end-to-end AgentOps/LLMOps practices for agentic systems: release pipelines for prompts/tools/policies, canary strategies, safe rollback mechanisms, and continuous regression/safety evaluations as release gates
Build and optimize data ingestion, preprocessing, feature/embedding engineering, and retrieval/memory workflows to improve grounding quality and reduce failure modes
Own production observability for agentic systems: trace capture, cost/token telemetry, latency and reliability SLOs, and incident response practices for agent failures
Implement drift detection and performance decay monitoring (data drift, concept drift), and automate model/agent retraining, policy updates, and redeployment to maintain output quality over time
Drive measurable improvements in system effectiveness, safety, and efficiency by defining success metrics (task success, intervention rate, policy violations, cost and latency per task) and continuously improving evaluation coverage
Mentor and grow senior and junior engineers through design reviews, code reviews, hands-on coaching, and the creation of reusable patterns, playbooks, and standards for agentic delivery

Fulltime

Principal AI Engineer

The Principal AI Engineer will serve as a technical cornerstone of VideoAmp's AI...

Location

United States , Los Angeles; Boulder; New York; St. Petersburg

Salary:

175000.00 - 200000.00 USD / Year

VideoAmp

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Engineering, or a related field preferred
equivalent practical experience considered
8+ years of software engineering experience
3+ years in AI/ML infrastructure, LLM platform engineering, or agentic systems
Deep hands-on experience with LLM APIs (Anthropic, OpenAI, or equivalent)
Familiarity with prompt engineering, tool use / function calling, and multi-step agent orchestration
Strong background in resource-based API design
Experience building or consuming developer-facing platform APIs at scale
Experience with MCP or equivalent tool-layer abstractions for exposing platform capabilities to AI agents
Proficiency in Golang, Python, and SQL

Job Responsibility

Design, build, and operate VideoAmp's AI infrastructure and its universal tool layers
Lead the development of scenario evaluation frameworks
Architect and implement efficient tool discovery systems
Partner with internal engineering teams to negotiate and promote API-first designs
Own full SDLC of new Agent APIs from design through production, testing, releases, and enhancement
Facilitate weekly AI office hours
Contribute to multi-provider LLM abstraction layers
Author, review, and drive clear, technical requirements documentation for new solutions

What we offer

Equity participation included
Discretionary & flexible PTO + Spring, Summer & Winter company breaks
Inclusive and comprehensive medical, dental & vision
401(k) with matching
HSA & FSA
Paid Maternity & Parental Leave for all family additions
Cell phone & wifi reimbursement
Commuter benefits

Fulltime

Principal Engineer, AI Inference Reliability

We’re looking for a hands-on Reliability Tech Lead (IC) to own the mission of ma...

Location

United States; Canada , Sunnyvale; Toronto

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

Bachelor's or master's degree in computer science or related field
7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems
Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust
Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture
Excellent communication and cross-functional leadership skills

Job Responsibility

Define and drive reliability strategy: establish SLOs and ensure alignment across engineering
Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers
Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents
Architect for reliability and observability: influence system design for redundancy, durability, and debuggability
Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection
Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service
Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights
Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

Fulltime

Principal AI Software Engineer, Senior Vice President

Are you looking for a career move that will put you at the heart of a global fin...

Location

United Kingdom , London

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Exceptional Python Expertise: Demonstrated mastery of core Python, including advanced features, performance optimization, and a deep understanding of the FastAPI framework
Prior hands-on experience with Generative AI, Large Language Model (LLM) frameworks (e.g. LangChain, LlamaIndex), and their application in enterprise environments is a must. This must be underpinned by a profound understanding of core machine learning principles, algorithms, and data science methodologies
Full Lifecycle Ownership: Extensive hands-on experience and technical authority throughout the entire software development lifecycle, from conceptualization and design to implementation, deployment, and operational ownership of enterprise software solutions, involving significant cross-functional collaboration
Strategic System Design: Significant hands-on experience in architecting and designing (architecture, design patterns, reliability, scaling) highly complex new and current systems with broad technical impact
Hands-on expertise with containerized deployment technologies (e.g. Kubernetes, OpenShift, Docker) and orchestration strategies
Hands-on experience and in-depth understanding of C++ is a significant bonus, particularly for complex code analysis, parsing, and integration into knowledge graph structures

Job Responsibility

Architect and implement cutting-edge software systems, defining the technical design for our AI solutions to ensure scalability, performance, and reliability
Drive the hands-on design, implementation, and deployment of sophisticated systems that automate the analysis of data, code, and documentation
Apply deep expertise to structure extracted knowledge within a Credit Risk Domain-aware knowledge graph, including advanced strategies for effectively modelling complex codebases, particularly C++, within this graph
Act as a critical technical partner with data scientists, business analysts, and other engineering teams to translate challenging business requirements into robust technical solutions and ensure successful, high-quality project delivery
Tackle the most complex technical challenges within our AI initiatives, providing solutions that set the standard for engineering excellence

What we offer

Generous holiday allowance starting at 27 days plus bank holidays
increasing with tenure
A discretional annual performance related bonus
Private medical insurance packages to suit your personal circumstances
Employee Assistance Program
Pension Plan
Paid Parental Leave
Special discounts for employees, family, and friends
Access to an array of learning and development resources

Fulltime

Principal Engineer Conversational AI

We’re building a world of health around every individual — shaping a more connec...

Location

United States , Wellesley

Salary:

144200.00 - 288400.00 USD / Year

CVS Health

Expiration Date

July 02, 2026

Requirements

10+ years of experience with designing and building software engineering solutions in cloud environments
10+ years of experience in one or more modern languages, such as Python, Java, C#, JavaScript, SQL, etc
10+ years of experience with APIs, microservices, and modern software patterns
5+ year(s) of soliciting complex requirements and managing relationships with key stakeholders
Strong understanding of conversational AI technologies (e.g., chatbots, voice assistants, LLMs, STT/TTS, NLU/NLP engines)
Bachelor degree from accredited university or equivalent work experience (HS diploma + 4 years relevant experience)

Job Responsibility

Create a technical vision to meet short- and longer-term business needs
Ensure the long-term quality of the design and code of our software systems
Oversee the creation and own critical software components
Lead hands-on, perform design and code and reviews
Help deploy and maintain large scale software in Production, ensure operational excellence
Possess expert knowledge in performance, scalability, distributed architecture, and engineering best practices
Help translate business requirements into software system in an Agile environment
Serve as technical lead on the most demanding, cross-functional projects
Advise leadership on technical decisions, strong communications skills required
Help the career development of other team members, mentor individuals on advanced technical issues, help managers grow their teams

What we offer

Medical
dental
vision coverage
paid time off
retirement savings options
wellness programs
bonus
commission or short-term incentive program
equity award program

Fulltime

!

Select Country

Principal AI Operations Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?