CrawlJobs Logo

Principal AI Operations Engineer

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Multiple Locations

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

The Security AI Platform team builds and operates production infrastructure that powers AI-native security capabilities at Microsoft scale. We are organized into two focused groups: Platform + Apps develops the core product, microservices, and architecture; AI Operations ensures reliability, deployments, and operational excellence. Together, we deliver mission-critical services that process millions of requests daily. We are seeking a Principal AI Operations Engineer to define the technical direction for the AI Operations group. In this role, you will design and architect operational systems, establish standards for branch health, CI/CD pipelines, production deployments, and on-call processes. You will drive reliability initiatives, maintain production health and uptime, and ensure the platform meets its SLOs. You will be the escalation point for complex incidents and work closely with the Platform team to ensure services are operationally ready.

Job Responsibility:

  • Define the operational vision, standards, and roadmap for the platform
  • establish SLOs, error budgets, and reliability targets
  • Drive technical direction for the AI Operations group: architecture for deployments, pipelines, branch health, and production reliability
  • Own CI/CD pipeline architecture: Azure DevOps/GitHub Actions pipelines, build optimization, artifact management, and deployment automation
  • Manage Kubernetes infrastructure: AKS cluster operations, Helm chart management, node pool configuration, GPU resource allocation, and autoscaling (KEDA)
  • Drive production deployments: canary/ring rollouts, safe deployment practices, rollback procedures, and release coordination with Platform team
  • Establish and operate first-level on-call: incident response procedures, escalation paths, runbooks, and post-incident reviews
  • Build and maintain observability infrastructure: Prometheus, Grafana, OpenTelemetry collectors, alerting rules, and dashboard curation
  • Manage infrastructure as code: Bicep templates for Azure resources, Helm charts for Kubernetes deployments, and environment parity
  • Ensure branch health and code quality gates: PR validation pipelines, automated testing, security scanning, and merge policies
  • Debug and diagnose production issues: analyze logs (Kusto/ADX), traces, and metrics to identify root causes and drive resolution
  • Collaborate with Platform team on operational readiness: review service designs for operability, define deployment requirements, and validate runbooks
  • Drive reliability improvements: capacity planning, performance optimization, chaos engineering, and disaster recovery testing
  • Guide and mentor operations engineers
  • establish operational effective practices and continuous improvement culture
  • Embody our culture and values

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • 6+ years technical engineering experience in DevOps, SRE, or platform operations
  • 6+ years driving complex operational initiatives across teams
  • demonstrated success leading without authority
  • 4+ years hands-on experience with Kubernetes in production environments
  • 3+ years building and maintaining CI/CD pipelines at scale
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Nice to have:

  • Experienced with Kubernetes: cluster operations, Helm, troubleshooting, autoscaling, and production management
  • Proficiency with CI/CD platforms: Azure DevOps, GitHub Actions, or similar pipeline tooling
  • Experience with cloud platforms (Azure preferred): AKS, networking, identity management, and resource provisioning
  • Infrastructure as Code: Bicep, Terraform, or Helm chart development
  • Observability tooling: Prometheus, Grafana, OpenTelemetry, and log analytics (Kusto/KQL)

Additional Information:

Job Posted:
March 04, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal AI Operations Engineer

Principal Engineering Manager - Applied AI

We are looking for a Principal Engineering Manager to join our growing Applied A...
Location
Location
United States , Seattle
Salary
Salary:
240870.00 - 297652.00 USD / Year
highspot.com Logo
Highspot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in Generative AI and Agentic AI systems, including LLMs, context engineering, and modern vector-based retrieval systems
  • 4+ years working as an engineering manager
  • 8+ years working as a professional software developer
  • A great understanding of Generative AI systems, best practices and experience in shipping Agentic AI into distributed, data-intensive production systems
  • Experience developing and operating Cloud services at enterprise scale
  • Strong programming skills in Java, Python, C#, Typescript or equivalent programming language
  • Substantial depth and breadth of management experience to lead and grow an Applied AI team
  • Great collaboration with teams with different backgrounds/expertise/functions
  • Expertise in full product lifecycle
  • technical designs, project planning, iterative implementation, and successful product launches
Job Responsibility
Job Responsibility
  • Lead a team of Applied AI engineers that works at the bleeding edge of Generative AI to solve high-impact business challenges
  • Apply Generative AI to solve hard unsolved challenges in the application of Agentic AI to real-world business challenges
  • Grow, coach, build and scale the Applied AI team
  • Drive operational excellence to achieve enterprise-grade scale, reliability, security, cost-efficiency and performance
  • Drive technical direction for building a safe, scalable and reliable Agentic AI platform for all of Highspot
  • Communicate complex concepts and the results of analyses in a clear and effective manner to technical and non-technical audiences
  • Collaborate with other team members and cross-functionally to share knowledge and discuss initiatives
What we offer
What we offer
  • Comprehensive medical, dental, vision, disability, and life benefits
  • Health Savings Account (HSA) with employer contribution
  • 401(k) Matching with immediate vesting on employer match
  • Flexible PTO
  • 8 paid holidays and 5 paid days for Annual Holiday Week
  • Quarterly Recharge Fridays (paid days off for mental health recharge)
  • 18 weeks paid parental leave
  • Access to Coaches and Therapists through Modern Health
  • 2 volunteer days per year
  • Commuting benefits
  • Fulltime
Read More
Arrow Right

Sr. Principal Software Engineer - Applied AI

We are looking for a Principal Software Engineer to join our growing Applied AI ...
Location
Location
United States , Seattle
Salary
Salary:
277391.00 - 342391.00 USD / Year
highspot.com Logo
Highspot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in Generative AI and Agentic AI systems, including LLMs, context engineering, and modern vector-based retrieval systems
  • 8+ years working as a professional software developer
  • A great understanding of Generative AI systems, best practices and experience in shipping Agentic AI into distributed, data-intensive production systems
  • Experience developing and operating Cloud services at enterprise scale
  • Strong programming skills in Java, Python, C#, Typescript or equivalent programming languages
  • Great collaboration with teams with different backgrounds/expertise/functions
  • Expertise in full product lifecycle
  • technical designs, fast shipping, iterative implementation, and successful product launches
  • Experience and passion for mentoring and encouraging collaborative teams
  • Experience in cultivating a strong engineering culture in an agile environment
Job Responsibility
Job Responsibility
  • Apply Generative AI to solve hard unsolved challenges in the application of Agentic AI to real-world business challenges
  • Work with a team of Applied AI engineers that works at the bleeding edge of Generative AI to solve high-impact business challenges
  • Grow, coach, build and scale talent on the Applied AI team
  • Drive operational excellence to achieve enterprise-grade scale, reliability, security, cost-efficiency and performance
  • Drive technical direction for building a safe, scalable and reliable Agentic AI platform for all of Highspot
  • Communicate complex concepts and the results of analyses in a clear and effective manner to technical and non-technical audiences
  • Collaborate with other team members and cross-functionally to share knowledge and discuss initiatives
What we offer
What we offer
  • Comprehensive medical, dental, vision, disability, and life benefits
  • Health Savings Account (HSA) with employer contribution
  • 401(k) Matching with immediate vesting on employer match
  • Flexible PTO
  • 8 paid holidays and 5 paid days for Annual Holiday Week
  • Quarterly Recharge Fridays (paid days off for mental health recharge)
  • 18 weeks paid parental leave
  • Access to Coaches and Therapists through Modern Health
  • 2 volunteer days per year
  • Commuting benefits
  • Fulltime
Read More
Arrow Right

Principal Forward Deployed Engineer, AI

As a Principal Forward Deployed Engineer (FDE) at Atlassian, you’ll be at the fo...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Engineering, Data Science, or a related field
  • 10+ years of experience in backend software development
  • 5+ years of experience delivering impactful AI/ML solutions in production environments, including experience as a founding or forward-deployed engineer in 0-to-1 product development
  • Deep expertise in applied AI/ML, including generative AI, LLMs, and agent-based frameworks
  • Proficiency in Python, JavaScript, or similar languages, and experience with APIs, microservices, and enterprise integration
  • Demonstrated ability to solve complex, ambiguous problems and thrive in rapidly changing environments
  • Strong communication and collaboration skills, with experience influencing cross-functional teams without formal authority
  • Experience with AI risk assessment, data privacy, and compliance (e.g., GDPR)
  • Strategic and innovative thinking, with a track record of aligning technical solutions to business goals
Job Responsibility
Job Responsibility
  • Lead the design, development, and deployment of AI/ML-powered solutions tailored to customer needs, leveraging frameworks such as TensorFlow, PyTorch, and agent-based platforms (e.g., LangChain, LlamaIndex)
  • Architect and implement robust application integrations using Python, JavaScript, or similar languages, with a focus on APIs, microservices, and enterprise-scale systems
  • Drive data-driven solution development by analyzing complex datasets to extract insights and inform product direction
  • Champion the adoption of AI-augmented workflow automation within customer environments, maximizing business value and operational efficiency
  • Oversee the end-to-end deployment of AI solutions into production, ensuring continuous evaluation, monitoring, and improvement
  • Navigate risk and compliance requirements, including AI risk assessment, data privacy (GDPR), and regulatory frameworks
  • Mentor and guide junior engineers, fostering a culture of innovation, collaboration, and continuous learning
  • Communicate complex technical concepts clearly to both technical and non-technical stakeholders, creating concise artifacts for peers, partners, and leadership
  • Align technical solutions with business objectives, setting a clear vision for technological progress and measurable results
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Engineer, SSD Firmware Engineering

We are seeking a talented Principal Engineer, Firmware Engineering to join our i...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
sandisk.com Logo
Sandisk
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Engineering, Electronics, Electrical Engineering, or related field
  • 10+ years of experience in firmware development for embedded systems
  • Strong proficiency in C/C++ programming languages
  • In-depth knowledge of microcontroller architectures and embedded systems
  • Experience with real-time operating systems (RTOS) and their implementation
  • Familiarity with hardware interfaces such as SPI, I2C, I3C, UART, and GPIO
  • Expertise in developing and debugging low-level device drivers
  • Proficiency in using version control systems, preferably Git
  • Strong analytical and problem-solving skills with attention to detail
  • Experience with firmware testing and validation methodologies
Job Responsibility
Job Responsibility
  • Design, develop, and implement firmware for embedded systems and microcontrollers
  • Collaborate with hardware engineers to integrate firmware with electronic components
  • Optimize firmware for performance, power consumption, and memory usage
  • Develop and maintain device drivers for various hardware interfaces
  • Implement and integrate real-time operating systems (RTOS) in firmware projects
  • Conduct code reviews and ensure adherence to coding standards and best practices
  • Debug and resolve firmware issues using specialized tools and techniques
  • Participate in firmware testing and validation processes
  • Document firmware architecture, design decisions, and implementation details
  • Stay up-to-date with the latest trends and technologies in firmware engineering
  • Fulltime
Read More
Arrow Right

Senior Principal Engineer - Atlassian Ecosystem and Marketplace

The Atlassian Ecosystem and Marketplace organization enables our customers to do...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience building software
  • 4+ years in an architect/principal role working across teams
  • Broad experience architecting, designing, and building large-scale systems with multiple dependencies
  • Passion for building quality solutions and up-keeping quality standards
  • Success with building, expressing, and pitching a technical vision to stakeholders
  • Experience with collaboration with an ecosystem of teams
  • Success with leading the long-term strategy for software architecture
  • Experience with building and operating large scale, high availability, high reliability services
  • Experience in operational requirements and common challenges of software systems
  • Experience working on developer productivity initiatives
Job Responsibility
Job Responsibility
  • Shape the forward-looking technical direction and long-term architecture for Ecosystem and Marketplace
  • Collaborate with product, engineering and design leaders to understand and influence the broader department level long term strategy
  • Ensure that the technical strategy you build is aligned with the technical strategy of Atlassian products and platforms
  • Partner with principal engineers and architects from other teams and drive exploration of large-scale projects spanning multiple teams in Enterprise
  • Provide pragmatic and balanced advice to the engineering leaders to invest in the long term architecture while also servicing the current systems with high quality
  • Improve, through example, the quality of software construction and meaningful code reviews in an agile environment
  • Be a role model for, and influence a large team of engineers at multiple seniority levels all the way from grads to principal engineers, and mentor engineers across the teams
  • Be influential within your team and work with peers and senior leaders to define and revise the standards for operational excellence across Atlassian
  • Mentor, hire and develop other engineers
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
Read More
Arrow Right

Principal Engineer

As a Principal Engineer at Aignostics, you will play a crucial role in shaping t...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
aignostics.com Logo
Aignostics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advanced degree in Computer Science, Software Engineering, or a related field
  • 10+ years of software development experience, with at least 5 years in senior technical leadership roles
  • Proven track record of driving technical excellence and innovation in organizations with 50+ engineers
  • Excellent communication skills, able to articulate complex technical concepts to both technical and non-technical stakeholders
  • Solid background in large scale systems and software architecture, design patterns, and clean coding
  • Extensive experience in designing and implementing large-scale, distributed and event-driven systems
  • Extensive experience with data processing at scale
  • Extensive expertise in multiple programming languages and frameworks
  • Deep understanding of cloud technologies (GCP, AWS), containerization and orchestration (Kubernetes)
  • Familiarity with DevSecOps and MLOps practices, complex CI/CD pipelines, and infrastructure as code
Job Responsibility
Job Responsibility
  • Own the technical direction and architectural integrity of our platform
  • Advise our CTO and Sr. Vice President of Engineering on the technical vision of Aignostics
  • Align our technical strategy with business objectives to provide a competitive advantage
  • Resolve technical conflicts across teams and harmonize technologies to unlock synergies
  • Advise product management on technical feasibility, cost, and risks of complex product features
  • Drive technical design, planning, and integration of our platform across systems
  • Provide technical guidance in system design reviews for all teams
  • Educate senior and mid-level engineers to bring them up to the next level
  • Demonstrate long-term thinking and utmost technical excellence in your individual contributions
  • Lead the technical strategic planning and execution across the TechOrg's quarterly roadmap
What we offer
What we offer
  • Cutting-edge AI research and development, with involvement of Charité, TU Berlin and our other partners
  • Work with a welcoming, diverse and highly international team of colleagues
  • Opportunity to take responsibility and grow your role within the startup
  • Expand your skills by benefitting from our Learning & Development yearly budget of 1,000 € (plus 2 L&D days), language classes and internal development programs
  • Mentoring program, you’ll learn from great experts
  • Flexible working hours and teleworking policy
  • Enjoy your well-deserved time off within our 30 paid vacations days per year
  • We are family & pet friendly and support flexible parental leave options
  • Pick a subsidized membership of your choice among public transport, sports and well-being
  • Enjoy our social gatherings, lunches, and off-site events for a fun and inclusive work environment
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right