CrawlJobs Logo

Principal Engineer, AI Inference Reliability

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States; Canada , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We’re looking for a hands-on Reliability Tech Lead (IC) to own the mission of making Cerebras Inference the most reliable AI service in the world. You will drive reliability strategy and execution across our inference stack, from client SDKs and public-cloud multi-region deployments to wafer-scale systems in specialized data centers. In this role, you will define SLOs and incident-response frameworks, design and implement reliability mechanisms at scale, and partner across hundreds of engineers to ensure our service meets world-class reliability standards.

Job Responsibility:

  • Define and drive reliability strategy: establish SLOs and ensure alignment across engineering
  • Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers
  • Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents
  • Architect for reliability and observability: influence system design for redundancy, durability, and debuggability
  • Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection
  • Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service
  • Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights
  • Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems

Requirements:

  • Bachelor's or master's degree in computer science or related field
  • 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems
  • Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust
  • Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture
  • Excellent communication and cross-functional leadership skills

Nice to have:

prior experience building large-scale AI infrastructure systems

What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Engineer, AI Inference Reliability

Principal AI Engineer

We are looking for a Principal AI Engineer to lead the design and deployment of ...
Location
Location
United States
Salary
Salary:
200000.00 - 300000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software engineering experience
  • at least 3 years in applied LLM or agentic AI systems (2023–present)
  • proven success in deploying LLM-powered products used by real users at scale
  • deep backend & systems engineering expertise with Python, distributed systems, and scalable APIs
  • familiarity with LangChain, LlamaIndex, or similar orchestration frameworks
  • experience with RAG pipelines, vector DBs, embedding models, and semantic search tuning
  • experience managing performance across cloud providers (e.g., AWS Bedrock, OpenAI, Anthropic, etc.)
  • demonstrated experience building multi-step agents, planning workflows, chaining reasoning steps, and integrating APIs with agent memory/state
  • comfort with advanced prompting strategies, few-shot and chain-of-thought reasoning, and embedding retrieval setups
  • strong understanding of AI system evaluation, human ratings, A/B experimentation, and feedback loop pipelines
Job Responsibility
Job Responsibility
  • Architect and lead the development of multi-agent systems capable of long-horizon planning, reasoning, and API orchestration
  • build reusable agentic components that integrate deeply into sales and marketing processes
  • own and evolve our in-house platform for scalable, low-latency, and cost-efficient LLM and agent deployments
  • lead design of interfaces powered by natural language understanding and retrieval-augmented generation (RAG)
  • build embedding-based, intent-aware search and personalization systems tuned to business user needs
  • drive innovation in personalized outreach generation using context-aware generation pipelines
  • tune inference pipelines, caching layers, and model selection logic for high-scale, cost-aware performance
  • define and drive robust offline and online testing methodologies (A/B, sandboxing, human evals) across agents and LLM flows
  • architect human-in-the-loop systems and telemetry to improve accuracy, UX, and explainability over time
What we offer
What we offer
  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • at least 10 paid holidays per year
  • flex PTO
  • parental leave
  • employee assistance program
  • wellbeing benefits
  • global travel coverage
  • life/AD&D/STD/LTD insurance
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Principal AI Architect

We are seeking an experienced AI Architect to lead the design, implementation, a...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
evoluteiq.com Logo
EvoluteIQ
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience in data science, ML engineering and AI system architecture
  • Hands-on experience with Python, TensorFlow, PyTorch, Scikit-learn, spaCy and related AI/ML frameworks
  • Expertise in MLOps tools such as MLflow, Kubeflow, Vertex AI, or SageMaker
  • Proficiency in data processing technologies (Spark, Kafka, Airflow) and data modeling
  • Strong background in deploying models such as APIs or services using Docker, Kubernetes, and REST/gRPC
  • Experience designing data pipelines and integrating AI with production systems
  • Should have an understanding of prompt engineering, LLM fine-tuning, and vector stores (e.g. Pinecone, FAISS, Weaviate)
  • Knowledge of cloud AI services (AWS, GCP, Azure) and distributed computing architectures
  • Proven experience implementing observability for models (drift, accuracy, bias, and performance)
Job Responsibility
Job Responsibility
  • Architect and oversee AI/ML pipelines covering data collection, preparation, training, validation, and inference
  • Define and implement scalable AI infrastructure for training, deployment, and continuous integration (MLOps)
  • Collaborate with data scientists, ML engineers, product manager, and product teams to translate business problems into AI-driven solutions
  • Establish frameworks for model governance, versioning, reproducibility, and explainability
  • Integrate models into production systems ensuring low latency, scalability, and reliability
  • Define data strategy, storage, and access patterns to support AI workloads
  • Build solutions to monitor model performance, drift, and data quality, implementing continuous retraining strategies
  • Ensure compliance with ethical AI, data privacy, and security best practices
  • Mentor AI/ML engineers and contribute to architectural decisions across the AI platform stack
What we offer
What we offer
  • Opportunity to shape the strategy of a next-gen hyper-automation platform
  • Work with a cross-disciplinary team in a fast-growing, innovation-driven environment
  • Competitive compensation and growth opportunities
  • A culture of innovation, ownership, and continuous learning
  • Fulltime
Read More
Arrow Right

Senior Principal Technical Program Manager - ML Platform

Location
Location
Salary
Salary:
231300.00 - 301975.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience on software teams as Development Manager, Technical Product Manager or TPM leading technical platforms areas
  • Deep domain experience in AI and/or Search. Example: Model Inference, Model Evaluation, Model Training, LLM Ops, Semantic Search, Search Relevance, etc.
  • Partner with Engineering in defining direction, strategy and execution at Platform level
  • Strategic thinking and ability to understand business objectives to translate them into technical problems and programs.
  • Technical understanding of systems involved. Willingness to develop domain expertise in the area they operate - storage, networking, authentication, capacity management, service deployments, etc.
  • TPMs are not expected to write or read code, but are expected to understand system flows, block architectures, APIs and such.
  • Experience defining and running end-to-end complex technical programs
  • Strong leadership, organizational, and communication skills
Job Responsibility
Job Responsibility
  • Understand and stay up-to-date on latest innovations in AI and Search. Partner closely with engineering teams to translate these into practical platform evolution for Atlassian bringing value to our customers.
  • Analyze business objectives, customer needs, product adoption inhibitors and opportunities, industry trends, and based on these, in close collaboration with your stakeholders, define a long-term strategy and roadmap for your platform and product components.
  • Understand business objectives and translate them into technical systems problems that need to be prioritized solved in the current business environment.
  • Define specific systems programs and create a plan of action for realizing those programs. Such programs could be around capacity planning, migration efforts, high availability, network architecture, performance optimization, reliability improvements and more.
  • Use your technical understanding of Atlassian and related systems to partner with and influence engineers and architects in making progress on these problems.
  • Responsible for taking a systematic approach to engineering problems. This includes: prioritizing tasks, scoping out the project, defining objectives, and making consistent progress against each of these.
  • Be accountable for the success of these technical programs by managing the entire lifecycle from initiation to forecasting, budgeting, scheduling, etc.
  • Manage complex dependencies and projects with a broad scope across the company
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
Read More
Arrow Right

Principal Machine Learning Engineer

As a Principal Machine Learning Engineer, you will lead the architecture and dev...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
amgen.com Logo
Amgen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, Data Science, or a related field with 12 to 17 years of total experience
  • 8+ years of experience in software engineering, machine learning engineering, or ML infrastructure
  • Strong experience building production ML systems or ML platforms
  • Hands-on experience with MLOps frameworks and tools such as MLflow / Equivalent - Model lifecycle management frameworks
  • Strong programming experience in Python and modern software engineering practices such as API Driven Architecture and Event based systems
  • Experience designing scalable distributed systems or cloud-native architectures
  • Experience deploying and operating machine learning models in production environments
  • Solid understanding of modern ML workflows including training, evaluation, deployment, monitoring, and retraining
Job Responsibility
Job Responsibility
  • Architect and build a scalable ML platform for training, deployment, and lifecycle management of ML, LLM, and Generative AI models
  • Lead development of infrastructure that supports production hosting of complex AI systems, including large-scale inference workloads
  • Design developer-friendly abstractions and automation that make it easy for researchers to build and deploy models within the Amgen ecosystem
  • Implement and evolve MLOps capabilities including experiment tracking, model versioning, CI/CD for ML, monitoring, and reproducibility using tools such as Databricks and MLflow
  • Build platform capabilities supporting Generative AI and emerging Agentic AI systems
  • Serve as the technical leader for a team of engineers, guiding architecture, design reviews, and engineering best practices
  • Partner with AI researchers, data scientists, and platform teams to translate cutting-edge AI research into reliable production systems
  • Evaluate and adopt emerging technologies across the modern AI stack including foundation models, vector databases, agent frameworks, and model serving systems
  • Champion AI-native engineering practices, leveraging tools like GitHub Copilot, Codex, and AI-assisted development workflows
  • Contribute to the broader strategy and evolution of the Enterprise AI Platforms ecosystem
Read More
Arrow Right

Principal Product Manager

AgentX Support is Workato’s flagship AI-native engine designed to transform cust...
Location
Location
United States , Palo Alto
Salary
Salary:
Not provided
workato.com Logo
Workato
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of product management experience in enterprise software
  • Demonstrated success building 0-to-1 products or defining new market categories
  • Track record shipping ML/AI products to production (not just prototyping)
  • Experience with customer support platforms or CRM systems
  • Experience with real-time systems or high-throughput environments
  • Strong systems thinking and comfort designing across APIs, data, models, and workflows
  • Sufficient technical depth to partner effectively with engineering and AI research teams
  • Ability to operate effectively in ambiguous, fast-moving environments
  • Exceptional written and verbal communication skills, with executive-level presence
Job Responsibility
Job Responsibility
  • Define the vision and long-term strategy for AI-native, agentic customer support
  • Establish the product narrative and positioning that differentiates Workato in the customer service market
  • Represent Workato’s point of view with customers, partners, and industry stakeholders
  • Partner with design and marketing to translate product capabilities into clear customer value
  • Define how AI agents interpret intent, reason through complex customer issues, and determine when to act autonomously
  • Design orchestration frameworks that connect large language models with enterprise systems of record
  • Lead the development of workflows that enable AI agents to execute actions such as refunds, order modifications, and escalations
  • Ensure appropriate controls for safety, observability, compliance, and human-in-the-loop oversight at enterprise scale
  • Define the next-generation support agent workspace augmented by real-time AI guidance
  • Deliver systems that surface relevant context, recommendations, and next-best actions to human support agents during live customer interactions
What we offer
What we offer
  • Vibrant and dynamic work environment
  • Multitude of benefits they can enjoy inside and outside of their work lives
Read More
Arrow Right

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 8+ years of experience in backend service engineering, including work on high-scale infrastructures
  • Proficiency in one or more systems programming languages such as C#, C++
  • 1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
  • 2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
  • 2+ years of experience building software for scale, performance, and reliability
  • Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)
Job Responsibility
Job Responsibility
  • Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
  • Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
  • Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
  • Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
  • Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
  • Design and implement scalable backend services optimized for machine learning workflows and large language model integration
  • Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
  • Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
  • Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
  • Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience
  • Fulltime
Read More
Arrow Right

Principal AI Engineer

Mastercard Foundry is a global innovation group focused on the evolution of tech...
Location
Location
Ireland , Dublin 18
Salary
Salary:
Not provided
mastercard.com Logo
Mastercard
Expiration Date
April 30, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, Data Science, or a related technical field. Master’s degree preferred
  • Minimum of 10+ years in software development, with at least 5 years focused on building and deploying AI/ML, GenAI, AgenticAI powered applications or microservices in production environments
  • Strong proficiency in Python, Java and relevant frameworks (e.g., Spring Boot, Spring AI, FastAPI, LangChain, LangGraph, Semantic Kernel)
  • Extensive experience designing, developing, and deploying RESTful APIs and microservices
  • Proficiency with containerization technologies
  • Proven experience delivering GenAI, Agentic AI workflows in production
  • Proven experience scaling machine learning models from prototype to production, including familiarity with feature stores, model registries, and inference patterns
  • Solid understanding of the AI/ML lifecycle, from data preparation and model training to deployment and monitoring
  • Experience with cloud platforms and their relevant compute, storage, and AI/ML services, cloud certification preferred
  • Solid understanding of ML, Deep Learning
Job Responsibility
Job Responsibility
  • AI Microservice Design: Design and lead implementation of AI-powered microservices (e.g., ML, agentic AI) with a focus on modularity, scalability, and reusability to support diverse business units and solutions
  • Productionalization: Transition AI models and agentic workflows from notebook-based PoCs to production-grade components, ensuring performance, reliability, and maintainability
  • API Development: Design and lead implementation of robust APIs for AI microservices to facilitate seamless integration within the Mastercard ecosystem
  • Quality Assurance: Lead the development of testing strategies (unit, integration, performance) to ensure the accuracy, stability, and quality of deployed AI services
  • Performance Optimization: Identify and address bottlenecks in AI microservices and infrastructure, optimizing for latency, throughput, and cost efficiency
  • Pipeline Development: Work with data science and infrastructure teams to design reusable workflows (e.g., feature engineering, model & agent optimization, evaluation pipelines)
  • Cross-Functional Collaboration: Partner with data scientists, MLOps engineers, product owners, and external stakeholders to translate business requirements into technical specifications
  • Technology Innovation: Research and evaluate emerging technologies (e.g., LLMs, frameworks, methodologies) to enhance AI software development and large-scale data processing capabilities
  • Compliance & Ethics: Ensure all AI microservices adhere to Mastercard’s security standards, compliance policies, and ethical AI principles, while contributing to AI engineering best practices
  • Fulltime
Read More
Arrow Right