CrawlJobs Logo

Principal Engineer, AI Inference Reliability

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States; Canada , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We’re looking for a hands-on Reliability Tech Lead (IC) to own the mission of making Cerebras Inference the most reliable AI service in the world. You will drive reliability strategy and execution across our inference stack, from client SDKs and public-cloud multi-region deployments to wafer-scale systems in specialized data centers. In this role, you will define SLOs and incident-response frameworks, design and implement reliability mechanisms at scale, and partner across hundreds of engineers to ensure our service meets world-class reliability standards.

Job Responsibility:

  • Define and drive reliability strategy: establish SLOs and ensure alignment across engineering
  • Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers
  • Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents
  • Architect for reliability and observability: influence system design for redundancy, durability, and debuggability
  • Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection
  • Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service
  • Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights
  • Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems

Requirements:

  • Bachelor's or master's degree in computer science or related field
  • 7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems
  • Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust
  • Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture
  • Excellent communication and cross-functional leadership skills

Nice to have:

prior experience building large-scale AI infrastructure systems

What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Engineer, AI Inference Reliability

Principal AI Engineer

We are looking for a Principal AI Engineer to lead the design and deployment of ...
Location
Location
United States
Salary
Salary:
200000.00 - 300000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software engineering experience
  • at least 3 years in applied LLM or agentic AI systems (2023–present)
  • proven success in deploying LLM-powered products used by real users at scale
  • deep backend & systems engineering expertise with Python, distributed systems, and scalable APIs
  • familiarity with LangChain, LlamaIndex, or similar orchestration frameworks
  • experience with RAG pipelines, vector DBs, embedding models, and semantic search tuning
  • experience managing performance across cloud providers (e.g., AWS Bedrock, OpenAI, Anthropic, etc.)
  • demonstrated experience building multi-step agents, planning workflows, chaining reasoning steps, and integrating APIs with agent memory/state
  • comfort with advanced prompting strategies, few-shot and chain-of-thought reasoning, and embedding retrieval setups
  • strong understanding of AI system evaluation, human ratings, A/B experimentation, and feedback loop pipelines
Job Responsibility
Job Responsibility
  • Architect and lead the development of multi-agent systems capable of long-horizon planning, reasoning, and API orchestration
  • build reusable agentic components that integrate deeply into sales and marketing processes
  • own and evolve our in-house platform for scalable, low-latency, and cost-efficient LLM and agent deployments
  • lead design of interfaces powered by natural language understanding and retrieval-augmented generation (RAG)
  • build embedding-based, intent-aware search and personalization systems tuned to business user needs
  • drive innovation in personalized outreach generation using context-aware generation pipelines
  • tune inference pipelines, caching layers, and model selection logic for high-scale, cost-aware performance
  • define and drive robust offline and online testing methodologies (A/B, sandboxing, human evals) across agents and LLM flows
  • architect human-in-the-loop systems and telemetry to improve accuracy, UX, and explainability over time
What we offer
What we offer
  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • at least 10 paid holidays per year
  • flex PTO
  • parental leave
  • employee assistance program
  • wellbeing benefits
  • global travel coverage
  • life/AD&D/STD/LTD insurance
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Principal AI Architect

We are seeking an experienced AI Architect to lead the design, implementation, a...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
evoluteiq.com Logo
EvoluteIQ
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience in data science, ML engineering and AI system architecture
  • Hands-on experience with Python, TensorFlow, PyTorch, Scikit-learn, spaCy and related AI/ML frameworks
  • Expertise in MLOps tools such as MLflow, Kubeflow, Vertex AI, or SageMaker
  • Proficiency in data processing technologies (Spark, Kafka, Airflow) and data modeling
  • Strong background in deploying models such as APIs or services using Docker, Kubernetes, and REST/gRPC
  • Experience designing data pipelines and integrating AI with production systems
  • Should have an understanding of prompt engineering, LLM fine-tuning, and vector stores (e.g. Pinecone, FAISS, Weaviate)
  • Knowledge of cloud AI services (AWS, GCP, Azure) and distributed computing architectures
  • Proven experience implementing observability for models (drift, accuracy, bias, and performance)
Job Responsibility
Job Responsibility
  • Architect and oversee AI/ML pipelines covering data collection, preparation, training, validation, and inference
  • Define and implement scalable AI infrastructure for training, deployment, and continuous integration (MLOps)
  • Collaborate with data scientists, ML engineers, product manager, and product teams to translate business problems into AI-driven solutions
  • Establish frameworks for model governance, versioning, reproducibility, and explainability
  • Integrate models into production systems ensuring low latency, scalability, and reliability
  • Define data strategy, storage, and access patterns to support AI workloads
  • Build solutions to monitor model performance, drift, and data quality, implementing continuous retraining strategies
  • Ensure compliance with ethical AI, data privacy, and security best practices
  • Mentor AI/ML engineers and contribute to architectural decisions across the AI platform stack
What we offer
What we offer
  • Opportunity to shape the strategy of a next-gen hyper-automation platform
  • Work with a cross-disciplinary team in a fast-growing, innovation-driven environment
  • Competitive compensation and growth opportunities
  • A culture of innovation, ownership, and continuous learning
  • Fulltime
Read More
Arrow Right

Senior Principal Technical Program Manager - ML Platform

Location
Location
Salary
Salary:
231300.00 - 301975.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience on software teams as Development Manager, Technical Product Manager or TPM leading technical platforms areas
  • Deep domain experience in AI and/or Search. Example: Model Inference, Model Evaluation, Model Training, LLM Ops, Semantic Search, Search Relevance, etc.
  • Partner with Engineering in defining direction, strategy and execution at Platform level
  • Strategic thinking and ability to understand business objectives to translate them into technical problems and programs.
  • Technical understanding of systems involved. Willingness to develop domain expertise in the area they operate - storage, networking, authentication, capacity management, service deployments, etc.
  • TPMs are not expected to write or read code, but are expected to understand system flows, block architectures, APIs and such.
  • Experience defining and running end-to-end complex technical programs
  • Strong leadership, organizational, and communication skills
Job Responsibility
Job Responsibility
  • Understand and stay up-to-date on latest innovations in AI and Search. Partner closely with engineering teams to translate these into practical platform evolution for Atlassian bringing value to our customers.
  • Analyze business objectives, customer needs, product adoption inhibitors and opportunities, industry trends, and based on these, in close collaboration with your stakeholders, define a long-term strategy and roadmap for your platform and product components.
  • Understand business objectives and translate them into technical systems problems that need to be prioritized solved in the current business environment.
  • Define specific systems programs and create a plan of action for realizing those programs. Such programs could be around capacity planning, migration efforts, high availability, network architecture, performance optimization, reliability improvements and more.
  • Use your technical understanding of Atlassian and related systems to partner with and influence engineers and architects in making progress on these problems.
  • Responsible for taking a systematic approach to engineering problems. This includes: prioritizing tasks, scoping out the project, defining objectives, and making consistent progress against each of these.
  • Be accountable for the success of these technical programs by managing the entire lifecycle from initiation to forecasting, budgeting, scheduling, etc.
  • Manage complex dependencies and projects with a broad scope across the company
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
Read More
Arrow Right
New

Principal AI Network Architect

Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 7+ years technical engineering experience
  • OR Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 8+ years technical engineering experience
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • 5+ years of experience in designing AI backend networks and integrating them into large-scale GPU systems
  • Proven expertise in system architecture across compute, networking, and accelerator domains
  • Deep understanding of RDMA protocols (RoCE, InfiniBand), congestion control (DCQCN), and Layer 2/3 routing
  • Experience with optical interconnects (e.g., PSM, WDM), link budget analysis, and transceiver integration
  • Familiarity with signal integrity modeling, link training, and physical layer optimization
Job Responsibility
Job Responsibility
  • Spearhead architectural definition and innovation for next-generation GPU and AI accelerator platforms, with a focus on ultra-high bandwidth, low-latency backend networks
  • Drive system-level integration across compute, storage, and interconnect domains to support scalable AI training workloads
  • Partner with silicon, firmware, and datacenter engineering teams to co-design infrastructure that meets performance, reliability, and deployment goals
  • Influence platform decisions across rack, chassis, and pod-level implementations
  • Cultivate deep technical relationships with silicon vendors, optics suppliers, and switch fabric providers to co-develop differentiated solutions
  • Represent Microsoft in joint architecture forums and technical workshops
  • Evaluate and articulate tradeoffs across electrical, mechanical, thermal, and signal integrity domains
  • Frame decisions in terms of TCO, performance, scalability, and deployment risk
  • Lead design reviews and contribute to PRDs and system specifications
  • Shape the direction of hyperscale AI infrastructure by engaging with standards bodies (e.g., IEEE 802.3), influencing component roadmaps, and driving adoption of novel interconnect protocols and topologies
  • Fulltime
Read More
Arrow Right
New

Principal Engineer - Generative AI Infra Capabilities

Wells Fargo is seeking a Principal Engineer - Generative Gen AI GPU Infrastructu...
Location
Location
India , BENGALURU
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
February 20, 2026
Flip Icon
Requirements
Requirements
  • 7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Design GPU cluster topologies (H100/H200, NVLink/NVSwitch), networking, and storage paths for high‑throughput inferencing
  • document sizing and perf baselines.
  • Implement Run: AI constructs (Collections/Departments/Projects/workloads) for MDEV/MDEP/UCEP/MRM
  • codify quota, priority, and fair‑share policies.
  • POC & benchmark disaggregated inferencing (prefill/decode) with vLLM/TensorRT‑LLM
  • publish guidance for H100/H200 tuning (FP8/INT8/AWQ) and KV‑transfer behavior over NVLink.
  • Operationalize OpenShift AI parity for GPU scheduling, time slicing/MIG profiles, and preemption
  • validate upgrade paths and helm/kustomize packaging.
  • Integrate Triton Inference Server for multi‑model serving
Job Responsibility
Job Responsibility
  • Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
  • Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
  • Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
  • Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
  • Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
  • Fulltime
!
Read More
Arrow Right

Principal Software Engineering - AI Frameworks

Are you looking for opportunities to deliver innovations to hundreds of millions...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Experience developing Inference software stack
  • Experience working on systems performance optimization
  • Working with Open-Source code
Job Responsibility
Job Responsibility
  • Partnering with appropriate stakeholders to determine user requirements for one or more complex scenarios
  • Providing technical leadership for the identification of dependencies and the development of design documents for a product, application, service, or platform
  • Leading by example and mentoring others to produce extensible and maintainable code used across the company
  • Leveraging deep subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to lead multiple product's project plans, release plans, and work items
  • Holding accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions
  • Proactively seeking new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers
  • Embodying our Culture and Values
  • Fulltime
Read More
Arrow Right
New

Principal AI/ML Engineer

Rapid7 is seeking a Principal AI Engineer to lead the architectural evolution of...
Location
Location
India , Pune
Salary
Salary:
Not provided
rapid7.com Logo
Rapid7
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Exceptional ability to reason at the system and architecture level to make long-term technical decisions
  • Courageous and principled decision-making when navigating high-stakes, ambiguous problem spaces
  • Proven mentorship of Staff and Senior engineers, fostering growth in architectural thinking and technical rigor
  • Accountability for the long-term technical health and security of AI systems across multiple teams
  • 13+ years of experience in Data Science, ML Engineering, or Applied AI with a focus on large-scale systems
  • Hands-on mastery of LLM orchestration frameworks such as LangChain and LangGraph for agentic workflows
  • Deep expertise in designing RAG pipelines and managing vector database retrieval at scale
  • Advanced proficiency in AWS ecosystems, specifically Bedrock, SageMaker, EKS, and Lambda
  • Expertise in MLOps standards, including model registries, drift detection, and automated retraining frameworks
  • Strong background in deep learning for NLP and sequence-based problems like malware behaviour modelling
Job Responsibility
Job Responsibility
  • Own the end-to-end system architecture for AI, ML, and Agentic platforms, ensuring they are reliable, scalable, and secure
  • Design complex data ingestion pipelines, feature stores, and inference microservices that bridge the gap between research and production
  • Establish architectural standards and reference patterns for LLM orchestration, RAG systems, and multi-agent workflows
  • Lead architectural reviews as the final technical authority, making critical trade-offs across accuracy, latency, cost, and reliability
Read More
Arrow Right