CrawlJobs Logo

Staff Software Engineer, AI Runtime

United States 185000.00 - 215000.00 USD / Year · Job Posted December 06, 2025
Apply Position
Job Link Share

Job Description

We’re seeking a Staff Software Engineer to help power the future of agentic AI workflows. You’ll take our MCP Server to the next level, turning it into an enterprise-grade service that lets diverse tools and systems be exposed effortlessly to AI agents. Looking ahead, you’ll also help architect the MCP Gateway—a new layer that will route requests across tools, enforce policies, and provide the runtime foundation for scalable multi-agent systems. Along the way, you’ll tackle challenges in scalability, performance, and developer experience to ensure our platform feels seamless, powerful, and enterprise-ready.

Job Responsibility

  • Architect and scale an enterprise AI/MCP Server and Gateway that powers multi-agent workflows across Apollo, including routing, orchestration, and integration boundaries
  • Design and implement robust server infrastructure to ensure reliability, performance, and security at scale
  • Build and maintain tools for agent discovery, communication, and coordination
  • Define deployment strategies and runtime optimizations to maximize efficiency and minimize operational overhead
  • Develop frameworks and patterns that enable seamless multi-agent collaboration and AI-driven orchestration
  • Integrate observability, logging, and monitoring for full visibility into server and agent behavior
  • Explore and implement AI-enhanced developer workflows to optimize orchestration and agent interactions
  • Collaborate with teams across Apollo to ensure the MCP Server meets evolving product and developer needs

Requirements

  • Expertise in agent-to-tool orchestration, routing, and coordination in scalable, fault-tolerant systems
  • Deep expertise in Rust programming language
  • Strong background in distributed systems, server architecture, and high-performance backend development
  • Proven experience with protocol design, message routing, and server-side orchestration frameworks
  • Experience building and maintaining robust runtime infrastructure that supports AI-driven workflows and enables reliable agent-to-tool interactions
  • Proven experience with protocol design, message routing, and building server-side frameworks that enable scalable, reliable multi-tool agent workflows
  • Hands-on experience with observability, monitoring, and debugging frameworks for complex systems
  • Passion for clean, maintainable code, high system reliability, and scalable architecture
  • Experience in strategic system design, making architectural trade-offs, and planning for long-term scalability and maintainability
  • Strong technical leadership and mentorship, including guiding junior engineers and driving engineering best practices across teams
  • Ability to influence cross-team architecture decisions and align engineering efforts with product and business objectives
  • Production ownership experience: leading incident response, debugging, and performance optimization in high-impact backend systems

Nice to have

  • Exposure to AI/ML-enabled developer tooling or autonomous system orchestration
  • Familiarity with cloud-native architectures, containerization, or orchestration frameworks
  • Experience with performance optimization and cost-efficient scaling of high-throughput distributed systems

What we offer

  • Offers Equity
  • Choice of 3 Anthem Blue Cross medical plans (California residents can also choose from an additional 2 Kaiser medical plans)
  • Dental and Vision benefits are provided by Sun Life Financial

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff Software Engineer, AI Runtime

8 matching positions

Staff Software Engineer, Managed AI - AI Platform

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'...
Location
Location
United States , San Francisco, CA; Sunnyvale, CA
Salary
Salary:
208725.00 - 253000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advanced degree in Computer Science/Engineering
  • 8-10+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
  • Experience with distributed systems, cloud services (compute, storage, networking, database), and delivering early-stage projects quickly
  • Experience with Generative AI (LLMs, Multimodal) and familiar with AI infrastructure (training, inference, ETL pipelines)
  • Proficient with container runtimes (e.g., Kubernetes), microservices, REST APIs, gRPC, and the full software development lifecycle including CI/CD
Job Responsibility
Job Responsibility
  • Lead the design and implementation of core AI services, including: Resilient fault-tolerant queues for efficient task distribution
  • Model catalogs for managing and versioning AI models
  • Scheduling mechanisms optimized for cost and performance
  • Architect and scale infrastructure to handle millions of API requests per second
  • Implement robust monitoring and alerting to ensure system health and 24/7 availability
  • Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap
  • Influence the long-term vision and architectural decisions of the platform
  • Contribute to open-source AI frameworks and actively participate in the AI community
  • Prototype and rapidly iterate on emerging technologies and new features
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Staff Infrastructure Software Engineer - AI Platform

We are currently seeking a Staff Software Engineer to join the AI Platform team ...
Location
Location
United Kingdom , Edinburgh
Salary
Salary:
Not provided
addepar.com Logo
Addepar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience as a Software/Backend Engineer, with a track record of taking on increasing responsibility
  • Experience across the full product lifecycle: designing, implementing, shipping, scaling, operationalizing, and maintaining technology/SaaS products
  • Exceptional Programming skills and fundamentals in Python/Go/Java, with a proven track record of building large scale production systems
  • Proficient experience with diverse compute environments including microservices (K8s), Databricks and serverless architectures (e.g. AWS Lambda)
  • Demonstrable experience leading initiatives with infrastructure-as-code tools such as Terraform in complex, multi-account environments
  • Proficient experience with comprehensive monitoring and alerting stacks (e.g. Prometheus/Grafana/Sentry/cloud-native tools), with a focus on observability strategy
  • Excellent interpersonal and communication skills to effectively collaborate with multi-functional teams, articulate complex technical concepts, and influence outcomes
Job Responsibility
Job Responsibility
  • Design and build the production runtime for LLM-based agents and products, creating the services and infrastructure that serve autonomous agents
  • Develop deep application-level knowledge to proactively inform and influence requirements, constraints and best practices for implementing composable, complex AI systems
  • Lead the design, implementation, and automation of production infrastructure on a variety of cloud environments (Kubernetes/Databricks), to enable us to ship and scale AI features instantly
  • Evangelize and promote disciplined, best engineering practices to enforce strong production hygiene and culture
  • Initiate and lead collaborations with cross-functional teams to identify and resolve complex application or infrastructure issues, serving as a technical subject matter expert
  • Architect, build, and maintain advanced, automated CI/CD pipelines e.g. using Jenkins, ArgoCD, AWS CodeBuild/Pipeline, GitHub Actions, or similar, establishing best practices for deployment strategies (e.g., blue/green, canary)
  • Develop systems and best practices monitoring, alerting, and troubleshooting of our probabilistic and AI-driven systems and broader software stack
Read More
Arrow Right

Senior Staff AI Software System Design Engineer

As an AICE Software System Design Engineer, you will be responsible for the cust...
Location
Location
China , Shanghai
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert knowledge in machine learning areas such as frameworks (e.g. vLLM, Sglang, Megatron-LM, Deepspeed, TensorRT etc.)
  • distribution
  • kernel operator
  • compiler
  • runtime
  • driver
  • performance optimization for inference or training
  • strong programming skills in C++ and Python
  • hands-on experience with industry AI use scenarios, solutions, end-to-end pipelines, frameworks or SDKs
  • strong debugging and development skillsets
Job Responsibility
Job Responsibility
  • Position technical proposals and support to top customers
  • provide significant contribution to customer PoC success
  • drive custom requirements for AI SW
  • collaborate and interact with different teams to analyze and optimize training and inference workloads and solutions
  • analyze competitive solutions to identify strength and weakness for articulate value propositions
  • apply your knowledge of software engineering best practices
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Engineer

Help build the infrastructure that powers training, evaluation, and data platfor...
Location
Location
Switzerland , Zürich
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background building reliable, scalable production systems (Python preferred)
  • Hands‑on experience supporting large‑scale ML / LLM training, evaluation, or experimentation infrastructure
  • Operating GPU‑heavy workloads in cloud environments using Docker and Kubernetes (scheduling, utilization, isolation)
  • Designing and running data / compute pipelines and orchestration (e.g., Airflow, Argo) with object storage (Azure Blob / S3)
  • Platform reliability and operability: observability, metrics, logging, tracing, alerting (Prometheus, Grafana, OpenTelemetry)
Job Responsibility
Job Responsibility
  • Design and build core platform services for scalable training and evaluation, including cluster orchestration, job scheduling, data and compute pipelines, and artifact management
  • Standardize containerized workflows by maintaining Docker images, CI/CD, and runtime configurations
  • advocate for best practices in security, reproducibility, and cost efficiency
  • Implement end-to-end observability and operations through metrics, tracing, logging, dashboard development, monitoring, and automated alerts for model training and platform health (using Prometheus, Grafana, OpenTelemetry)
  • Architect and operate services on Azure cloud platforms, managing infrastructure-as-code (Terraform/Helm), secrets, networking, and storage
  • Enhance developer experience by creating tools, CLIs, and portals that simplify job submission, metrics analysis, and experiment management for generalist software engineering and research teams
  • Enforce security and compliance policies for data access, container hardening, and supply-chain integrity, and partner with security and privacy teams to maintain robust practices in multi-tenant environments and secret management
  • Collaborate cross-functionally with data, model, and product teams to align infrastructure roadmaps with training needs, evaluation protocols, and Copilot product goals
  • Fulltime
Read More
Arrow Right

Sr Staff Engineer Software (Full Stack Prisma AIRS)

With Prisma AIRS, Palo Alto Networks is building the world's most comprehensive ...
Location
Location
United States , Santa Clara
Salary
Salary:
147000.00 - 237500.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science or a related field with 5+ years of experience, or a Master's degree with 3+ years of experience
  • Expertise in modern React (Functional components) and JavaScript/TypeScript
  • Expertise in building scalable distributed systems with excellent Python or Golang programming skills
  • Expertise writing comprehensive unit, integration, and end-to-end tests
  • Proven experience with modern backend frameworks, databases (SQL or NoSQL), and cloud platforms, specifically GCP (Google Cloud Platform)
  • Excellent written and verbal communication, able to collaborate and convey ideas effectively
  • Self-disciplined, self-managed, self-motivated, and strong sense of ownership, urgency, and drive
Job Responsibility
Job Responsibility
  • Design and build innovative, scalable software products to ensure our customers can use AI securely
  • Own new features/functionality from start to finish. Participate in all phases of the product development cycle, from definition, design, through implementation and test. Ensure that applications are production-ready, scalable, and reliable
  • Collaborate with product managers, backend software engineers, product designers, and infrastructure engineers to shape the future of Prisma AIRS
  • Deliver high quality UX and provide the Prisma AIRS customer with a seamless, intuitive customer experience
  • Proactively identify problems and opportunities, proposing and developing simple, attainable solutions to enhance the team's development process and product quality
  • Serve as a role model in establishing and implementing engineering best practices, including test-driven development for AI runtime services
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Design Systems

Design Systems is on a mission to build tooling that empowers internal teams to ...
Location
Location
Canada
Salary
Salary:
Not provided
vanta.com Logo
Vanta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Over 9 years of industry experience with deep expertise in one or more technical areas (e.g., frontend, design systems, accessibility, app performance, runtime)
  • Proven ability to lead complex technical initiatives, driving strategic projects and improving organizational processes in fast-paced, dynamic environments
  • Mastery in system design and software architecture, with an emphasis on user experience and accessibility standards, and a strong sense of design craft
  • Extensive experience building tooling and abstractions for developers that result in user-facing experiences
  • Expertise in debugging and supporting platform tooling for the frontend
  • Strong leadership and mentorship experience, consistently up-leveling teams and leading by example
  • Excellent communication skills with the ability to influence and advocate for technical decisions at all levels of the organization
  • Open to using AI to amplify their skills and strengthen their work - demonstrating curiosity, a willingness to learn, and sound judgment in applying AI responsibly to improve efficiency and impact
Job Responsibility
Job Responsibility
  • Be a leader for and advise on UI/UX best practices and standards at Vanta bringing a high bar for design craft ensuring our components are both functional and refined
  • Build, maintain, and update existing shared UI components to ensure they are consistent across our system and product, bug free, well tested, and well documented
  • Educate all engineers in UX and industry standards and best practices, our Design System guidance, and how to implement in code
  • Support product team use cases through building new shared patterns when it makes sense to extend the system, or updating guidance
  • Identify, scope, and lead large technical projects that lay the groundwork for building highly performant and reliable systems
  • Lead key technical decisions that will form the system’s stance and recommendations for product teams
  • Rally cross-functional teams to drive initiatives to completion, even without direct management of team members
  • Address product, technical, and operational challenges with clear, impactful solutions
  • Serve as a cultural leader, modeling collaborative behaviors and mentoring engineers to elevate organizational performance
What we offer
What we offer
  • Industry-competitive salary and equity
  • 100% covered medical, dental, and vision benefits with dependents coverage
  • Pension contribution
  • 16 weeks fully paid Parental Leave for all new parents
  • Health & wellness stipend
  • Remote workspace, internet, and cellphone stipend
  • Flexible work hours and location
  • 21 days of Vacation Time and 80 hours of Sick Leave
  • 11 company-paid holidays
  • Virtual team building activities, lunch and learns, and other company-wide events
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Design Systems

Design Systems is on a mission to build tooling that empowers internal teams to ...
Location
Location
United States; Canada
Salary
Salary:
Not provided
vanta.com Logo
Vanta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Over 9 years of industry experience with deep expertise in one or more technical areas (e.g., frontend, design systems, accessibility, app performance, runtime)
  • Proven ability to lead complex technical initiatives, driving strategic projects and improving organizational processes in fast-paced, dynamic environments
  • Mastery in system design and software architecture, with an emphasis on user experience and accessibility standards, and a strong sense of design craft
  • Extensive experience building tooling and abstractions for developers that result in user-facing experiences
  • Expertise in debugging and supporting platform tooling for the frontend
  • Strong leadership and mentorship experience, consistently up-leveling teams and leading by example
  • Excellent communication skills with the ability to influence and advocate for technical decisions at all levels of the organization
  • Open to using AI to amplify their skills and strengthen their work - demonstrating curiosity, a willingness to learn, and sound judgment in applying AI responsibly to improve efficiency and impact
Job Responsibility
Job Responsibility
  • Be a leader for and advise on UI/UX best practices and standards at Vanta bringing a high bar for design craft ensuring our components are both functional and refined
  • Build, maintain, and update existing shared UI components to ensure they are consistent across our system and product, bug free, well tested, and well documented
  • Educate all engineers in UX and industry standards and best practices, our Design System guidance, and how to implement in code
  • Support product team use cases through building new shared patterns when it makes sense to extend the system, or updating guidance
  • Identify, scope, and lead large technical projects that lay the groundwork for building highly performant and reliable systems
  • Lead key technical decisions that will form the system’s stance and recommendations for product teams
  • Rally cross-functional teams to drive initiatives to completion, even without direct management of team members
  • Address product, technical, and operational challenges with clear, impactful solutions
  • Serve as a cultural leader, modeling collaborative behaviors and mentoring engineers to elevate organizational performance
What we offer
What we offer
  • Industry-competitive salary and equity
  • Comprehensive medical, dental, and vision coverage, with 100% of employee-only benefit premiums covered for most medical plans
  • 16 weeks fully-paid Parental Leave for all new parents
  • Health & wellness stipend
  • Remote workspace, internet, and cellphone stipend
  • Commuter benefits for team members who report to the SF and NYC office
  • Family planning benefits
  • Matching 401(k) contribution with immediate vesting
  • Flexible PTO policy, plus 80 hours of Sick Time
  • 11 company-paid holidays
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Model LifeCycle

The Staff Software Engineer for the Model LifeCycle team will play a key role in...
Location
Location
United States , San Francisco
Salary
Salary:
208725.00 - 253000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field
  • 8-10+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
  • Proven track record of delivering production features on time
  • Experience in using cloud-based services, such as, elastic compute, object storage, virtual private networks, managed database, etc.
  • Experience with Generative AI (Large Language Models, Multimodal)
  • Experience with AI infrastructure, including training, inference
Job Responsibility
Job Responsibility
  • Contribute to fine-tuning systems for large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling
  • Implement and maintain end-to-end training pipelines for Large Language Models
  • Contribute to distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling)
  • Develop and maintain agent execution infrastructure
  • Implement features for dataset, model, and experiment management: versioning, lineage, evaluation, and reproducible fine-tuning at scale
  • Work closely with Principal Engineers, product, business, and platform teams to implement the core abstractions and APIs of the system
  • Contribute to architectural decisions around training runtimes, scheduling, storage, and model lifecycle management
  • Engage with the open-source LLM ecosystem
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right