CrawlJobs Logo

Senior Software Engineer, AI Runtime

apollographql.com Logo

Apollo GraphQL

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

157000.00 - 198900.00 USD / Year

Job Description:

We’re seeking a Senior Software Engineer to help power the future of agentic AI workflows. You’ll take our MCP Server to the next level, turning it into an enterprise-grade service that lets diverse tools and systems be exposed effortlessly to AI agents. Looking ahead, you’ll also help architect the MCP Gateway—a new layer that will route requests across tools, enforce policies, and provide the runtime foundation for scalable multi-agent systems. Along the way, you’ll tackle challenges in scalability, performance, and developer experience to ensure our platform feels seamless, powerful, and enterprise-ready.

Job Responsibility:

  • Scale an enterprise AI/MCP Server and Gateway that powers multi-agent workflows across Apollo, including routing, orchestration, and integration boundaries
  • Implement robust server infrastructure to ensure reliability, performance, and security at scale
  • Build and maintain tools for agent discovery, communication, and coordination
  • Define deployment strategies and runtime optimizations to maximize efficiency and minimize operational overhead
  • Develop frameworks and patterns that enable seamless multi-agent collaboration and AI-driven orchestration
  • Integrate observability, logging, and monitoring for full visibility into server and agent behavior
  • Explore and implement AI-enhanced developer workflows to optimize orchestration and agent interactions
  • Collaborate with teams within our org to ensure the MCP Server meets evolving product and developer needs

Requirements:

  • Expertise in agent-to-tool orchestration, routing, and coordination in scalable, fault-tolerant systems
  • Deep expertise in Rust programming language
  • Strong background in distributed systems, server architecture, and high-performance backend development
  • Proven experience with protocol design, message routing, and server-side orchestration frameworks
  • Experience building and maintaining robust runtime infrastructure that supports AI-driven workflows and enables reliable agent-to-tool interactions
  • Proven experience with protocol design, message routing, and building server-side frameworks that enable scalable, reliable multi-tool agent workflows
  • Hands-on experience with observability, monitoring, and debugging frameworks for complex systems
  • Passion for clean, maintainable code, high system reliability, and scalable architecture
  • Experience in strategic system design, making architectural trade-offs, and planning for long-term scalability and maintainability
  • Strong technical leadership and mentorship, including guiding junior engineers and driving engineering best practices across teams
  • Ability to influence cross-team architecture decisions and align engineering efforts with product and business objectives
  • Production ownership experience: leading incident response, debugging, and performance optimization in high-impact backend systems

Nice to have:

  • Exposure to AI/ML-enabled developer tooling or autonomous system orchestration
  • Familiarity with cloud-native architectures, containerization, or orchestration frameworks
  • Experience with performance optimization and cost-efficient scaling of high-throughput distributed systems

Additional Information:

Job Posted:
December 06, 2025

Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Software Engineer, AI Runtime

Senior Solution Engineer

JFrog is expanding in APAC, and we are looking for a strong Senior Solution Engi...
Location
Location
Singapore
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in pre-sales, solutions engineering, DevOps consulting, or platform engineering
  • Hands-on knowledge of Docker, Kubernetes, Git, Jenkins/GitHub/GitLab, cloud-native architectures
  • Strong communication and customer-facing skills
  • Based in Singapore and open to travel across SEA and Korea
Job Responsibility
Job Responsibility
  • Lead technical discovery, demos, and POCs for customers in SEA + Korea
  • Architect CI/CD, DevSecOps, and software supply chain solutions using the JFrog Platform
  • Work closely with sales, product, R&D, and channel partners
  • Represent JFrog at regional events, workshops, and customer sessions
  • Support enterprise adoption of Artifactory, Xray, Curation, Advanced Security, AI Catalog, Runtime, and more
Read More
Arrow Right

Senior AI Engineer - MSC AI Innovation

MSC AI Innovation is an AI-first team that incubates, builds, and accelerates so...
Location
Location
Israel , Tel Aviv, Herzliya
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years professional software development
  • 4+ years of software engineering experience in the AI space (e.g., building and shipping AI/ML or GenAI features in production)
  • Proven experience with building AI agents
  • Hands-on experience with evaluation methodologies and integrating quality standards/guardrails into delivery
  • Proficiency in Python and/or C#, with experience using REST APIs and SDKs
  • Deep understanding of AI system design, including ML fundamentals, Generative AI concepts, and cloud-native architectures
Job Responsibility
Job Responsibility
  • Design and build AI agents that plan, use tools/APIs, manage state/memory, and reliably complete multi-step workflows
  • Own AI features from design through production, including deployment, monitoring, and live‑site reliability, with an eval-first development lifecycle: define success criteria, build evaluation datasets and automated harnesses, and run human-in-the-loop reviews where needed
  • Develop and maintain prompt, retrieval, and memory strategies (system prompts, few-shot examples, tool schemas, retrieval context) with proper versioning and evaluation coverage
  • Debug AI behavior using prompt analysis, data inspection, and model/tool-call traces, and translate failure patterns into targeted improvements
  • Establish and track AI quality metrics (e.g., accuracy, groundedness, relevance, hallucination rate) and integrate them into CI/CD release gates
  • Optimize runtime performance and economics (token usage, inference cost, latency, caching, model selection/routing, batching) and implement monitoring and continuous improvement loops (online signals, drift detection, structured user feedback)
  • Partner with product, design, and domain stakeholders to define use cases, acceptance criteria, and rollout plans for AI features
  • Live site responsibility
  • Fulltime
Read More
Arrow Right

Director, Digital Ecosystem Applications

This position is responsible for the Software Platforms group at the Innovation ...
Location
Location
United States , Belmont
Salary
Salary:
240000.00 - 285000.00 USD / Year
https://www.volkswagen-group.com Logo
Volkswagen AG
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years with 2+ years in a technical leadership role
  • CS, EE, M.S. Engineering (or equivalent) REQUIRED
  • M.S. Engineering (or equivalent) or PhD PREFERRED
  • Analytical and conceptual thinking – using logic and reason, creative and strategic
  • Communication skills – interpersonal, presentation and written
  • Managing interdisciplinary teams on individual projects
  • Integration – joining people, processes or systems
  • Influencing and negotiation skills
  • Problem solving
  • Resource management
Job Responsibility
Job Responsibility
  • Define the technical mission, architecture strategy, and long‑term platform vision for the In‑Vehicle Computing & Digital Ecosystem Applications team, spanning Android Automotive OS (AAOS), in‑vehicle compute platforms, Software‑Defined Vehicle (SDV) architecture, and AI‑driven cockpit intelligence
  • Provide technical leadership across the full software stack, including Android Framework, System Services, HAL layers, middleware, connectivity stacks, media/audio frameworks, HMI toolchains, and cloud‑connected AI runtimes within an SDV‑aligned architecture
  • Lead and mentor engineering teams in platform bring‑up, system integration, performance optimization, and development of AI‑agentic features, multimodal interaction models, and next‑generation speech technologies
  • Manage multi‑year budgets for platform development, AI integration, SDV‑aligned compute evolution, SoC evaluations, cloud services, and prototype programs
  • Deliver executive‑level technical reporting on architecture decisions, platform readiness, SDV integration milestones, AI progress, risks, and strategic recommendations
  • Drive strategic planning for ICC’s infotainment and cockpit portfolio, including AAOS evolution, hybrid cloud/edge AI pipelines, intelligent mobile agent technologies, and SDV‑centric software and compute roadmaps
  • Align technical roadmaps with global VW Group Innovation teams across infotainment, connectivity, AI/ML, vehicle architecture, cloud services, and SDV platform strategy, ensuring cross‑platform consistency and shared component reuse
  • Build strategic relationships with SoC vendors, Tier‑1 suppliers, cloud providers, and AI technology partners to influence cockpit compute and SDV platform evolution
  • Maintain partnerships with Silicon Valley companies specializing in AI runtimes, LLMs, speech, multimodal interaction, and automotive‑grade SDV‑compatible software frameworks
  • Collaborate with academic and research institutions on AI‑agentic systems, embedded ML, HMI, and in‑vehicle compute architectures aligned with SDV principles
What we offer
What we offer
  • Eligibility for annual performance bonus
  • Healthcare benefits
  • 401(k), with company match
  • Defined contribution retirement program
  • Tuition reimbursement
  • Company lease car program
  • Paid time off
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Managed AI - AI model LifeCycle

The Senior Software Engineer for the Model LifeCycle team will contribute to bui...
Location
Location
United States , San Francisco
Salary
Salary:
172425.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or a related field
  • Experience delivering production-ready features
  • Familiarity with essential cloud-based services (e.g., compute, storage, networking)
  • Familiarity with Generative AI (Large Language Models, Multimodal)
  • Experience with AI infrastructure components (training, inference)
  • 4-5+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
Job Responsibility
Job Responsibility
  • Implement and maintain systems for fine-tuning large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling
  • Implement and maintain end-to-end training pipelines for Large Language Models
  • Implement components for distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling)
  • Develop and maintain core agent execution infrastructure
  • Implement features for dataset, model, and experiment management, focusing on versioning, lineage, evaluation, and reproducible fine-tuning
  • Work closely with Senior Engineers and Principal Engineers, as well as product and platform teams, to implement system abstractions and APIs
  • Contribute to technical discussions on training runtimes, scheduling, storage, and model lifecycle management
  • Engage with the open-source LLM ecosystem
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Managed AI - AI Platform

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'...
Location
Location
United States , San Francisco, CA; Sunnyvale, CA
Salary
Salary:
172425.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advanced degree in Computer Science/Engineering
  • 4-5+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
  • Experience with distributed systems, cloud services (compute, storage, networking, database), and delivering early-stage projects quickly
  • Experience with Generative AI (LLMs, Multimodal) and familiar with AI infrastructure (training, inference, ETL pipelines)
  • Proficient with container runtimes (e.g., Kubernetes), microservices, REST APIs, gRPC, and the full software development lifecycle including CI/CD
Job Responsibility
Job Responsibility
  • Lead the design and implementation of core AI services, including: Resilient fault-tolerant queues for efficient task distribution
  • Model catalogs for managing and versioning AI models
  • Scheduling mechanisms optimized for cost and performance
  • Architect and scale infrastructure to handle millions of API requests per second
  • Implement robust monitoring and alerting to ensure system health and 24/7 availability
  • Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap
  • Influence the long-term vision and architectural decisions of the platform
  • Contribute to open-source AI frameworks and actively participate in the AI community
  • Prototype and rapidly iterate on emerging technologies and new features
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Performance Tooling

The Artificial Intelligence (AI) Frameworks team at Microsoft develops AI softwa...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. This includes passing the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Master's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, or Python OR Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C++, or Python OR equivalent experience
  • 4+ years’ practical experience working on high performance applications and performance debugging and optimization on CPUs/GPUs
  • Experience in DNN/LLM inference and experience in one or more DL frameworks such as PyTorch, Tensorflow, or ONNX Runtime and familiarity with CUDA, ROCm, Triton
  • Technical background and solid foundation in software engineering principles, computer architecture, GPU architecture, hardware neural net acceleration
  • Experience in end-to-end performance analysis and optimization of state of the art LLMs and HPC applications, including proficiency using GPU profiling tools
  • Cross-team collaboration skills and the desire to collaborate in a team of researchers and developers
  • Ability to independently lead projects
Job Responsibility
Job Responsibility
  • Work across multiple layers of the AI software stack (abstractions, programming models, compilers, runtimes, libraries, and APIs) to enable large-scale model training and inference
  • Benchmark OpenAI and other LLMs for performance on GPUs and Microsoft hardware
  • Debug, profile, and optimize performance for training/inference workloads on Central Processing Units (CPUs)/Graphics Processing Units (GPUs)
  • Monitor performance regressions and drive continuous improvements to reduce time-to-deploy and hardware footprint
  • Collaborate across teams of researchers and engineers to deliver scalable, production-ready AI performance improvements
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, CoreAI Workload Engines

The CoreAI Workloads team builds the foundational inference engines and APIs tha...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 304200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field and 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, or equivalent experience.
  • Proven ability to design and operate large-scale, production inference services with high reliability and performance requirements, and to ship performance improvements safely via disciplined experimentation.
  • Strong skills in performance analysis: benchmarking, profiling, diagnosing regressions, and turning results into concrete engine/runtime changes.
  • Strong problem-solving skills and the ability to debug complex, cross layer systems issues.
  • Demonstrated technical leadership, including mentoring engineers, driving cross-team architectural alignment, and leveraging AI tools and AI-assisted workflows to accelerate engineering velocity and quality.
  • Hands-on experience with Kubernetes (building and operating services on k8s), including debugging production issues and designing platform abstractions (e.g., custom resources/controllers) and scheduling-aware deployments (e.g., node affinity, taints/tolerations, resource requests/limits).
  • Strong collaboration and communication skills, with the ability to work across organizational boundaries.
Job Responsibility
Job Responsibility
  • Optimize inference engines for OpenAI and open-source models by implementing and shipping performance/efficiency improvements across runtime, scheduling, and serving paths (latency, throughput, utilization, availability, and cost).
  • Run experiments end-to-end: formulate hypotheses, implement engine changes (including Python/PyTorch integration points where relevant), analyze results, and ship improvements behind guardrails.
  • Build and use experimentation capabilities for large-scale AI inference (experiment lifecycle, tracking, metric modeling, comparability standards, automated analysis) so the team can iterate quickly and safely.
  • Own serving availability and efficiency for Azure OpenAI Service workloads through tiered experimentation, lean segmentation, and multi-modal utilization across heterogeneous fleets—turning findings into shipped engine improvements.
  • Design and evolve inference serving architectures to improve utilization and latency using techniques such as disaggregated serving, multi-token prediction, KV offload/retrieval, and quantization—validated via staged rollouts and production guardrails.
  • Extend AI infrastructure abstractions to support elastic, heterogeneous inference engines reliably at scale (e.g., dynamic scaling across model families, modalities, and workload classes while maintaining isolation and SLOs).
  • Tune and scale inference engines across NVIDIA GPU generations (A100, H100, H200) for state-of-the-art OpenAI models, focusing on serving efficiency, utilization, and reliability (not hardware bring-up).
  • Partner with networking and storage teams to leverage high-performance interconnects (e.g., RDMA/InfiniBand-class fabrics such as RoCE over IB) for distributed inference, without owning low-level kernel/driver enablement.
  • Drive end-to-end features from design through production: observability, diagnostics, performance regression detection, and operational excellence for inference serving.
  • Influence platform architecture and technical direction across teams through design reviews, clear metrics, and technical leadership focused on experimentation velocity and production reliability.
What we offer
What we offer
  • Benefits and other compensation
  • Fulltime
Read More
Arrow Right

Senior System Development Engineer – AI Technologies

Our customers’ system requirements are usually highly complex. Bringing together...
Location
Location
United States , Austin
Salary
Salary:
123000.00 - 170000.00 USD / Year
dell.com Logo
Dell
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Engineering, Computer Science, Electrical Engineering, or related field
  • 5+ years of experience in system engineering, platform development, or hardware–software validation
  • Strong understanding of x86 system architecture, CPU/GPU/accelerator internals, memory systems, and I/O subsystems
Job Responsibility
Job Responsibility
  • Lead bring‑up, configuration, and validation of system platforms supporting AI workloads (servers, GPU racks, accelerators, networking fabrics)
  • work with BIOS/UEFI, BMC, firmware, drivers, and kernel subsystems to ensure system readiness for large‑scale AI deployments
  • perform hardware–software co-validation of CPUs, GPUs, DPUs, NICs, accelerators, and memory subsystems under AI‑heavy workloads
  • validate PCIe fabric behavior, NUMA topology, and data‑path efficiency for model training and inference
  • Diagnose complex issues across BIOS, firmware, OS, driver stack, container runtime, orchestration layer, and AI frameworks
  • analyze system logs, kernel traces, hardware event telemetry, GPU health signals, and fabric diagnostics
  • conduct root‑cause analysis of performance bottlenecks, training failures, model divergence, and hardware stability issues
  • collaborate with silicon, firmware, OS, and AI software teams to resolve issues rapidly
  • Deploy and manage AI clusters: GPU servers, accelerators, high‑speed networking (InfiniBand, RoCE), and storage systems
  • validate cluster readiness for distributed training, including bandwidth, latency, topology checks, and gradient‑sync performance
What we offer
What we offer
  • Comprehensive Healthcare Programs
  • Award Winning Financial Wellness Tools and Resources
  • Generous Leave of Absence for New Parents and Caregivers
  • Industry Leading Wellness Platform
  • Employee Assistance Program
  • Fulltime
Read More
Arrow Right