CrawlJobs Logo

Post-Training Platform Infrastructure Engineer

United States, San Jose 204000.00 - 306000.00 USD / Year · Job Posted June 16, 2026
Apply Position
Job Link Share

Job Description

We are looking for a systems-minded engineer who lives at the intersection of large-scale model inference, distributed systems, and performance optimization. This role focuses on post-training and inference infrastructure, with particular emphasis on P/D disaggregation, KV cache lifecycle management, and efficient offloading mechanisms across both inference and reinforcement learning (RL) systems.

Job Responsibility

  • Research and deeply understand modern LLM inference frameworks
  • Analyze and compare inference execution paths to identify performance bottlenecks and inefficiencies
  • Develop and implement infrastructure-level features to improve inference latency, throughput, and memory efficiency
  • Optimize KV cache management and offloading strategies
  • Enhance scalability across multi-GPU and multi-node deployments
  • Apply the same research-driven approach to RL frameworks
  • Study post-training and RL systems
  • Debug performance and correctness issues in distributed RL pipelines
  • Optimize inference, rollout efficiency, and memory usage during training
  • Collaborate with research and applied ML teams
  • Translate model-level requirements into infrastructure capabilities
  • Validate performance gains with benchmarks and real workloads
  • Document findings, architectural insights, and best practices to guide future system design

Requirements

  • Strong background in systems engineering, distributed systems, or ML infrastructure
  • Hands-on experience with GPU-accelerated workloads and memory-constrained systems
  • Solid understanding of: LLM inference workflows (prefill vs decode)
  • Attention mechanisms and KV cache behavior
  • Multi-process / multi-GPU execution models
  • Proficiency in Python and C++ (or similar systems languages)
  • Experience debugging performance issues using profiling tools (GPU, CPU, memory)
  • Ability to read, understand, and modify complex open-source codebases
  • Strong analytical skills and comfort working in research-heavy, ambiguous problem spaces
  • Bachelor's or master's degree in computer science, computer engineering, electrical engineering, or equivalent

Nice to have

  • Direct experience with LLM inference frameworks or serving stacks
  • Familiarity with: GPU memory hierarchies (HBM, pinned memory, NUMA considerations)
  • KV cache compression, paging, or eviction strategies
  • Storage-backed offloading (NVMe, object stores, distributed file system)
  • Experience with distributed RL or post-training pipelines
  • Knowledge of scheduling systems, async execution, or actor-based runtimes
  • Contributions to open-source ML or systems projects
  • Experience designing benchmarking suites or performance evaluation frameworks

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Post-Training Platform Infrastructure Engineer

8 matching positions

Software Engineer, AI Platform

Perplexity is seeking an experienced Software Engineer focusing on building the ...
Location
Location
United States , San Francisco, Palo Alto
Salary
Salary:
210000.00 - 385000.00 USD / Year
perplexity.ai Logo
Perplexity
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming and data engineering skills, with proficiency in open source & distributed framework(AWS, Spark, Flink, Iceberg, DynamoDB)
  • Familiarity with cloud-based data services (e.g., AWS, RDS, DynamoDB), containerized infrastructure (e.g., EKS, Docker), and data streaming (Flink, Spark streaming, CDC)
  • Strong quantitative and engineering skills with experience in estimating performance at high scale
  • Experience supporting various ML/AI engineering teams to build scalable frameworks to accelerate R&D for frontier models and AI products
  • Experience iterating on improving LLM responses and set up proper evaluation framework or Judges to analysis performance holistically.
  • Self-motivated with a strong sense of ownership of systems and designs
  • 5+ years of industry experience in distributed systems or AI infrastructure
Job Responsibility
Job Responsibility
  • Collaborate closely with AI Product, Applied ML, Post-Training, and Data Science teams to design, build, and maintain scalable data pipelines and data lakes
  • Develop high-performance infrastructure that powers personalization features including memory, discover, and agentic products
  • Create a scalable, multi-modal evaluation platform for all Perplexity AI products, including personalization, pro search, labs, deep research, and Comet
  • Design tools and abstractions on foundational infrastructure to enhance personalization, analytics, recommendations, AI products, and post-training capabilities
  • Holistically improve engineering foundation to support rapid growth of Perplexity products and international user base.
What we offer
What we offer
  • equity
  • health
  • dental
  • vision
  • retirement
  • fitness
  • commuter and dependent care accounts
  • Fulltime
Read More
Arrow Right

Applied Research - Forward-Deployed

Prime Intellect builds the infrastructure that frontier AI labs build internally...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 300000.00 USD / Year
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep hands-on experience building, evaluating, or deploying LLM-based agents in the past 1–2 years
  • Strong intuition for evaluation design
  • Working understanding of RL and post-training concepts (GRPO, RLHF, reward modeling, SFT)
  • Strong Python skills and comfort with the modern AI stack (Hugging Face, inference engines, agent frameworks)
  • Experience in a customer-facing or consulting-adjacent technical role, or as a technical founder
  • Excellent written and verbal communication
  • High agency and comfort with ambiguity
Job Responsibility
Job Responsibility
  • Embed directly with strategic customers to understand their agent architectures, failure modes, and product goals
  • Design and build custom RL environments, evaluation harnesses, and verifiers that capture what 'good' looks like for each customer's domain
  • Architect agent scaffolding — tool use, multi-step reasoning, memory, sandbox execution — tailored to customer workflows
  • Configure and launch training runs on Lab, iterating on reward functions, rollout strategies, and evaluation criteria
  • Serve as the technical lead for engagements end-to-end: from discovery through deployed, improved models
  • Identify repeatable patterns from customer engagements and codify them into reference implementations, templates, and documentation
  • Serve as the voice of the customer internally, shaping the roadmap for Lab, verifiers, the Environments Hub, and training infrastructure
  • Build high-quality examples and 'recipes' that make it easy for new customers and open-source contributors to extend the stack
  • Contribute to technical content (blog posts, tutorials, case studies) that demonstrates real-world platform usage
  • Develop novel evaluation methodologies for agentic behavior — multi-step reasoning, tool use correctness, recovery from failure, long-horizon task completion
What we offer
What we offer
  • Cash Compensation Range of $150-300k + equity incentives
  • Flexible Work (San Francisco or hybrid-remote)
  • Visa Sponsorship & relocation support
  • Professional Development budget
  • Team Off-sites & conference attendance
  • Fulltime
Read More
Arrow Right

Technical Program Manager - Infrastructure

At Microsoft AI, we are on a mission to train the world’s most capable AI fronti...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 6+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
  • 3+ years of experience managing cross-functional and/or cross-team projects
  • Deeply understand the design, deployment, and optimization of large-scale infrastructure for AI/ML workloads
  • Have experience collaborating with AI researchers, engineers, and infrastructure teams to deliver robust, scalable solutions
  • Thrive in a scrappy, 0->1, innovative environment, managing high-stakes, time-sensitive, large-scale programs
  • Take initiative and enjoy navigating complexity, driving progress across offices, teams, and time zones
  • Demonstrate a proactive attitude and enthusiasm for exploring new methods and technologies in infrastructure and platform engineering
Job Responsibility
Job Responsibility
  • Coordinate projects and programs related to AI/ML infrastructure (e.g. pre-training, post-training pipelines, inference & model serving stacks), including end-to-end planning, timelines, milestones, performance metrics, and resource needs
  • Collaborate with product teams, engineers, researchers, and external partners to identify gaps and drive timelines toward resolution and mitigation
  • Leverage data and analytics to identify opportunities for improvement, track progress, and measure the impact of quality and efficiency programs
  • Foster a culture of collaboration, continuous improvement, and growth
  • Own the status of key infrastructure projects, proactively identifying risks and proposing solutions to ensure timely delivery
  • Communicate program strategies, progress, and results to executive leadership and key stakeholders, advocating for quality and efficiency within the team
  • Advance the AI frontier responsibly
  • Embody Microsoft’s culture and values
  • Fulltime
Read More
Arrow Right

Engineering Director, AI Solutions and Automation (ASA)-AI Product Acceleration

We are seeking a highly accomplished Engineering Director with extensive technic...
Location
Location
United States , Bellevue, WA
Salary
Salary:
271000.00 - 347000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of experience growing and leading successful Engineering teams, with a proven ability to recruit, land, and grow both engineering technical managers and individual contributors
  • Extensive expertise (15+ years) in Machine Learning (ML), and Artificial Intelligence (AI), with a history of functioning as a technical leader or lead architect on production systems
  • Extensive experience building and deploying complex, large-scale, distributed AI/ML software systems from the ground up
  • Experience as a great collaborator, building models and processes for aligning work across large, multi-disciplinary teams (Engineering, Data Science, Product Management)
  • Hands-on technical experience in relevant ML/AI languages (e.g., Python, C++) and applying data-driven methodologies to define and manage large software projects
  • Demonstrated ability to drive technical strategy and execution in cutting-edge AI domains like multi-modal processing, model evaluation, or RL-based post-training
Job Responsibility
Job Responsibility
  • Lead and manage teams of AI applied researchers and engineers, providing extensive technical guidance, mentorship, and support to ensure the successful end-to-end delivery of high-quality, scalable AI/ML systems
  • Serve as the technical authority, driving the design, development, and deployment of complex AI solutions, including LLM post-training techniques (like Reinforcement Learning and Fine-Tuning), Multi-modal Content Understanding, and Agentic AI platforms
  • Define and lead the long-term technical strategy and roadmap for large, enterprise-wide AI efforts, ensuring alignment with the ASA mission to deliver cost-efficient and performant AI models
  • Foster an environment of innovation, rapid prototyping, and technical excellence, encouraging experimentation and continuous improvement in the pursuit of SoTA performance
  • Identify new, high-leverage opportunities for LLM-based automation across Meta's product portfolio and influence cross-functional partners for appropriate staffing and prioritization
  • Supervise the development of AI-centric platforms, such as the AI Evaluation and scalable inference and serving infrastructure for 1P, 2P, and 3P models
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

AI Researcher

Perplexity is seeking top-tier AI Research Scientists and Engineers to advance o...
Location
Location
United States , San Francisco; Palo Alto
Salary
Salary:
210000.00 - 470000.00 USD / Year
perplexity.ai Logo
Perplexity
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience with large-scale LLMs and Deep Learning systems
  • Strong programming skills in Python/PyTorch
  • Experience with post-training techniques and reinforcement learning
  • Self-starter with a willingness to take ownership of tasks
  • Passion for tackling challenging problems
  • Minimum 2-6 years of experience on relevant projects (depending on seniority level)
Job Responsibility
Job Responsibility
  • Post-train SOTA LLMs using the latest supervised and reinforcement learning techniques (SFT/DPO/GRPO)
  • Leverage our rich query/answer dataset to scale model performance across Sonar, Deep Research, Comet, and Search products
  • Stay current with the latest LLM research, especially in model training, optimization, and personalization techniques
  • Implement preference optimization and personalization capabilities to enhance user experience
  • Invent in-house improvements and optimizations to enhance SOTA models
  • Turn research ideas into algorithms and run experiments to launch new models
  • Own full-stack data, training, and evaluation pipelines required for model development
  • Build robust and effective training frameworks (on top of Megatron/PyTorch) for post-training LLMs
  • Implement necessary infrastructure and components to support cutting-edge model training at scale
  • Integrate models seamlessly into our product ecosystem
What we offer
What we offer
  • Equity
  • Health
  • Dental
  • Vision
  • Retirement
  • Fitness
  • Commuter and dependent care accounts
  • Fulltime
Read More
Arrow Right

Head of Enterprise Sales

Prime Intellect is building the open superintelligence stack - from frontier age...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years in enterprise sales with at least 2+ years owning or leading a sales team
  • Proven track record closing $250k-$2M+ infrastructure or platform deals
  • Deep familiarity with selling to technical buyers - ML teams, AI labs, DevInfra, cloud buyers
  • Strong outbound and pipeline creation instincts - you know how to open doors at top accounts
  • Experience building sales processes, playbooks, and forecasting systems from scratch
  • Exceptional communicator who moves fluidly between CTO, Head of Research, and procurement
  • Comfortable navigating multi-stakeholder, high-complexity enterprise cycles
  • Bias for action, ownership, and clarity in fast-moving, high-ambiguity environments
Job Responsibility
Job Responsibility
  • Build and lead the enterprise sales function - process, playbooks, pipeline, quota design
  • Own full-funnel revenue generation: outbound, discovery, technical qualification, pilots, negotiation, close
  • Develop repeatable GTM motions for selling compute + RL post-training as a unified offering
  • Shape strategic targeting - AI labs, research teams, foundation model builders, enterprise ML orgs
  • Partner with Solutions, TAM, and Engineering to design pilots, ensure value realization, and drive expansion
  • Refine messaging and positioning for technical and executive audiences
  • Own forecasting, pipeline health, and CRM accuracy at leadership level
  • Hire and mentor early AEs as team scales
  • Drive expansion strategies - multi-year deals, committed spend, broader footprint within accounts
  • Collaborate closely with leadership on pricing strategy, deal structuring, and strategic accounts
What we offer
What we offer
  • Competitive Compensation + equity incentives
  • Flexible Work (remote or San Francisco)
  • Visa Sponsorship & relocation support
  • Professional Development budget
  • Team Off-sites & conference attendance
  • Opportunity to Shape Decentralized AI at Prime Intellect
  • Fulltime
Read More
Arrow Right

Applied Research - RL & Agents

Prime Intellect builds the infrastructure that frontier AI labs build internally...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong background in machine learning engineering, with experience in post-training, RL, or large-scale model alignment
  • Experience with agent frameworks and tooling (e.g. DSPy, LangGraph, MCP, Stagehand)
  • Familiarity with distributed training/inference frameworks (e.g., vLLM, sglang, Accelerate, Ray, Torch)
  • Track record of research contributions (publications, open-source contributions, benchmarks) in ML/RL
  • Passion for advancing the state-of-the-art in reasoning and building practical, agentic AI systems
  • Strong technical writing abilities (documentation, blogs, papers) and research taste
  • Eagerness to drive collaborations with external partners and engage with the broader open-source community
Job Responsibility
Job Responsibility
  • Advancing Agent Capabilities: Designing and iterating on next-generation AI agents that tackle real workloads—workflow automation, reasoning-intensive tasks, and decision-making at scale
  • Building Robust Infrastructure: Developing the systems and frameworks that enable these agents to operate reliably, efficiently, and at massive scale
  • Bridge Between Applications & Research: Translate ambiguous objectives into clear technical requirements that guide product and research priorities
  • Prototype in the Field: Rapidly design and deploy agents, evals, and harnesses for real-world tasks to validate solutions
  • Application-Driven Research & Infrastructure: Shape the direction and feature set for verifiers, the Environments Hub, training services, and other research platform offerings
  • Build high‑quality examples, reference implementations, and “recipes” that make it easy for others to extend the stack
  • Prototype agents and eval harnesses tailored to real-world use cases and external systems
  • Pair with technical end‑users (research teams, infra‑heavy customers, open‑source contributors) to design environments, evals, and verifiers that reflect real workloads
  • Post-training & Reinforcement Learning: Design and implement novel RL and post-training methods (RLHF, RLVR, GRPO, etc.) to align large models with domain-specific tasks
  • Build evaluations and harnesses and to measure reasoning, robustness, and agentic behavior in real-world workflows
What we offer
What we offer
  • Competitive Compensation + equity incentives
  • Flexible Work (San Francisco or hybrid-remote)
  • Visa Sponsorship & relocation support
  • Professional Development budget
  • Team Off-sites & conference attendance
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, MLE

At Cohere, our Members of Technical Staff are at the forefront of defining and s...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as Tensorflow, TF-Serving, JAX, and XLA/MLIR
  • Deep experience in building and leading a product-centric organisation
  • Direct experience working as part of a team building Large Language Models
  • Released multiple features with several iterations
  • Strong track record of creating and curating large-scale datasets
  • Experience using large-scale distributed training strategies
  • Familiarity with autoregressive sequence models, such as Transformers
  • Ability to collaborate effectively with human annotators and cross-functional teams
  • Paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)
Job Responsibility
Job Responsibility
  • Join a small, diverse team of engineers in designing, building, and scaling AI systems that underpin our suite of dev-centric enterprise products
  • Work directly on North, Cohere’s all-in-one secure AI workspace platform. Here you will drive agent development in RAG, tool use, and language agents embedded in North
  • Quickly research and experiment with novel ideas on our supercomputer and data infrastructure, ensuring our products remain at the forefront of the industry
  • Collaborate with top researchers, engineers, and annotators to create and evaluate data for post-training LLMs, ensuring our products are of the highest quality and performance
  • Engage with the latest AI and deep learning research, staying up to date with leading conferences such as NeurIPS, ICLR, and AAAI
  • Leverage product data to understand usage patterns and identify areas for improvement, ensuring our products remain relevant and competitive
  • Work closely with leadership to shape company strategy and goals, ensuring our product vision is aligned with our overall business objectives
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right