CrawlJobs Logo

Software Engineer, Agent Infrastructure

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 385000.00 USD / Year

Job Description:

The Agent Infrastructure team at OpenAI is responsible for building systems that enable training and deployment of highly useful AI agents, both internally and for the world. We work hand-in-hand with researchers to design and scale the environment in which agentic models are trained – providing a workspace for AI models to execute code, debug issues, and develop software just as human SWEs do. Our training environment for agentic models operates at an extremely high scale and has the flexibility to emulate any environment in which an agent might work. At the same time, our team builds and maintains OpenAI’s core platform for the deployment and execution of agents in production. Our systems power products such as Codex, Operator, tool use in ChatGPT, and future agentic products.

Job Responsibility:

  • Push massive compute clusters to their limits as a core contributor to a novel container orchestration platform
  • Develop and maintain FastAPI and gRPC APIs that serve as the interface for agentic infrastructure used in training and production
  • Use Terraform to stand up and evolve complex infrastructure for both research and production
  • Collaborate with research teams to stand up and optimize systems for novel AI training runs and experimental applications
  • Build and scale systems to train highly capable agentic models
  • Build the platform and integrations to launch new agents to hundreds of millions of users worldwide

Requirements:

  • Deep experience working on large-scale machine learning infrastructure
  • Ability to reason about training at scale, identifying bottlenecks and engineering solutions to optimize system performance
  • Experience building new things from 0-1 quickly and scaling them 1,000,000x
  • Keen eye for performance and optimization of complex, globally-distributed systems
  • Knowledge of cloud platforms and infrastructure-as-code tech like Terraform
  • Driven by solving complex, ambiguous problems at the intersection of infrastructure scalability, virtualization efficiency, and agentic capabilities
  • Deep technical expertise in virtualization and containerization technologies (e.g. Kata, Firecracker, gVisor, Sysbox)
  • Passion for optimizing runtime performance
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Charitable donation matching and wellness stipends
  • Offers Equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Agent Infrastructure

Senior Software Engineering Manager - Digital Engineering

Senior Engineering Manager of Digital Engineering responsible for leading back-e...
Location
Location
United States
Salary
Salary:
106605.00 - 260590.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
February 28, 2026
Flip Icon
Requirements
Requirements
  • 7+ years of software development experience with focus on enterprise-level solutions and cloud technologies
  • 2+ years of experience leading large cross-functional project management initiatives including microservices, event-driven architectures, and streaming platforms like Apache Kafka
  • 2+ years of experience with Agile/Scrum practices
  • Bachelor's Degree in Computer Science or related field, or equivalent experience
Job Responsibility
Job Responsibility
  • Leading back-end engineering teams in creating exceptional member experiences
  • Identifying, prioritizing, and shaping complex enterprise initiatives with business stakeholders
  • Guiding teams of engineers in delivering digital services that enhance healthcare's cost transparency
  • Overseeing migration of data and services from legacy infrastructure to cloud
  • Integrating emerging AI and agentic technologies to improve offerings
  • Providing leadership, coaching, and strategic guidance to application development teams
  • Leading digital engineering teams responsible for back-end service development, data migration, micro-services, and emerging AI technologies
  • Influencing strategic roadmaps for future initiatives
  • Collaborating with business partners to ensure alignment with business initiatives and objectives
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan with matching company contributions
  • Employee stock purchase plan
  • No-cost wellness screenings
  • Tobacco cessation and weight management programs
  • Confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Fulltime
!
Read More
Arrow Right

Senior Machine Learning Engineer, Agentic

Join us in building the future of finance. Our mission is to democratize finance...
Location
Location
United States , Bellevue; Menlo Park; New York; Washington; Denver; Westlake; Chicago; Lake Mary; Clearwater; Gainesville
Salary
Salary:
146000.00 - 220000.00 USD / Year
robinhood.com Logo
Robinhood
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong technical expertise in software development, with understanding of agentic workflows—including reasoning loops, tool invocation, memory, and orchestration of autonomous AI agents
  • Hands-on experience using Large Language Models, including prompt engineering, fine-tuning, model distillation, and deploying optimized models (e.g. via DPO, PPO) into production environments
  • Proven ability to build and scale ML/AI systems, from experimentation to deployment—owning dataset generation, evaluation pipelines, A/B testing, and performance monitoring
  • Leadership and mentorship capabilities, with a track record of guiding complex technical projects and supporting the growth of teammates through code/design reviews and technical direction
  • Excellent communication and collaboration skills, with the ability to translate technical ideas into actionable plans and work effectively with cross-functional partners, including product and infrastructure teams
  • Innovation mindset and commitment to continuous learning and a bias toward action, staying at the forefront of ML/AI trends, agentic systems research, and best practices in tooling, safety, and evaluation
Job Responsibility
Job Responsibility
  • Design and create tools and workflows for agent development that support rapid prototyping—define agents, compose toolchains, and construct reasoning loops with minimal overhead
  • Build platform solutions to support scalable experimentation, synthetic dataset generation, and multi-agent evaluation across diverse tasks and domains
  • Develop feedback and optimization pipelines that incorporate both automated metrics and human-in-the-loop evaluation signals to fine-tune agent behavior
  • Implement and scale optimization techniques such as Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and reward modeling to improve agent performance
  • Launch and support fine-tuned models in production environments with robust evaluation, rollback strategies, and performance monitoring
  • Collaborate closely with applied AI/ML teams to translate state-of-the-art research in agentic reasoning, planning, and tool use into reliable, production-ready systems
What we offer
What we offer
  • Market competitive and pay equity-focused compensation structure
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Annual lifestyle wallet for personal wellness, learning and development, and more
  • Lifetime maximum benefit for family forming and fertility benefits
  • Dedicated mental health support for employees and eligible dependents
  • Generous time away including company holidays, paid time off, sick time, parental leave, and more
  • Lively office environment with catered meals, fully stocked kitchens, and geo-specific commuter benefits
  • Bonus opportunities
  • Equity
  • Fulltime
Read More
Arrow Right

Sr. Staff Software Engineer - Advanced Analytics Platform

At DISQO, we’re redefining how companies turn data into decisions. Our mission i...
Location
Location
United States , Los Angeles, Glendale
Salary
Salary:
200000.00 - 240000.00 USD / Year
disqo.com Logo
DISQO
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of professional software engineering experience
  • 5+ years architecting or building high-performance data systems or analytics platforms
  • 3+ years of product Rust experience
  • Deep expertise in Rust and strong experience in Java
  • Proven track record building large-scale data analytics or OLAP systems from the ground up
  • Deep understanding of columnar data engines, vectorized execution, and query/dataframe optimization
  • Hands-on experience with performance engineering, profiling, and hardware-aware optimization
  • Strong expertise with AWS - designing, deploying, and optimizing large-scale data and compute systems in the cloud
  • A systems-thinking mindset
  • Thrives in a fast-moving, startup environment
Job Responsibility
Job Responsibility
  • Architect and deliver a high-performance Advanced Analytics Engine
  • Design and build an Agentic AI system that leverages this Advanced Analytics Engine
  • Partner with product, engineering and data teams to power agentic AI analytics systems
  • Profile, benchmark, and optimize Rust components
  • Leverage AWS cloud services to architect scalable, reliable, and cost-efficient analytics infrastructure
  • Shape the evolution of DISQO’s broader data platform and its integration across our product ecosystem
  • Mentor and guide engineers
  • Contribute to open-source or internal frameworks that advance analytical systems and distributed computation
What we offer
What we offer
  • 100% covered Medical/Dental/Vision for employee
  • Equity
  • 401K
  • Generous PTO policy
  • Flexible workplace policy
  • Team offsites, social events & happy hours
  • Life Insurance
  • Health FSA
  • Commuter FSA (for hybrid employees)
  • Catered lunch and fully stocked kitchen
  • Fulltime
Read More
Arrow Right

Intermediate Software Engineer SRE – AI

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more
  • Fulltime
Read More
Arrow Right

Senior Backend Software Engineer

The Coaching team builds Highspot’s personalized, AI-enhanced coaching capabilit...
Location
Location
Canada , Vancouver
Salary
Salary:
146000.00 - 178000.00 CAD / Year
highspot.com Logo
Highspot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science or equivalent practical experience
  • 5+ years of experience in back-end software development building and maintaining large-scale web applications
  • At least 3 years of experience working with object-oriented programming languages
  • Ruby and Python preferred
  • Experience architecting, building, and deploying mid-to-large scale web applications in a distributed environment
  • Strong understanding of API design, data modeling, and backend scalability
  • Experience integrating or working with AI/LLM platforms such as OpenAI, Anthropic (Claude), or Azure OpenAI
  • Familiarity with AI-powered development tools (e.g., Cursor, GitHub Copilot, Cody, etc.) and a demonstrated ability to incorporate them effectively into day-to-day workflows
  • Deep expertise in web performance, security, and reliability best practices
  • Proven ability to deconstruct complex technical problems and deliver elegant, maintainable solutions
Job Responsibility
Job Responsibility
  • Design, develop, and maintain high-quality, scalable, and user-centric backend systems using modern technologies
  • Architect and optimize backend infrastructure to power intelligent, AI-driven workflows and Agentic AI integrations
  • Build and maintain integrations with multiple large language models (LLMs) including ChatGPT, Claude, and other OpenAI and Microsoft models
  • Collaborate closely with AI/ML engineers to productionize agentic workflows and autonomous reasoning systems
  • Partner effectively with Product Management and UX Design to translate ideas and research into production-ready, AI-enhanced features
  • Leverage AI-assisted development tools such as Cursor, GitHub Copilot, and other code generation frameworks to accelerate development and improve code quality
  • Lead and mentor engineers through complex projects, emphasizing clean architecture, testing, and software craftsmanship
  • Drive backend infrastructure improvements that enhance reliability, observability, and performance
  • Collaborate cross-functionally to deliver differentiated customer value through AI and data-driven solutions
  • Troubleshoot and resolve critical production issues while contributing to internal documentation and best practices
What we offer
What we offer
  • Comprehensive medical, dental, vision, disability, and life benefits
  • Group Retirement Savings Plan (RRSP) and matching employer contributions (DPSP) with immediate vesting
  • Flexible PTO
  • Generous Holiday Schedule + 5 Days for Annual Holiday Week
  • Quarterly Recharge Fridays (paid days off for mental health recharge)
  • Flexible work schedules
  • Access to Coaches and Therapists through Modern Health
  • 2 Volunteer days per year
  • Monthly transportation allowance for employees that work in our Vancouver Hub location
  • Employees are eligible to receive stock options
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, AI Runtime

We’re seeking a Senior Software Engineer to help power the future of agentic AI ...
Location
Location
United States
Salary
Salary:
157000.00 - 198900.00 USD / Year
apollographql.com Logo
Apollo GraphQL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in agent-to-tool orchestration, routing, and coordination in scalable, fault-tolerant systems
  • Deep expertise in Rust programming language
  • Strong background in distributed systems, server architecture, and high-performance backend development
  • Proven experience with protocol design, message routing, and server-side orchestration frameworks
  • Experience building and maintaining robust runtime infrastructure that supports AI-driven workflows and enables reliable agent-to-tool interactions
  • Proven experience with protocol design, message routing, and building server-side frameworks that enable scalable, reliable multi-tool agent workflows
  • Hands-on experience with observability, monitoring, and debugging frameworks for complex systems
  • Passion for clean, maintainable code, high system reliability, and scalable architecture
  • Experience in strategic system design, making architectural trade-offs, and planning for long-term scalability and maintainability
  • Strong technical leadership and mentorship, including guiding junior engineers and driving engineering best practices across teams
Job Responsibility
Job Responsibility
  • Scale an enterprise AI/MCP Server and Gateway that powers multi-agent workflows across Apollo, including routing, orchestration, and integration boundaries
  • Implement robust server infrastructure to ensure reliability, performance, and security at scale
  • Build and maintain tools for agent discovery, communication, and coordination
  • Define deployment strategies and runtime optimizations to maximize efficiency and minimize operational overhead
  • Develop frameworks and patterns that enable seamless multi-agent collaboration and AI-driven orchestration
  • Integrate observability, logging, and monitoring for full visibility into server and agent behavior
  • Explore and implement AI-enhanced developer workflows to optimize orchestration and agent interactions
  • Collaborate with teams within our org to ensure the MCP Server meets evolving product and developer needs
Read More
Arrow Right

Staff Software Engineer, AI Runtime

We’re seeking a Staff Software Engineer to help power the future of agentic AI w...
Location
Location
United States
Salary
Salary:
185000.00 - 215000.00 USD / Year
apollographql.com Logo
Apollo GraphQL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expertise in agent-to-tool orchestration, routing, and coordination in scalable, fault-tolerant systems
  • Deep expertise in Rust programming language
  • Strong background in distributed systems, server architecture, and high-performance backend development
  • Proven experience with protocol design, message routing, and server-side orchestration frameworks
  • Experience building and maintaining robust runtime infrastructure that supports AI-driven workflows and enables reliable agent-to-tool interactions
  • Proven experience with protocol design, message routing, and building server-side frameworks that enable scalable, reliable multi-tool agent workflows
  • Hands-on experience with observability, monitoring, and debugging frameworks for complex systems
  • Passion for clean, maintainable code, high system reliability, and scalable architecture
  • Experience in strategic system design, making architectural trade-offs, and planning for long-term scalability and maintainability
  • Strong technical leadership and mentorship, including guiding junior engineers and driving engineering best practices across teams
Job Responsibility
Job Responsibility
  • Architect and scale an enterprise AI/MCP Server and Gateway that powers multi-agent workflows across Apollo, including routing, orchestration, and integration boundaries
  • Design and implement robust server infrastructure to ensure reliability, performance, and security at scale
  • Build and maintain tools for agent discovery, communication, and coordination
  • Define deployment strategies and runtime optimizations to maximize efficiency and minimize operational overhead
  • Develop frameworks and patterns that enable seamless multi-agent collaboration and AI-driven orchestration
  • Integrate observability, logging, and monitoring for full visibility into server and agent behavior
  • Explore and implement AI-enhanced developer workflows to optimize orchestration and agent interactions
  • Collaborate with teams across Apollo to ensure the MCP Server meets evolving product and developer needs
What we offer
What we offer
  • Offers Equity
  • Choice of 3 Anthem Blue Cross medical plans (California residents can also choose from an additional 2 Kaiser medical plans)
  • Dental and Vision benefits are provided by Sun Life Financial
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Cloud Transition

Atlassian is hiring a Senior Software Engineer for its Cloud Transition team in ...
Location
Location
United States , San Francisco
Salary
Salary:
146300.00 - 235000.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 6 years of experience in building cloud SaaS platforms in a dynamic environment
  • Strong experience in Java, microservices, and relational databases
  • Passionate about collaborating with customers and cross-functional teams
  • Experience in AWS and streaming technologies such as Kafka
  • Experience in test-driven development
  • Passion for engineering and operational excellence
  • Understanding of SaaS, PaaS, and IaaS industries with hands-on experience with public cloud offerings (e.g., AWS, GCP, Azure)
  • Fluency in any one database technology (e.g., RDBMS like Oracle or Postgres and/or NoSQL like DynamoDB or Cassandra)
  • Experience crafting and implementing well-tested, highly scalable, and performant microservices and/or other distributed systems
  • Practical knowledge of agile software development methodologies (e.g., XP, scrum)
Job Responsibility
Job Responsibility
  • Drive large, complex projects autonomously, from technical design to launch
  • Tackle complex architectural challenges, apply architectural standards, and start using them on new projects
  • Lead code reviews & documentation as well as take on complex bug fixes, especially on high-risk problems
  • Be an example for thorough, meaningful code reviews
  • Partner across engineering teams to tackle company-wide initiatives spanning multiple projects
  • Mentor junior members of the team
  • Develop platform capabilities to power customer-facing solutions/experiences such as migration assistants, App Migrations, and Routine Admin tasks (sandbox data clone, cloud-to-cloud data transformation, backup-restore)
  • Implement compliance initiatives across platform and product stacks ranging from cloud infrastructure to product experiences
  • Collaborate with Core Engineering, products, and platform teams on a large scale and high-reliability transformative architecture, including Kafka & Kafka Stream adoption
  • Ensure the adoption of world-class engineering and operational practices across teams
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Bonuses
  • Commissions
  • Equity
  • Fulltime
Read More
Arrow Right