CrawlJobs Logo

Software Engineer, Load Balancing - Inference

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

293000.00 - 490000.00 USD / Year

Job Description:

We’re looking for a senior engineer to design and build the load balancer that will sit at the very front of our research inference stack - routing the world’s largest AI models with millisecond precision and bulletproof reliability. This system will serve research jobs where requests must stay “sticky” to the same model instance for hours or days and where even subtle errors can directly degrade model performance.

Job Responsibility:

  • Architect and build the gateway / network load balancer that fronts all research jobs, ensuring long-lived connections remain consistent and performant
  • Design traffic stickiness and routing strategies that optimize for both reliability and throughput
  • Instrument and debug complex distributed systems — with a focus on building world-class observability and debuggability tools (distributed tracing, logging, metrics)
  • Collaborate closely with researchers and ML engineers to understand how infrastructure decisions impact model performance and training dynamics
  • Own the end-to-end system lifecycle: from design and code to deploy, operate, and scale
  • Work in an outcome-oriented environment where everyone contributes across layers of the stack, from infra plumbing to performance tuning

Requirements:

  • Deep experience designing and operating large-scale distributed systems, particularly load balancers, service gateways, or traffic routing layers
  • 5+ years of experience designing in theory for and debugging in practice for the algorithmic and systems challenges of consistent hashing, sticky routing, and low-latency connection management
  • 5+ years of experience as a software engineer and systems architect working on high-scale, high-reliability infrastructure
  • Strong debugging mindset and enjoy spending time in tracing, logs, and metrics to untangle distributed failures
  • Comfortable writing and reviewing production code in Rust or similar systems languages (C/C++, Java, Go, Zig, etc)
  • Operated in big tech or high-growth environments and are excited to apply that experience in a faster-moving setting
  • Take ownership of problems end-to-end and are excited to build something foundational to how our models interact with the world

Nice to have:

  • Experience with gateway or load balancing systems (e.g., Envoy, gRPC, custom LB implementations)
  • Familiarity with inference workloads (e.g., reinforcement learning, streaming inference, KV cache management, etc)
  • Exposure to debugging and operational excellence practices in large production environments
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity
  • Performance-related bonus(es) for eligible employees

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Load Balancing - Inference

Software Engineer, Networking - Inference

We’re looking for a senior engineer to design and build the load balancer that w...
Location
Location
United States , San Francisco
Salary
Salary:
325000.00 - 490000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep experience designing and operating large-scale distributed systems, particularly load balancers, service gateways, or traffic routing layers
  • 5+ years of experience designing in theory for and debugging in practice for the algorithmic and systems challenges of consistent hashing, sticky routing, and low-latency connection management
  • 5+ years of experience as a software engineer and systems architect working on high-scale, high-reliability infrastructure
  • Strong debugging mindset and enjoy spending time in tracing, logs, and metrics to untangle distributed failures
  • Comfortable writing and reviewing production code in Rust or similar systems languages (C/C++, Java, Go, Zig, etc)
  • Operated in big tech or high-growth environments and are excited to apply that experience in a faster-moving setting
  • Take ownership of problems end-to-end and are excited to build something foundational to how our models interact with the world
Job Responsibility
Job Responsibility
  • Architect and build the gateway / network load balancer that fronts all research jobs, ensuring long-lived connections remain consistent and performant
  • Design traffic stickiness and routing strategies that optimize for both reliability and throughput
  • Instrument and debug complex distributed systems — with a focus on building world-class observability and debuggability tools (distributed tracing, logging, metrics)
  • Collaborate closely with researchers and ML engineers to understand how infrastructure decisions impact model performance and training dynamics
  • Own the end-to-end system lifecycle: from design and code to deploy, operate, and scale
  • Work in an outcome-oriented environment where everyone contributes across layers of the stack, from infra plumbing to performance tuning
What we offer
What we offer
  • Offers Equity
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Fulltime
Read More
Arrow Right
New

Software Engineer 2

Microsoft Azure AI Inference platform is the next generation cloud business posi...
Location
Location
United States , Redmond
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science or a related technical field AND 2+ years of technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, or Golang, OR equivalent experience
  • Ability to meet Microsoft, customer, and/or government security screening requirements for this role
  • Technical background with a solid foundation in software engineering principles, distributed computing, and system architecture
  • Experience working on high-scale, reliable online systems
  • Experience with real-time online services requiring low latency and high throughput
  • Experience working with Layer 7 (L7) network proxies and gateways
  • Knowledge of network architecture and concepts, including HTTP and TCP protocols, authentication, and session management
  • Knowledge and experience with OSS, Docker, Kubernetes, C++, Golang, or equivalent programming languages
  • Cross-team collaboration skills and the desire to collaborate in a team of researchers and developers
  • Ability to independently lead projects
Job Responsibility
Job Responsibility
  • Design and implement core inference infrastructure for serving frontier AI models in production
  • Identify and drive improvements to end-to-end inference performance and efficiency of state-of-the-art LLMs and GenAI models from OpenAI, Anthropic and xAI hosted on AI Foundary
  • Design and implement efficient load scheduling and balancing strategies, by leveraging key insights and features of the model and workload
  • Scale the platform to support the growing inferencing demand and maintain high availability
  • Deliver critical capabilities required to serve the latest and greatest Gen AI models such as GPT5, Realtime audio, Sora, and enable fast time to market for them
  • Drive generic features to cater to the needs of customers such as GitHub, M365, Microsoft AI and third-party companies
  • Collaborate with our partners both internal and external
  • Embody Microsoft's Culture and Values
  • Fulltime
Read More
Arrow Right

Software Engineer, Caching Infrastructure

The Caching Infrastructure team is responsible for building a caching layer that...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 385000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building and scaling distributed systems, with a strong focus on caching, load balancing, or storage systems
  • Deep expertise with Redis, Memcached, or similar solutions, including clustering, durability configurations, client-side connection patterns, and performance tuning
  • Production experience with Kubernetes, service meshes (e.g., Envoy), and autoscaling systems
  • Think rigorously about latency, reliability, throughput, and cost in designing platform capabilities
  • Thrive in a fast-paced environment and enjoy balancing pragmatic engineering with long-term technical excellence
Job Responsibility
Job Responsibility
  • Design, build, and operate OpenAI’s multi-tenant caching platform used across inference, identity, quota, and product experiences
  • Define the long-term vision and roadmap for caching as a core infra capability, balancing performance, durability, and cost
  • Collaborate with other infra teams (e.g., networking, observability, databases) and product teams to ensure our caching platform meets their needs
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Sr Staff Engineer Software, Fullstack (Prisma AIRS) - NetSec

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience building and scaling multi-tenant SaaS platforms with strict data isolation
  • Strong knowledge of API design, RESTful principles, and OpenAPI specifications
  • Proficiency in modern JavaScript frameworks (React, Vue, or Svelte) with TypeScript
  • Experience building data-intensive dashboards with complex visualisations and real-time data
  • Strong CSS/styling skills and responsive design principles
  • Demonstrated experience working with production AI/ML systems at scale
  • Practical experience integrating LLM APIs and managing inference at scale
  • Understanding of LLM operational challenges: rate limiting, cost optimisation, latency management, fallback strategies
  • Familiarity with AI agent frameworks (LangChain, AutoGen, MCP, or similar)
  • Knowledge of prompt engineering, semantic search, and vector databases
Job Responsibility
Job Responsibility
  • Design and implement high-performance REST APIs with enterprise-grade multi-tenant isolation and strict security boundaries
  • Work on distributed systems architecture handling high-throughput workloads with mission-critical uptime requirements
  • Build responsive dashboards and administrative interfaces for platform management, data visualisation, and system configuration
  • Integrate multiple LLM providers, implement semantic search capabilities, and build intelligent agent workflows
  • Architect complex, multi-step AI evaluation pipelines for asynchronous job execution and large-scale data processing
  • Design and implement database schemas with proper indexing, query optimisation, and data isolation strategies
  • Build and maintain scalable micro-services with async/await patterns and type-safe code
  • Develop data-intensive UIs with real-time updates, complex state management, and intuitive user experiences
  • Deploy and manage containerised applications on Kubernetes with comprehensive observability
  • Write thorough tests (frontend and backend) and maintain high code quality standards with automated tooling
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a Site Reliability Engineer (SRE), you will be a key player in ensuring our p...
Location
Location
Portugal , Lisboa
Salary
Salary:
Not provided
tekever.com Logo
Tekever
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field
  • 3+ years of experience in Site Reliability Engineering, DevOps, or a related software/systems engineering role
  • Proficiency in one or more programming languages such as Python, Go, or Bash for automation and tooling
  • Deep understanding of Linux/Unix operating systems and networking fundamentals (TCP/IP, DNS, HTTP, load balancing)
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud, with a focus on Google Cloud
  • Strong knowledge of CI/CD tools like Jenkins, GitLab CI, or CircleCI
  • Strong hands-on experience operating Kubernetes in production, including troubleshooting of networking, storage, scheduling, autoscaling, and stateful workloads
  • Experience with Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • Understanding of version control systems (e.g., Git) and with CI/CD principles and tools (e.g., GitLab CI, Jenkins)
  • Knowledge of monitoring, logging and tracing tools (e.g., Prometheus, Grafana, ELK stack)
Job Responsibility
Job Responsibility
  • Design, build, and maintain highly available, scalable infrastructure for distributed and stateful workloads, supporting real-time data ingestion, AI inference pipelines, and hybrid cloud/edge deployment
  • Automate repetitive manual tasks, infrastructure provisioning, and operational workflows to reduce toil and improve system efficiency
  • Implement and manage robust monitoring, logging, and alerting solutions to proactively detect and address issues
  • Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
  • Participate in an on-call rotation to respond to production incidents
  • Lead blameless post-mortem analyses for incidents in complex distributed systems, identifying root causes, systemic weaknesses, and implementing long-term preventative measures
  • Manage and provision cloud and on-premise infrastructure using IaC principles and tools like Terraform and Ansible
  • Conduct performance analysis, system tuning, and capacity planning to ensure our services meet performance and cost-efficiency goals
  • Develop, test, and maintain disaster recovery plans and business continuity strategies to ensure service resilience
  • Work closely with software development teams to consult on system design, platform choices, and reliability best practices for new features and services
What we offer
What we offer
  • An excellent work environment and an opportunity to create a real impact in the world
  • A truly high-tech, state-of-the-art engineering company with flat structure and no politics
  • Working with the very latest technologies in Data & AI, including Edge AI, Swarming - both within our software platforms and within our embedded on-board systems
  • Flexible work arrangements
  • Professional development opportunities
  • Collaborative and inclusive work environment
  • Salary compatible with the level of proven experience
  • Fulltime
Read More
Arrow Right

Catering Lead

Aramark Canada Ltd. is currently seeking an outgoing, professional to join the D...
Location
Location
Canada , Halifax
Salary
Salary:
16.50 - 18.36 / Hour
aramark.co.uk Logo
Aramark UK
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years' experience in customer service/catering required
  • Must have valid driver’s license
  • Must maintain neat and professional appearance at all times
  • Must be able to see without impairment for inspection purposes and ability to read small text such as product warning labels
  • Must be able to lift, push, and pull min. 45 lbs
Job Responsibility
Job Responsibility
  • Provide exceptional customer service to all customers and clients
  • Deliver catering supplies, including tables, chairs, linens and cutlery to event spaces
  • Deliver catering orders, including hot and cold foods, to event spaces
  • Respond to any customer questions or concerns at time of delivery
  • May assist with preparing and managing invoices
  • Adhere to all Aramark policies and procedures, including occupational health and safety and food safety
  • General preventative maintenance on vehicle (drivers checklist)
  • Maintain safe and structured driving patterns (good driving habits)
  • Other duties as assigned
Read More
Arrow Right

Barista

We're currently recruiting a driven Barista to help us create beautifully crafte...
Location
Location
United Kingdom , Southampton
Salary
Salary:
12.25 GBP / Hour
14forty.co.uk Logo
14forty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Be a brilliant communicator and easily build relationships
  • Have previous experience in customer service
  • Strive for excellence in an eager and motivated manner
  • Take initiative and make decisions that are right for our customers
  • Have Hands-on experience with brewing equipment
  • Possess the ability to work under pressure
  • Demonstrate exceptional timekeeping and reliability
Job Responsibility
Job Responsibility
  • Preparing perfectly blended drinks and serving high-quality food that delights our customers
  • Keeping the bar area clean
  • Receiving and processing payments (cash and credit cards)
  • Being an enthusiastic team player and excellent communicator
  • Maintaining stock of clean mugs and plates
  • Learning about brewing methods, beverage blends, food preparation and presentation techniques
  • Check if brewing equipment operates properly and report any maintenance needs
  • Representing M&S and maintaining a positive brand image
  • Complying with Food Handling & Hygiene standards
  • Complying with Health & Safety regulations
What we offer
What we offer
  • Get given every opportunity to progress within a company that invests in its people, celebrates individuality, and rewards and recognises employees who go beyond the plate
  • Competitive pay, great perks and unrivalled opportunities for learning and development
  • Parttime
Read More
Arrow Right

Outpatient Nuclear Medicine Technologist

Are you interested in working for an organization passionate about love and exce...
Location
Location
United States , Glen Mills
Salary
Salary:
38.22 - 61.16 USD / Hour
christianacare.org Logo
Christiana Care
Expiration Date
July 04, 2026
Flip Icon
Requirements
Requirements
  • Associates Degree in Nuclear Medicine Technology
  • One year experience preferred
  • Knowledge, skills, and abilities relating to neonatal, pediatric, adolescent, adult, and geriatric patient care
  • Knowledge, skills, and abilities related to nuclear medicine and PET/CT procedures, techniques, radiation safety, quality control, and performance improvement practices, stress test monitoring, and EKG’s
  • Knowledge of and ability to abide by departmental standards, procedures, protocols, policies and guidelines
  • Knowledge of and ability to use nuclear medicine and PET/CT cameras, computers, and laboratory equipment
  • Ability to abide by local, state, and federal regulations
  • Ability to assess patient needs to determine which type of radiopharmaceutical to use and which procedure to follow, including additional images
  • Ability to utilize fundamental computer techniques on nuclear medicine and PET/CT equipment and Christiana Care Health Services Clinical Care System
  • Ability to prepare, verify, and administer all radiopharmaceuticals in a safe and effective manner
Job Responsibility
Job Responsibility
  • To perform diagnostic and therapeutic nuclear medicine procedures and PET/CT procedures in adherence to the Nuclear Regulatory Commission and State of Delaware Regulations and to assist physicians in the diagnosis and treatment of diseases
  • Performs diagnostic and therapeutic nuclear medicine procedures
  • Maintains compliance with local, state, and federal regulations as specified in NRC and state licenses
  • Complies with JCAHO regulations
  • Performs Nuclear Medicine and/or PET/CT procedures and quality control as defined by department protocols
  • Demonstrates complete knowledge and proficiency in the use of all contrast media injectors
  • Completes all technical assigned imaging and non-imaging procedures in the designated scheduled times
  • Performs IV, IM, subcutaneous, and intradermal injections
  • Performs accurate dose calculations
  • Completes transmission of optimum quality images to PACS system, able to digitize films, and know how to use a CD burner, if available
What we offer
What we offer
  • Full Medical, Dental, Vision and other insurance benefits
  • 403 (b) with an employer match
  • Generous Paid Time Off
  • Parttime
Read More
Arrow Right