CrawlJobs Logo

Software Engineer, Networking - Inference

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

325000.00 - 490000.00 USD / Year

Job Description:

We’re looking for a senior engineer to design and build the load balancer that will sit at the very front of our research inference stack - routing the world’s largest AI models with millisecond precision and bulletproof reliability. This system will serve research jobs where requests must stay “sticky” to the same model instance for hours or days and where even subtle errors can directly degrade model performance.

Job Responsibility:

  • Architect and build the gateway / network load balancer that fronts all research jobs, ensuring long-lived connections remain consistent and performant
  • Design traffic stickiness and routing strategies that optimize for both reliability and throughput
  • Instrument and debug complex distributed systems — with a focus on building world-class observability and debuggability tools (distributed tracing, logging, metrics)
  • Collaborate closely with researchers and ML engineers to understand how infrastructure decisions impact model performance and training dynamics
  • Own the end-to-end system lifecycle: from design and code to deploy, operate, and scale
  • Work in an outcome-oriented environment where everyone contributes across layers of the stack, from infra plumbing to performance tuning

Requirements:

  • Deep experience designing and operating large-scale distributed systems, particularly load balancers, service gateways, or traffic routing layers
  • 5+ years of experience designing in theory for and debugging in practice for the algorithmic and systems challenges of consistent hashing, sticky routing, and low-latency connection management
  • 5+ years of experience as a software engineer and systems architect working on high-scale, high-reliability infrastructure
  • Strong debugging mindset and enjoy spending time in tracing, logs, and metrics to untangle distributed failures
  • Comfortable writing and reviewing production code in Rust or similar systems languages (C/C++, Java, Go, Zig, etc)
  • Operated in big tech or high-growth environments and are excited to apply that experience in a faster-moving setting
  • Take ownership of problems end-to-end and are excited to build something foundational to how our models interact with the world

Nice to have:

  • Experience with gateway or load balancing systems (e.g., Envoy, gRPC, custom LB implementations)
  • Familiarity with inference workloads (e.g., reinforcement learning, streaming inference, KV cache management, etc)
  • Exposure to debugging and operational excellence practices in large production environments
What we offer:
  • Offers Equity
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Networking - Inference

Senior Software Engineer - Network Enablement (Applied ML)

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering skills including systems design, APIs, and building reliable backend services (Go or Python preferred)
  • Production experience with batch and streaming data pipelines and orchestration tools such as Airflow or Spark
  • Experience building or operating real-time scoring and online feature-serving systems, including feature stores and low-latency model inference
  • Experience integrating model outputs into product flows (APIs, feature flags) and measuring impact through experiments and product metrics
  • Experience with model lifecycle and operations: model registries, CI/CD for models, reproducible training, offline & online parity, monitoring and incident response
Job Responsibility
Job Responsibility
  • Embed model inference into Network Enablement product flows and decision logic (APIs, feature flags, backend flows)
  • Define and instrument product + ML success metrics (fraud reduction, retention lift, false positives, downstream impact)
  • Design and run experiments and rollout plans (backtesting, shadow scoring, A/B tests, feature-flagged releases) to validate product hypotheses
  • Build and operate offline training pipelines and production batch scoring for bank intelligence products
  • Ship and maintain online feature serving and low-latency model inference endpoints for real-time partner/bank scoring
  • Implement model CI/CD, model/version registry, and safe rollout/rollback strategies
  • Monitor model/data health: drift/regression detection, model-quality dashboards, alerts, and SLOs targeted to partner product needs
  • Ensure offline and online parity, data lineage, and automated validation / data contracts to reduce regressions
  • Optimize inference performance and cost for real-time scoring (batching, caching, runtime selection)
  • Ensure fairness, explainability and PII-aware handling for partner-facing ML features
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Software Engineer Staff

This Software Engineer Staff will be engaged in data science-related research an...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Utilize analytical and programming skills and open-source systems, such as Apache Storm, Apache Spark, Elasticsearch, Cassandra, Graph DB etc. develop data processing pipeline required efficacy and latency
  • Require good knowledge and experience of the big data tool sets and techniques of distributed storage and computation engine
  • Require the experience to develop the reusable and highly scalable data processing component
  • Require good knowledge and experience to work with cloud based CICD tools and cloud devops teams to collect stats and create monitors for our data processing pipelines
  • Develop good quality python APIs to support micro services
  • Require the knowledge of APIs to various No SQL storage systems, Elasticsearch, Cassandra, and Redis, etc.
  • Good understanding Python Flask web service and be able to develop good quality code
  • Troubleshoot production environment and customer reported issues
  • Require the knowledge of the multi-cloud production environment
  • Require the agility to troubleshoot open-source data processing engine, such as Apache Spark, Apache Storm and Apache Flink
Job Responsibility
Job Responsibility
  • Designs, develops, troubleshoots and debugs software programs for software enhancements and new products
  • Develops software including operating systems, compilers, routers, networks, utilities, databases and Internet-related tools
  • Determines hardware compatibility and/or influences hardware design
  • Engaged in data science-related research and software application development and engineering duties related to our enterprise-grade Wi-Fi technology and autonomous platform to provide an unprecedented visibility into the user experience
  • Collaborate with other engineers and product managers to build the next generation of autonomous Wi-Fi networks leveraging big data and predictive models
  • Use knowledge of wireless communication networks, machine learning and software engineering to develop and implement scalable algorithms to process a large amount of streaming data to detect anomalies, predict problems, and classify them in real-time
  • Leverage the data collected from the Wi-Fi network to empower the inference engine of our Mist platform and systems, including the Mist virtual assistant chat bot
  • Determine the likelihood of failures across the Wi-Fi network and performing failure scope analysis
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Principal Software Engineer role at Hewlett Packard Enterprise to design, develo...
Location
Location
United States , San Jose
Salary
Salary:
148000.00 - 340500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Masters degree in Computer science, Computer Engineering or a related field
  • 10+ years of experience in software engineering with a focus on Python, Go or Java
  • Strong understanding of RESTful API design and development
  • 2+ years of Experience working with large scale distributed systems based on either cloud technologies or Kubernetes
  • 2+ years of experience on event-driven technologies like Kafka and Apache Storm/Flink
  • 2+ years of experience in Big-data technologies like Apache spark/Databricks
  • Proficient in working with Redis and databases like Cassandra/Datastax
  • Must hold U.S. citizenship
Job Responsibility
Job Responsibility
  • Design, develop, and test software related to the cloud-based network configuration and reporting system
  • Solve complex problems and designing subsystems for Mist platform
  • Develop software for highly scalable and fault-tolerant cloud-scale distributed applications
  • Develop microservices using Python, and/or Go (golang)
  • Develop event-driven systems using Python and Java
  • Develop software for AIDE's real-time data pipeline and batch processing
  • Develop ETL pipelines aiding in training and inference of various ML models using big-data frameworks like Apache Spark
  • Build metrics, monitoring and structured logging into the product
  • Write unit, integration and functional tests
  • Participate in collaborative, DevOps style, lean practices
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Principal Software Engineer role at Hewlett Packard Enterprise to design, develo...
Location
Location
United States , San Jose
Salary
Salary:
148000.00 - 340500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Masters degree in Computer science, Computer Engineering or a related field
  • 10+ years of experience in software engineering with a focus on Python, Go or Java
  • Strong understanding of RESTful API design and development
  • 2+ years of Experience working with large scale distributed systems based on either cloud technologies or Kubernetes
  • 2+ years of experience on event-driven technologies like Kafka and Apache Storm/Flink
  • 2+ years of experience in Big-data technologies like Apache spark/Databricks
  • Proficient in working with Redis and databases like Cassandra/Datastax
  • Excellent problem-solving and analytical skills
  • Strong communication and collaboration skills
Job Responsibility
Job Responsibility
  • Design, develop, and test software related to the cloud-based network configuration and reporting system
  • Solve complex problems and design subsystems for the Mist platform
  • Develop software for highly scalable and fault-tolerant cloud-scale distributed applications
  • Develop microservices using Python, and/or Go (golang)
  • Develop event-driven systems using Python and Java
  • Develop software for AIDE's real-time data pipeline and batch processing
  • Develop ETL pipelines aiding in training and inference of various ML models using big-data frameworks like Apache Spark
  • Build metrics, monitoring and structured logging into the product
  • Write unit, integration and functional tests
  • Participate in collaborative, DevOps style, lean practices
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Software Engineer, AI Infrastructure

As a Software Engineer on our AI Infrastructure team, you will help design the c...
Location
Location
United States , New York, NY; San Mateo, CA
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 3 years of experience in software engineering, with a focus on infrastructure or machine learning systems
  • Strong programming skills in Python, Go, or a similar language
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, MLflow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Basic understanding of LLM knowledge (e.g., context length, disaggregated prefill, KV cache memory estimation, etc)
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as LLM CI/CD pipeline, control plane, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Building frameworks and safeguards to ensure Fireworks AI has the best model quality in the industry
  • Collaborate with performance, training, and product teams to translate research and product needs into infrastructure solutions
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure
  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally
  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results
  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation
  • Fulltime
Read More
Arrow Right

AI Software Engineer - NLP/LLM

At Moody's, we unite the brightest minds to turn today’s risks into tomorrow’s o...
Location
Location
United States , New York
Salary
Salary:
159300.00 - 230850.00 USD / Year
moodys.com Logo
Moody's
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of demonstrated experience building production-grade machine learning systems with measurable impacts
  • expertise in NLP and search and recommendation systems is preferred
  • Hands-on experience with large language model (LLM) applications and AI agents, including retrieval-augmented generation, prompt optimization, fine-tuning, agent design, and evaluation methodologies
  • familiarity with prompt optimization frameworks like DSPy is preferred
  • Deep expertise in machine learning models and systems design, including classic models (e.g., XGBoost), modern deep learning and graph machine learning architectures (e.g., transformers-based models, graph neural networks (GNN)), and reinforcement learning systems
  • Proven ability to take models and agents from research to production, including optimization for latency and cost, implementation of monitoring and tracing, and development of reusable platforms or frameworks
  • Strong technical leadership and mentorship skills, with a track record of growing engineers, improving team velocity through automation, documentation, and tooling, and influencing architectural decisions without direct authority
  • Excellent communication and strategic thinking abilities, capable of aligning technical decisions with business outcomes, navigating ambiguity, and driving cross-functional collaboration
  • Bachelor’s degree or higher in Computer Science, Engineering, or a related field
Job Responsibility
Job Responsibility
  • Design and deploy end to end AI and machine learning solutions including machine learning and graph-based models, natural language processing (NLP) models, and large language model (LLM) based AI agents
  • Build robust pipelines for data ingestion, feature engineering, model training, validation, and real-time or batch inference
  • Develop and integrate large language model (LLM) applications using techniques such as fine-tuning, retrieval-augmented generation, and reinforcement learning
  • Build autonomous agents capable of multi-step reasoning and tool use in production environments
  • Lead the full model and agent development lifecycle, from problem definition and data exploration through experimentation, implementation, deployment, and monitoring
  • Ensure solutions are scalable, reliable, and aligned with business goals
  • Advocate and implement machine learning operations (MLOps) best practices including data monitoring and tracing, error analysis, automated retraining, model and prompt versioning, business metrics monitoring, and incident response
  • Collaborate across disciplines and provide technical leadership, working with product managers, engineers, and researchers to deliver impactful solutions
  • Mentor team members, lead design reviews, and promote best practices in AI and machine learning systems development
What we offer
What we offer
  • medical
  • dental
  • vision
  • parental leave
  • paid time off
  • a 401(k) plan with employee and company contribution opportunities
  • life, disability, and accident insurance
  • a discounted employee stock purchase plan
  • tuition reimbursement
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

As a Software Engineer on our Infrastructure team, you will help design and buil...
Location
Location
United States , New York; San Mateo; Redwood City
Salary
Salary:
140000.00 - 150000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • Strong programming skills in Python, C++, or a similar language
  • Solid understanding of computer systems concepts such as networking, storage, and distributed computing
  • Familiarity with cloud platforms like AWS, GCP, or Azure, and containerization tools like Docker or Kubernetes
  • Knowledge and interest in cloud infrastructure, distributed systems, and machine learning
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as job schedulers, autoscalers, resource managers, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Collaborate with ML, DevOps, and product teams to translate research and product needs into infrastructure solutions
  • Learn and apply modern cloud technologies including Kubernetes, Ray, Kubeflow, and MLFlow
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary and comprehensive benefits package
  • Fulltime
Read More
Arrow Right
New

Staff AI Embedded Software Engineer - Connected Devices

As a Staff Embedded Software Engineer, you will lead critical software engineeri...
Location
Location
United States , Seattle; Boston; Scottsdale
Salary
Salary:
168750.00 - 270000.00 USD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of professional software development experience, with extensive expertise in C/C++, Go, Python, or comparable systems programming languages, including significant experience building AI- and data-intensive systems
  • Deep, demonstrated expertise in embedded systems architecture, firmware integration, and device-level software engineering, combined with hands-on experience deploying and optimizing AI inference workloads on constrained edge platforms (MCUs, SoCs, NPUs)
  • Proven experience designing, training, and operating machine learning models at scale, including ownership of data pipelines, model evaluation, and iterative improvement in production environments
  • Practical experience with large-scale AI systems, including foundation models and LLMs, such as fine-tuning, adaptation, or integration into real-world products
  • Proven track record of addressing and resolving system-wide challenges in performance, scalability, reliability, security, and safety across AI-enabled and mission-critical systems
  • At least 7+ years mentoring senior engineers and leading complex, strategic engineering initiatives across multiple teams, including setting technical direction for AI-enabled products
  • Advanced understanding of computer science fundamentals, data structures, algorithms, and high-standard software design practices, applied to both embedded and large-scale AI systems
  • Experience with networking and distributed system concepts relevant to connected and AI-enabled devices
Job Responsibility
Job Responsibility
  • Define and significantly advance embedded software architectures for Axon’s current and future connected device products, including AI-enabled systems spanning on-device inference and cloud-assisted workflows
  • Lead the technical direction for AI-enabled capabilities across connected devices, including collaboration on large-scale model training, data strategy, deployment, and iterative improvement in production, across multiple product lines
  • Partner with research, product, and platform teams to explore and integrate emerging AI approaches, including foundation models and multimodal systems, shaping Axon’s medium and long-term AI strategy for connected devices
  • Establish and enforce Axon-wide standards for embedded software and AI system design, including reliability, scalability, safety, observability, and lifecycle management
  • Identify and mitigate risks associated with AI systems, including model failure modes, data drift, and operational edge cases, and drive architectural decisions that ensure safe and reliable behavior in real-world conditions
  • Provide executive-level guidance and mentorship, significantly enhancing the capabilities and technical decision-making of the embedded software engineering teams
  • Continuously improve software engineering practices and drive excellence through strategic retrospectives, planning sessions, and innovation cycles
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right