Inference Software Engineer Job at Etched (San Jose)

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
8+ years of experience in backend service engineering, including work on high-scale infrastructures
Proficiency in one or more systems programming languages such as C#, C++
1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
2+ years of experience building software for scale, performance, and reliability
Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)

Job Responsibility

Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
Design and implement scalable backend services optimized for machine learning workflows and large language model integration
Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience

Fulltime

Software Engineer II and Senior Software Engineer - Performance

The Artificial Intelligence Performance team at Microsoft develops AI software t...

Location

United States , Mountain View

Salary:

100600.00 - 199000.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Identify and drive improvements to end-to-end inference performance of OpenAI and other state-of-the-art LLMs
Measure, benchmark performance on Nvidia/AMD GPUs and first party Microsoft silicon
Optimize and monitor performance of LLMs and build SW tooling to enable insights into performance opportunities ranging from the model level to the systems and silicon level to improve customer experience and reduce the footprint of the computing fleet
Enable fast time to market of LLMs/models and their deployments at scale by building SW tools that afford velocity in porting models on new Nvidia and AMD GPUs
Design, implement, and test functions or components for our AI/DNN/LLM frameworks and tools
Speeding up/reducing complexity of key components/pipelines to improve performance and/or efficiency of our systems
Communicate and collaborate with our partners both internal and external
Embody Microsoft's Culture and Values

Fulltime

Software Engineer, Inference – AMD GPU Enablement

We’re hiring engineers to scale and optimize OpenAI’s inference infrastructure a...

Location

United States , San Francisco

Salary:

295000.00 - 555000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Experience writing or porting GPU kernels using HIP, CUDA, or Triton
Familiarity with communication libraries like NCCL/RCCL
Experience working on distributed inference systems
Ability to solve end-to-end performance challenges across hardware, system libraries, and orchestration layers
Ability to thrive in a small, fast-moving team building new infrastructure from first principles

Job Responsibility

Own bring-up, correctness and performance of the OpenAI inference stack on AMD hardware
Integrate internal model-serving infrastructure (e.g., vLLM, Triton) into a variety of GPU-backed systems
Debug and optimize distributed inference workloads across memory, network, and compute layers
Validate correctness, performance, and scalability of model execution on large GPU clusters
Collaborate with partner teams to design and optimize high-performance GPU kernels for accelerators using HIP, Triton, or other performance-focused frameworks
Collaborate with partner teams to build, integrate and tune collective communication libraries (e.g., RCCL) used to parallelize model execution across many GPUs

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Software Engineer, Inference - Multi Modal

OpenAI’s Inference team powers the deployment of our most advanced models - incl...

Location

United States , San Francisco

Salary:

295000.00 - 555000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Experience building and scaling inference systems for LLMs or multimodal models
Worked with GPU-based ML workloads and understand the performance dynamics of large models, especially with complex data like images or audio
Enjoy experimental, fast-evolving work and collaborating closely with research
Comfortable dealing with systems that span networking, distributed compute, and high-throughput data handling
Familiarity with inference tooling like vLLM, TensorRT-LLM, or custom model parallel systems
Own problems end-to-end and are excited to operate in ambiguous, fast-moving spaces

Job Responsibility

Design and implement inference infrastructure for large-scale multimodal models
Optimize systems for high-throughput, low-latency delivery of image and audio inputs and outputs
Enable experimental research workflows to transition into reliable production services
Collaborate closely with researchers, infra teams, and product engineers to deploy state-of-the-art capabilities
Contribute to system-level improvements including GPU utilization, tensor parallelism, and hardware abstraction layers

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Staff Software Engineer, Inference Infrastructure

Our mission is to scale intelligence to serve humanity. We’re training and deplo...

Location

San Francisco, Toronto, London, New York, Montreal

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

5+ years of engineering experience running production infrastructure at a large scale
Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters
Experience with Kubernetes dev and production coding and support
Experience with GCP, Azure, AWS, OCI, multi-cloud on-prem / hybrid serving
Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments
Experience in compute/storage/network resource and cost management
Excellent collaboration and troubleshooting skills to build mission-critical systems, and ensure smooth operations and efficient teamwork
The grit and adaptability to solve complex technical challenges that evolve day to day
Familiarity with computational characteristics of accelerators (GPUs, TPUs, and/or custom accelerators), especially how they influence latency and throughput of inference
Strong understanding or working experience with distributed systems

Job Responsibility

Developing, deploying, and operating the AI platform delivering Cohere's large language models through easy to use API endpoints
Working closely with many teams to deploy optimized NLP models to production in low latency, high throughput, and high availability environments
Interfacing with customers and creating customized deployments to meet their specific needs

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Full-Stack Software Engineer, Inference

Our mission is to scale intelligence to serve humanity. We’re training and deplo...

Location

Toronto; San Francisco; London; New York; Montreal

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

5+ years of experience writing clean backend code
Experience with Golang and React
Built payment systems and have experience with subscription or usage-based SaaS, and/or products with a freemium model
Strong coding abilities and comfortable working across the stack
Worked in both large enterprises and startups
Excel in fast-paced environments and can execute while priorities and objectives are a moving target

Job Responsibility

Improve the platform’s auth, billing, and payment systems
Add new features to the interactive Playground where customers can try our models
Implement new platform features for managing deployments
Write and ship minimal code that runs in low-resource environments, and has highly stringent deployment mechanisms
As security and privacy are paramount, you will sometimes need to reinvent the wheel, and won’t be able to use the most popular libraries or tooling

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Software Engineer, Networking - Inference

We’re looking for a senior engineer to design and build the load balancer that w...

Location

United States , San Francisco

Salary:

325000.00 - 490000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Deep experience designing and operating large-scale distributed systems, particularly load balancers, service gateways, or traffic routing layers
5+ years of experience designing in theory for and debugging in practice for the algorithmic and systems challenges of consistent hashing, sticky routing, and low-latency connection management
5+ years of experience as a software engineer and systems architect working on high-scale, high-reliability infrastructure
Strong debugging mindset and enjoy spending time in tracing, logs, and metrics to untangle distributed failures
Comfortable writing and reviewing production code in Rust or similar systems languages (C/C++, Java, Go, Zig, etc)
Operated in big tech or high-growth environments and are excited to apply that experience in a faster-moving setting
Take ownership of problems end-to-end and are excited to build something foundational to how our models interact with the world

Job Responsibility

Architect and build the gateway / network load balancer that fronts all research jobs, ensuring long-lived connections remain consistent and performant
Design traffic stickiness and routing strategies that optimize for both reliability and throughput
Instrument and debug complex distributed systems — with a focus on building world-class observability and debuggability tools (distributed tracing, logging, metrics)
Collaborate closely with researchers and ML engineers to understand how infrastructure decisions impact model performance and training dynamics
Own the end-to-end system lifecycle: from design and code to deploy, operate, and scale
Work in an outcome-oriented environment where everyone contributes across layers of the stack, from infra plumbing to performance tuning

What we offer

Offers Equity
Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth

Fulltime

Software Engineer, Model Inference

Our Inference team brings OpenAI’s most capable research and technology to the w...

Location

United States , San Francisco

Salary:

295000.00 - 555000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Understanding of modern ML architectures and an intuition for how to optimize their performance, particularly for inference
Own problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done
At least 5 years of professional software engineering experience
Familiarity with PyTorch, NVidia GPUs and the software stacks that optimize them (e.g. NCCL, CUDA), as well as HPC technologies such as InfiniBand, MPI, NVLink, etc
Experience architecting, building, observing, and debugging production distributed systems
Have needed to rebuild or substantially refactor production systems several times over due to rapidly increasing scale
Are self-directed and enjoy figuring out the most important problem to work on
Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed

Job Responsibility

Work alongside machine learning researchers, engineers, and product managers to bring our latest technologies into production
Work alongside researchers to enable advanced research through awesome engineering
Introduce new techniques, tools, and architecture that improve the performance, latency, throughput, and efficiency of our model inference stack
Build tools to give us visibility into our bottlenecks and sources of instability and then design and implement solutions to address the highest priority issues
Optimize our code and fleet of Azure VMs to utilize every FLOP and every GB of GPU RAM of our hardware

What we offer

Offers Equity
Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth

Fulltime

Select Country

Inference Software Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?