Systems Research Engineer, GPU Programming Job at Together AI (San Francisco)

Software Engineer, Systems ML - Compilers / Backend

We are seeking a software engineer to support the development of the compiler to...

Location

United States , Sunnyvale

Salary:

181000.00 USD / Year ▼

Staff Software Engineer, GPU Infrastructure (HPC)

The internal infrastructure team is responsible for building world-class infrast...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment

Job Responsibility

Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Software Engineer, Systems ML - Compilers / Backend

We are seeking a software engineer to support the development of the compiler to...

Location

United States , Sunnyvale

Salary:

217000.00 USD / Year ▼

Principal Research Engineer - Agent 365

Copilot usage is growing rapidly across Microsoft 365 and custom agent experienc...

Location

United States , Redmond

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Architect and deliver AI systems across model development, data, infra, evaluation, and deployment spanning multiple product lines
Set technical direction for large programs
drive alignment across Research, Engineering, and Product
Integrate LLMs, multimodal models, multi-agent architectures, and RAG into Microsoft’s ecosystem
Establish standards for MLOps, governance, and Responsible AI, compliant with Microsoft principles and industry standards
Drive original research and thought leadership (whitepapers, internal notes, patents)
convert insights into shipped capabilities
Research Translation: Continuously review emerging work
identify high-potential methods and adapt them to Microsoft problem spaces
Production Integration: Turn research prototypes into production-quality code optimized for scale, latency, and maintainability

Fulltime

Senior Research Engineer

We are seeking a highly skilled Senior Research Engineer to collaborate closely ...

Location

United States

Salary:

210000.00 - 309000.00 USD / Year

Assembly

Expiration Date

Until further notice

Requirements

Strong expertise in the Python ecosystem and major ML frameworks (PyTorch, JAX)
Experience with lower-level programming (C++ or Rust preferred)
Deep understanding of GPU acceleration (CUDA, profiling, kernel-level optimization)
TPU experience is a strong plus
Proven ability to accelerate deep learning workloads using compiler frameworks, graph optimizations, and parallelization strategies
Solid understanding of the deep learning lifecycle: model design, large-scale training, data processing pipelines, and inference deployment
Strong debugging, profiling, and optimization skills in large-scale distributed environments
Excellent communication and collaboration skills, with the ability to clearly prioritize and articulate impact-driven technical solutions

Job Responsibility

Investigate and mitigate performance bottlenecks in large-scale distributed training and inference systems
Develop and implement both low-level (operator/kernel) and high-level (system/architecture) optimization strategies
Translate research models and prototypes into highly optimized, production-ready inference systems
Explore and integrate inference compilers such as TensorRT, ONNX Runtime, AWS Neuron and Inferentia, or similar technologies
Design, test, and deploy scalable solutions for parallel and distributed workloads on heterogeneous hardware
Facilitate knowledge transfer and bidirectional support between Research and Engineering teams, ensuring alignment of priorities and solutions

What we offer

competitive equity grants
100% employer-paid benefits
flexibility of being fully remote

Fulltime

Geoint Systems Engineer

Reinventing Geospatial (RGi) is a leading expert in geospatial solutions for Def...

Location

United States , Aberdeen Proving Grounds; Alexandria

Salary:

Not provided

Reinventing Geospatial

Expiration Date

Until further notice

Requirements

Active Top Secret clearance with an ability to obtain SCI access and willingness to obtain CI Polygraph
US Citizenship Required
Experience with installation, configuration, security hardening, operation, maintenance, and troubleshooting of: Windows operating systems (Server and Desktop environments), Linux operating systems (RHEL, CentOS, Ubuntu, or similar distributions)
Proficiency in managing and troubleshooting enterprise software including: Web servers (Apache, Nginx, IIS), Database systems (PostgreSQL, SQL Server, MySQL, Oracle), Web applications and services Middleware and application servers
Strong scripting and automation capabilities with knowledge of: General programming paradigms including data types, control flow structures, and logic constructs, PowerShell, Python, Bash/Shell scripting experience
Experience with REST API technologies including: Understanding of HTTP methods (GET, POST, PUT, DELETE, PATCH) and the ability to automate API interactions for system integration and operations, JSON/XML data handling
Comprehensive understanding of networking fundamentals: Network protocols (TCP, UDP, multicast, unicast), File sharing protocols (SMB, NFS), IP addressing schemes (IPv4/IPv6) and subnet calculations, Routing concepts and implementation, OSI model and troubleshooting methodology
Experience with network troubleshooting tools and techniques
Knowledge of system hardware architecture for selection, suitability analysis, operation, and troubleshooting: RAID configurations (0, 1, 5, 6, 10), HDD vs. SSD performance characteristics, SAN architecture and management, CPU architectures and performance considerations, RAM capacity and speed requirements, GPU capabilities for geospatial processing workloads
Ability to perform hardware capacity planning and performance optimization

Job Responsibility

Support the installation, configuration, operation, and maintenance of geospatial software systems
Utilize technical expertise across operating systems, enterprise applications, automation technologies, and hardware infrastructure to ensure mission-critical geospatial capabilities remain operational and secure
Analyze system capabilities with AGE and COE compliance requirements and identify gaps
Maintain functional specifications that define essential technical requirements of Legacy DCGS-A, IS&A, Mission Command, and COE CPCE
Maintain system engineering documentation including the System Engineering Plan, Software Requirements Traceability Matrix
Cross reference mapping of GEOINT functional specifications to Intelligence or Mission Command Systems specifications and program-level documents, such as the Capabilities Production Document (CPD), Information Systems Interface Control Document (IS-ICD), and Requirements Definition Package (RDP)
Interact with systems users to translate their requirements into systems, hardware, and software requirements and design
Plan and perform engineering research, design development, and other assignments in conformance with design, engineering and customer specifications
Lead team of engineers through project completion
responsible for major technical/engineering projects of higher complexity

What we offer

100% paid employee healthcare & dental insurance
Paid parental leave
401k with matching
Escalating vacation time
Referral bonuses
Tuition reimbursement
Professional development training
Free beverages and snacks
Weekly catered lunches and breakfast on Fridays

Fulltime

Research Intern - Systems For Efficient AI

Research Internships at Microsoft provide a dynamic environment for research car...

Location

United States , Redmond

Salary:

6710.00 - 13270.00 USD / Month

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Accepted or currently enrolled in a PhD program in Computer Science, Software Engineering, Electrical Engineering, or a related STEM field
Experience with LLM architectures, systems for LLM inference, and/or AI hardware
Experience with GPUs and understanding of CUDA/ROCm frameworks
Experience with computer systems and/or networks
Experience in conducting research and writing peer-reviewed publications
Proficient written and verbal communication skills
Be able to work in a cross-functional and multi-disciplinary setting across research and product
Proficient software development skills, preferably in C++ and Python

Job Responsibility

Research Interns put inquiry and theory into practice
Learn, collaborate, and network for life
Advance their own careers and contribute to exciting research and development strides
Paired with mentors and expected to collaborate with other Research Interns and researchers
Present findings
Contribute to the vibrant life of the community

Fulltime

Founding GPU Compiler Engineer

We're hiring a Founding GPU Compiler Engineer to build the core compilation infr...

Location

United States , San Francisco

Salary:

285000.00 - 315000.00 USD / Year

YC Work at a Startup

Expiration Date

Until further notice

Requirements

Deep experience with compiler infrastructure (LLVM, MLIR, or similar)
Strong background in GPU architecture and low-level optimization (CUDA, ROCm, or equivalent)
Hands-on experience with at least one of: PTX/SASS, GCN/RDNA assembly, or other GPU ISAs
Familiarity with ML compiler stacks (XLA, TVM, Triton, torch.compile, or similar)
Solid systems programming skills in C++ and/or Rust
Proven track record of building production-grade compiler infrastructure

Job Responsibility

Design and implement the main compilation pipeline, from StableHLO to executable GPU and host binaries
Build and extend MLIR dialects and passes to optimize AI workloads
Develop backend code generation for multiple targets (NVIDIA PTX/SASS, AMD GCN/RDNA, Trainium, TPU)
Implement classic compiler optimizations customized for large-scale training (fusion, tiling, memory planning, scheduling)
Build search-based compiler infrastructure to explore different optimization options
Create hybrid codegen paths for cases where direct MLIR lowering isn't practical
Set up testing, benchmarking, and performance regression systems
Work closely with ML researchers to understand workload characteristics and find optimization opportunities

What we offer

bonus
equity
benefits
relocation assistance

Fulltime

Select Country

Systems Research Engineer, GPU Programming

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?