CrawlJobs Logo

Staff Software Engineer, GPU Infrastructure (HPC)

cohere.com Logo

Cohere

Location Icon

Location:

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North.

Job Responsibility:

  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence

Requirements:

  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
What we offer:
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)

Additional Information:

Job Posted:
February 20, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Software Engineer, GPU Infrastructure (HPC)

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right
New

Staff Software Engineer, Slurm

We are actively seeking an exceptional Staff Software Engineer to join our cloud...
Location
Location
United States , San Francisco
Salary
Salary:
185000.00 - 224000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience working in software engineering, with strong experience in Systems Engineering
  • Experience in distributed systems, cloud, or HPC environments is a must
  • 2+ years of programming experience in GoLang
  • Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial
  • Extensive experience with Kubernetes and Linux Engineering and debugging
  • Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments
  • Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform
  • Understand Argo, CI/CD, and Automated Testing pipelines
  • Can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards
  • Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals
Job Responsibility
Job Responsibility
  • Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling
  • Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap
  • Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters
  • Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation
  • Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm
  • Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services
  • Develop scalable systems to compete with leading managed services
  • Support the development of your peers by sharing knowledge and providing guidance in technical discussions
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right
New

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer- Backend

Microsoft Teams is the hub for modern collaboration—bringing together everything...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Design, develop, and operate high-scale services that power the core messaging infrastructure of Microsoft Teams
  • Apply advanced in‑house AI tools to streamline development workflows, accelerate delivery, and improve system scalability.
  • Dive deep into Azure technologies and distributed database systems. Collaborate with internal and external partners to design features that drive user growth and engagement.
  • Develop features that delight customers while upholding the highest standards of availability, reliability, performance, and scalability—never compromising engineering fundamentals.
  • Influence and define new designs, architectures, standards, and reusable service libraries that empower teams across Microsoft to build at scale.
  • Fulltime
Read More
Arrow Right
New

IT Desktop Support Engineer

The Desktop Support Engineer is responsible for high-quality end-user IT support...
Location
Location
Thailand , Bang Saotong, Samut Prakan
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or diploma in Information Technology, Computer Science, or equivalent experience
  • 3+ years of experience in Desktop Support or IT Support roles
  • Hands‑on experience with Windows operating systems, desktop hardware, and enterprise applications
  • Experience in hardware troubleshooting, OS installation, and software upgrades
  • Basic to intermediate knowledge of networking concepts (LAN, WAN, Wi‑Fi, VPN, TCP/IP)
  • Demonstrated leadership and team coordination skills
  • Strong communication, customer service, and client‑handling abilities
  • Proven analytical and problem‑solving skills
Job Responsibility
Job Responsibility
  • Act as a primary point of contact for client IT support requests
  • Diagnose, troubleshoot, and resolve hardware, software, operating system, and connectivity issues
  • Provide on‑site and remote desktop support to end users
  • Install, configure, maintain, and upgrade desktops, laptops, peripherals, operating systems, and applications
  • Provide preliminary network support for LAN, voice, and video‑related incidents, changes, and projects
  • Coordinate with internal IT teams and stakeholders to establish, modify, and support LAN/Voice/Video services
  • Advise users on appropriate hardware and software upgrades
  • Deliver basic user training on computer operations and applications
  • Complete job reports, maintain documentation, and manage ordering of IT supplies
  • Support and drive remediation activities related to information security incidents, audits, and assessments
  • Fulltime
Read More
Arrow Right
New

Robot Operator

Data collection is the fuel that drives our mission. Every robot movement, every...
Location
Location
United States , San Francisco
Salary
Salary:
25.00 USD / Hour
Physical Intelligence
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience with hands-on technical work, lab environments, or precision tasks
  • Interest in AI, robotics, and cutting-edge technology
  • Strong attention to detail and quality focus
  • Meticulous attention to detail—data quality is crucial for AI training
  • Good manual dexterity and hand-eye coordination
  • Enjoys repetitive, precision-focused work
  • Thrives in fast-paced, metrics-driven environments
  • Excited about contributing to breakthrough AI research
  • Collaborative mindset and strong work ethic
  • Ability to stand at a workstation for 8-hour shifts
Job Responsibility
Job Responsibility
  • Teleoperate robotic arms through a variety of tasks using our intuitive control systems
  • Either lead robot movements with your arms (the robot mirrors your actions) or guide robots using specialized controllers
  • Complete diverse tasks ranging from household activities like folding laundry to complex assembly work
  • Maintain high standards for data quality and consistency across all demonstrations
  • Meet established metrics for data collection volume and quality during your shift
  • Review and annotate videos of robot task performances using computer interfaces
  • Provide detailed feedback on robot performance and data quality
  • Assist with equipment setup and basic office tasks as needed
  • Participate in process improvements to enhance data collection efficiency
What we offer
What we offer
  • benefits package
  • Fulltime
Read More
Arrow Right
New

Software Engineer, Inference - Multi Modal

OpenAI’s Inference team powers the deployment of our most advanced models - incl...
Location
Location
United States , San Francisco
Salary
Salary:
295000.00 - 555000.00 USD / Year
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience building and scaling inference systems for LLMs or multimodal models
  • Worked with GPU-based ML workloads and understand the performance dynamics of large models, especially with complex data like images or audio
  • Enjoy experimental, fast-evolving work and collaborating closely with research
  • Comfortable dealing with systems that span networking, distributed compute, and high-throughput data handling
  • Familiarity with inference tooling like vLLM, TensorRT-LLM, or custom model parallel systems
  • Own problems end-to-end and are excited to operate in ambiguous, fast-moving spaces
Job Responsibility
Job Responsibility
  • Design and implement inference infrastructure for large-scale multimodal models
  • Optimize systems for high-throughput, low-latency delivery of image and audio inputs and outputs
  • Enable experimental research workflows to transition into reliable production services
  • Collaborate closely with researchers, infra teams, and product engineers to deploy state-of-the-art capabilities
  • Contribute to system-level improvements including GPU utilization, tensor parallelism, and hardware abstraction layers
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right