CrawlJobs Logo

Staff Software Engineer, GPU Infrastructure (HPC)

cohere.com Logo

Cohere

Location Icon

Location:

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North.

Job Responsibility:

  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence

Requirements:

  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
What we offer:
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)

Additional Information:

Job Posted:
February 20, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 31694 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Software Engineer, GPU Infrastructure (HPC)

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Slurm

We are actively seeking an exceptional Staff Software Engineer to join our cloud...
Location
Location
United States , San Francisco
Salary
Salary:
185000.00 - 224000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience working in software engineering, with strong experience in Systems Engineering
  • Experience in distributed systems, cloud, or HPC environments is a must
  • 2+ years of programming experience in GoLang
  • Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial
  • Extensive experience with Kubernetes and Linux Engineering and debugging
  • Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments
  • Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform
  • Understand Argo, CI/CD, and Automated Testing pipelines
  • Can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards
  • Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals
Job Responsibility
Job Responsibility
  • Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling
  • Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap
  • Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters
  • Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation
  • Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm
  • Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services
  • Develop scalable systems to compete with leading managed services
  • Support the development of your peers by sharing knowledge and providing guidance in technical discussions
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Hardware Health

Microsoft AI operates one of the world’s most advanced AI training infrastructur...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
  • Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
  • Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
  • Experience with exascale-class systems or cloud-scale AI clusters.
  • Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
  • Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.
Job Responsibility
Job Responsibility
  • Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
  • Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
  • Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
  • Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
  • Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
  • Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
  • Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.
  • Fulltime
Read More
Arrow Right

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right
New

Senior Lecturer/Associate Professor in Literacy

As a Senior Lecturer / Associate Professor in Literacy, you will play a key role...
Location
Location
Australia , Albury-Wodonga, Bathurst, Port Macquarie, Wagga Wagga
Salary
Salary:
Not provided
csu.edu.au Logo
Charles Sturt University
Expiration Date
June 08, 2026
Flip Icon
Requirements
Requirements
  • A doctoral qualification relevant to literacy or education, with a recognised teaching qualification
  • A strong record of high-quality teaching and student-centred learning
  • An established or emerging research profile aligned to literacy, curriculum or pedagogy
  • The ability to build productive partnerships and contribute to academic leadership
Job Responsibility
Job Responsibility
  • Lead impactful literacy teaching and research
  • Teach across online and on-campus environments
  • Shape future teachers and education practice
  • Contribute to curriculum innovation
  • Build strong relationships with students and partners
  • Provide academic leadership in literacy education
  • Contribute to the School's research profile
  • Supervise higher degree research students
  • Actively engage with professional, community and government stakeholders
  • At Associate Professor level: significant academic leadership, research impact, and contribution to the broader discipline at national/international level
What we offer
What we offer
  • 17% superannuation
  • Fulltime
Read More
Arrow Right
New

Program Manager - Controls and Avionics Solutions

This position is based in Endicott, New York. New York and on-site work will be ...
Location
Location
United States , Endicott
Salary
Salary:
120874.00 - 205486.00 USD / Year
baesystems.com Logo
Baesystems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering, engineering or manufacturing management, or other discipline
  • Demonstrated ability for building strong customer/ stakeholder relationships
  • Strong communication, negotiation, and presentation skills
  • Ability to interpret data and make data-driven decisions
  • Highly adaptable with strong initiative
  • Demonstrated ability to lead and motivate cross-functional teams
  • Knowledge of the global aviation market and regulatory requirements and/ or military aviation market
Job Responsibility
Job Responsibility
  • Maintaining strong customer relationships and leading a multidisciplinary team to execute complex development programs within schedule and budget
  • Leadership and management oversight of a project team assuring that project’s financials, schedule, and technical objectives are met and that the highest level of customer satisfaction is achieved while meeting all contractual commitments
  • Work effectively and collaboratively with Engineering, Operations, and all Program Office functional leadership to assure deliveries continue to exceed customer commitments and achievement of financial commitments to the company
  • Manages, coordinates, plans, organizes, controls, integrates, and executes projects within the Military Aircraft Systems portfolio
  • Participates in the support of new business and in the development of proposals
What we offer
What we offer
  • Health insurance
  • Dental insurance
  • Vision insurance
  • Health savings accounts
  • 401(k) savings plan
  • Disability coverage
  • Life and accident insurance
  • Employee assistance program
  • Legal plan
  • Discounts on home, auto, and pet insurance
  • Fulltime
Read More
Arrow Right