CrawlJobs Logo

Staff Software Engineer, GPU Infrastructure (HPC)

cohere.com Logo

Cohere

Location Icon

Location:

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North.

Job Responsibility:

  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence

Requirements:

  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment
What we offer:
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)

Additional Information:

Job Posted:
February 20, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Software Engineer, GPU Infrastructure (HPC)

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Slurm

We are actively seeking an exceptional Staff Software Engineer to join our cloud...
Location
Location
United States , San Francisco
Salary
Salary:
185000.00 - 224000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience working in software engineering, with strong experience in Systems Engineering
  • Experience in distributed systems, cloud, or HPC environments is a must
  • 2+ years of programming experience in GoLang
  • Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial
  • Extensive experience with Kubernetes and Linux Engineering and debugging
  • Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments
  • Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform
  • Understand Argo, CI/CD, and Automated Testing pipelines
  • Can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards
  • Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals
Job Responsibility
Job Responsibility
  • Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling
  • Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap
  • Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters
  • Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation
  • Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm
  • Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services
  • Develop scalable systems to compete with leading managed services
  • Support the development of your peers by sharing knowledge and providing guidance in technical discussions
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right
New

Project Controls Coordinator III

Under the direction of the Supervisor Project Controls, the Analyst will perform...
Location
Location
Canada , North York
Salary
Salary:
55.00 - 58.00 CAD / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
June 03, 2026
Flip Icon
Requirements
Requirements
  • Four Year Degree or combination of education and related experience
  • Minimum of 5 years of Project Controls or Project Management experience
  • Project Management professional designation is preferred
  • Experienced analytical skills including Earned Value Management
  • An independent worker within a team setting
  • Demonstrated professional engagement at a high level with work group, stakeholders, and contractors in a team setting
  • Proficient in the use of SAP, Oracle and MS office suite, intermediate+ Excel skills
  • Excellent communication, interpersonal, and organizational skills
  • Ability to effectively manage and prioritize workload, bring issues forward and develop working relationships at all levels of the organization
  • Detail oriented and understands the importance of data reconciliation
Job Responsibility
Job Responsibility
  • Analyze and maintain the project costs at the WBS level including control budget, incurred costs, commitments, and forecast
  • Provide the project team with accurate and timely cost information and reporting
  • Perform earned value measurements to anticipate forecast impacts
  • Perform monthly project close processes and prepare monthly project reports and comparative capital cost estimates for the project in Excel and EcoSys
  • Prepare and document project change orders timely in accordance with Project Management Office standards
  • Engage the Project Managers in meetings and discussions to review and reforecast project costs
  • Review cost transactions to ensure accurate project costs
  • Communicate with larger Controls team for the project
  • Liaise with Project Managers and Field Cost Analysts to ensure engagement with the project progress, changes, highlights and issues
  • Maintain the project Work Breakdown Structure such that it facilitates project execution and cost control during project execution and meets accounting requirements for asset creation and project closeout
  • Fulltime
Read More
Arrow Right
New

Mechanical Engineer - Energy Solutions

Join a Team of engineers dedicated to working hand-in-hand with large manufactur...
Location
Location
Canada , North York
Salary
Salary:
65.00 - 68.00 CAD / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
May 09, 2026
Flip Icon
Requirements
Requirements
  • Engineering Degree preferred, Chemical or Mechanical Engineering preferred
  • Membership in Professional Engineers of Ontario or similar professional organization is preferred
  • Proven skills in: leading and influencing without explicit authority
  • time management
  • Ability to work independently but work within team of like-minded professionals
  • Valid driver’s license with a responsible driving record is needed
Job Responsibility
Job Responsibility
  • Identify new contacts and conduct at large manufacturing facilities to for the purpose to arrange site visits
  • Attend joint-site visits with team members to support in the identification and quantification of potential energy savings projects
  • Balance multiple priorities: Able to effectively manage time and priorities, consistently delivering in firm annual savings targets
  • Quantify impact and secure buy-in: Build technical savings calculations, sometimes from scratch, to support project justification and persuade key stakeholders on execution of work
  • Provide solutions to complex problems: expertly analyze complex operations across various industries and synthesize available information to create solutions equally appealing to business and technical people
  • Forge long-term customer relationships: build and nurture professional relationships founded on unwavering trust and mutual respect, being a first-choice energy efficiency partner for your customers
  • Continuous growth and curious mindset: Proactively identify new savings opportunities to drive both short and long-term work and build a sales funnel for sustained growth
  • Drive results autonomously while thriving in a collaborative environment: Play a supporting role in managing a small group customer base and integrate into, and support broader Team to achieve personal and collective objectives
What we offer
What we offer
  • Hybrid Work Model: in-Office (Monday, Tuesday & Thursday) Remote (Wednesday & Friday)
Read More
Arrow Right
New

Senior Hardware Compliance Designer

Are you an experienced Electrical Engineer with a deep understanding of Electrom...
Location
Location
Canada , North York, Ontario
Salary
Salary:
50.00 - 58.00 CAD / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
May 09, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Electrical Engineering or an equivalent combination of diplomas and work experience
  • 6 to 10+ years of professional experience in hardware qualification or testing
  • Excellent knowledge of core EMC concepts, including Cabling, Grounding techniques, and Surge Protection
  • Demonstrated familiarity with railway (or similar industry) EMC and environmental standards
  • Proficiency in operating hardware test equipment and simulators used in the qualification process
  • Professional verbal and effective writing skills to ensure clear, concise, and unambiguous technical documentation and presentations
  • High degree of professionalism, strong ability to follow directions, use sound judgment, and keep meticulous track of all assigned tasks
Job Responsibility
Job Responsibility
  • Leading the hardware qualification, certification, and verification activities for new and revised products from planning through to customer acceptance
  • Conducting in-depth Electromagnetic Compatibility (EMC) tests on electronic devices according to industry standards (e.g., EN 61000 series, EN 50121 series, FCC, AREMA, MIL standards)
  • Performing environmental qualification tests (e.g., temperature, humidity, shock & vibration) on signaling electronic devices as per standards like EN 50125 and EN 50155
  • Participating in troubleshooting EMC-related issues and providing technical data and measurements to the design team
  • Preparing and driving comprehensive project documentation to customer acceptance, including the EMC Control Plan, EMC Compatibility Study, EMC Hazard Analysis, and EMC Test Report
  • Preparing hardware environmental qualification plans and reports, understanding the impact of requirements, and providing feedback on different technical approaches
  • Leading and managing subcontractors for specific on-site EMC measurements
  • Participating in ISA committee audits related to EMC and environmental requirements
  • Mentoring and developing junior team members within the department
What we offer
What we offer
  • Impactful Work: Lead the qualification process for cutting-edge electronic devices used in safety-critical railway signaling systems
  • Technical Leadership: Be the in-house expert on EMC and environmental standards, influencing design and compliance strategies
  • Professional Development: Opportunities to develop junior team members and participate in high-level ISA committee audits
  • Global Exposure: Work with international standards and lead engagements with external subcontractors and customers for acceptance
  • Fulltime
Read More
Arrow Right
New

Customs Compliance Specialist

Are you a trade compliance expert with a specialized background in the aviation ...
Location
Location
Canada , Woodbridge, Ontario
Salary
Salary:
30.00 - 45.00 CAD / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
May 15, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Supply Chain, International Business, Aviation Management, or a related field (equivalent experience considered)
  • 5+ years of dedicated customs/trade compliance experience, specifically within aerospace, aviation repair, or a high-tech regulated industry
  • Deep proficiency in EAR, HTS coding, and aviation-specific documentation (8130-3, EASA Form 1, teardown documentation)
  • Experience with NAV (Microsoft Dynamics) or similar ERP systems for tariff code management is preferred
  • A U.S. Customs Broker License, CCS, or specialized aviation compliance certification is a significant asset
  • Strong analytical ability, meticulous attention to detail, and the communication skills necessary to influence cross-functional teams in multiple countries
Job Responsibility
Job Responsibility
  • Ensure all aviation parts comply with U.S. Customs (CBP), EAR, FAA, and DOT regulations, as well as Canadian and UK (HMRC) requirements
  • Maintain expert-level accuracy for HTS codes, ECCN, country-of-origin, and valuation, specifically regarding airworthiness tags (8130-3/EASA Form 1)
  • Manage inbound consolidations and broker relationships (Expeditors/Freight Boy)
  • resolve PARs issues, and audit FedEx/broker invoices to minimize duty spend
  • Manage Item Card tariff codes in NAV
  • file and track duty recovery disputes
  • and perform gap analyses to identify process improvements
  • Serve as the primary point of contact for USCBP, BIS, and Census Bureau
  • lead internal and external compliance audits and maintain rigorous record-keeping
What we offer
What we offer
  • 6 month contract with the opportunity to become permanent
  • Hybrid work environment
  • Play a pivotal role in a highly regulated international supply chain, acting as a bridge between North American operations and UK-based parent company compliance
  • Opportunity to spearhead high-value projects, including the exploration of bonded warehouse setups and process automation within your first 120 days
  • Work within the Aerospace industry that values technical expertise and regulatory mastery
  • Fulltime
Read More
Arrow Right