CrawlJobs Logo

Staff Software Engineer, Slurm

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

185000.00 - 224000.00 USD / Year

Job Description:

We are actively seeking an exceptional Staff Software Engineer to join our cloud software team, focusing specifically on building and operating Slurm as a fully managed cloud service within Crusoe Cloud. This role is crucial for delivering next-generation orchestration capabilities to power GPU-accelerated and high-performance computing (HPC) at scale. Your expertise will be instrumental in designing and scaling our carbon-reducing operating model, and advancing our AI training clusters to lead the industry in reliability and performance. You will shape the technical direction of systems that allow customers to run advanced workloads across CPUs, NVIDIA and AMD GPUs, and high-performance networking environments. You will be involved in writing and reviewing code, contributing to proposals, and drafting architecture documents. You will evaluate tools and frameworks, considering their impact on reliability, scalability, operational costs, and ease of adoption.

Job Responsibility:

  • Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling
  • Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap
  • Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters
  • Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation
  • Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm
  • Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services
  • Develop scalable systems to compete with leading managed services
  • Support the development of your peers by sharing knowledge and providing guidance in technical discussions

Requirements:

  • 7+ years of experience working in software engineering, with strong experience in Systems Engineering
  • Experience in distributed systems, cloud, or HPC environments is a must
  • 2+ years of programming experience in GoLang
  • Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial
  • Extensive experience with Kubernetes and Linux Engineering and debugging
  • Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments
  • Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform
  • Understand Argo, CI/CD, and Automated Testing pipelines
  • Can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards
  • Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals
  • Familiarity with GPU integration in Kubernetes, including device plugins and GPU operators
  • Excellent communication skills, both verbal and written
What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Software Engineer, Slurm

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...
Location
Location
Canada , Vancouver
Salary
Salary:
190000.00 - 240000.00 CAD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience
  • 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
Job Responsibility
Job Responsibility
  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...
Location
Location
United States , Mountain View
Salary
Salary:
180000.00 - 280000.00 USD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience, with 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
  • In-office location: Mountain View, CA, United States. You must be available for hybrid work
Job Responsibility
Job Responsibility
  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure
Job Responsibility
Job Responsibility
  • Design and write high-performant and scalable software for training
  • Improve our training setup from an infrastructure and codebase performance standpoint
  • Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
  • Research, implement, and experiment with ideas on our supercompute and data infrastructure
  • Learn from and work with the best researchers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Post-Training

Advance the state of the art for model post training, ship state of the art mode...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale
  • Hands on experience with the post training phase of model training, with a strong emphasis on performance optimisation
Job Responsibility
Job Responsibility
  • Design and write high-performant and scalable software for training models
  • Consistently post-train the models to reach SOTA level performance
  • Coordinate with other specialist teams (Agentic, Code…) to produce models that have strong all encompassing performance
  • Craft and implement techniques to improve the performance and results of our training cycles both on the SFT and the RL regime
  • Research, implement, and experiment with ideas on our supercompute and data infrastructure
  • Learn from and work with the best researchers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right
New

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right
New

Store Associate

Retail Store Associates play a meaningful role within the CVS Health family. At ...
Location
Location
United States , Terrell
Salary
Salary:
15.00 - 19.00 USD / Hour
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
March 05, 2026
Flip Icon
Requirements
Requirements
  • At least 16 years of age
  • Physical Requirements: Remaining upright on the feet, particularly for sustained periods of time
  • Lifting and exerting up to 35 lbs of force occasionally, up to 10 lbs of force frequently, and a negligible amount of force regularly to move objects to and from, including overhead lifting
  • Visual Acuity - Having close visual acuity to perform activities such as: viewing a computer terminal, reading, visual inspection involving small parts/details
Job Responsibility
Job Responsibility
  • Providing differentiated customer service by anticipating customer needs, demonstrating compassion and care in all interactions, and actively identifying and resolving potential service issues
  • Focusing on the customer by giving a warm and friendly greeting, maintaining eye contact and offering help locating additional items, when needed
  • Accurately perform cashier duties - handling cash, checks and credit card transactions with precision while following company policies and procedures
  • Maintaining the sales floor by restocking shelves, checking in vendors, updating pricing information and completing inventory management tasks as directed by store manager
  • Supporting opening and closing store activities, when needed
  • Providing customer support to all departments, including photo and beauty, ensuring departments are fully stocked and operational while remaining current with all updated services and tools
  • Assisting pharmacy personnel when needed, including working regular shifts in the pharmacy as part of opportunities for growth and career development
  • Embracing and advocating for new CVS services and loyalty programs that support our purpose of helping people on their path to better health
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan (including matching company contributions)
  • Employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Colleague assistance programs
  • Tuition assistance
  • Parttime
Read More
Arrow Right
New

Certified Nurse Aide

Tennessee Quality Care provides personalized medical care to help people live th...
Location
Location
United States , Kenton
Salary
Salary:
Not provided
arcadiahomecare.com Logo
Arcadia Home Care and Staffing - an Addus family company
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Certified by the state as a nurse aid
  • Meets the training requirements and successfully completes them
  • At least 18 years of age
  • Possess and maintains current CPR certification, or another certification process as deemed appropriate by agency, state, and federal regulations
  • Must have reliable transportation, current driver's license, and appropriate automobile insurance
Job Responsibility
Job Responsibility
  • Providing hands-on personal care including Baths, Oral hygiene, Shampoos, changing bed linen, assisting with dressing and undressing, skin care and prevention of skin breakdown, toileting assistance, keeping living area clean and orderly as needed, and other duties within their scope of practice assigned by supervising clinician
  • Taking and recording oral, temporal, and axillary temperatures, pulse, respiration, pulse oximetry, and blood pressure when ordered (with appropriate completed/demonstrated skills competency)
  • Performing simple procedures as an extension of nursing services as ordered (with appropriate completed/demonstrated skills competency)
  • Assisting patients in the self-administration of medication
  • Reporting on patient’s condition and significant changes to the assigned nurse or case manager
What we offer
What we offer
  • Great culture and team atmosphere
  • Comprehensive benefits, including medical, dental, and vision, effective on the first of the month
  • 401(k) retirement plan with a generous company match
  • Generous time off accruals
  • Paid holidays
  • Mileage reimbursement
  • Tuition Reimbursement
  • Employee Referral Program
  • Merit Increases
  • Employee Discount Programs
Read More
Arrow Right