Staff Software Engineer, Slurm Job at Crusoe (San Francisco)

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...

Location

United States , San Francisco; Sunnyvale

Salary:

180000.00 - 220000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
Advanced Linux systems expertise
Deep Kubernetes operational experience (CKA-level or higher)
Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
Experience supporting AI/ML workloads at scale (GPU clusters)
Proven track record of resolving multi-layer, distributed system failures
Strong customer communication and executive-facing presence

Job Responsibility

Serve as highest-level escalation point for complex P1/P0 incidents
Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
Design and improve node validation, burn-in processes, performance baselining, and release readiness
Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
Reduce MTTR and incident recurrence through structural improvements
Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
Support complex AI workloads (training + inference) with performance tuning and observability improvements
Act as senior technical advisor during high-risk customer incidents
Deliver executive-ready RCAs with clarity and confidence

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Extremely strong software engineering skills
Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
Experience using large-scale distributed training strategies
Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure

Job Responsibility

Design and write high-performant and scalable software for training
Improve our training setup from an infrastructure and codebase performance standpoint
Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
Research, implement, and experiment with ideas on our supercompute and data infrastructure
Learn from and work with the best researchers in the field

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Hpc Linux System Administrator

HPC Linux System Administrator. High Performance Computing, AI and Labs is a cri...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor's or Master's engineering degree in Computer Science, Information Systems
Typically 4-8 years experience
Strong proficiency in Linux/Unix administration (installation, configuration, tuning, troubleshooting)
Experience managing HPC clusters (e.g., HPE Cray, Slurm, PBS, LSF)
Solid understanding of networking fundamentals (TCP/IP, DNS, DHCP, VLANs)
Experience with storage management systems such as NFS, Lustre, or GPFS
Hands-on experience in hardware diagnostics and maintenance
Familiarity with system monitoring tools such as Prometheus, Grafana, or Nagios
Working knowledge of containerization (Docker, Singularity) and virtualization technologies is a plus
Proficiency in shell scripting (Bash)

Job Responsibility

Must be hands-on. Be able to develop a solid understanding of the Linux system and be able to test the system
Manage and maintain HPC clusters, including installation, configuration, and optimization of compute and management nodes
Administer Linux/Unix-based systems, ensuring high availability, performance, and security
Perform system imaging, software provisioning, and configuration management using tools such as Ansible
Conduct hardware troubleshooting and coordinate with vendors or internal teams for hardware repairs and replacements
Oversee lab systems used for development, testing, and release validation in HPC environments
Manage storage systems (NFS, Lustre, GPFS, RAID) and ensure efficient data flow across the HPC environment
Monitor system performance, perform regular health checks, and implement preventive maintenance measures
Apply OS, firmware, and security updates to maintain system stability and compliance
Develop and maintain automation scripts (using Bash, Python, or Ansible) to improve operational efficiency

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Member of Technical Staff, Post-Training

Advance the state of the art for model post training, ship state of the art mode...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Extremely strong software engineering skills
Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
Experience using large-scale distributed training strategies
Hands on experience on training large model at scale
Hands on experience with the post training phase of model training, with a strong emphasis on performance optimisation

Job Responsibility

Design and write high-performant and scalable software for training models
Consistently post-train the models to reach SOTA level performance
Coordinate with other specialist teams (Agentic, Code…) to produce models that have strong all encompassing performance
Craft and implement techniques to improve the performance and results of our training cycles both on the SFT and the RL regime
Research, implement, and experiment with ideas on our supercompute and data infrastructure
Learn from and work with the best researchers in the field

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Accountant

We are seeking a detail-oriented Accountant to support daily accounting operatio...

Location

United States , Miami

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Accounting, Finance, or related field
2+ years of accounting experience
Knowledge of general accounting principles and financial reporting
Experience with general ledger, reconciliations, and close processes
Proficiency in Microsoft Excel and accounting software or ERP systems
Strong attention to detail, organization, and problem-solving skills
Ability to work independently and collaboratively in a team environment
Strong written and verbal communication skills

Job Responsibility

Prepare and record journal entries, accruals, and adjustments
Maintain and reconcile general ledger accounts
Perform monthly bank and account reconciliations
Assist with month-end, quarter-end, and year-end close
Prepare financial statements, reports, and supporting schedules
Analyze account activity and investigate discrepancies
Support accounts payable, accounts receivable, and payroll processes as needed
Help ensure compliance with company policies, internal controls, and accounting standards
Assist with audits by preparing documentation and responding to requests
Contribute to budgeting, forecasting, and variance analysis

What we offer

medical insurance
vision insurance
dental insurance
life insurance
disability insurance
401(k) plan

Fulltime

Critical Environment Technician Manager

In alignment with our Microsoft values, we are committed to cultivating an inclu...

Location

United States , Mount Pleasant

Salary:

75400.00 - 167900.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

High School Diploma, GED, or equivalent AND 3+ years mission critical services work/applied learning experience (e.g., high availability assembly/manufacturing/critical infrastructure environments such as data centers, oil and gas refineries, hospitals, pharmaceutical, manufacturing, or related fields) OR equivalent experience
Ability to work shifts, including shift assignments during non-standard business hours that may include evening, nighttime, weekends, and/or holidays
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
High School Diploma, GED, or equivalent AND 6+ years mission critical services experience OR Associate's Degree or technical trade certification (e.g., military, trade school), or higher-equivalent education AND 5+ years mission-critical services experience OR equivalent experience
1+ year(s) people management experience
1+ year(s) experience in a specialized area (e.g., mechanical field, electrical field, controls field) or related field

Job Responsibility

People Management: Managers deliver success through empowerment and accountability by modeling, coaching, and caring. Model - Live our culture
Embody our values
Practice our leadership principles. Coach - Define team objectives and outcomes
Enable success across boundaries
Help the team adapt and learn. Care - Attract and retain great people
Know each individual’s capabilities and aspirations
Invest in the growth of others
Equipment and Systems Operations: Serve as an operations specialist one or more major area of operations (e.g., electrical, mechanical, controls, generators, and work on advanced tasks independently. Oversee and coach team with the inspection of critical environment-related facility equipment (e.g., controls, heating, ventilation, and air conditioning [HVAC], mechanical systems), building, and grounds regularly for unsafe or abnormal conditions to develop and analyze trends. Monitor performance of maintenance and operations utilizing telemetry, control systems, and other platforms and is able to identify all alarms. Utilize internal computerized maintenance management system (CMMS) to track all equipment assets and to complete work order requests for maintenance work and generate reporting to identify outstanding and ongoing work orders. Safely and quickly respond to and lead an onsite incident response team for all abnormal conditions that impact operations and coordinate with other critical facilities professionals to perform corrective repairs. Enhances, develops new, or follows preexisting emergency operating procedures (EOPs), methods of procedure (MOPs), and standard operating procedures (SOPs) in relation to incidents. Gathers necessary information and creates incident timelines/data, root-cause analyses, and/or action items following an abnormal condition
Equipment and Systems Maintenance: Guide, oversee, and perform various types of maintenance (e.g., planned, predictive, corrective) and repairs following methods of procedure (MOPs), and standard operating procedures (SOPs) for one or more disciplines and one or more types of equipment (e.g., electrical, mechanical, cooling systems) and escalate when appropriate. Serve as a subject matter expert for one type of equipment and oversee everyday tasks and troubleshooting within their area of expertise. Have a hands-on understanding of how equipment works within disciplines they have been trained and how to troubleshoot equipment, systems, subsystems, and components independently within their trained discipline(s). Provide and/or assign team to provide necessary escort to third-party contractors, sub contractors, vendors, and service providers on site for all severity leveled procedures. Coordinate and schedule supplier/vendor on-site activities and recognizes circumstances when to stop supplier work to address potential and/or identified concerns. Take part in getting third-party work underway (e.g., making sure systems are properly energized/deenergized), ensuring the work is started and completed in a safe manner in accordance with standard practices, procedures, federal/local legislation, and municipal codes. Advises junior colleagues on inspection and supervision issues. Provides consultation to lower-level colleagues in troubleshooting systems and problems
Critical Environment Culture: Understands, follows, ensures, and coaches team on safety and security requirements (e.g., job hazard assessments [JHAs], toolbox talks), and business processes and procedures to properly perform work in a safe, quality, and reliable manner in accordance with applicable federal, state, local, and Microsoft requirements. Proactively ensures safety and security requirements are followed and met for the work of themselves and others. Maintain safe working conditions and escalate immediately when unsafe working conditions are observed. Assesses and identifies appropriate resources and equipment necessary to fully support environmental health and safety (EH&S) objectives. Participates in required meetings, trainings, and necessary handoffs

What we offer

Benefits and other compensation (details at https://careers.microsoft.com/us/en/us-corporate-pay)

Fulltime

Psychologist/Psychological Assessment and Evaluation

Join a clinician-centered multidisciplinary team of counselors, social workers, ...

Location

United States , Georgetown

Salary:

Not provided

Ellie Mental Health

Expiration Date

Until further notice

Requirements

Clinical licensure is required (PhD or PsyD)
Experience with completing diagnostic evaluations, scoring and interpreting results, writing reports that clients can understand, and providing feedback to clients
Comfort and familiarity working with a diverse client base from an affirming perspective
Candidates must be able to work in Maryland with a Maryland license, or be license-eligible
Opportunity to provide supervision to psychology associates and trainees for qualified candidates with a demonstrated history of success with Ellie

Job Responsibility

Completing diagnostic evaluations
Scoring and interpreting results
Writing reports that clients can understand
Providing feedback to clients
Providing supervision to psychology associates and trainees for qualified candidates

What we offer

Competitive salary
Flexible schedule
Opportunity for advancement
Health insurance
PTO
Weekly case consultation groups
Free CEUs
Monthly team activities focused on clinician well-being
We will cover all testing costs and expenses as well as administrative tasks like marketing, scheduling, and billing
Cash pay service: clients pay upfront, and you’ll earn 50% of what we collect

Fulltime

Temporary Lettings Administrator

As a Lettings Administrator, you will play a vital role in managing a portfolio ...

Location

United Kingdom , Manchester

Salary:

30000.00 GBP / Year ▼

Office Angels

Expiration Date

Until further notice

Requirements

Strong administration and customer service skills
Previous experience in property management is a bonus
Excellent communication, organisational and problem-solving abilities
Confident in working independently and taking the initiative
A proactive mindset with a genuine care for delivering outstanding service
Full UK driving licence and access to a vehicle (mileage allowance provided)

Job Responsibility

Manage a diverse portfolio of student rental properties across Manchester
Conduct regular property inspections to ensure everything is in tip-top shape
Coordinate tenancy check-ins and check-outs like a pro
Handle rent collections, deposit returns and tenancy agreements efficiently
Address tenant queries and resolve issues promptly and professionally
Conduct viewings and manage enquiries during the bustling student letting cycle
Oversee health and safety, fire safety and compliance across all properties

What we offer

Opportunity to transition into a permanent role for the right candidate
Work in a fun and dynamic environment that values your growth
Make a real difference in the lives of students and enhance their rental experience
Mileage allowance provided

Fulltime

Select Country

Staff Software Engineer, Slurm

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?