CrawlJobs Logo

HPC Systems Engineer

United States, Bala Cynwyd (Philadelphia Area), Pennsylvania · Job Posted February 03, 2026
Apply Position
Job Link Share

Job Description

As a member of our Platform Development team, you will be instrumental in building and optimizing high-performance trading systems, research compute clusters, databases, support systems, and more. You will heavily utilize Linux and Windows internals while working on servers in our HPC environment.

Job Responsibility

  • Contribute to our library of home-grown tools, written primarily in Python and Bash, to automate monitoring, and maintenance
  • Work closely with Strategy Developers, Quantitative Researchers, and trade-supporting application teams to translate complex problems into scalable solutions
  • Coordinate with IT infrastructure teams, including storage and networking, to identify and implement the best solutions
  • Tune operating systems and batch workflows for performance
  • Dive deep on root-cause analysis of systems issues
  • Integrate all of these solutions into our systems effectively and efficiently
  • Oversee all aspects of our HPC environment, including the scheduler, parallel filesystems, GPUs, and interconnects
  • Implement and optimize high-performance storage solutions, including Lustre, VAST, and GPFS
  • Develop strategies to ensure optimal resource allocation and scalability
  • Utilize monitoring and diagnostic tools to quickly pinpoint failures, streamline troubleshooting processes, and ensure the timely recovery of disrupted workflows

Requirements

  • A Bachelor’s degree in Engineering, Computer Science, Information Systems, or a related discipline
  • 5-7 years of progressive experience building Linux and/or Windows based HPC based platforms
  • Familiarity with kernel-level and I/O subsystem tweaks and tools such as sysctl, strace, tcpdump, and netstat
  • Recent hands-on experience with automation in Python or other tools
  • Experience administering Lustre, GPFS, VAST, or other parallel filesystems
  • Understanding of resource schedulers like HTCondor, SLURM, or similar

Nice to have

Bonus points for equivalent Windows knowledge (registry, procmon, wireshark, tshark)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

HPC Systems Engineer

8 matching positions

HPC Systems Engineer

The Consumer Products Infrastructure team builds and operates the high-performan...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience designing and operating large-scale HPC clusters (1,000+ nodes)
  • Deep expertise with NC, IBM/Platform LSF, and Slurm workload managers
  • Strong Linux system administration experience (RHEL-family preferred)
  • Hands-on experience with MPI, parallel scaling, and performance tuning for simulation workloads
  • Experience using Azure CycleCloud to provision and manage HPC clusters in hybrid cloud environments
  • Proven experience operating InfiniBand or other high-speed interconnects
  • Strong Python and Bash skills for automation, tooling, and workflow optimization
  • Experience with distributed filesystems (NFS, DFS, Lustre, GPFS, BeeGFS)
  • Deep familiarity with HPC licensing systems (FlexLM, DSLS, RLM, LUM)
  • Experience supporting product-oriented engineering or simulation teams
Job Responsibility
Job Responsibility
  • Architect, deploy, and operate large-scale HPC clusters (1,000+ nodes) supporting simulation workloads critical to consumer product development
  • Optimize workload management using NC, IBM/Platform LSF, and Slurm, with a focus on throughput, fairness, and minimizing queue wait times for product teams
  • Design and implement strategies for workload balancing, cluster federation, and multi-scheduler environments that support diverse product workflows
  • Partner closely with product design, mechanical, electrical, and simulation engineers to debug jobs, improve parallel scaling, and accelerate design-to-validation cycles
  • Administer and harden Linux-based HPC systems (RHEL, Rocky Linux, AlmaLinux), including patching, kernel tuning, and performance optimization
  • Operate and optimize software licensing infrastructure (FlexLM, DSLS, LUM, RLM) to maximize utilization and prevent license-related development bottlenecks
  • Deploy and manage Azure CycleCloud and/or TotalCAE to enable elastic capacity, cloud bursting, and hybrid HPC workflows during peak product development cycles
  • Configure and tune high-speed interconnects, including InfiniBand (HDR/EDR/FDR), to support low-latency, tightly coupled simulation workloads
  • Design and maintain high-performance storage systems (NFS, DFS, Lustre, GPFS / Spectrum Scale, BeeGFS, Azure NetApp) optimized for simulation I/O patterns
  • Build automation and internal tooling using Python and Bash to streamline provisioning, monitoring, diagnostics, and job submission workflows
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

HPC & AI Systems Engineer for Integrated Systems Test

HPC & AI Systems Engineer for Integrated Systems Test role at Hewlett Packard En...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master's degree in Computer Engineering, Computer Science, Electrical Engineering, Information Systems, or equivalent
  • Minimum 4 years of experience
  • Experience with certification & submission to OS vendors of Linux (RedHat, SLES, Ubuntu, etc.), Windows Server operating systems, Windows Client operating systems, and VMWare (ESXi)
  • Experience installing and working with Linux, Windows and VMWare OSes
  • Experience in programming or scripting languages, Python, PowerShell, Perl, Linux Shell, Java, MySQL, MS SQL Server
  • Understanding of Redfish commands, RESTful API, and JSON format
  • Knowledge of creating and using Docker containers and VMs
  • Experience in configuring Storage (internal/external storage, file systems, and raid/non-raid settings) and Networking devices (iSCSI, FCoE, IPs, VLANs, Bonding, Jumbo Frames, LAGs)
  • Knowledge of networking concepts such as NIC teaming, VLANs, IPv4, IPv6
  • Excellent written and verbal communication skills in English
Job Responsibility
Job Responsibility
  • Work with Program & Product Management, technical leads, and product development teams to obtain product feature requirements
  • Design and implement new test features in existing and new test cases
  • Analyze, debug and provide feedback/resolution on issues uncovered by test team prior to submission of results to OS vendors for approval
  • Implement software solutions for multiple test programs/projects with internal and outsourced development partners
  • Review and evaluate the implementation and use of test automation and test tools
  • Planning, development, and implementation of software tools for the testing and evaluation of current and next-generation HPE HPC products
  • Debug and analyze issues to a successful resolution
  • Perform testing in local and remote labs
  • Drive appropriate automated test execution to test engineers at various global locations
  • Provide training and guidance to test teams both onshore and offshore
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Senior Distributed Systems Engineer (HPC Platform)

We are looking for a Senior Distributed Systems Engineer to design and build cor...
Location
Location
European Union
Salary
Salary:
Not provided
itransition.com Logo
Itransition
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in backend development with Rust
  • Solid understanding of distributed systems architecture
  • Hands-on experience with message queues (e.g., Apache Pulsar, RabbitMQ)
  • Experience designing and building gRPC-based APIs / service-oriented architectures
  • Experience with AWS or similar cloud platforms
  • Strong problem-solving skills and ability to work with complex systems
Job Responsibility
Job Responsibility
  • design and build core backend services for a high-performance distributed computing platform
  • develop resilient, high-throughput infrastructure that orchestrates workloads across CPU and GPU nodes
What we offer
What we offer
  • Projects for such clients as PayPal, Wargaming, Xerox, Philips, Adidas and Toyota
  • Competitive compensation that depends on your qualification and skills
  • Career development system with clear skill qualifications
  • Flexible working hours aligned to your schedule
  • Options to work remotely
  • Corporate medical insurance covering services of private and public medical centers
  • English courses online
  • Corporate parties and events for employees and their children
  • Internal conferences, workshops and meetups for learning and experience sharing
  • Gym membership compensation
Read More
Arrow Right

HPC Engineer

Location
Location
India , Chennai
Salary
Salary:
Not provided
whiteblue.com Logo
WhiteBlue
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in designing, implementing, and supporting high-performance computing (HPC) clusters with strong knowledge of CPU/GPU architecture, scalable storage, interconnects, and cloud-based systems
  • Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
  • Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
  • Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
  • Understand and assemble the project specifications and performance requirements at the subsystem and system levels
  • Adhere and drive to project timelines to insure program achievements complete on time
  • Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
  • Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
  • Experience of crafting and maintaining robust storage
  • Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas
Job Responsibility
Job Responsibility
  • Design, implementation & support of high-performance compute clusters
  • Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
  • Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
  • Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
  • Understand and assemble the project specifications and performance requirements at the subsystem and system levels
  • Adhere and drive to project timelines to insure program achievements complete on time
  • Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
  • Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
  • Experience of crafting and maintaining robust storage
  • Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas
  • Fulltime
Read More
Arrow Right

Staff Flight Sciences Software and HPC Engineer

Archer is an aerospace company based in San Jose, California building an all-ele...
Location
Location
United States , San Jose
Salary
Salary:
162800.00 - 217600.00 USD / Year
archer.com Logo
Archer Aviation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's or Ph.D. in Aerospace Engineering, Mechanical Engineering, Computational Engineering, or a related field
  • 5+ years of experience as a user and developer of scientific/engineering software for flight sciences or similar disciplines (such as aerodynamics, acoustics, control, loads, thermal analysis, mass properties, vehicle simulation, etc.) in a fast-moving environment
  • Demonstrated experience in developing computing software and infrastructure, with proficiency in the scientific Python ecosystem (NumPy, SciPy, Pandas, Scikit-learn, TensorFlow/PyTorch, VTK)
  • Demonstrated experience in standard best practices in software development, including version control, CI/CD, software testing, environment management
  • Demonstrated experience with the design and administration of HPC systems, either on-premise or cloud (AWS preferred). Knowledge of Linux administration, high speed network interconnects, parallel file systems, and MPI required
  • Experience with HPC management software (Slurm/PBS/Torque, OpenHPC/Bright, Warewulf/XCat, Spack/EasyBuild, Lmod)
  • Good understanding of enterprise IT and common network security practices
  • Excellent problem-solving skills and ability to work collaboratively in a team environment
Job Responsibility
Job Responsibility
  • Design, implement, and maintain internal software libraries and applications as well as computing infrastructure to enable engineers to solve problems faster and more efficiently. Promote the use of shared computational infrastructure, tools, and practices across engineering teams within the Flight Sciences department
  • Develop processes and software tools to improve the reproducibility and traceability of computations. Drive the implementation of such tools
  • Promote a culture of software excellence across the engineering organization
  • Understand the needs of various engineering teams to efficiently utilize High-Performance Computing (HPC) resources, and make informed decisions on infrastructure solutions to ensure optimal resource utilization and cost savings
  • Maintain and administer on-premises HPC resources
  • Advocate for engineering and computing needs with the company-wide IT department
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • OR equivalent experience
  • Strong proficiency in Kubernetes, Docker, and container orchestration
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Strong programming/scripting skills in Python, Go, or Bash
  • Solid knowledge of distributed systems, networking, and storage
  • Experience running large-scale GPU clusters for ML/AI workloads (preferred)
Job Responsibility
Job Responsibility
  • Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
  • Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
  • Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows
What we offer
What we offer
  • Competitive compensation, equity options, and comprehensive benefits
  • Fulltime
Read More
Arrow Right

Senior ML Systems Engineer, Frameworks & Tooling

We’re looking for a senior engineer to help build, maintain and evolve the train...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong engineering experience in large-scale distributed training or HPC systems
  • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
  • Experience working with containerized environments (Docker, Singularity/Apptainer)
  • A track record of building tools that increase developer velocity for ML teams
  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
  • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams
Job Responsibility
Job Responsibility
  • Build and own the training framework responsible for large-scale LLM training
  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100)
  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
  • Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training
  • Investigate and resolve performance bottlenecks across the ML systems stack
  • Build robust systems that ensure reproducible, debuggable, large-scale runs
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right