HPC Systems Engineer Job at Susquehanna International Group (Bala Cynwyd (Philadelphia Area), Pennsylvania)

HPC Systems Engineer

The Consumer Products Infrastructure team builds and operates the high-performan...

Location

United States , San Francisco

Salary:

Not provided

OpenAI

Expiration Date

Until further notice

Requirements

7+ years of experience designing and operating large-scale HPC clusters (1,000+ nodes)
Deep expertise with NC, IBM/Platform LSF, and Slurm workload managers
Strong Linux system administration experience (RHEL-family preferred)
Hands-on experience with MPI, parallel scaling, and performance tuning for simulation workloads
Experience using Azure CycleCloud to provision and manage HPC clusters in hybrid cloud environments
Proven experience operating InfiniBand or other high-speed interconnects
Strong Python and Bash skills for automation, tooling, and workflow optimization
Experience with distributed filesystems (NFS, DFS, Lustre, GPFS, BeeGFS)
Deep familiarity with HPC licensing systems (FlexLM, DSLS, RLM, LUM)
Experience supporting product-oriented engineering or simulation teams

Job Responsibility

Architect, deploy, and operate large-scale HPC clusters (1,000+ nodes) supporting simulation workloads critical to consumer product development
Optimize workload management using NC, IBM/Platform LSF, and Slurm, with a focus on throughput, fairness, and minimizing queue wait times for product teams
Design and implement strategies for workload balancing, cluster federation, and multi-scheduler environments that support diverse product workflows
Partner closely with product design, mechanical, electrical, and simulation engineers to debug jobs, improve parallel scaling, and accelerate design-to-validation cycles
Administer and harden Linux-based HPC systems (RHEL, Rocky Linux, AlmaLinux), including patching, kernel tuning, and performance optimization
Operate and optimize software licensing infrastructure (FlexLM, DSLS, LUM, RLM) to maximize utilization and prevent license-related development bottlenecks
Deploy and manage Azure CycleCloud and/or TotalCAE to enable elastic capacity, cloud bursting, and hybrid HPC workflows during peak product development cycles
Configure and tune high-speed interconnects, including InfiniBand (HDR/EDR/FDR), to support low-latency, tightly coupled simulation workloads
Design and maintain high-performance storage systems (NFS, DFS, Lustre, GPFS / Spectrum Scale, BeeGFS, Azure NetApp) optimized for simulation I/O patterns
Build automation and internal tooling using Python and Bash to streamline provisioning, monitoring, diagnostics, and job submission workflows

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

HPC & AI Systems Engineer for Integrated Systems Test

HPC & AI Systems Engineer for Integrated Systems Test role at Hewlett Packard En...

Location

Puerto Rico , Aguadilla

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor's or master's degree in Computer Engineering, Computer Science, Electrical Engineering, Information Systems, or equivalent
Minimum 4 years of experience
Experience with certification & submission to OS vendors of Linux (RedHat, SLES, Ubuntu, etc.), Windows Server operating systems, Windows Client operating systems, and VMWare (ESXi)
Experience installing and working with Linux, Windows and VMWare OSes
Experience in programming or scripting languages, Python, PowerShell, Perl, Linux Shell, Java, MySQL, MS SQL Server
Understanding of Redfish commands, RESTful API, and JSON format
Knowledge of creating and using Docker containers and VMs
Experience in configuring Storage (internal/external storage, file systems, and raid/non-raid settings) and Networking devices (iSCSI, FCoE, IPs, VLANs, Bonding, Jumbo Frames, LAGs)
Knowledge of networking concepts such as NIC teaming, VLANs, IPv4, IPv6
Excellent written and verbal communication skills in English

Job Responsibility

Work with Program & Product Management, technical leads, and product development teams to obtain product feature requirements
Design and implement new test features in existing and new test cases
Analyze, debug and provide feedback/resolution on issues uncovered by test team prior to submission of results to OS vendors for approval
Implement software solutions for multiple test programs/projects with internal and outsourced development partners
Review and evaluate the implementation and use of test automation and test tools
Planning, development, and implementation of software tools for the testing and evaluation of current and next-generation HPE HPC products
Debug and analyze issues to a successful resolution
Perform testing in local and remote labs
Drive appropriate automated test execution to test engineers at various global locations
Provide training and guidance to test teams both onshore and offshore

What we offer

Health & Wellbeing benefits
Personal & Professional Development programs
Unconditional Inclusion environment
Comprehensive suite of benefits that supports physical, financial and emotional wellbeing

Fulltime

Senior Distributed Systems Engineer (HPC Platform)

We are looking for a Senior Distributed Systems Engineer to design and build cor...

Location

European Union

Salary:

Not provided

Itransition

Expiration Date

Until further notice

Requirements

Strong experience in backend development with Rust
Solid understanding of distributed systems architecture
Hands-on experience with message queues (e.g., Apache Pulsar, RabbitMQ)
Experience designing and building gRPC-based APIs / service-oriented architectures
Experience with AWS or similar cloud platforms
Strong problem-solving skills and ability to work with complex systems

Job Responsibility

design and build core backend services for a high-performance distributed computing platform
develop resilient, high-throughput infrastructure that orchestrates workloads across CPU and GPU nodes

What we offer

Projects for such clients as PayPal, Wargaming, Xerox, Philips, Adidas and Toyota
Competitive compensation that depends on your qualification and skills
Career development system with clear skill qualifications
Flexible working hours aligned to your schedule
Options to work remotely
Corporate medical insurance covering services of private and public medical centers
English courses online
Corporate parties and events for employees and their children
Internal conferences, workshops and meetups for learning and experience sharing
Gym membership compensation

HPC Engineer

Location

India , Chennai

Salary:

Not provided

WhiteBlue

Expiration Date

Until further notice

Requirements

Experience in designing, implementing, and supporting high-performance computing (HPC) clusters with strong knowledge of CPU/GPU architecture, scalable storage, interconnects, and cloud-based systems
Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
Understand and assemble the project specifications and performance requirements at the subsystem and system levels
Adhere and drive to project timelines to insure program achievements complete on time
Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
Experience of crafting and maintaining robust storage
Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas

Job Responsibility

Design, implementation & support of high-performance compute clusters
Solid knowledge on HPC systems, including CPU/GPU architecture, scalable/robust storage, high-bandwidth inter-connects, and a knowledge of cloud based computing architectures
Apply their attention to detail to generate HW BOMs for the HCP Clusters, provide vendor management and oversee HW release activities
Use their strong skills with the Linux OS to configure appropriate operating systems for the HPC system
Understand and assemble the project specifications and performance requirements at the subsystem and system levels
Adhere and drive to project timelines to insure program achievements complete on time
Support design and release of new products to manufacturing and ultimately the customer, providing quality golden images, procedures, scripts and documentation to the manufacturing team and customer support team
Validated in-depth and flavor agnostic knowledge of Linux systems (SuSE, RedHat, Rocky, Ubuntu)
Experience of crafting and maintaining robust storage
Strong HPC HW knowledge especially in the server, GPU, networking, Storage, BIOS & BMC arenas

Fulltime

Staff Flight Sciences Software and HPC Engineer

Archer is an aerospace company based in San Jose, California building an all-ele...

Location

United States , San Jose

Salary:

162800.00 - 217600.00 USD / Year

Archer Aviation

Expiration Date

Until further notice

Requirements

Master's or Ph.D. in Aerospace Engineering, Mechanical Engineering, Computational Engineering, or a related field
5+ years of experience as a user and developer of scientific/engineering software for flight sciences or similar disciplines (such as aerodynamics, acoustics, control, loads, thermal analysis, mass properties, vehicle simulation, etc.) in a fast-moving environment
Demonstrated experience in developing computing software and infrastructure, with proficiency in the scientific Python ecosystem (NumPy, SciPy, Pandas, Scikit-learn, TensorFlow/PyTorch, VTK)
Demonstrated experience in standard best practices in software development, including version control, CI/CD, software testing, environment management
Demonstrated experience with the design and administration of HPC systems, either on-premise or cloud (AWS preferred). Knowledge of Linux administration, high speed network interconnects, parallel file systems, and MPI required
Experience with HPC management software (Slurm/PBS/Torque, OpenHPC/Bright, Warewulf/XCat, Spack/EasyBuild, Lmod)
Good understanding of enterprise IT and common network security practices
Excellent problem-solving skills and ability to work collaboratively in a team environment

Job Responsibility

Design, implement, and maintain internal software libraries and applications as well as computing infrastructure to enable engineers to solve problems faster and more efficiently. Promote the use of shared computational infrastructure, tools, and practices across engineering teams within the Flight Sciences department
Develop processes and software tools to improve the reproducibility and traceability of computations. Drive the implementation of such tools
Promote a culture of software excellence across the engineering organization
Understand the needs of various engineering teams to efficiently utilize High-Performance Computing (HPC) resources, and make informed decisions on infrastructure solutions to ensure optimal resource utilization and cost savings
Maintain and administer on-premises HPC resources
Advocate for engineering and computing needs with the company-wide IT department

Fulltime

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR equivalent experience
Strong proficiency in Kubernetes, Docker, and container orchestration
Knowledge of CI/CD pipelines for Inference and ML model deployment
Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Strong programming/scripting skills in Python, Go, or Bash
Solid knowledge of distributed systems, networking, and storage
Experience running large-scale GPU clusters for ML/AI workloads (preferred)

Job Responsibility

Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

What we offer

Competitive compensation, equity options, and comprehensive benefits

Fulltime

Senior ML Systems Engineer, Frameworks & Tooling

We’re looking for a senior engineer to help build, maintain and evolve the train...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Strong engineering experience in large-scale distributed training or HPC systems
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
Experience working with containerized environments (Docker, Singularity/Apptainer)
A track record of building tools that increase developer velocity for ML teams
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
Strong collaboration skills — you’ll work closely with infra, research, and deployment teams

Job Responsibility

Build and own the training framework responsible for large-scale LLM training
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100)
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training
Investigate and resolve performance bottlenecks across the ML systems stack
Build robust systems that ensure reproducible, debuggable, large-scale runs

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Strong background in one or more of the following areas: AI accelerator or GPU architectures
Distributed systems and large-scale AI training/inference
High-performance computing (HPC) and collective communications
ML systems, runtimes, or compilers
Performance modeling, benchmarking, and systems analysis
Hardware–software co-design for AI workloads
Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.

Job Responsibility

Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.

Fulltime

Select Country

HPC Systems Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

HPC Systems Engineer

HPC Systems Engineer

HPC & AI Systems Engineer for Integrated Systems Test

Senior Distributed Systems Engineer (HPC Platform)

HPC Engineer

Staff Flight Sciences Software and HPC Engineer

Member of Technical Staff, Site Reliability Engineer (HPC)

Senior ML Systems Engineer, Frameworks & Tooling

Member of Technical Staff, Software Co-Design AI HPC Systems

Our AI answers in your language