Senior Distributed Systems Engineer (HPC Platform) Job at Itransition

New

Senior HPC Systems and Storage Engineer

The Senior HPC Systems and Storage Engineer will apply advanced systems and soft...

Location

United States , San Diego

Salary:

108100.00 - 160000.00 USD / Year

UC San Diego

Expiration Date

May 07, 2026

Requirements

Bachelor’s degree in related area and / or equivalent experience / training
Proven experience administering and supporting large-scale HPC clusters or other distributed POSIX (Linux) systems, including advanced knowledge of Linux system administration, primarily Red Hat and its derivatives (e.g., Rocky Linux)
Proven experience designing, deploying, and operating large-scale (petabyte-class) high-performance parallel and distributed file systems (e.g., Lustre, Ceph, BeeGFS, GPFS), as well as enterprise and local file systems (e.g., NFS, ZFS, ext4, XFS) in Linux-based environments, including troubleshooting and performance tuning
Demonstrated experience with scripting and automation using languages such as Bash and Python
use of configuration management tools (e.g., Ansible, CFEngine)
and version control systems (e.g., Git) to manage and maintain system configurations and infrastructure
Advanced knowledge of HPC middleware stack including cluster management tools, job schedulers and resources managers. Examples include: Slurm, PBS, HPCM, and Bright Cluster Manager
Demonstrated knowledge of TCP/IP networking, including sockets, VLANs, and firewalls

Job Responsibility

Designing, deploying, and operating SDSC HPC compute clusters and their associated storage systems
Maintaining their performance, reliability, and availability at the national, state, and campus level
Contributing to the design, deployment, and operation of high-performance HPC systems and storage environments, including parallel file systems operating at scale across high-speed networks
Planning and executing system lifecycles, including deployment, upgrades, and decommissioning of HPC systems and storage services
Contribute to technical planning and effort estimation for new deployments, proposals, and recharge-based services
Evaluating and recommending improvements to tools and workflows
Participates in the selection and integration of new technologies
Working with vendors and SDSC staff to benchmark and evaluate storage systems and cluster platforms
Maintaining current knowledge of emerging technologies
Developing advanced processes and scripts for system analysis, testing, and automation

Fulltime

!

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

We are looking for a highly skilled engineer with deep expertise in building and...

Location

United States , San Francisco

Salary:

166000.00 - 201000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
Strong programming skills in Go or Python for automation, operators, and custom integrations
Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
Solid understanding of distributed systems, performance engineering, and debugging complex workloads
Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices

Job Responsibility

Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
Partnering with engineering teams to embed observability into applications, services, and infrastructure

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Strong background in one or more of the following areas: AI accelerator or GPU architectures
Distributed systems and large-scale AI training/inference
High-performance computing (HPC) and collective communications
ML systems, runtimes, or compilers
Performance modeling, benchmarking, and systems analysis
Hardware–software co-design for AI workloads
Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.

Job Responsibility

Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.

Fulltime

New

Senior Aerodynamics Systems Engineer-Motorsports

The Team: GM’s Motorsports Platform & Systems team analyzes, defines, and delive...

Location

United States , Concord

Salary:

125200.00 - 192700.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

5+ years hands-on full-stack or backend-focused development experience, with strong emphasis on server-side architectures supporting batch processing, multi-stage pipelines, and compute-intensive workloads (e.g., CFD workflows, large numerical simulation pipelines, HPC job orchestration)
2+ years designing and developing backend web services (REST/GRPC) including server-side batch execution engines, distributed compute orchestration, asynchronous task processing, and workflow automation for CFD or similar computational pipelines
Demonstrated experience building high-performance server-side processing frameworks, including parallelized job execution, distributed scheduling, queue-based workloads, and fault-tolerant pipeline management
Strong experience with pipeline-oriented architectures, such as CFD post-processing chains, multi-stage data conditioning workflows, large model computation pipelines, or batch-driven scientific/engineering processing systems
Proficiency in two or more backend-focused languages or ecosystems: Java, Python, Scala, C#/.NET, or equivalent, used to build distributed compute services and processing automation
Experience with containerized compute environments (Kubernetes, Docker), especially for scaling simulation services, HPC workflow endpoints, or compute-heavy microservices
Solid understanding of software development best practices, DevOps, CI/CD, observability (metrics/logging/tracing), and reliability engineering for long-running, high-load backend systems
Experience working in an agile/scrum environment, especially on teams delivering simulation pipelines, compute orchestration services, or backend system components
Demonstrated ability to articulate sound technical decisions and deep understanding of distributed, event-driven, or batch-processing architectures, especially those powering HPC, CFD simulations, or multidisciplinary compute workloads
Highly collaborative mindset with strong communication skills, especially when working with simulation engineers, aerodynamicists, data engineers, and HPC platform teams

Job Responsibility

Implement and maintain GM Motorsports aero-thermal applications including CFD model construction, visualization, and analysis using microservices architectures to creatively integrate loosely coupled systems
Define a templated approach to integrate dependent systems in a functional programming model
Scrum story delivery
Playbooks, implementation architectures, interfaces, build frameworks, code, testing, deployment for your story
Participation in solution architectures
Working with other members to collaborate, support, and otherwise work together

What we offer

An incentive pay program offers payouts based on company performance, job level, and individual performance
medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays

Fulltime

Senior Software System Design Engineer

We are seeking a Senior Member of Technical Staff to design, build, and evolve l...

Location

United States , Florida

Salary:

134400.00 - 201600.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Strong experience designing distributed or large-scale systems with an emphasis on automation and reliability
Proven expertise in CI/CD platforms and workflow orchestration (e.g., Jenkins or similar systems)
Hands-on experience with containerization technologies (Docker, container-based build/test workflows)
Solid programming and scripting skills in Python, C, or C++, with an emphasis on tooling and automation
Experience working with complex build systems, dependency management, and multi-component software stacks
Familiarity with performance analysis, benchmarking, or resource-intensive workloads (e.g., GPU, HPC, or systems software is a plus)
Demonstrated ability to work across teams and influence technical direction without formal authority
Strong written and verbal communication skills, especially for technical design and documentation
Bachelor’s or Master’s in Electrical Engineer, Computer Engineering, Computer Science, or a closely related field

Job Responsibility

Architect and own scalable platform solutions for CI/CD, build, test, and release workflows used by multiple engineering teams
Design automation frameworks and reusable pipelines that emphasize reproducibility, reliability, and efficiency
Lead technical design discussions and drive alignment across teams with differing requirements and constraints
Develop and maintain containerized build and test environments, ensuring consistency across systems and releases
Implement and evolve manifest-driven, configuration-based systems that enable flexibility without sacrificing control
Integrate quality gates such as testing, performance validation, static analysis, and artifact management into automated workflows
Analyze system performance and pipeline efficiency, identifying bottlenecks and driving continuous improvement
Serve as a technical mentor, reviewing designs, guiding best practices, and raising overall engineering maturity
Partner closely with product, infrastructure, and software teams to ensure the platform evolves with business and technical needs
Document architecture, workflows, and operational best practices to enable long-term sustainability

Fulltime

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...

Location

United States , Redmond

Salary:

163000.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
5+ years of people management experience leading software engineering teams, including managing principal engineers
Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Job Responsibility

Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate

Fulltime

Senior Manager, AI Infrastructure and Operations

The Sr. Manager/Staff Engineer, AI Infrastructure & MLOps Engineering is a senio...

Location

Japan , Tokyo

Salary:

Not provided

Pfizer

Expiration Date

Until further notice

Requirements

8+ years of hands-on software engineering experience in cloud infrastructure, DevOps, and MLOps
Deep expertise in Python, Kubernetes, Terraform, Helm, and CI/CD pipeline development
Proven experience architecting and operating containerized solutions on AWS, GCP, and Azure
Strong knowledge of Infrastructure-as-Code, distributed systems, and production system reliability
Bachelor’s or Master’s degree in Computer Science, Engineering, or related field

Job Responsibility

Design, implement, and own large-scale cloud-based HPC and MLOps platforms supporting AI model training, genomic sequencing, and precision medicine
Architect multi-environment clusters (AWS, GCP, Azure), enabling GPU/FPGA workloads and advanced observability
Lead the development of developer and cloud platforms, including internal engineering accelerators and reusable toolsets
Design, implement, and manage unified platform catalogs using Backstage, enhancing developer experience and application metadata management
Develop custom plugins and APIs for Backstage to support internal engineering workflows and documentation
Build and maintain Python-based automation frameworks, CI/CD pipelines, and Infrastructure-as-Code (Terraform, Helm, Pulumi, AWS CDK)
Operationalize containerized solutions using Docker and Kubernetes, integrating MLflow, Kubeflow, and other orchestration platforms
Implement robust automation for provisioning, configuring, and managing cloud resources across multiple environments
Lead the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and advanced observability (Prometheus, Grafana, PagerDuty)
Develop and maintain APIs and services for model management, feature stores, and inference pipelines

Fulltime

Principal Technical Program Manager

The CO+I AI Delivery team is focused on delivering various platform services to ...

Location

United States , Redmond

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree AND 6+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
3+ years of experience managing cross-functional and/or cross-team projects
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Proven experience leading complex, cross‑team technical programs with significant infrastructure or platform components
Strong technical foundation in one or more of the following: Cloud infrastructure and distributed systems, Large‑scale datacentre delivery projects, Hardware‑software integrations (compute, networking, storage, power, cooling)
Demonstrated ability to manage execution in ambiguous, fast‑moving environments
Excellent written and verbal communication skills, with experience presenting to senior leadership
Experience delivering or scaling AI, HPC, or GPU‑based platforms in production environments
Familiarity with data center operations, hardware lifecycle management, or global deployment programs

Job Responsibility

Program Ownership & Execution: Own end‑to‑end technical programs focused on accelerating AI deployment timelines
Drive execution across multiple parallel workstreams
Establish clear success metrics and mechanisms
Document appropriately all artifacts
Cross‑Functional Leadership: Partner deeply with hardware engineering, software engineering, infrastructure, networking, data center operations, and supply chain teams
Act as the central point of coordination
Influence decision‑making with data, technical insight, and strong executive communication
Technical Rigor: Develop deep working knowledge of AI deployment architectures
Identify technical risks early and drive mitigation strategies
Translate complex technical concepts into clear, actionable plans

Fulltime

Senior Distributed Systems Engineer (HPC Platform)

Itransition

Location:
European Union

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
May 05, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Senior Distributed Systems Engineer (HPC Platform)

Senior HPC Systems and Storage Engineer

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

Member of Technical Staff, Software Co-Design AI HPC Systems

Senior Aerodynamics Systems Engineer-Motorsports

Senior Software System Design Engineer

Senior Principal Engineering Manager

Senior Manager, AI Infrastructure and Operations

Principal Technical Program Manager

Senior Distributed Systems Engineer (HPC Platform)

Itransition

Location:European Union

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:May 05, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Senior Distributed Systems Engineer (HPC Platform)

Senior HPC Systems and Storage Engineer

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

Member of Technical Staff, Software Co-Design AI HPC Systems

Senior Aerodynamics Systems Engineer-Motorsports

Senior Software System Design Engineer

Senior Principal Engineering Manager

Senior Manager, AI Infrastructure and Operations

Principal Technical Program Manager

Location:
European Union

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
May 05, 2026