CrawlJobs Logo

Senior Distributed Systems Engineer (HPC Platform)

itransition.com Logo

Itransition

Location Icon

Location:
European Union

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are looking for a Senior Distributed Systems Engineer to design and build core backend services for a high-performance distributed computing platform. In this role, you will focus on developing resilient, high-throughput infrastructure that orchestrates workloads across CPU and GPU nodes. You’ll work at the intersection of distributed systems, high-performance computing, and modern backend engineering. This is a hands-on engineering role for someone who enjoys building scalable systems from the ground up and working with cutting-edge technologies.

Job Responsibility:

  • design and build core backend services for a high-performance distributed computing platform
  • develop resilient, high-throughput infrastructure that orchestrates workloads across CPU and GPU nodes

Requirements:

  • Strong experience in backend development with Rust
  • Solid understanding of distributed systems architecture
  • Hands-on experience with message queues (e.g., Apache Pulsar, RabbitMQ)
  • Experience designing and building gRPC-based APIs / service-oriented architectures
  • Experience with AWS or similar cloud platforms
  • Strong problem-solving skills and ability to work with complex systems

Nice to have:

  • Experience with high-performance networking (e.g., RDMA, libfabric)
  • Familiarity with high-performance storage systems (e.g., Lustre)
  • Understanding of GPU architecture and memory management
  • Experience with CUDA ecosystem (Runtime APIs, Thrust, CUB, PTX)
  • Knowledge of LLVM / compiler toolchains
What we offer:
  • Projects for such clients as PayPal, Wargaming, Xerox, Philips, Adidas and Toyota
  • Competitive compensation that depends on your qualification and skills
  • Career development system with clear skill qualifications
  • Flexible working hours aligned to your schedule
  • Options to work remotely
  • Corporate medical insurance covering services of private and public medical centers
  • English courses online
  • Corporate parties and events for employees and their children
  • Internal conferences, workshops and meetups for learning and experience sharing
  • Gym membership compensation
  • 5 days of paid sick leave per year with no obligation to submit a sick-leave certificate

Additional Information:

Job Posted:
May 05, 2026

Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Distributed Systems Engineer (HPC Platform)

New

Senior HPC Systems and Storage Engineer

The Senior HPC Systems and Storage Engineer will apply advanced systems and soft...
Location
Location
United States , San Diego
Salary
Salary:
108100.00 - 160000.00 USD / Year
ucsd.edu Logo
UC San Diego
Expiration Date
May 07, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in related area and / or equivalent experience / training
  • Proven experience administering and supporting large-scale HPC clusters or other distributed POSIX (Linux) systems, including advanced knowledge of Linux system administration, primarily Red Hat and its derivatives (e.g., Rocky Linux)
  • Proven experience designing, deploying, and operating large-scale (petabyte-class) high-performance parallel and distributed file systems (e.g., Lustre, Ceph, BeeGFS, GPFS), as well as enterprise and local file systems (e.g., NFS, ZFS, ext4, XFS) in Linux-based environments, including troubleshooting and performance tuning
  • Demonstrated experience with scripting and automation using languages such as Bash and Python
  • use of configuration management tools (e.g., Ansible, CFEngine)
  • and version control systems (e.g., Git) to manage and maintain system configurations and infrastructure
  • Advanced knowledge of HPC middleware stack including cluster management tools, job schedulers and resources managers. Examples include: Slurm, PBS, HPCM, and Bright Cluster Manager
  • Demonstrated knowledge of TCP/IP networking, including sockets, VLANs, and firewalls
Job Responsibility
Job Responsibility
  • Designing, deploying, and operating SDSC HPC compute clusters and their associated storage systems
  • Maintaining their performance, reliability, and availability at the national, state, and campus level
  • Contributing to the design, deployment, and operation of high-performance HPC systems and storage environments, including parallel file systems operating at scale across high-speed networks
  • Planning and executing system lifecycles, including deployment, upgrades, and decommissioning of HPC systems and storage services
  • Contribute to technical planning and effort estimation for new deployments, proposals, and recharge-based services
  • Evaluating and recommending improvements to tools and workflows
  • Participates in the selection and integration of new technologies
  • Working with vendors and SDSC staff to benchmark and evaluate storage systems and cluster platforms
  • Maintaining current knowledge of emerging technologies
  • Developing advanced processes and scripts for system analysis, testing, and automation
  • Fulltime
!
Read More
Arrow Right

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

We are looking for a highly skilled engineer with deep expertise in building and...
Location
Location
United States , San Francisco
Salary
Salary:
166000.00 - 201000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices
Job Responsibility
Job Responsibility
  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Partnering with engineering teams to embed observability into applications, services, and infrastructure
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right
New

Senior Aerodynamics Systems Engineer-Motorsports

The Team: GM’s Motorsports Platform & Systems team analyzes, defines, and delive...
Location
Location
United States , Concord
Salary
Salary:
125200.00 - 192700.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years hands-on full-stack or backend-focused development experience, with strong emphasis on server-side architectures supporting batch processing, multi-stage pipelines, and compute-intensive workloads (e.g., CFD workflows, large numerical simulation pipelines, HPC job orchestration)
  • 2+ years designing and developing backend web services (REST/GRPC) including server-side batch execution engines, distributed compute orchestration, asynchronous task processing, and workflow automation for CFD or similar computational pipelines
  • Demonstrated experience building high-performance server-side processing frameworks, including parallelized job execution, distributed scheduling, queue-based workloads, and fault-tolerant pipeline management
  • Strong experience with pipeline-oriented architectures, such as CFD post-processing chains, multi-stage data conditioning workflows, large model computation pipelines, or batch-driven scientific/engineering processing systems
  • Proficiency in two or more backend-focused languages or ecosystems: Java, Python, Scala, C#/.NET, or equivalent, used to build distributed compute services and processing automation
  • Experience with containerized compute environments (Kubernetes, Docker), especially for scaling simulation services, HPC workflow endpoints, or compute-heavy microservices
  • Solid understanding of software development best practices, DevOps, CI/CD, observability (metrics/logging/tracing), and reliability engineering for long-running, high-load backend systems
  • Experience working in an agile/scrum environment, especially on teams delivering simulation pipelines, compute orchestration services, or backend system components
  • Demonstrated ability to articulate sound technical decisions and deep understanding of distributed, event-driven, or batch-processing architectures, especially those powering HPC, CFD simulations, or multidisciplinary compute workloads
  • Highly collaborative mindset with strong communication skills, especially when working with simulation engineers, aerodynamicists, data engineers, and HPC platform teams
Job Responsibility
Job Responsibility
  • Implement and maintain GM Motorsports aero-thermal applications including CFD model construction, visualization, and analysis using microservices architectures to creatively integrate loosely coupled systems
  • Define a templated approach to integrate dependent systems in a functional programming model
  • Scrum story delivery
  • Playbooks, implementation architectures, interfaces, build frameworks, code, testing, deployment for your story
  • Participation in solution architectures
  • Working with other members to collaborate, support, and otherwise work together
What we offer
What we offer
  • An incentive pay program offers payouts based on company performance, job level, and individual performance
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • Fulltime
Read More
Arrow Right

Senior Software System Design Engineer

We are seeking a Senior Member of Technical Staff to design, build, and evolve l...
Location
Location
United States , Florida
Salary
Salary:
134400.00 - 201600.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience designing distributed or large-scale systems with an emphasis on automation and reliability
  • Proven expertise in CI/CD platforms and workflow orchestration (e.g., Jenkins or similar systems)
  • Hands-on experience with containerization technologies (Docker, container-based build/test workflows)
  • Solid programming and scripting skills in Python, C, or C++, with an emphasis on tooling and automation
  • Experience working with complex build systems, dependency management, and multi-component software stacks
  • Familiarity with performance analysis, benchmarking, or resource-intensive workloads (e.g., GPU, HPC, or systems software is a plus)
  • Demonstrated ability to work across teams and influence technical direction without formal authority
  • Strong written and verbal communication skills, especially for technical design and documentation
  • Bachelor’s or Master’s in Electrical Engineer, Computer Engineering, Computer Science, or a closely related field
Job Responsibility
Job Responsibility
  • Architect and own scalable platform solutions for CI/CD, build, test, and release workflows used by multiple engineering teams
  • Design automation frameworks and reusable pipelines that emphasize reproducibility, reliability, and efficiency
  • Lead technical design discussions and drive alignment across teams with differing requirements and constraints
  • Develop and maintain containerized build and test environments, ensuring consistency across systems and releases
  • Implement and evolve manifest-driven, configuration-based systems that enable flexibility without sacrificing control
  • Integrate quality gates such as testing, performance validation, static analysis, and artifact management into automated workflows
  • Analyze system performance and pipeline efficiency, identifying bottlenecks and driving continuous improvement
  • Serve as a technical mentor, reviewing designs, guiding best practices, and raising overall engineering maturity
  • Partner closely with product, infrastructure, and software teams to ensure the platform evolves with business and technical needs
  • Document architecture, workflows, and operational best practices to enable long-term sustainability
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Senior Manager, AI Infrastructure and Operations

The Sr. Manager/Staff Engineer, AI Infrastructure & MLOps Engineering is a senio...
Location
Location
Japan , Tokyo
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of hands-on software engineering experience in cloud infrastructure, DevOps, and MLOps
  • Deep expertise in Python, Kubernetes, Terraform, Helm, and CI/CD pipeline development
  • Proven experience architecting and operating containerized solutions on AWS, GCP, and Azure
  • Strong knowledge of Infrastructure-as-Code, distributed systems, and production system reliability
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
Job Responsibility
Job Responsibility
  • Design, implement, and own large-scale cloud-based HPC and MLOps platforms supporting AI model training, genomic sequencing, and precision medicine
  • Architect multi-environment clusters (AWS, GCP, Azure), enabling GPU/FPGA workloads and advanced observability
  • Lead the development of developer and cloud platforms, including internal engineering accelerators and reusable toolsets
  • Design, implement, and manage unified platform catalogs using Backstage, enhancing developer experience and application metadata management
  • Develop custom plugins and APIs for Backstage to support internal engineering workflows and documentation
  • Build and maintain Python-based automation frameworks, CI/CD pipelines, and Infrastructure-as-Code (Terraform, Helm, Pulumi, AWS CDK)
  • Operationalize containerized solutions using Docker and Kubernetes, integrating MLflow, Kubeflow, and other orchestration platforms
  • Implement robust automation for provisioning, configuring, and managing cloud resources across multiple environments
  • Lead the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and advanced observability (Prometheus, Grafana, PagerDuty)
  • Develop and maintain APIs and services for model management, feature stores, and inference pipelines
  • Fulltime
Read More
Arrow Right

Principal Technical Program Manager

The CO+I AI Delivery team is focused on delivering various platform services to ...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree AND 6+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
  • 3+ years of experience managing cross-functional and/or cross-team projects
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • Proven experience leading complex, cross‑team technical programs with significant infrastructure or platform components
  • Strong technical foundation in one or more of the following: Cloud infrastructure and distributed systems, Large‑scale datacentre delivery projects, Hardware‑software integrations (compute, networking, storage, power, cooling)
  • Demonstrated ability to manage execution in ambiguous, fast‑moving environments
  • Excellent written and verbal communication skills, with experience presenting to senior leadership
  • Experience delivering or scaling AI, HPC, or GPU‑based platforms in production environments
  • Familiarity with data center operations, hardware lifecycle management, or global deployment programs
Job Responsibility
Job Responsibility
  • Program Ownership & Execution: Own end‑to‑end technical programs focused on accelerating AI deployment timelines
  • Drive execution across multiple parallel workstreams
  • Establish clear success metrics and mechanisms
  • Document appropriately all artifacts
  • Cross‑Functional Leadership: Partner deeply with hardware engineering, software engineering, infrastructure, networking, data center operations, and supply chain teams
  • Act as the central point of coordination
  • Influence decision‑making with data, technical insight, and strong executive communication
  • Technical Rigor: Develop deep working knowledge of AI deployment architectures
  • Identify technical risks early and drive mitigation strategies
  • Translate complex technical concepts into clear, actionable plans
  • Fulltime
Read More
Arrow Right