CrawlJobs Logo

Senior Distributed Systems Engineer (HPC Platform)

European Union · Job Posted May 05, 2026
Apply Position
Job Link Share

Job Description

We are looking for a Senior Distributed Systems Engineer to design and build core backend services for a high-performance distributed computing platform. In this role, you will focus on developing resilient, high-throughput infrastructure that orchestrates workloads across CPU and GPU nodes. You’ll work at the intersection of distributed systems, high-performance computing, and modern backend engineering. This is a hands-on engineering role for someone who enjoys building scalable systems from the ground up and working with cutting-edge technologies.

Job Responsibility

  • design and build core backend services for a high-performance distributed computing platform
  • develop resilient, high-throughput infrastructure that orchestrates workloads across CPU and GPU nodes

Requirements

  • Strong experience in backend development with Rust
  • Solid understanding of distributed systems architecture
  • Hands-on experience with message queues (e.g., Apache Pulsar, RabbitMQ)
  • Experience designing and building gRPC-based APIs / service-oriented architectures
  • Experience with AWS or similar cloud platforms
  • Strong problem-solving skills and ability to work with complex systems

Nice to have

  • Experience with high-performance networking (e.g., RDMA, libfabric)
  • Familiarity with high-performance storage systems (e.g., Lustre)
  • Understanding of GPU architecture and memory management
  • Experience with CUDA ecosystem (Runtime APIs, Thrust, CUB, PTX)
  • Knowledge of LLVM / compiler toolchains

What we offer

  • Projects for such clients as PayPal, Wargaming, Xerox, Philips, Adidas and Toyota
  • Competitive compensation that depends on your qualification and skills
  • Career development system with clear skill qualifications
  • Flexible working hours aligned to your schedule
  • Options to work remotely
  • Corporate medical insurance covering services of private and public medical centers
  • English courses online
  • Corporate parties and events for employees and their children
  • Internal conferences, workshops and meetups for learning and experience sharing
  • Gym membership compensation
  • 5 days of paid sick leave per year with no obligation to submit a sick-leave certificate

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Distributed Systems Engineer (HPC Platform)

8 matching positions

Senior Aerodynamics Systems Engineer-Motorsports

The Team: GM’s Motorsports Platform & Systems team analyzes, defines, and delive...
Location
Location
United States , Concord
Salary
Salary:
125200.00 - 192700.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years hands-on full-stack or backend-focused development experience, with strong emphasis on server-side architectures supporting batch processing, multi-stage pipelines, and compute-intensive workloads (e.g., CFD workflows, large numerical simulation pipelines, HPC job orchestration)
  • 2+ years designing and developing backend web services (REST/GRPC) including server-side batch execution engines, distributed compute orchestration, asynchronous task processing, and workflow automation for CFD or similar computational pipelines
  • Demonstrated experience building high-performance server-side processing frameworks, including parallelized job execution, distributed scheduling, queue-based workloads, and fault-tolerant pipeline management
  • Strong experience with pipeline-oriented architectures, such as CFD post-processing chains, multi-stage data conditioning workflows, large model computation pipelines, or batch-driven scientific/engineering processing systems
  • Proficiency in two or more backend-focused languages or ecosystems: Java, Python, Scala, C#/.NET, or equivalent, used to build distributed compute services and processing automation
  • Experience with containerized compute environments (Kubernetes, Docker), especially for scaling simulation services, HPC workflow endpoints, or compute-heavy microservices
  • Solid understanding of software development best practices, DevOps, CI/CD, observability (metrics/logging/tracing), and reliability engineering for long-running, high-load backend systems
  • Experience working in an agile/scrum environment, especially on teams delivering simulation pipelines, compute orchestration services, or backend system components
  • Demonstrated ability to articulate sound technical decisions and deep understanding of distributed, event-driven, or batch-processing architectures, especially those powering HPC, CFD simulations, or multidisciplinary compute workloads
  • Highly collaborative mindset with strong communication skills, especially when working with simulation engineers, aerodynamicists, data engineers, and HPC platform teams
Job Responsibility
Job Responsibility
  • Implement and maintain GM Motorsports aero-thermal applications including CFD model construction, visualization, and analysis using microservices architectures to creatively integrate loosely coupled systems
  • Define a templated approach to integrate dependent systems in a functional programming model
  • Scrum story delivery
  • Playbooks, implementation architectures, interfaces, build frameworks, code, testing, deployment for your story
  • Participation in solution architectures
  • Working with other members to collaborate, support, and otherwise work together
What we offer
What we offer
  • An incentive pay program offers payouts based on company performance, job level, and individual performance
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • Fulltime
Read More
Arrow Right

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

We are looking for a highly skilled engineer with deep expertise in building and...
Location
Location
United States , San Francisco
Salary
Salary:
166000.00 - 201000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices
Job Responsibility
Job Responsibility
  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Partnering with engineering teams to embed observability into applications, services, and infrastructure
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Software Co-Design AI HPC Systems

Our team’s mission is to architect, co-design, and productionize next-generation...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Strong background in one or more of the following areas: AI accelerator or GPU architectures
  • Distributed systems and large-scale AI training/inference
  • High-performance computing (HPC) and collective communications
  • ML systems, runtimes, or compilers
  • Performance modeling, benchmarking, and systems analysis
  • Hardware–software co-design for AI workloads
  • Proficiency in systems-level programming (e.g., C/C++, CUDA, Python) and performance-critical software development.
  • Proven ability to work across organizational boundaries and influence technical decisions involving multiple stakeholders.
Job Responsibility
Job Responsibility
  • Lead the co-design of AI systems across hardware and software boundaries, spanning accelerators, interconnects, memory systems, storage, runtimes, and distributed training/inference frameworks.
  • Drive architectural decisions by analyzing real workloads, identifying bottlenecks across compute, communication, and data movement, and translating findings into actionable system and hardware requirements.
  • Co-design and optimize parallelism strategies, execution models, and distributed algorithms to improve scalability, utilization, reliability, and cost efficiency of large-scale AI systems.
  • Develop and evaluate what-if performance models to project system behavior under future workloads, model architectures, and hardware generations, providing early guidance to hardware and platform roadmaps.
  • Partner with compiler, kernel, and runtime teams to unlock the full performance of current and next-generation accelerators, including custom kernels, scheduling strategies, and memory optimizations.
  • Influence and guide AI hardware design at system and silicon levels, including accelerator microarchitecture, interconnect topology, memory hierarchy, and system integration trade-offs.
  • Lead cross-functional efforts to prototype, validate, and productionize high-impact co-design ideas, working across infrastructure, hardware, and product teams.
  • Mentor senior engineers and researchers, set technical direction, and raise the overall bar for systems rigor, performance engineering, and co-design thinking across the organization.
  • Fulltime
Read More
Arrow Right

Senior Software Engineer (Data)

Optiver is a global market maker founded in Amsterdam, with offices in London, C...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
optiver.com Logo
Optiver
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering, with a data focus
  • Demonstrable proficiency in Python
  • Familiarity with distributed data processing and building robust data pipelines
  • Familiarity with the use of or concepts underpinning various Data Engineering technologies and approaches
  • Ability to lead projects autonomously, prioritise tasks, and deliver high-quality results
  • Proven success working with research and platform engineering teams to support data-driven projects
  • Strong problem-solving skills, focusing on efficiency, scalability, and cost-effectiveness
  • Excellent written and verbal communication skills, with the ability to convey technical concepts to both technical and non-technical stakeholders
  • A Bachelor’s or Master’s degree in Computer Science, Engineering, Data Science, or a related field
Job Responsibility
Job Responsibility
  • Data Pipeline Design & Development: Architect, build, and manage scalable data pipelines using Spark, Databricks, and proprietary HPC tooling, ensuring high availability, scalability, and performance
  • Platform and Tooling Development: Advance the state of the art of Optiver’s Research and Data platforms and tools
  • Cost Optimisation: Monitor and optimise resource usage in partnership with the Data Platform team, balancing performance and cost-effectiveness
  • Monitoring & Tuning: Implement monitoring tools and fine-tune systems for optimal throughput, latency, and reliability
  • Cross-Functional Collaboration: Work closely with global Data Platform teams to maintain alignment on data strategy, tools, and best practices
  • Documentation & Standards: Develop and maintain clear documentation of data pipeline architectures, processes, and best practices
  • Mentorship & Guidance: Provide technical mentorship to junior engineers and support the growth of the data engineering function in Sydney
What we offer
What we offer
  • The chance to work alongside best-in-class professionals
  • Competitive remuneration, including an attractive bonus structure and additional leave entitlements
  • Training, mentorship and personal development opportunities
  • Gym membership plus weekly in-house chair massages
  • Daily breakfast, lunch and an in-house barista
  • Regular social events including a company trip every two years
  • Guided relocation, a competitive relocation package and visa sponsorship where necessary
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Copilot Security

Copilot Security is at the core of Microsoft’s mission to deliver trusted, human...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • 3+ years in technical engineering roles building large-scale services.
  • Hands-on experience designing and operating security-critical or AI-powered systems at scale, including agentic AI, secure orchestration, or advanced threat defenses.
  • Proven ability to design, build, and ship agentic AI features or frameworks.
  • Ability to clearly explain complex systems and security concepts to technical and non-technical stakeholders and influence cross-org roadmaps.
  • Agentic AI Development & Orchestration: Experience building production agent systems using frameworks such as LangGraph, Amazon Strands SDK, or similar platforms
  • familiarity with agentic design patterns including tool calling, multi-agent coordination, and secure delegation patterns.
  • Hands-on experience with distributed training frameworks (Ray, Slurm, HPC), containerization and orchestration technologies (Docker, Kubernetes) for ML model deployment, and ML lifecycle management in production environments.
  • Experience designing evaluation frameworks for LLM-based applications and implementing observability for agent systems using tools such as Phoenix, MLFlow, LangFuse, or custom eval harnesses
  • understanding of AI safety evaluation methodologies including adversarial testing and red-teaming.
Job Responsibility
Job Responsibility
  • Develop and ship agentic AI-powered security features that protect users from threats such as prompt injection, adversarial manipulation, and abuse of agentic workflows.
  • Implement secure orchestration frameworks that enable Copilot to safely delegate, coordinate, and execute actions across devices, services, and platforms.
  • Invent and apply new intelligent agents that leverage information flow analysis and apply common sense and judgement guardrails for security and privacy.
  • Collaborate with product, engineering, security, privacy, and AI teams to adopt agentic security patterns and best practices across Copilot and MAI.
  • Monitor key metrics for agentic AI security and innovation, using data-driven insights to improve defenses and enablement.
  • Document secure agentic AI patterns, ensuring they address novel risks, support safe delegation, and enable responsible orchestration of actions.
  • Fulltime
Read More
Arrow Right

Senior Principal Engineering Manager

Microsoft Research (MSR) is working to transform the future of artificial intell...
Location
Location
United States , Redmond
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 5+ years of people management experience leading software engineering teams, including managing principal engineers
  • Experience building or operating infrastructure for large-scale distributed systems, cloud platforms, or artificial intelligence (AI)/machine learning(ML) workloads
  • Track record of driving execution on complex, multi-workstream infrastructure projects with clear milestones and accountability
  • Technical fluency in one or more of: large-scale compute clusters, GPU infrastructure, scheduling and orchestration (Kubernetes, Volcano), or High-Performance Compute (HPC) environments
  • Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
  • Expertise in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
  • A track record of strong cross-functional partnerships, including the ability to align on strategic direction, deliver joint accountabilities, and develop relationships with staff members with widely varied expertise
  • Experience scaling engineering teams through significant growth phases (hiring, onboarding, and integrating new engineers into a high-performing team)
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 15+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow the engineering team that builds MSR’s AI research infrastructure
  • Recruit and develop exceptional engineering talent, building a diverse team - including hiring, onboarding, career development, and performance management
  • Drive execution across the team by setting clear goals, tracking milestones, managing dependencies, and ensuring accountability for delivering complex infrastructure projects on time and at high quality
  • Lead team culture and process changes, cultivating an AI-first mentality that accelerates our progress through agentic coding, automation, and skills development
  • Provide technical vision and judgment on the team's architecture, strategy, and roadmap — spanning supercomputer GPU clusters, high performance networking, workload optimization, researcher tools, and agentic workflows — while empowering engineers to own deep technical details
  • Collaborate closely cross-discipline with engineers, program managers, and research and science teams to align priorities, resolve dependencies, and build better solutions together
  • Foster a team culture of operational excellence, continuous improvement, and high psychological safety where engineers are empowered to take ownership and innovate
  • Fulltime
Read More
Arrow Right

Senior Compute Cluster Administrator

We are looking for a Senior Compute Cluster Administrator responsible for operat...
Location
Location
United States , Austin; Santa Clara; Seattle
Salary
Salary:
109760.00 - 164640.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands‑on experience administering or supporting HPC clusters in production, research, or academic environments
  • Practical experience working as an HPC user combined with Linux system administration in enterprise or lab environments
  • Background in software development combined with deep Linux systems exposure in server or infrastructure contexts
  • Demonstrated intermediate to advanced Linux expertise
  • Strong understanding of networking fundamentals, including the OSI model, multi‑homed systems, firewall troubleshooting, and high‑speed interconnects
  • Willingness to experiment with open‑source and emerging technologies
  • Experience supporting infrastructure services such as DNS, DHCP, BOOTP, PXE, TFTP, NTP, and PAM
  • Understanding of interprocess communication and familiarity with MPI implementations such as OpenMPI or MPICH
  • Proficiency with Linux troubleshooting tools such as nmap, gdb, lsof, sar, and server management interfaces including IPMI, iDRAC, and iLO
  • Working knowledge of virtualization, VLANs, and directory services
Job Responsibility
Job Responsibility
  • Work directly with tenants and stakeholders to maximize service quality, utilization, and availability of managed compute clusters
  • Collaborate with highly technical users working deep within AMD’s Instinct platform (e.g., ROCm) to troubleshoot misconfigurations impacting HPC performance
  • Lead the resolution of complex issues during new deployments and ongoing operations
  • Partner with hardware vendors on technical escalations involving third‑party OEM platforms and coordinate maintenance cycles aligned with upstream releases
  • Support multiple Linux distributions across Red Hat and Ubuntu/Debian families
  • Act as a subject matter expert in one or more cluster scheduling technologies such as Slurm, LSF, Sun Grid Engine, OpenLava, or Kubernetes
  • Compare configurations and behaviors across heterogeneous clusters within AMD’s compute estate
  • Engage with emerging technologies where formal documentation may be limited, including white‑box platforms and pre‑beta hardware
  • Maintain and evolve compute images using automated CI/CD pipelines, or deploy software manually where automation is not available
  • Monitor cluster health, performance, and availability using standard tooling such as Grafana, Prometheus, and Zabbix
  • Fulltime
Read More
Arrow Right

Senior Manager, AI Infrastructure and Operations

The Sr. Manager/Staff Engineer, AI Infrastructure & MLOps Engineering is a senio...
Location
Location
Japan , Tokyo
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of hands-on software engineering experience in cloud infrastructure, DevOps, and MLOps
  • Deep expertise in Python, Kubernetes, Terraform, Helm, and CI/CD pipeline development
  • Proven experience architecting and operating containerized solutions on AWS, GCP, and Azure
  • Strong knowledge of Infrastructure-as-Code, distributed systems, and production system reliability
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or related field
Job Responsibility
Job Responsibility
  • Design, implement, and own large-scale cloud-based HPC and MLOps platforms supporting AI model training, genomic sequencing, and precision medicine
  • Architect multi-environment clusters (AWS, GCP, Azure), enabling GPU/FPGA workloads and advanced observability
  • Lead the development of developer and cloud platforms, including internal engineering accelerators and reusable toolsets
  • Design, implement, and manage unified platform catalogs using Backstage, enhancing developer experience and application metadata management
  • Develop custom plugins and APIs for Backstage to support internal engineering workflows and documentation
  • Build and maintain Python-based automation frameworks, CI/CD pipelines, and Infrastructure-as-Code (Terraform, Helm, Pulumi, AWS CDK)
  • Operationalize containerized solutions using Docker and Kubernetes, integrating MLflow, Kubeflow, and other orchestration platforms
  • Implement robust automation for provisioning, configuring, and managing cloud resources across multiple environments
  • Lead the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and advanced observability (Prometheus, Grafana, PagerDuty)
  • Develop and maintain APIs and services for model management, feature stores, and inference pipelines
  • Fulltime
Read More
Arrow Right