CrawlJobs Logo

Systems Design Engineer - AI Cluster Software

United States, Austin Employment contract 163200.00 - 244800.00 USD / Year · Job Posted June 03, 2026
Apply Position
Job Link Share

Job Description

WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

Job Responsibility

  • Apply your expertise to shape AI infrastructure by creating reference architectures, configuration guides, and deployment blueprints that help internal teams and customers make informed hardware and software decisions
  • Perform deep technical evaluations of AI stacks across compute, storage, networking, and observability layers, documenting how they work, where they fit, and the tradeoffs involved
  • Design and execute reproducible experiments and benchmarking harnesses to compare technologies such as schedulers, distributed training libraries, and observability stacks
  • Develop small reference implementations and tools to validate performance hypotheses, analyze system behavior and more
  • Build a library of technical artifacts—including presentations, design documents, and “how it works” guides, to support pre-sales engineers and enable others to skill up from an HPC perspective
  • Present findings through demos, documentation, and internal talks, and create templates and checklists to support repeatable evaluations and cluster designs

Requirements

  • Bachelors or Masters degree in electrical or computer engineering
  • Evidence of end-to-end systems thinking, debugging, and tradeoff decisions
  • hands-on familiarity with at least two schedulers and/or orchestration systems (e.g., Slurm, Kubernetes), MPI/OpenMP, distributed storage patterns, or performance analysis
  • experience writing evaluation docs/RFCs with clear criteria, benchmarks, risks, and recommendations
  • Strong Linux fundamentals: Linux operating systems, networking, filesystems, containers, performance tooling (perf, flamegraphs, nvprof/rocprof, basic eBPF)
  • ability to turn complex systems into accessible, structured documentation with diagrams and reproducible steps
  • ROCm, RCCL, Instinct GPUs, EPYC platforms, compiler/toolchain impacts, and performance tuning
  • DDP, collective comms, sharded/stateful optimizers
  • NCCL/RCCL behavior and transport considerations (PCIe, NVLink, IF)
  • Slurm configuration patterns, Kubernetes for HPC/AI (GPU operators, device plugins), Apptainer/Singularity
  • parallel filesystems (Lustre, BeeGFS), object stores, RDMA, data pipeline throughput and caching strategies
  • Terraform/Ansible for reproducible blueprints—focused on design and sample configs, not running prod clusters
  • reproducible docs/workbooks, literate programming notebooks, CI for benchmarks

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Systems Design Engineer - AI Cluster Software

8 matching positions

Principal Engineer for Storage Software Development

In the HPE Hybrid Cloud, we lead the innovation agenda and technology roadmap fo...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven track record of delivering V1 products and anchoring multiple releases in storage product development
  • Demonstrated ability to handhold customers and played role of trusted advisor for their technology decisions
  • Bachelor's or master's degree in computer science, Information Systems, or equivalent
  • Typically, 15-20 years' experience
  • Expertise in multiple software systems design tools and languages
  • Strong analytical and problem-solving skills
  • Designing software systems running on multiple platform types
  • Software systems testing methodology, including writing and execution of test plans, debugging, and testing scripts and tools
  • Excellent written and verbal communication skills
  • mastery in English and local language
Job Responsibility
Job Responsibility
  • Set technology direction for broader engineering team on next generation storage involving multiple technologies such as object, file & AI ready workloads
  • Ability to detail out multi release delivery content from high level vision for the products
  • Help leadership and Product Management to understand finer details on contemporary technological trends
  • Inspire engineering team to question the status quo and make bold moves on technology roadmap and deliverables
  • Designs enhancements, updates, and programming changes for portions and subsystems of systems software, including operating systems, compliers, networking, utilities, databases, and Internet-related tools
  • Analyzes design and determines coding, programming, and integration activities required based on general objectives and knowledge of overall architecture of product or solution
  • Writes and executes complete testing plans, protocols, and documentation for assigned portion of application
  • identifies and debugs, and creates solutions for issues with code and integration into application architecture
  • Leads a project team of other software systems engineers and internal and outsourced development partners to develop reliable, cost effective and high quality solutions for assigned systems portion or subsystem
  • Collaborates and communicates with management, internal, and outsourced development partners regarding software systems design status, project progress, and issue resolution
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Machine Learning Engineer - LLMs & Generative AI

Truveta is the world’s first health provider led data platform with a vision of ...
Location
Location
United States , Seattle
Salary
Salary:
155000.00 - 175000.00 USD / Year
truveta.com Logo
Truveta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in software engineering or machine learning (3+ years with a PhD)
  • Experience designing and training LLMs or large-scale generative models (e.g., GPT, PaLM, LLaMA, Claude, Gemma)
  • Deep expertise in NLP, sequence modeling, and transformer architectures
  • Proficient in Python and ML libraries such as PyTorch or TensorFlow
  • strong engineering skills in building scalable ML pipelines
  • Experience with RL-based fine-tuning (e.g., Reinforcement Learning from Human Feedback) and evaluation of generative systems
  • Proven ability to lead technical projects and collaborate across teams
  • Bachelor's degree in Computer Science, Engineering, or a related technical field
Job Responsibility
Job Responsibility
  • Lead the development, training, and deployment of large language and multimodal foundation models tailored to clinical and biomedical domains
  • apply and refine state-of-the-art techniques such as supervised fine-tuning (SFT), reinforcement learning-based methods (e.g., RLHF, RLVR), parameter-efficient fine-tuning (PEFT), prompt tuning, and retrieval-augmented generation (RAG)
  • Collaborate cross-functionally with researchers, clinicians, and engineers to design ML-driven solutions that improve healthcare delivery and outcomes
  • Build scalable infrastructure for distributed training of large models (TPU/GPU clusters, multi-node orchestration)
  • Design and evaluate models for robustness, bias mitigation, factual consistency, and explainability in healthcare contexts
  • Stay current with the latest research in generative AI and contribute back to the community through publications and open-source initiatives
What we offer
What we offer
  • Interesting and meaningful work for every career stage
  • Great benefits package
  • Comprehensive benefits with strong medical, dental and vision insurance plans
  • 401K plan
  • Professional development & training opportunities for continuous learning
  • Work/life autonomy via flexible work hours and flexible paid time off
  • Generous parental leave
  • Regular team activities (virtual and in-person)
  • Additional compensation such as incentive pay and stock options
  • Fulltime
Read More
Arrow Right

Staff Engineer - AI Platform

At Teradata, we're not just managing data; we're unleashing its full potential. ...
Location
Location
India , Telangana
Salary
Salary:
Not provided
teradata.com Logo
Teradata
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, Data Science, or a related field
  • Experience in UI/UX design and frontend engineering
  • Proficiency in Angular, TypeScript, Figma, and design systems
  • Strong engineering background (Python/Java/Golang, API integration, backend frameworks)
  • Strong system design skills and understanding of distributed systems
  • Experience developing native notebook interfaces or extensions (e.g., Jupyter, VS Code Notebooks, or custom notebook UIs)
  • Strong understanding of human-computer interaction (HCI)
  • Experience designing interfaces for complex workflows or ML-powered products
  • Experience with LLM-based tools or agent orchestration (e.g., LangChain, AutoGen)
  • Familiarity with containerized environments (Docker, Kubernetes) and CI/CD pipelines
Job Responsibility
Job Responsibility
  • Design and prototype interfaces for interacting with autonomous agents
  • Implement responsive, accessible, and explainable UI components that visualize AI decisions, uncertainty, and reasoning paths
  • Partner with AI researchers and software engineers to ensure interfaces support emerging agent capabilities
  • Conduct usability testing with humans-in-the-loop scenarios
  • Drive UX best practices around safety, trust calibration, and explainability for intelligent systems
  • Design and prototype intuitive, high-impact interfaces for interacting with autonomous and intelligent agents
  • Experiment with LLM APIs, agentic workflows, and cutting-edge open-source frameworks
  • Explore and implement planning systems, vector databases, or memory architectures such as graph-based storage on Teradata
  • Champion UX best practices around safety, trust calibration, and explainability for intelligent systems
  • Collaborate with AI researchers and software engineers to ensure interfaces evolve alongside emerging agent capabilities
What we offer
What we offer
  • We prioritize a people-first culture
  • We embrace a flexible work model
  • We focus on well-being
  • We are committed to actively working to foster an inclusive environment that celebrates people for all of who they are
  • Fulltime
Read More
Arrow Right

Principal Engineer - Data path - HPE Alletra Storage MP X10000 (Object Storage product development)

Develops organization-wide architectures and methodologies for software systems ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master's degree in computer science, Information Systems, or equivalent
  • 15+ years of experience in a product development environment on storage/system engineering
  • Track record of delivering V1 products (or early-stage product development) in modern storage technologies (Object/File storage for modern AI use-cases, Object storage, cloud storage)
  • A track record of establishing and assuring adherence to performance requirements, work plans, and schedules for significant engineering initiatives
  • Experience designing and developing software systems design tools and languages
  • Experience in storage product development either file, block or object storage
  • Excellent analytical and problem-solving skills
  • Experience in overall architecture of software systems for products and solutions
  • Designing and integrating software systems running on multiple platform types into overall architecture
  • Evaluating and selecting forms and processes for software systems testing and methodology, including writing and execution of test plans, debugging, and testing scripts and tools
Job Responsibility
Job Responsibility
  • Develops organization-wide architectures and methodologies for software systems design and development across multiple platforms and organizations within the Global Business Unit
  • End-to-End Ownership and Technical Leadership
  • Identifies and evaluates new technologies, innovations, and outsourced development partner relationships for alignment with technology roadmap and business value
  • creates plans for integration and update into architecture
  • Anticipate bottlenecks and architect innovative solutions
  • Reviews and evaluates designs and project activities for compliance with development guidelines and standards
  • provides tangible feedback to improve product quality and mitigate failure risk
  • Drive best practices and operational excellence both at the team and organizational level
  • Coach and mentor junior and mid-level developers to help them grow technically and understand best practices
  • Leverages recognized domain expertise, business acumen, and experience to influence decisions of executive business leadership, outsourced development partners, and industry standards groups
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Head of Performance Profiling

We are hiring a Head of Performance Profiling to define how performance is under...
Location
Location
United States , San Jose
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep experience building complex systems at the intersection of hardware and software
  • Personally envisioned and built significant portions of profiling, tracing, or observability systems — not solely defined requirements or product strategy
  • Demonstrated ability to translate raw hardware signals into scalable, production-grade telemetry and analysis infrastructure
  • Experience correlating time-series events across distributed systems
  • Deep systems programming expertise (C++ or Rust), with a track record of shipping low-level infrastructure operating close to hardware or runtime systems
  • Experience designing distributed correlation mechanisms, timestamp-alignment strategies, or performance modeling frameworks across multiple devices or hosts
  • A history of introducing new technical abstractions or counter models that materially improved how engineers debug and optimize systems
  • Experience designing distributed tracing or observability platforms at scale
  • Experience with high-performance computing systems and large AI training clusters
  • Experience with timestamp synchronization strategies and event alignment in distributed environments
Job Responsibility
Job Responsibility
  • System-Level Performance Design: Define the architectural approach for collecting and structuring telemetry across CPUs, drivers, interconnects, and multiple accelerators
  • Design scalable models for correlating performance events across device and host boundaries
  • Cross-Layer Event Correlation: Develop mechanisms to align hardware counters, runtime activity, communication phases, and workload semantics across model-layer execution into coherent, actionable insight
  • Implement time synchronization and trace-alignment strategies across multi-device systems
  • Telemetry & Counter Modeling: Define structured counter taxonomies separating base signals from derived metrics
  • Design derived performance models bridging low-level hardware signals and workload-level behavior
  • Influence instrumentation strategy for future hardware generations
  • Distributed Performance Reasoning: Build tools that identify bottlenecks among multi-accelerator workloads across chips within hosts
  • Build cluster-scale performance analysis for distributed inference across data center networks
  • Tooling & Insight Delivery: Contribute to analysis engines and developer-facing tooling that transform raw telemetry into intuitive insight
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch and dinner in our office
  • Fulltime
Read More
Arrow Right

AI Cluster & Data Center Design Engineer

We are seeking a highly skilled systems engineer to architect and design scalabl...
Location
Location
United States , Austin
Salary
Salary:
139440.00 - 209160.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in HPC, AI infrastructure, or data center systems engineering
  • Strong understanding of rack and data center power delivery
  • Knowledge of GPU/CPU architectures, PCIe, UALink, InfiniBand, and Ethernet networking
  • Familiarity with AI/ML frameworks and workload characteristics
  • Excellent problem-solving, communication, and documentation skills
  • Bachelor's or Master's degree in Electrical Engineering, Computer Engineering, Computer Science or related field
Job Responsibility
Job Responsibility
  • Design scalable AI/HPC clusters including compute, storage, and networking with specific focus on power delivery
  • Evaluate and select CPUs, GPUs, accelerators, interconnects, and memory configurations for optimal cluster performance
  • Design leading-edge power delivery solutions for high-density AI/GPU deployments
  • Define power budgets, redundancy schemes, and fault tolerance mechanisms
  • Design network topologies to maximize overall cluster performance
  • Understand the network performance needs of different types of workloads
  • Understand advantages and performance trade-offs of network topologies for AI/HPC clusters
  • Design and optimize storage solutions to maximize AI/HPC cluster performance
  • Understand advantages and performance trade-offs of cluster storage solutions, e.g. Lustre, Ceph, etc.
  • Work across multiple organizations with subject matter experts from hardware, software, network, data center, and operations teams to deliver scalable, efficient, and reliable compute infrastructure
Read More
Arrow Right

Infrastructure Software Engineer

Building cutting-edge model-specific ASICs requires crafting custom infrastructu...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Are a systems-minded software engineer who loves building foundational platforms, working close to the metal and cloud, solving high-leverage problems at scale
  • Are a deeply technical engineer who treats infrastructure as a software problem - prioritizing clean abstractions, version control, small change lists, easy roll backs, testing, and long-term maintainability over ad hoc configuration
  • Have strong programming skills in languages such as Python, Go, Rust, and C++, and are comfortable building production-grade tooling
  • Have experience manufacturing hardware working with big name firms in Taiwan
  • Possess expert-level knowledge of Linux, virtualization, containerization, and CI/CD pipelines, with a deep understanding of how to debug, optimize, and scale complex systems
  • Are familiar with Infrastructure as Code tools like OpenTofu, Ansible, or Puppet, and enjoy designing declarative, reproducible infrastructure systems
  • Understand and use PromQL and other telemetry/query languages and have used LLM to extract insight from real-time metrics, and know how to architect and tune observability stacks
  • Have a track record of debugging and resolving difficult hardware-software integration problems across bare-metal systems, networks, and distributed workloads
  • Can lead and mentor technical teams, guiding design decisions and helping others develop sound engineering instincts
  • Have 4+ years of experience in infrastructure engineering, systems programming, or backend software development - ideally in environments where performance, scale, or hardware interaction mattered
Job Responsibility
Job Responsibility
  • Architect and Scale Distributed Compute Systems: Design and build the orchestration layers that drive our hybrid high-performance clusters—enabling simulation, synthesis, and continuous integration of AI ASICs at unprecedented scale
  • Build Infrastructure-as-Code Systems: Develop and maintain a fully programmable infrastructure control plane to ensure reproducibility, auditability, and rapid iteration across the entire stack
  • Optimize End-to-End Developer Experience: Create tools and abstractions that empower engineers to harness massive parallelism without worrying about the underlying complexity
  • Workload Elasticity, Reliability, and Efficiency: Prototype and execute workload orchestration and migration strategies between on-premise and cloud environments, balancing performance, storage availability and replication, uptime, and cost across heterogeneous hardware and compute backends
  • Implement real-time telemetry, tracing systems that surface insights from millions of metrics, enabling proactive debugging and system optimization
  • Push the Limits of Observability: Build a full observability stack that includes dashboards, alerting, automated responses, and a synthetic testing framework to proactively test infrastructure performance and reliability for various application and data flows, ensuring we remain ahead of issues impacting development and productivity workflows
  • Build an integrated, world-class manufacturing infrastructure: in close collaboration with partners to design, test, and ship the highest-quality AI acceleration hardware
What we offer
What we offer
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits
  • Fulltime
Read More
Arrow Right

Distinguished Technologist, Presales Engineering

Distinguished Technologist, Presales Engineering. This role has been designated ...
Location
Location
United States , All, New Jersey
Salary
Salary:
203500.00 - 492500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • AI/ML Networking Expertise: Proven experience designing networks for AI clusters, with a deep understanding of lossless Ethernet, congestion control algorithms (PFC, ECN), and load-balancing techniques specific to AI traffic patterns
  • Edge & Distributed Compute Knowledge: Technical proficiency in MEC (Multi-access Edge Computing) and how it intersects with AI inferencing models and 5G/6G transport
  • Advanced Network Design: 15+ years of experience in Network Infrastructure Design, with at least 3-5 years focused on Hyperscale Datacenter or Cloud-Scale SP architectures
  • Modern Protocol Fluency: Working knowledge of and hands-on experience with JunOS, EOS, or IOS(certifications preferred), with specialized knowledge in EVPN-VXLAN, Segment Routing (SRv6), and telemetry-driven automation
  • Systems Thinking: Ability to bridge the gap between hardware (GPU/NIC/Switch) and software (AI Frameworks, Kubernetes, Virtualization) to provide a holistic "AI Fabric" vision
  • Thought Leadership: Significant experience providing both solution and commercial leadership, specifically in translating the "Cost of AI" into a value-based networking ROI for C-level stakeholders
  • Executive Presence: Professional with strong business acumen and the ability to build relationships with technical decision makers and C-level executives in client organizations
  • Sales Mastery: Experience preparing RFP/Tender response documents, including compliance, bill of materials, and solution documents to drive a successful response
  • Resource Management: Strong resource management skills, including how and when to effectively engage SMEs, specialists, and Engineering resources
  • Minimum Qualifications: 15+ years relevant industry experience in Cloud/Networking Infrastructure, routing, and switching domains
Job Responsibility
Job Responsibility
  • Lead AI Grid Strategy: Serve as the primary architect for "AI Grid" initiatives, helping Service Providers build interconnected, high-performance compute and networking fabrics designed specifically for distributed AI workloads
  • Architect Edge AI Solutions: Design and implement low-latency networking architectures to support Inferencing at the Edge, ensuring SPs can deliver AI services closer to the end-user with minimal jitter and maximum throughput
  • Optimize AI Infrastructure: Articulate the value of specialized AI/ML networking, including the orchestration of RDMA over Converged Ethernet (RoCE v2) and InfiniBand-to-Ethernet transitions within the Modern Datacenter
  • End-to-End AI Design: Responsible for the architecture of end-to-end Networking Infrastructure that supports both the "Front-end" (management/client) and "Back-end" (GPU-to-GPU) AI clusters
  • Strategic Advisory: Advise Sales Engineers and partners on the transition from traditional SP routing to AI-Optimized WAN and Fabric solutions, ensuring customers' business requirements for massive scale and predictive analytics are met
  • Ecosystem Integration: Develop deep partnerships with Silicon providers, AI software innovators, and Integrators to solve emerging "AI-scale" problems across multi-tenant environments
  • Hyperscale AI Networking: Architect and design Hyperscale solutions that specifically address "Job Completion Time" (JCT) metrics critical to AI training customers
  • Complex Deal Orchestration: In complex Networking Infrastructure deals, the Distinguished Technologist will be responsible for the end-to-end technical solution & customizations, orchestrating other Specialists and Technical resources including Systems Engineers as well as Product Management
  • Technical Validation: Excel in delivering Demos and Proof of Concept on solutions as well as clearly articulating the value proposition aligned to customer use cases
  • Stakeholder Engagement: Build relationships with senior resources and technical decision makers with key customers to ensure smooth knowledge transfer and hand-over to the delivery team
What we offer
What we offer
  • Health & Wellbeing: We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing
  • Personal & Professional Development: We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have — whether you want to become a knowledge expert in your field or apply your skills to another division
  • Unconditional Inclusion: We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good
  • Fulltime
Read More
Arrow Right