CrawlJobs Logo

AI Cluster & Data Center Design Engineer

amd.com Logo

AMD

Location Icon

Location:
United States , Austin

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139440.00 - 209160.00 USD / Year

Job Description:

We are seeking a highly skilled systems engineer to architect and design scalable AI/HPC clusters with specific focus on rack and data center power delivery. This role involves evaluating and selecting compute, storage, networking, and power delivery components and solutions to optimize performance and reliability across global deployments. You will collaborate with cross-functional teams to deliver cutting-edge infrastructure for AI and high-performance computing workloads.

Job Responsibility:

  • Design scalable AI/HPC clusters including compute, storage, and networking with specific focus on power delivery
  • Evaluate and select CPUs, GPUs, accelerators, interconnects, and memory configurations for optimal cluster performance
  • Design leading-edge power delivery solutions for high-density AI/GPU deployments
  • Define power budgets, redundancy schemes, and fault tolerance mechanisms
  • Design network topologies to maximize overall cluster performance
  • Understand the network performance needs of different types of workloads
  • Understand advantages and performance trade-offs of network topologies for AI/HPC clusters
  • Design and optimize storage solutions to maximize AI/HPC cluster performance
  • Understand advantages and performance trade-offs of cluster storage solutions, e.g. Lustre, Ceph, etc.
  • Work across multiple organizations with subject matter experts from hardware, software, network, data center, and operations teams to deliver scalable, efficient, and reliable compute infrastructure

Requirements:

  • Experience in HPC, AI infrastructure, or data center systems engineering
  • Strong understanding of rack and data center power delivery
  • Knowledge of GPU/CPU architectures, PCIe, UALink, InfiniBand, and Ethernet networking
  • Familiarity with AI/ML frameworks and workload characteristics
  • Excellent problem-solving, communication, and documentation skills
  • Bachelor's or Master's degree in Electrical Engineering, Computer Engineering, Computer Science or related field

Nice to have:

  • Experience designing power delivery solutions for racks and data centers
  • Contributions to open-source HPC or AI infrastructure projects

Additional Information:

Job Posted:
March 21, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI Cluster & Data Center Design Engineer

Business Development Manager – HPE POD (Modular Data Center Solutions)

Develop and grow the HPE Modular Data Center (POD) and AI infrastructure busines...
Location
Location
Japan , Tokyo
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering, computer science, or related technical field, or equivalent industry experience
  • Typically 8+ years of professional experience in data center infrastructure, HPC environments, modular infrastructure, or enterprise technology solutions
  • Experience in business development, solution sales, or infrastructure consulting within enterprise IT, cloud, or data center industries
  • Experience engaging with senior technical and executive stakeholders on infrastructure strategy and large-scale technology investments
  • Experience supporting complex infrastructure deals involving multiple stakeholders, partners, and delivery organizations
  • Strong understanding of modern data center architecture including modular data centers and containerized infrastructure, high-density GPU and HPC environments, liquid cooling and advanced thermal management, AI infrastructure and accelerated computing platforms, enterprise and hyperscale data center operations
  • Ability to translate complex technical infrastructure concepts into business value and strategic outcomes for customers
  • Strong commercial acumen with the ability to structure large infrastructure deals and navigate enterprise procurement processes
  • Experience working across multi-technology environments including compute, networking, storage, cooling systems, and facility infrastructure
  • Ability to develop scalable infrastructure solutions that enhance performance, efficiency, and time-to-deployment for AI and HPC workloads
Job Responsibility
Job Responsibility
  • Drive business development activities for HPE POD and modular data center solutions across targeted industries including AI, education, research, and enterprise environments
  • Identify, qualify, and develop new opportunities for modular data center deployments including AI factory infrastructure, HPC clusters, GPU environments, and edge data center solutions
  • Lead engagement with customers to understand technical, operational, and business requirements for large-scale data center deployments and translate these into POD-based solutions
  • Work closely with HPE account teams, solution architects, and partners to develop end-to-end proposals including infrastructure architecture, modular facility design, and lifecycle services
  • Act as a trusted advisor to customer executives, infrastructure teams, and decision makers on modern data center architecture, capacity scaling strategies, and AI-ready infrastructure
  • Coordinate cross-functional resources including engineering, supply chain, manufacturing partners, and delivery teams to ensure solutions are feasible, scalable, and aligned with customer timelines
  • Lead or support major proposal efforts and RFP responses for modular data center solutions, including technical positioning, commercial structuring, and value articulation
  • Support the creation of detailed solution architectures including modular data center configurations, cooling strategies (air and liquid)
  • Develop and maintain relationships with strategic ecosystem partners including cooling technology providers, modular construction manufacturers, and infrastructure integrators
  • Provide market intelligence and customer feedback to influence the evolution of the HPE POD portfolio
What we offer
What we offer
  • Health & Wellbeing (comprehensive suite of benefits supporting physical, financial and emotional wellbeing)
  • Personal & Professional Development (programs to help reach career goals)
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Lead Advanced Analytics, Digital & AI Products

The Analytics Centre of Excellence (ACOE) at Airbnb, based in India, is a hub of...
Location
Location
India , Bangalore
Salary
Salary:
2940000.00 - 4170000.00 INR / Year
airbnb.com Logo
Airbnb
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Min 8 or more years in industry experience and a degree (Masters or PhD is a plus) in a quantitative field (e.g., Statistics, Econometrics, Computer Science, Engineering, Mathematics, Data Science, Operations Research)
  • Expert communication and collaboration skills with the ability to work effectively with internal teams in a cross-cultural and cross-functional environment. Ability to conduct rigorous analysis and communicate conclusions to both technical and non-technical audiences
  • Strong expertise in Python, SQL, A/B testing platforms and best practices
  • Expertise in EDA, hypothesis testing, significance testing, regression, clustering techniques, concepts of NLP/Text Mining, machine learning and deep learning techniques, language model fine-tuning etc
  • Understanding of LLM architectures (e.g., prompt chains, retrieval augmentation, orchestration frameworks like LangChain or DSPy)
  • Data engineering foundations, including ability to work with data engineering teams on pipeline design, metric computation jobs, data model changes, and maintaining reliable end-to-end metric systems
  • Familiarity with Agentic AI systems, human-in-the-loop design, and AI observability best practices
  • Experience partnering with internal teams to drive action and providing expertise and direction on analytics, data science, experimental design, and measurement
  • Experience designing and building metrics, from conception to building prototypes with data pipelines
Job Responsibility
Job Responsibility
  • Data thought partner to product and business leaders across marketplace teams through providing insights, recommendations, and enabling data informed decisions
  • Drive day to day product analytics and build scalable analytical solutions
  • Develop a deep understanding of how guests and hosts interact with our CS products including but not limited to AI Assistant, IVR and other agent assisting tools to improve the customer experience
  • Owning the product analytics roadmap, prioritisation & delivery of solutions in the Contact Center Product space
  • Own projects from start to end: building out timelines, key milestones, providing regular updates to product managers, analytics management and delivery against agreed timelines
  • Lead end-to-end measurement for AI systems, aligning metrics with business outcomes, user experience, and trustworthiness
  • Collaborate with engineering to define scalable logging and ensure observability across agentic and LLM workflows
  • Lead the design and evaluate A/B and causal tests to quantify impact of AI features and optimisations
  • Translate insights into strategic recommendations for senior leaders
  • shape product priorities through data
What we offer
What we offer
  • bonus or incentives
  • one or more equity programs
  • benefits
  • Employee Travel Credits
Read More
Arrow Right

Senior Software Engineer- AI and Data Governance

At GEICO, we offer a rewarding career where your ambitions are met with endless ...
Location
Location
United States , Palo Alto
Salary
Salary:
100000.00 - 215000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advance knowledge of at least one modern OOP languages such as Go, Python, Java, etc.
  • Advance knowledge of web technologies such as HTML, CSS, JavaScript is preferred
  • Understand open-source databases like MySQL, PostgreSQL, etc., familiar with No-SQL databases like Cassandra, MongoDB, Elasticsearch, etc.
  • Experience in architecting, designing, building automation, workflows, custom objects/apps, declarative functionality, triggers, migration tools in BMC Helix platform and transition such platform to Open Source is a big plus
  • Experience building and configuring flows, and process builders
  • Strong understanding of web service integration (GRPC / REST) and enterprise middleware integration tiers
  • Ability to articulate channel dataflow and process flow including email, messaging, chat, mobile Push and SDK's
  • Excellent communication skills – needs to be able to lead projects from the front and interact with clients and sponsors on a regular basis
  • Experience partnering with engineering teams and transferring research to production
  • Experience with continuous delivery (CI/CD) and Infrastructure as Code
Job Responsibility
Job Responsibility
  • Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
  • Develop and execute technical software development strategy for the Platform Engineering domain including Service Management, Business Continuity, Recovery, Incident Response and Paging platforms
  • Accountable for the quality, usability, and performance of the solutions
  • Deep hands-on experience in complex system design and data pipeline and architectures, scale and performance, tuning, with good knowledge on Docker and Kubernetes
  • Consistently share best practices and improve processes within and across teams
  • Willing to take on-call and operational support
  • Experience designing recommendation systems, ranking, personalization, similarity search and embeddings
  • Experience with NLP, LLMs and RAG, as well as translating natural language into graph or data queries
  • Experience designing scalable AI systems and Data pipelines
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Lead Advanced Analytics, Digital & AI Products

The Analytics Centre of Excellence (ACOE) at Airbnb, based in India, is a hub of...
Location
Location
India , Bangalore
Salary
Salary:
2940000.00 - 4170000.00 INR / Year
airbnb.com Logo
Airbnb
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Min 8 or more years in industry experience
  • A degree (Masters or PhD is a plus) in a quantitative field (e.g., Statistics, Econometrics, Computer Science, Engineering, Mathematics, Data Science, Operations Research)
  • Expert communication and collaboration skills with the ability to work effectively with internal teams in a cross-cultural and cross-functional environment
  • Ability to conduct rigorous analysis and communicate conclusions to both technical and non-technical audiences
  • Strong expertise in Python, SQL, A/B testing platforms and best practices
  • Expertise in EDA, hypothesis testing, significance testing, regression, clustering techniques, concepts of NLP/Text Mining, machine learning and deep learning techniques, language model fine-tuning etc
  • Understanding of LLM architectures (e.g., prompt chains, retrieval augmentation, orchestration frameworks like LangChain or DSPy)
  • Data engineering foundations, including ability to work with data engineering teams on pipeline design, metric computation jobs, data model changes, and maintaining reliable end-to-end metric systems
  • Familiarity with Agentic AI systems, human-in-the-loop design, and AI observability best practices
  • Experience partnering with internal teams to drive action and providing expertise and direction on analytics, data science, experimental design, and measurement
Job Responsibility
Job Responsibility
  • Data thought partner to product and business leaders across marketplace teams through providing insights, recommendations, and enabling data informed decisions
  • Drive day to day product analytics and build scalable analytical solutions
  • Develop a deep understanding of how guests and hosts interact with our CS products including but not limited to AI Assistant, IVR and other agent assisting tools to improve the customer experience
  • Owning the product analytics roadmap, prioritisation & delivery of solutions in the Contact Center Product space
  • Own projects from start to end: building out timelines, key milestones, providing regular updates to product managers, analytics management and delivery against agreed timelines
  • Lead end-to-end measurement for AI systems, aligning metrics with business outcomes, user experience, and trustworthiness
  • Collaborate with engineering to define scalable logging and ensure observability across agentic and LLM workflows
  • Lead the design and evaluate A/B and causal tests to quantify impact of AI features and optimisations
  • Translate insights into strategic recommendations for senior leaders
  • shape product priorities through data
What we offer
What we offer
  • Bonus or incentives
  • One or more equity programs
  • Employee Travel Credits
Read More
Arrow Right

Member of Technical Staff, Hardware Health

Microsoft AI operates one of the world’s most advanced AI training infrastructur...
Location
Location
United States , Mountain View
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
  • Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
  • Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
  • Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
  • Experience with exascale-class systems or cloud-scale AI clusters.
  • Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
  • Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.
Job Responsibility
Job Responsibility
  • Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
  • Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
  • Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
  • Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
  • Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
  • Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
  • Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.
  • Fulltime
Read More
Arrow Right

Head of Performance Profiling

We are hiring a Head of Performance Profiling to define how performance is under...
Location
Location
United States , San Jose
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep experience building complex systems at the intersection of hardware and software
  • Personally envisioned and built significant portions of profiling, tracing, or observability systems — not solely defined requirements or product strategy
  • Demonstrated ability to translate raw hardware signals into scalable, production-grade telemetry and analysis infrastructure
  • Experience correlating time-series events across distributed systems
  • Deep systems programming expertise (C++ or Rust), with a track record of shipping low-level infrastructure operating close to hardware or runtime systems
  • Experience designing distributed correlation mechanisms, timestamp-alignment strategies, or performance modeling frameworks across multiple devices or hosts
  • A history of introducing new technical abstractions or counter models that materially improved how engineers debug and optimize systems
  • Experience designing distributed tracing or observability platforms at scale
  • Experience with high-performance computing systems and large AI training clusters
  • Experience with timestamp synchronization strategies and event alignment in distributed environments
Job Responsibility
Job Responsibility
  • System-Level Performance Design: Define the architectural approach for collecting and structuring telemetry across CPUs, drivers, interconnects, and multiple accelerators
  • Design scalable models for correlating performance events across device and host boundaries
  • Cross-Layer Event Correlation: Develop mechanisms to align hardware counters, runtime activity, communication phases, and workload semantics across model-layer execution into coherent, actionable insight
  • Implement time synchronization and trace-alignment strategies across multi-device systems
  • Telemetry & Counter Modeling: Define structured counter taxonomies separating base signals from derived metrics
  • Design derived performance models bridging low-level hardware signals and workload-level behavior
  • Influence instrumentation strategy for future hardware generations
  • Distributed Performance Reasoning: Build tools that identify bottlenecks among multi-accelerator workloads across chips within hosts
  • Build cluster-scale performance analysis for distributed inference across data center networks
  • Tooling & Insight Delivery: Contribute to analysis engines and developer-facing tooling that transform raw telemetry into intuitive insight
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch and dinner in our office
  • Fulltime
Read More
Arrow Right
New

Senior Infrastructure Engineer

Build the backbone of next-gen defense technology! We are seeking a Senior Infra...
Location
Location
Greece , Athens
Salary
Salary:
Not provided
https://www.randstad.com Logo
Randstad
Expiration Date
May 28, 2026
Flip Icon
Requirements
Requirements
  • 6+ years in designing, implementing and maintaining distributed infrastructures
  • At least 6 years of experience in complex, high-performance distributed environments
  • Deep expertise in networking, including routing, switching, VLAN/VXLAN, firewalls, load balancing and software-defined networking
  • Strong experience with VMware or similar hypervisors
  • Extensive experience designing and operating Kubernetes clusters for large-scale distributed workloads
  • Experience with storage systems such as distributed storage, SAN/NAS, and software-defined storage
  • Deep knowledge of server architecture and GPU configurations
  • Strong knowledge of Linux operating systems and system internals
Job Responsibility
Job Responsibility
  • Own the full stack (Compute, Storage, Network, Virtualization) for a highly available, on-premises data center
  • Deploy and manage K8s clusters for massive, distributed AI workloads
  • Automate everything using Terraform, Ansible, and CI/CD pipelines
  • Optimize hardware (GPUs, high-speed networking) for demanding ML/AI requirements
  • Manage databases, monitor capacity, and ensure 'mission-ready' security and resilience
  • Fulltime
Read More
Arrow Right

Compute Partnerships Lead

We are looking for a Compute Partnerships Lead to architect and operate our glob...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3–7+ years in infrastructure partnerships, business development, commercial sourcing, or AI infrastructure strategy
  • Direct experience negotiating GPU/cloud/data center agreements strongly preferred
  • Strong understanding of AI workloads (training vs inference, memory constraints, networking, utilization economics)
  • Experience working cross-functionally with engineering and finance
  • High commercial discipline — comfortable modeling margin, utilization, and contract tradeoffs
  • Comfortable operating in constrained supply environments
  • Strong ownership mentality — you build systems, not just deals
  • Ability to travel and manage global partnerships across time zones
Job Responsibility
Job Responsibility
  • Develop and execute Prime Intellect’s global GPU sourcing strategy across H100/H200/B200-class infrastructure and beyond
  • Structure commercial agreements that balance cost, flexibility, term length, and growth optionality
  • Identify and evaluate infrastructure partners across hyperscalers, specialized AI clouds, data centers, colocation providers, and hardware vendors
  • Lead negotiations on pricing, SLAs, capacity reservations, expansion rights, and risk allocation
  • Continuously optimize blended gross margins through disciplined sourcing and contract structuring
  • Secure capacity for internal frontier RL research and model training
  • Coordinate closely with research and engineering teams to understand workload requirements (training vs inference vs long-context deployments)
  • Align capacity planning with enterprise deployment roadmaps
  • Ensure compute supply keeps pace with customer expansion and new model launches
  • Work with infrastructure, platform, and DevOps teams to ensure partner capacity is onboarded efficiently and runs reliably in production
What we offer
What we offer
  • Competitive Compensation + equity incentives
  • Flexible Work (remote or San Francisco)
  • Visa Sponsorship and relocation support
  • Professional Development budget
  • Team off-sites and conference attendance
  • Opportunity to shape decentralized AI at Prime Intellect
  • Fulltime
Read More
Arrow Right