CrawlJobs Logo

AI Cluster & Data Center Design Engineer

amd.com Logo

AMD

Location Icon

Location:
United States , Austin

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139440.00 - 209160.00 USD / Year

Job Description:

We are seeking a highly skilled systems engineer to architect and design scalable AI/HPC clusters with specific focus on rack and data center power delivery. This role involves evaluating and selecting compute, storage, networking, and power delivery components and solutions to optimize performance and reliability across global deployments. You will collaborate with cross-functional teams to deliver cutting-edge infrastructure for AI and high-performance computing workloads.

Job Responsibility:

  • Design scalable AI/HPC clusters including compute, storage, and networking with specific focus on power delivery
  • Evaluate and select CPUs, GPUs, accelerators, interconnects, and memory configurations for optimal cluster performance
  • Design leading-edge power delivery solutions for high-density AI/GPU deployments
  • Define power budgets, redundancy schemes, and fault tolerance mechanisms
  • Design network topologies to maximize overall cluster performance
  • Understand the network performance needs of different types of workloads
  • Understand advantages and performance trade-offs of network topologies for AI/HPC clusters
  • Design and optimize storage solutions to maximize AI/HPC cluster performance
  • Understand advantages and performance trade-offs of cluster storage solutions, e.g. Lustre, Ceph, etc.
  • Work across multiple organizations with subject matter experts from hardware, software, network, data center, and operations teams to deliver scalable, efficient, and reliable compute infrastructure

Requirements:

  • Experience in HPC, AI infrastructure, or data center systems engineering
  • Strong understanding of rack and data center power delivery
  • Knowledge of GPU/CPU architectures, PCIe, UALink, InfiniBand, and Ethernet networking
  • Familiarity with AI/ML frameworks and workload characteristics
  • Excellent problem-solving, communication, and documentation skills
  • Bachelor's or Master's degree in Electrical Engineering, Computer Engineering, Computer Science or related field

Nice to have:

  • Experience designing power delivery solutions for racks and data centers
  • Contributions to open-source HPC or AI infrastructure projects

Additional Information:

Job Posted:
March 21, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI Cluster & Data Center Design Engineer

Applied Data Center Design Engineer

As an Applied Data Center Design Engineer, you’ll own the “last mile” of cluster...
Location
Location
Canada , Toronto
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Engineering, Electrical Engineering, Computer Science, or a related field — or equivalent practical experience
  • 1–3 years of experience in infrastructure engineering, data center design, or systems deployment, creating rack elevations, bill of materials (BOMs), and port/cable maps
  • Familiarity with servers, networking, and storage hardware
  • Basic proficiency in scripting or automation (e.g., Python, PowerShell, or Bash)
  • Strong analytical and problem-solving skills with attention to detail
  • Excellent communication and teamwork skills across multiple engineering disciplines
Job Responsibility
Job Responsibility
  • Translate cluster and rack-level design specifications into deployable blueprints for servers, storage, networking, and cabling
  • Customize rack-level designs to meet unique cluster requirements, ensuring power, thermal, and network connectivity are optimized for each deployment
  • Collaborate with operations team to validate and adapt designs based on site-specific constraints (e.g., power, cooling, space, logistics)
  • Identify and implement automation and tooling to streamline BOM generation and design validation
  • Participate in data center deployment reviews, ensuring alignment between design intent and implementation
  • Support issue triage and root cause analysis for deployment-related or physical integration problems
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right

Lead Advanced Analytics, Digital & AI Products

The Analytics Centre of Excellence (ACOE) at Airbnb, based in India, is a hub of...
Location
Location
India , Bangalore
Salary
Salary:
2940000.00 - 4170000.00 INR / Year
airbnb.com Logo
Airbnb
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Min 8 or more years in industry experience and a degree (Masters or PhD is a plus) in a quantitative field (e.g., Statistics, Econometrics, Computer Science, Engineering, Mathematics, Data Science, Operations Research)
  • Expert communication and collaboration skills with the ability to work effectively with internal teams in a cross-cultural and cross-functional environment. Ability to conduct rigorous analysis and communicate conclusions to both technical and non-technical audiences
  • Strong expertise in Python, SQL, A/B testing platforms and best practices
  • Expertise in EDA, hypothesis testing, significance testing, regression, clustering techniques, concepts of NLP/Text Mining, machine learning and deep learning techniques, language model fine-tuning etc
  • Understanding of LLM architectures (e.g., prompt chains, retrieval augmentation, orchestration frameworks like LangChain or DSPy)
  • Data engineering foundations, including ability to work with data engineering teams on pipeline design, metric computation jobs, data model changes, and maintaining reliable end-to-end metric systems
  • Familiarity with Agentic AI systems, human-in-the-loop design, and AI observability best practices
  • Experience partnering with internal teams to drive action and providing expertise and direction on analytics, data science, experimental design, and measurement
  • Experience designing and building metrics, from conception to building prototypes with data pipelines
Job Responsibility
Job Responsibility
  • Data thought partner to product and business leaders across marketplace teams through providing insights, recommendations, and enabling data informed decisions
  • Drive day to day product analytics and build scalable analytical solutions
  • Develop a deep understanding of how guests and hosts interact with our CS products including but not limited to AI Assistant, IVR and other agent assisting tools to improve the customer experience
  • Owning the product analytics roadmap, prioritisation & delivery of solutions in the Contact Center Product space
  • Own projects from start to end: building out timelines, key milestones, providing regular updates to product managers, analytics management and delivery against agreed timelines
  • Lead end-to-end measurement for AI systems, aligning metrics with business outcomes, user experience, and trustworthiness
  • Collaborate with engineering to define scalable logging and ensure observability across agentic and LLM workflows
  • Lead the design and evaluate A/B and causal tests to quantify impact of AI features and optimisations
  • Translate insights into strategic recommendations for senior leaders
  • shape product priorities through data
What we offer
What we offer
  • bonus or incentives
  • one or more equity programs
  • benefits
  • Employee Travel Credits
Read More
Arrow Right

Senior Software Engineer- AI and Data Governance

At GEICO, we offer a rewarding career where your ambitions are met with endless ...
Location
Location
United States , Palo Alto
Salary
Salary:
100000.00 - 215000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advance knowledge of at least one modern OOP languages such as Go, Python, Java, etc.
  • Advance knowledge of web technologies such as HTML, CSS, JavaScript is preferred
  • Understand open-source databases like MySQL, PostgreSQL, etc., familiar with No-SQL databases like Cassandra, MongoDB, Elasticsearch, etc.
  • Experience in architecting, designing, building automation, workflows, custom objects/apps, declarative functionality, triggers, migration tools in BMC Helix platform and transition such platform to Open Source is a big plus
  • Experience building and configuring flows, and process builders
  • Strong understanding of web service integration (GRPC / REST) and enterprise middleware integration tiers
  • Ability to articulate channel dataflow and process flow including email, messaging, chat, mobile Push and SDK's
  • Excellent communication skills – needs to be able to lead projects from the front and interact with clients and sponsors on a regular basis
  • Experience partnering with engineering teams and transferring research to production
  • Experience with continuous delivery (CI/CD) and Infrastructure as Code
Job Responsibility
Job Responsibility
  • Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
  • Develop and execute technical software development strategy for the Platform Engineering domain including Service Management, Business Continuity, Recovery, Incident Response and Paging platforms
  • Accountable for the quality, usability, and performance of the solutions
  • Deep hands-on experience in complex system design and data pipeline and architectures, scale and performance, tuning, with good knowledge on Docker and Kubernetes
  • Consistently share best practices and improve processes within and across teams
  • Willing to take on-call and operational support
  • Experience designing recommendation systems, ranking, personalization, similarity search and embeddings
  • Experience with NLP, LLMs and RAG, as well as translating natural language into graph or data queries
  • Experience designing scalable AI systems and Data pipelines
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Lead Advanced Analytics, Digital & AI Products

The Analytics Centre of Excellence (ACOE) at Airbnb, based in India, is a hub of...
Location
Location
India , Bangalore
Salary
Salary:
2940000.00 - 4170000.00 INR / Year
airbnb.com Logo
Airbnb
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Min 8 or more years in industry experience
  • A degree (Masters or PhD is a plus) in a quantitative field (e.g., Statistics, Econometrics, Computer Science, Engineering, Mathematics, Data Science, Operations Research)
  • Expert communication and collaboration skills with the ability to work effectively with internal teams in a cross-cultural and cross-functional environment
  • Ability to conduct rigorous analysis and communicate conclusions to both technical and non-technical audiences
  • Strong expertise in Python, SQL, A/B testing platforms and best practices
  • Expertise in EDA, hypothesis testing, significance testing, regression, clustering techniques, concepts of NLP/Text Mining, machine learning and deep learning techniques, language model fine-tuning etc
  • Understanding of LLM architectures (e.g., prompt chains, retrieval augmentation, orchestration frameworks like LangChain or DSPy)
  • Data engineering foundations, including ability to work with data engineering teams on pipeline design, metric computation jobs, data model changes, and maintaining reliable end-to-end metric systems
  • Familiarity with Agentic AI systems, human-in-the-loop design, and AI observability best practices
  • Experience partnering with internal teams to drive action and providing expertise and direction on analytics, data science, experimental design, and measurement
Job Responsibility
Job Responsibility
  • Data thought partner to product and business leaders across marketplace teams through providing insights, recommendations, and enabling data informed decisions
  • Drive day to day product analytics and build scalable analytical solutions
  • Develop a deep understanding of how guests and hosts interact with our CS products including but not limited to AI Assistant, IVR and other agent assisting tools to improve the customer experience
  • Owning the product analytics roadmap, prioritisation & delivery of solutions in the Contact Center Product space
  • Own projects from start to end: building out timelines, key milestones, providing regular updates to product managers, analytics management and delivery against agreed timelines
  • Lead end-to-end measurement for AI systems, aligning metrics with business outcomes, user experience, and trustworthiness
  • Collaborate with engineering to define scalable logging and ensure observability across agentic and LLM workflows
  • Lead the design and evaluate A/B and causal tests to quantify impact of AI features and optimisations
  • Translate insights into strategic recommendations for senior leaders
  • shape product priorities through data
What we offer
What we offer
  • Bonus or incentives
  • One or more equity programs
  • Employee Travel Credits
Read More
Arrow Right

Head of Performance Profiling

We are hiring a Head of Performance Profiling to define how performance is under...
Location
Location
United States , San Jose
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep experience building complex systems at the intersection of hardware and software
  • Personally envisioned and built significant portions of profiling, tracing, or observability systems — not solely defined requirements or product strategy
  • Demonstrated ability to translate raw hardware signals into scalable, production-grade telemetry and analysis infrastructure
  • Experience correlating time-series events across distributed systems
  • Deep systems programming expertise (C++ or Rust), with a track record of shipping low-level infrastructure operating close to hardware or runtime systems
  • Experience designing distributed correlation mechanisms, timestamp-alignment strategies, or performance modeling frameworks across multiple devices or hosts
  • A history of introducing new technical abstractions or counter models that materially improved how engineers debug and optimize systems
  • Experience designing distributed tracing or observability platforms at scale
  • Experience with high-performance computing systems and large AI training clusters
  • Experience with timestamp synchronization strategies and event alignment in distributed environments
Job Responsibility
Job Responsibility
  • System-Level Performance Design: Define the architectural approach for collecting and structuring telemetry across CPUs, drivers, interconnects, and multiple accelerators
  • Design scalable models for correlating performance events across device and host boundaries
  • Cross-Layer Event Correlation: Develop mechanisms to align hardware counters, runtime activity, communication phases, and workload semantics across model-layer execution into coherent, actionable insight
  • Implement time synchronization and trace-alignment strategies across multi-device systems
  • Telemetry & Counter Modeling: Define structured counter taxonomies separating base signals from derived metrics
  • Design derived performance models bridging low-level hardware signals and workload-level behavior
  • Influence instrumentation strategy for future hardware generations
  • Distributed Performance Reasoning: Build tools that identify bottlenecks among multi-accelerator workloads across chips within hosts
  • Build cluster-scale performance analysis for distributed inference across data center networks
  • Tooling & Insight Delivery: Contribute to analysis engines and developer-facing tooling that transform raw telemetry into intuitive insight
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch and dinner in our office
  • Fulltime
Read More
Arrow Right

Compute Partnerships Lead

We are looking for a Compute Partnerships Lead to architect and operate our glob...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3–7+ years in infrastructure partnerships, business development, commercial sourcing, or AI infrastructure strategy
  • Direct experience negotiating GPU/cloud/data center agreements strongly preferred
  • Strong understanding of AI workloads (training vs inference, memory constraints, networking, utilization economics)
  • Experience working cross-functionally with engineering and finance
  • High commercial discipline — comfortable modeling margin, utilization, and contract tradeoffs
  • Comfortable operating in constrained supply environments
  • Strong ownership mentality — you build systems, not just deals
  • Ability to travel and manage global partnerships across time zones
Job Responsibility
Job Responsibility
  • Develop and execute Prime Intellect’s global GPU sourcing strategy across H100/H200/B200-class infrastructure and beyond
  • Structure commercial agreements that balance cost, flexibility, term length, and growth optionality
  • Identify and evaluate infrastructure partners across hyperscalers, specialized AI clouds, data centers, colocation providers, and hardware vendors
  • Lead negotiations on pricing, SLAs, capacity reservations, expansion rights, and risk allocation
  • Continuously optimize blended gross margins through disciplined sourcing and contract structuring
  • Secure capacity for internal frontier RL research and model training
  • Coordinate closely with research and engineering teams to understand workload requirements (training vs inference vs long-context deployments)
  • Align capacity planning with enterprise deployment roadmaps
  • Ensure compute supply keeps pace with customer expansion and new model launches
  • Work with infrastructure, platform, and DevOps teams to ensure partner capacity is onboarded efficiently and runs reliably in production
What we offer
What we offer
  • Competitive Compensation + equity incentives
  • Flexible Work (remote or San Francisco)
  • Visa Sponsorship and relocation support
  • Professional Development budget
  • Team off-sites and conference attendance
  • Opportunity to shape decentralized AI at Prime Intellect
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

GEICO is seeking an experienced Senior Engineer with a passion for building high...
Location
Location
United States , Chevy Chase
Salary
Salary:
105000.00 - 215000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of professional experience in software development, platform architecture, administration, governance, infrastructure management, installation, and maintenance of the hardware, software, and network systems
  • 4+ years of experience in open-source frameworks
  • 3+ years of experience with design
  • 3+ years of experience with AWS, GCP, Azure, or hybrid data center
  • Bachelor's degree in computer science, Information Systems, or equivalent education or work experience
  • Good hands-on experience in building complex distributed system to process large scale telemetry and architectures to support the scale and performance, with great knowledge on Docker and Kubernetes
  • Advanced knowledge of at least one OOP language such as Java, Go, Python, etc.
  • Great understanding of open-source databases like MySQL, PostgreSQL, etc. And strong foundation with No-SQL databases like Clickhouse, Cassandra. Apache Trino etc. Knowledge or Big data formats such as Parquet or Avro etc.
  • Experience in architecting, designing, building Observability platform solutions, Advanced data analytics using Open-Source technologies are a big plus.
  • Experience building distributed systems
Job Responsibility
Job Responsibility
  • Focus on Single or multiple areas and provide technical and thought leadership to the enterprise
  • Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
  • Develop and execute technical software development strategy for the Observability Engineering domain
  • Accountable for the quality, usability, and performance of the solutions
  • Be a executor as well as an active learner, helping to coach TDPs and strengthen the technical expertise and know-how of our engineering and product community. Influence and educate executives
  • Consistently share best practices and improve processes within and across teams
  • Analyze cost and forecast, incorporating them into business plans
  • Determine and support resource requirements, evaluate operational processes, measure outcomes to ensure desired results, demonstrate adaptability and sponsor continuous learning
  • Willing to take on-call and operation support
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Together Cloud Infrastructure

Together AI is building the AI Acceleration Cloud, an end-to-end platform for th...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 230000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
Job Responsibility
Job Responsibility
  • Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other benefits
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right