AI Cluster & Data Center Design Engineer Job at AMD (Austin)

Senior Software Engineer- AI and Data Governance

At GEICO, we offer a rewarding career where your ambitions are met with endless ...

Location

United States , Palo Alto

Salary:

100000.00 - 215000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Advance knowledge of at least one modern OOP languages such as Go, Python, Java, etc.
Advance knowledge of web technologies such as HTML, CSS, JavaScript is preferred
Understand open-source databases like MySQL, PostgreSQL, etc., familiar with No-SQL databases like Cassandra, MongoDB, Elasticsearch, etc.
Experience in architecting, designing, building automation, workflows, custom objects/apps, declarative functionality, triggers, migration tools in BMC Helix platform and transition such platform to Open Source is a big plus
Experience building and configuring flows, and process builders
Strong understanding of web service integration (GRPC / REST) and enterprise middleware integration tiers
Ability to articulate channel dataflow and process flow including email, messaging, chat, mobile Push and SDK's
Excellent communication skills – needs to be able to lead projects from the front and interact with clients and sponsors on a regular basis
Experience partnering with engineering teams and transferring research to production
Experience with continuous delivery (CI/CD) and Infrastructure as Code

Job Responsibility

Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
Develop and execute technical software development strategy for the Platform Engineering domain including Service Management, Business Continuity, Recovery, Incident Response and Paging platforms
Accountable for the quality, usability, and performance of the solutions
Deep hands-on experience in complex system design and data pipeline and architectures, scale and performance, tuning, with good knowledge on Docker and Kubernetes
Consistently share best practices and improve processes within and across teams
Willing to take on-call and operational support
Experience designing recommendation systems, ranking, personalization, similarity search and embeddings
Experience with NLP, LLMs and RAG, as well as translating natural language into graph or data queries
Experience designing scalable AI systems and Data pipelines

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Business Development Manager – HPE POD (Modular Data Center Solutions)

Develop and grow the HPE Modular Data Center (POD) and AI infrastructure busines...

Location

Japan , Tokyo

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering, computer science, or related technical field, or equivalent industry experience
Typically 8+ years of professional experience in data center infrastructure, HPC environments, modular infrastructure, or enterprise technology solutions
Experience in business development, solution sales, or infrastructure consulting within enterprise IT, cloud, or data center industries
Experience engaging with senior technical and executive stakeholders on infrastructure strategy and large-scale technology investments
Experience supporting complex infrastructure deals involving multiple stakeholders, partners, and delivery organizations
Strong understanding of modern data center architecture including modular data centers and containerized infrastructure, high-density GPU and HPC environments, liquid cooling and advanced thermal management, AI infrastructure and accelerated computing platforms, enterprise and hyperscale data center operations
Ability to translate complex technical infrastructure concepts into business value and strategic outcomes for customers
Strong commercial acumen with the ability to structure large infrastructure deals and navigate enterprise procurement processes
Experience working across multi-technology environments including compute, networking, storage, cooling systems, and facility infrastructure
Ability to develop scalable infrastructure solutions that enhance performance, efficiency, and time-to-deployment for AI and HPC workloads

Job Responsibility

Drive business development activities for HPE POD and modular data center solutions across targeted industries including AI, education, research, and enterprise environments
Identify, qualify, and develop new opportunities for modular data center deployments including AI factory infrastructure, HPC clusters, GPU environments, and edge data center solutions
Lead engagement with customers to understand technical, operational, and business requirements for large-scale data center deployments and translate these into POD-based solutions
Work closely with HPE account teams, solution architects, and partners to develop end-to-end proposals including infrastructure architecture, modular facility design, and lifecycle services
Act as a trusted advisor to customer executives, infrastructure teams, and decision makers on modern data center architecture, capacity scaling strategies, and AI-ready infrastructure
Coordinate cross-functional resources including engineering, supply chain, manufacturing partners, and delivery teams to ensure solutions are feasible, scalable, and aligned with customer timelines
Lead or support major proposal efforts and RFP responses for modular data center solutions, including technical positioning, commercial structuring, and value articulation
Support the creation of detailed solution architectures including modular data center configurations, cooling strategies (air and liquid)
Develop and maintain relationships with strategic ecosystem partners including cooling technology providers, modular construction manufacturers, and infrastructure integrators
Provide market intelligence and customer feedback to influence the evolution of the HPE POD portfolio

What we offer

Health & Wellbeing (comprehensive suite of benefits supporting physical, financial and emotional wellbeing)
Personal & Professional Development (programs to help reach career goals)
Unconditional Inclusion

Fulltime

Ai network engineer

We are seeking an experienced AI Network Engineer to support and optimize high-p...

Location

United States , Houston

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

5+ years of experience in network engineering or infrastructure engineering
Hands-on experience with high-performance networking (InfiniBand, RDMA, RoCE)
Experience supporting GPU-based or HPC environments
Strong knowledge of data center networking (L2/L3, BGP, EVPN, VXLAN)
Familiarity with Linux systems and performance tuning
Experience with NVIDIA ecosystems (DGX, CUDA, NCCL, or similar)
Ability to diagnose low-latency and high-throughput network issues

Job Responsibility

Design, implement, and support high-performance networks for AI/ML workloads, including GPU clusters and distributed training environments
Deploy and optimize NVIDIA-based infrastructure (DGX systems, HGX platforms, or GPU clusters)
Configure and manage high-speed networking technologies such as InfiniBand, RoCE, and 100/200/400Gb Ethernet
Optimize network performance for east-west traffic, low latency, and large data throughput required for AI model training
Integrate NVIDIA software stack (CUDA, NCCL, GPU Cloud, AI Enterprise) with networking and compute environments
Troubleshoot performance bottlenecks across network, storage, and GPU interconnects
Collaborate with AI/ML engineers to ensure infrastructure meets training and inference demands
Support automation and infrastructure-as-code initiatives for scalable AI environments

What we offer

medical, vision, dental, and life and disability insurance
company 401(k) plan

Senior Principal AI Interconnect Architect

An AI Interconnect Architect defines and engineers high-speed networking and com...

Location

United States , Milpitas

Salary:

194425.00 - 322092.00 USD / Year

Sandisk

Expiration Date

Until further notice

Requirements

Master's or Ph.D. in Electrical Engineering, Computer Engineering, or Computer Science
10 - 15 years experience developing interconnect technologies including transport and link level protocols, switching fabrics, QoS and reliable communication methods, and Software Defined Networking
Familiarity with various fabric topologies such as Fat tree, Leaf-Spine (Clos), Torus, Meshed and their applicability to various workload and system configurations
Familiarity with GPU/accelerator clusters and data center infrastructure
Deep, working knowledge of various interconnect technologies and protocols such as PCIe, CXL, NVLink, UALink, Ethernet, Ultra-Ethernet, and serial links
Ability to develop performance models

Job Responsibility

Develop architectures for chip-to-chip interconnects and switched fabrics tailored for AI/ML scale-out
Analyze trade-offs in bandwidth, latency, power, area, and reliability
Participate in industry standard bodies and contribute/influence/shape the direction of industry specifications
Work with SoC, package design, and software teams to ensure seamless integration

What we offer

paid vacation time
paid sick leave
medical/dental/vision insurance
life, accident and disability insurance
tax-advantaged flexible spending and health savings accounts
employee assistance program
other voluntary benefit programs such as supplemental life and AD&D, legal plan, pet insurance, critical illness, accident and hospital indemnity
tuition reimbursement
transit
the Applause Program

Fulltime

Senior Advanced Analyst - Digital & AI Products

Airbnb was born in 2007 when two hosts welcomed three guests to their San Franci...

Location

India , Bangalore

Salary:

1960000.00 - 2800000.00 INR / Year

Airbnb

Expiration Date

Until further notice

Requirements

6+ years in industry experience and a degree (Masters or PhD is a plus) in a quantitative field (e.g., Statistics, Econometrics, Computer Science, Engineering, Mathematics, Data Science, Operations Research)
Expert communication and collaboration skills with the ability to work effectively with internal teams in a cross-cultural and cross-functional environment. Ability to conduct rigorous analysis and communicate conclusions to both technical and non-technical audiences
Strong expertise in Python, SQL, A/B testing platforms and best practices
Expertise in EDA, hypothesis testing, significance testing, regression, clustering techniques, concepts of NLP/Text Mining, machine learning and deep learning techniques, language model fine-tuning etc
Understanding of LLM architectures (e.g., prompt chains, retrieval augmentation, orchestration frameworks like LangChain or DSPy)
Data engineering foundations, including ability to work with data engineering teams on pipeline design, metric computation jobs, data model changes, and maintaining reliable end-to-end metric systems
Familiarity with Agentic AI systems, human-in-the-loop design, and AI observability best practices
Experience partnering with internal teams to drive action and providing expertise and direction on analytics, data science, experimental design, and measurement
Experience designing and building metrics, from conception to building prototypes with data pipelines

Job Responsibility

Data thought partner to product and business leaders across marketplace teams through providing insights, recommendations, and enabling data informed decisions
Drive day to day product analytics and build scalable analytical solutions
Develop a deep understanding of how guests and hosts interact with our CS products including but not limited to AI Assistant, IVR and other agent assisting tools to improve the customer experience
Owning the product analytics roadmap, prioritization & delivery of solutions in the Contact Center Product space
Own projects from start to end: building out timelines, key milestones, providing regular updates to product managers, analytics management and delivery against agreed timelines
Lead end-to-end measurement for AI systems, aligning metrics with business outcomes, user experience, and trustworthiness
Collaborate with engineering to define scalable logging and ensure observability across agentic and LLM workflows
Lead the design and evaluate A/B and causal tests to quantify impact of AI features and optimizations
Translate insights into strategic recommendations for senior leaders
shape product priorities through data

Fulltime

Lead Advanced Analytics, Digital & AI Products

The Analytics Centre of Excellence (ACOE) at Airbnb, based in India, is a hub of...

Location

India , Bangalore

Salary:

2940000.00 - 4170000.00 INR / Year

Airbnb

Expiration Date

Until further notice

Requirements

Min 8 or more years in industry experience
A degree (Masters or PhD is a plus) in a quantitative field (e.g., Statistics, Econometrics, Computer Science, Engineering, Mathematics, Data Science, Operations Research)
Expert communication and collaboration skills with the ability to work effectively with internal teams in a cross-cultural and cross-functional environment
Ability to conduct rigorous analysis and communicate conclusions to both technical and non-technical audiences
Strong expertise in Python, SQL, A/B testing platforms and best practices
Expertise in EDA, hypothesis testing, significance testing, regression, clustering techniques, concepts of NLP/Text Mining, machine learning and deep learning techniques, language model fine-tuning etc
Understanding of LLM architectures (e.g., prompt chains, retrieval augmentation, orchestration frameworks like LangChain or DSPy)
Data engineering foundations, including ability to work with data engineering teams on pipeline design, metric computation jobs, data model changes, and maintaining reliable end-to-end metric systems
Familiarity with Agentic AI systems, human-in-the-loop design, and AI observability best practices
Experience partnering with internal teams to drive action and providing expertise and direction on analytics, data science, experimental design, and measurement

Job Responsibility

Data thought partner to product and business leaders across marketplace teams through providing insights, recommendations, and enabling data informed decisions
Drive day to day product analytics and build scalable analytical solutions
Develop a deep understanding of how guests and hosts interact with our CS products including but not limited to AI Assistant, IVR and other agent assisting tools to improve the customer experience
Owning the product analytics roadmap, prioritisation & delivery of solutions in the Contact Center Product space
Own projects from start to end: building out timelines, key milestones, providing regular updates to product managers, analytics management and delivery against agreed timelines
Lead end-to-end measurement for AI systems, aligning metrics with business outcomes, user experience, and trustworthiness
Collaborate with engineering to define scalable logging and ensure observability across agentic and LLM workflows
Lead the design and evaluate A/B and causal tests to quantify impact of AI features and optimisations
Translate insights into strategic recommendations for senior leaders
shape product priorities through data

What we offer

Bonus or incentives
One or more equity programs
Employee Travel Credits

Member of Technical Staff, Hardware Health

Microsoft AI operates one of the world’s most advanced AI training infrastructur...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
Experience with exascale-class systems or cloud-scale AI clusters.
Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.

Job Responsibility

Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.

Fulltime

Head of Performance Profiling

We are hiring a Head of Performance Profiling to define how performance is under...

Location

United States , San Jose

Salary:

Not provided

Etched

Expiration Date

Until further notice

Requirements

Deep experience building complex systems at the intersection of hardware and software
Personally envisioned and built significant portions of profiling, tracing, or observability systems — not solely defined requirements or product strategy
Demonstrated ability to translate raw hardware signals into scalable, production-grade telemetry and analysis infrastructure
Experience correlating time-series events across distributed systems
Deep systems programming expertise (C++ or Rust), with a track record of shipping low-level infrastructure operating close to hardware or runtime systems
Experience designing distributed correlation mechanisms, timestamp-alignment strategies, or performance modeling frameworks across multiple devices or hosts
A history of introducing new technical abstractions or counter models that materially improved how engineers debug and optimize systems
Experience designing distributed tracing or observability platforms at scale
Experience with high-performance computing systems and large AI training clusters
Experience with timestamp synchronization strategies and event alignment in distributed environments

Job Responsibility

System-Level Performance Design: Define the architectural approach for collecting and structuring telemetry across CPUs, drivers, interconnects, and multiple accelerators
Design scalable models for correlating performance events across device and host boundaries
Cross-Layer Event Correlation: Develop mechanisms to align hardware counters, runtime activity, communication phases, and workload semantics across model-layer execution into coherent, actionable insight
Implement time synchronization and trace-alignment strategies across multi-device systems
Telemetry & Counter Modeling: Define structured counter taxonomies separating base signals from derived metrics
Design derived performance models bridging low-level hardware signals and workload-level behavior
Influence instrumentation strategy for future hardware generations
Distributed Performance Reasoning: Build tools that identify bottlenecks among multi-accelerator workloads across chips within hosts
Build cluster-scale performance analysis for distributed inference across data center networks
Tooling & Insight Delivery: Contribute to analysis engines and developer-facing tooling that transform raw telemetry into intuitive insight

What we offer

Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits
Housing subsidy of $2k per month for those living within walking distance of the office
Relocation support for those moving to San Jose (Santana Row)
Various wellness benefits covering fitness, mental health, and more
Daily lunch and dinner in our office

Fulltime

Select Country

AI Cluster & Data Center Design Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?