Senior AI Hardware Architect Job at Microsoft Corporation (Mountain View)

Senior Product Architect – AI Data Center & SONiC Networking

Senior Product Architect – AI Data Center & SONiC Networking. This role has been...

Location

United States , San Jose

Salary:

172000.00 - 349000.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

10 plus years of experience in data center networking, AI infrastructure, or high-performance systems
Deep expertise in: SONiC architecture and internals
Large-scale Ethernet fabrics
High-speed SerDes (112G/224G PAM4) and their impact on system performance
Strong understanding of ASIC pipelines, buffering, ECMP behavior, and congestion mechanisms
Proven ability to diagnose cross-layer performance and reliability issues involving software, hardware, and physical-layer interactions
Hands-on experience with RDMA/RoCE, congestion control, and lossless Ethernet at scale
Experience with automation and tooling (Python, Ansible, Terraform) in large-scale environments
Industry certifications (e.g., CCIE, JNCIE, NVIDIA) or equivalent practical experience preferred

Job Responsibility

Architect ultra-low-latency, lossless Ethernet fabrics supporting tens of thousands of GPUs for AI training and inference
Own the end-to-end SONiC platform architecture and fabric strategy, spanning control plane, management plane, data-plane integration, and operations at scale
Define multi-generation fabric and platform strategy across switch ASICs, NICs, SerDes capabilities, cabling, and system constraints, aligned to power, performance, and deployment realities
Own link-level and physical-layer requirements as they impact SONiC performance, including high-speed PAM4 signaling (112G/224G), error handling, and hardware/software interaction
Align SONiC architectures with next-generation GPU, NIC, and switch platforms, ensuring optimal performance across hardware and software boundaries
Define SONiC capabilities for AI and HPC workloads, including: Lossless Ethernet and RoCE
Congestion management, QoS, and ECN
Dynamic and flow-based load balancing
Drive scale, performance, and resiliency targets for SONiC-based fabrics, including fast convergence, hitless upgrades, and failure recovery
Define and enforce system-level validation criteria, including scale testing, fault injection, performance benchmarking, and upgrade scenarios

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Senior AI Network Architect

Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 3+ years technical engineering experience
OR Bachelor's Degree in Electrical Engineering, Computer Engineering, Mechanical Engineering, or related field AND 5+ years technical engineering experience
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
3+ years of experience in designing AI backend networks and integrating them into large-scale GPU systems
Proven expertise in system architecture across compute, networking, and accelerator domains
Deep understanding of RDMA protocols (RoCE, InfiniBand), congestion control (DCQCN), and Layer 2/3 routing
Experience with optical interconnects (e.g., PSM, WDM), link budget analysis, and transceiver integration
Familiarity with signal integrity modeling, link training, and physical layer optimization

Job Responsibility

Spearhead architectural definition and innovation for next-generation GPU and AI accelerator platforms, with a focus on ultra-high bandwidth, low-latency backend networks
Drive system-level integration across compute, storage, and interconnect domains to support scalable AI training workloads
Partner with silicon, firmware, and datacenter engineering teams to co-design infrastructure that meets performance, reliability, and deployment goals
Influence platform decisions across rack, chassis, and pod-level implementations
Cultivate deep technical relationships with silicon vendors, optics suppliers, and switch fabric providers to co-develop differentiated solutions
Represent Microsoft in joint architecture forums and technical workshops
Evaluate and articulate tradeoffs across electrical, mechanical, thermal, and signal integrity domains
Frame decisions in terms of TCO, performance, scalability, and deployment risk
Lead design reviews and contribute to PRDs and system specifications
Shape the direction of hyperscale AI infrastructure by engaging with standards bodies (e.g., IEEE 802.3), influencing component roadmaps, and driving adoption of novel interconnect protocols and topologies

Fulltime

Senior AI Software Architect

Do you want to be at the forefront of innovating the latest hardware designs to ...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter

Job Responsibility

Port and optimize large-scale AI models (e.g., foundation models, diffusion models, YOLO) to run efficiently on Maia hardware
Integrate models using frameworks such as PyTorch, ONNX, vLLM, and SGLang
Apply techniques like KV cache quantization (e.g., BF16 → FP8), checkpointing, and re-sharding for efficient inference and training
Experiment with parallelism strategies (TP, PP) and analyze performance impacts across interconnects (NVLink vs PCIe)
Collaborate on improving inference pipelines, including KV caching in sglang/vllm and performance tuning at the PyTorch level
Work with Triton kernels for basic operations (e.g., FP8 dequantization) and assist in kernel performance analysis
Partner with hardware architects and kernel developers for co-design discussions
Communicate effectively with multiple stakeholders to align on performance goals and deliverables

Fulltime

Senior Principal AI Infrastructure Architect

The Senior Principal AI Infrastructure Architect is a highly skilled and advance...

Location

Italy , Milano

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Significant experience in a consulting, presales or architecture role within a large-scale (preferably multi-national) technology services environment, with a track record of leading AI infrastructure pursuits
Demonstrable experience designing and delivering production AI platforms — from single multi-GPU servers through to multi-rack training clusters and inference factories
Strong working knowledge of the AI hardware vendor landscape (NVIDIA, AMD, Intel, Dell, HPE, Lenovo, Supermicro, Cisco, Pure, VAST, WEKA, DDN, NetApp) and how to position partner ecosystems competitively
Proven ability to translate AI workload requirements (model size, parameter count, sequence length, throughput SLOs, latency targets) into accurate hardware bills of materials and sizing justifications
Significant client engagement and consulting experience, including client needs assessment, change management and the ability to identify whitespace for follow-on AI infrastructure and managed-services work
Significant business development and presales experience on infrastructure-led deals, ideally including sovereign AI, AI Factory or regulated-industry GenAI programmes
Strong understanding of how AI infrastructure integrates with business processes, applications, data platforms and existing enterprise architecture
Bachelor's degree or equivalent in Information Technology, Engineering, Computer Science or a related field
Deep, hands-on knowledge of AI hardware: GPU and accelerator portfolios (NVIDIA Hopper / Blackwell, AMD MI300/MI325, Intel Gaudi 3, emerging custom silicon), host CPU platforms (Intel Xeon, AMD EPYC, NVIDIA Grace), system topologies (HGX, DGX, MGX, OAM) and how each choice maps to specific AI workloads
Strong understanding of AI-class storage: parallel filesystems, all-flash NVMe platforms, S3-class object stores, checkpoint and dataset pipelines and the I/O patterns of large-scale training and inference (VAST, WEKA, DDN EXAScaler, Pure FlashBlade, NetApp ONTAP AI, Dell PowerScale)

Job Responsibility

Lead the end-to-end design of large, complex AI infrastructure solutions — covering accelerated compute (NVIDIA H100/H200/B200 and GB200 NVL72, AMD Instinct MI300X/MI325X, Intel Gaudi 3), CPU host platforms (Intel Xeon, AMD EPYC, NVIDIA Grace), high-throughput storage tiers and lossless AI fabric — for enterprise, sovereign AI and AI Factory clients
Architect reference designs built on NVIDIA DGX/HGX SuperPOD, Dell AI Factory with NVIDIA, Cisco Nexus HyperFabric AI, HPE / Lenovo / Supermicro accelerated compute and equivalent platforms, balancing single-node performance with cluster-scale efficiency
Size and validate GPU clusters against real workloads — foundation-model pre-training, distributed fine-tuning, RAG, real-time and batch inference — using the right combination of NVLink/NVSwitch domains, InfiniBand NDR/XDR or Ultra Ethernet / NVIDIA Spectrum-X fabrics and tiered NVMe and parallel storage (VAST, WEKA, DDN, Pure FlashBlade, NetApp ONTAP AI, Dell PowerScale)
Define the supporting datacenter design: high-density power (50–140 kW/rack), direct-to-chip and rear-door liquid cooling, structured cabling for AI fabrics and modular deployment models across on-prem, colo and sovereign-cloud footprints
Work closely with the sales team to drive the presales process for AI infrastructure pursuits — client discovery, technical workshops, proposal writing, executive presentations and bid defence
Translate clients' AI ambitions and business outcomes into a hardware and platform roadmap, positioning NTT DATA's end-to-end portfolio — silicon, systems, storage, fabric, MLOps stack and managed services — to land service-led AI solutions
Lead integration of compute, storage, networking, the AI software stack (CUDA, ROCm, Triton, NIM, NVIDIA AI Enterprise, Run:ai, Slurm, Kubernetes / Kubeflow) and managed-service operating models across multiple domains, delivery units and geographies
Build business cases, TCO and unit-economics models (cost per token, cost per training run, GPU-hour economics) and end-to-end transition roadmaps for cloud-to-private AI migrations and sovereign AI deployments
Define architectural principles for AI infrastructure — accelerator utilisation, data gravity, multi-tenancy, model lifecycle, energy efficiency — and apply them to influence architectural outcomes and governance
Develop As-Is, Vision, FMO and To-Be AI platform architectures, identify gaps and develop transition roadmaps

Fulltime

Senior AI Infrastructure Engineer - Training Platform

As a Software Engineer on the Machine Learning Infrastructure team, you will bui...

Location

United States , San Francisco; Seattle; New York

Salary:

216000.00 - 270000.00 USD / Year

Scale

Expiration Date

Until further notice

Requirements

5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
Proven ability to solve complex problems and work independently in fast-moving environments

Job Responsibility

Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
Design and implement scheduling primitives to optimize the lifecycle of training jobs
Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
Work closely with Finance and Procurement teams to drive our capacity planning process
Participate in our team's on call process to ensure the availability of our services
Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment

What we offer

Comprehensive health, dental and vision coverage
retirement benefits
a learning and development stipend
generous PTO
commuter stipend (may be eligible)

Fulltime

Senior AI Presales Consultant

We are seeking a high-impact, strategic AI Presales Consultant to join our elite...

Location

India , Mumbai

Salary:

Not provided

Eviden

Expiration Date

Until further notice

Requirements

7+ years in a customer-facing technical role (e.g., Presales, Solutions Architecture, AI Specialist, or Technical Consulting), with a proven track record of designing large-scale AI, ML, or HPC solutions
Deep, hands-on understanding of LLM architectures. Must be able to architect, explain, and build PoCs for RAG pipelines, including vector databases (e.g., Milvus, Pinecone, Chroma), embedding models, and data ingestion strategies
Direct experience in sizing AI infrastructure. Must be able to perform "napkin math" and detailed calculations for GPU, CPU, memory, and network requirements
Must be able to fluently discuss performance metrics (tokens/second, latency, throughput, TFLOPS) and their relationship to hardware choice (e.g., NVIDIA H100 vs. A100, memory bandwidth, interconnects like NVLink/InfiniBand)
Expertise in the AI software stack. Strong understanding of MLOps principles (Kubeflow, MLflow), Kubernetes (K8s) for AI workloads, and model serving platforms (NVIDIA Triton, KServe, or similar)
Strong, current knowledge of the AI model landscape (e.g., Llama family, Mistral, GPT-family, foundation models). Ability to discuss fine-tuning techniques, quantization, and pruning
Exceptional communication, whiteboarding, and presentation skills. Ability to translate executive-level business needs into detailed technical architecture and build a compelling C-level value proposition
Bachelor's or Master's degree in Computer Science, AI, Data Science, or a related engineering field

Job Responsibility

Strategic Client Advisory: Lead executive-level "Art of the Possible" workshops and technical discovery sessions to understand a client's business goals, data readiness, and AI maturity
Full-Stack Solution Architecture: Design holistic, end-to-end AI solutions that synergize our supercomputing hardware, AI software platform, and MLOps capabilities to meet specific client needs
Generative AI & LLM Expertise: Act as the subject matter expert on Generative AI. Architect and evangelize scalable data ingestion and preparation pipelines, specializing in Retrieval-Augmented Generation (RAG) frameworks
Infrastructure Sizing & Performance Modelling: Analyse customer workloads (data volume, model complexity, training frequency, inference throughput) to accurately size the required platform infrastructure, including Kubernetes clusters, data storage, and software licenses. This includes calculating compute, storage, and network requirements based on key performance metrics like model parameters, token performance (tokens/sec), desired latency, and concurrent user load
Model & Software Consultation: Advise clients on AI model selection, comparing the trade-offs of open-source vs. proprietary LLMs, fine-tuning vs. foundation models, and model quantization
Position and demonstrate our proprietary AI software platform, MLOps tools, and libraries, integrating them into the client's ecosystem
Inference Optimization: Design and architect robust, low-latency, and high-throughput inference solutions for complex AI models, including large-scale LLM serving
User Experience (UX) Advocacy: Collaborate with client teams to define the end-user experience, ensuring the solution delivers tangible business value and a seamless interface for data scientists, analysts, and application users
Sales Cycle Enablement: Own the technical narrative throughout the sales cycle. Build and deliver compelling presentations, custom demonstrations, and Proofs of Concept (PoCs). Lead the technical response to complex RFIs/RFPs

Fulltime

Senior Systems Architect (IIOT & Cloud)

Location

United Kingdom

Salary:

75000.00 GBP / Year

Carbon13

Expiration Date

Until further notice

Requirements

7+ years of experience in systems architecture, on-site controllers, or IIoT platforms
In-depth experience in designing, testing, or installing control interfaces for at least one of these commercial or industrial asset categories: Battery BESS inverters, HVAC systems, heat pumps, and thermal storage utilising IIOT controllers, PLCs, RTUs, or Building Management Systems (BMS/BEMS)
Strong domain background in energy systems, industrial automation, HVAC automation or similar industries
Deep knowledge in any of the industrial communication protocols (Modbus, BACnet, OPCC, MQTT, OPC-UA,)
Strong understanding of networking and secure communications
Experience with IEC 61131-3 (industrial automation programming)
Proficiency in at least one of the programming languages: C, C++, Go, or Python
Strong experience with embedded Linux, real-time systems, and controller design
Experience designing cloud-native architectures (Azure)
Deep expertise in time-series data systems (e.g. Postgres, InfluxDB)

Job Responsibility

Lead the end-to-end design and evolution of our next-generation energy technology platform
Define and own the system architecture across edge devices and cloud, ensuring it is secure, scalable, standards-compliant, and future-proof
Translate complex energy system requirements into robust, production-ready solutions
Design architectures that reliably support tens of thousands of distributed industrial devices
Integrate real-time system within built environment and industrial set up with advanced connectivity (MQTT, OPC-UA / IEC 62541) and cloud-native data platforms
Develop Edge-AI control models and translate them into production systems across both edge and cloud environments
Set architectural direction while actively guiding firmware, hardware, and cloud implementation across teams
Prioritise secure, standards-compliant, and future-proof systems that can scale, adapt, and operate globally over time

What we offer

Remote-first with occasional travel to London/Cambridge for team meetings
Travel to customer or manufacturer locations within the UK, EU, or internationally
Co-working support – Eagle Labs membership for Cambridge-based hires (other locations TBD)
Competitive salary + share options – negotiable depending on cash vs equity preference
As a founding member, you will have influence in shaping future benefits + leave

Fulltime

Senior Distinguished AI Engineer

At Capital One, we are creating responsible and reliable AI systems, changing ba...

Location

United States , San Francisco; Richmond; San Jose; Cambridge; McLean; New York

Salary:

286200.00 - 392000.00 USD / Year

Capital One

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 10 years of experience developing AI and ML algorithms or technologies, or a Master's degree in Computer Science, AI, Electrical Engineering, Computer Engineering, or related fields plus at least 8 years of experience developing AI and ML algorithms or technologies
At least 10 years of experience programming with Python, Go, Scala, or Java
9 years of experience deploying scalable and responsible AI solutions on cloud platforms (e.g. AWS, Google Cloud, Azure, or equivalent private cloud)
Experience architecting, designing, developing, integrating, delivering, and supporting complex enterprise AI systems
Demonstrated ability to lead and mentor an engineering organization and influence cross-functional stakeholders up to the SVP level
Experience developing AI and ML algorithms or technologies (e.g. LLM Inference, Similarity Search and VectorDBs, Guardrails, Memory) using Python, C++, C#, Java, or Golang
Experience developing and applying state-of-the-art techniques for optimizing training and inference software to improve hardware utilization, latency, throughput, and cost
Passion for staying abreast of the latest AI research and AI systems, and judiciously apply novel techniques in production
Excellent communication and presentation skills, with the ability to articulate complex AI concepts to peers

Job Responsibility

Partner with a cross-functional team of engineers, research scientists, technical program managers, and product managers to deliver AI-powered products
Design, develop, test, deploy, and support AI software components including foundation model training, large language model inference, similarity search, guardrails, model evaluation, experimentation, governance, and observability
Leverage a broad stack of Open Source and SaaS AI technologies such as AWS Ultraclusters, Huggingface, VectorDBs, Nemo Guardrails, PyTorch, and more
Invent and introduce state-of-the-art LLM optimization techniques to improve the performance — scalability, cost, latency, throughput — of large scale production AI systems
Contribute to the technical vision and the long term roadmap of foundational AI systems at Capital One

What we offer

performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
comprehensive, competitive, and inclusive set of health, financial and other benefits that support your total well-being

Fulltime

Select Country

Senior AI Hardware Architect

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?