CrawlJobs Logo

Technical Sourcing Manager - AI GPU & Cloud

United States, Menlo Park Employment contract 208000.00 - 289000.00 USD / Year · Job Posted April 24, 2026
Apply Position
Job Link Share

Job Description

The Global Technology Sourcing Team moves fast and leverages key partnerships to drive technical innovation, high quality standards and cost-savings via a customized supply chain. The Technical Sourcing Manager is the primary technical contact for supplier engagement and management with merchant suppliers across a variety of technologies and, in conjunction with the Commercial Sourcing Manager, will drive the strategic sourcing decisions on a global basis. Working closely with the hardware/software development engineering and research teams, the Technical Sourcing Manager will help develop, communicate and implement sourcing strategies, and participate in supplier engagements with an overall focus of influencing and driving technology roadmaps, maintaining long term supplier technical relationships, and developing solutions that deliver the optimal total cost of ownership.

Job Responsibility

  • Maintain current knowledge of the technology and industry trends and perform competitive analysis and due diligence on relevant products
  • Develop, manage and refresh individual and customized technology and commodity sourcing strategies and roadmaps and in-depth understanding of the actively managed adjacent technologies
  • Develop and maintain in depth technical relationships with executive management at relevant suppliers. Influence supplier technology roadmaps to ensure Meta's system architecture and technical requirements are met
  • In conjunction with Hardware/Software Engineering, Technology Strategy, and Technical Program Management, lead the creation and definition of future technology directions in the assigned commodity
  • Provide technical commodity and supplier expertise and consulting to research & design engineering for technology migration and lower TCO
  • Provide technical review and analysis for, and responses to, supplier proposals and RFQs. Resolve all technical queries arising from the quote process
  • Drive pre-POR evaluation of key technologies usability/suitability within the Meta infrastructure
  • Own technical expertise and due diligence in supplier negotiations and provide review and input for product development and master supply agreements
  • Work with hardware engineers to ensure part specifications and requirements are within broad commodity and supplier capabilities, avoiding special or unique SKUs
  • Develop supplier technical process improvement plans to drive positive gain on cost, quality, and reduced qualification and time to deployment
  • Support product transfer from NPI to mass-production and establish SLAs with cloud providers on compute availability
  • Partner with hardware engineering teams to identify critical disruptive and new technologies with potential to deliver step function improvements in Performance and TCO
  • Develop evaluation criteria, and milestones to vet, incubate, and graduate technologies in a methodical and organized manner and lead supplier development and influence
  • Evaluate architectural tradeoffs with hardware engineering including but not limited to cross commodity horizontals, integration of multiple technologies, and exploration and evaluation of new business models with supply base

Requirements

  • Bachelor's or Master's degree in Electrical Engineering, Mechanical Engineering or relevant technical field, and/or equivalent practical experience
  • 12+ years of technical experience in engineering, product management, or supply chain roles within data center, infrastructure, cloud, AI silicon semiconductor, or associated high-tech hardware industries
  • Demonstrated understanding of relevant supplier technology roadmaps, performance drivers and a proven pragmatic approach to influencing supplier product roadmaps
  • Experience moving seamlessly from strategy to execution and delivering tangible results in complex, cross-functional, and fast-paced environments
  • Interpersonal and communication skills, with experience influencing, negotiating, building consensus and making key strategic decisions
  • Experience interfacing with internal and external partners in a fast-paced, often ambiguous, entrepreneurial and cross-functional environment, requiring a wide latitude for independent judgment while coordinating people and technical resources

Nice to have

  • Master's degree in Electrical Engineering or MBA
  • Experience working with AI GPU & Cloud technology suppliers
  • Knowledge of industry regulations, standards and value chain considerations relevant to GPU & cloud and related technologies such as networking and storage
  • Leadership and management experience
  • Experience in driving and engaging in industry ecosystem and standards and AI GPU and Cloud technologies

What we offer

  • bonus
  • equity
  • benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Technical Sourcing Manager - AI GPU & Cloud

8 matching positions

Sr. Manager, Cloud Sourcing

Together AI is looking for a leader to own the commercial relationships with our...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 260000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years in cloud sourcing, strategic sourcing, or infrastructure commercial roles, with exposure to GPU or high-performance compute environments
  • Track record negotiating large-scale cloud or infrastructure contracts (multi-million dollar deal sizes)
  • Working knowledge of cloud infrastructure, including GPU configurations, networking, and cluster architecture, with the ability to translate technical requirements into commercial terms
  • Strong analytical skills
  • experience building pricing models or financial analyses that inform sourcing decisions
  • Excellent communication skills
  • comfortable working across engineering, finance, and vendor executives to achieve commercial and operational goals
  • Ability to apply FinOps principles (e.g., cost allocation, rightsizing, commitment strategy) to cloud sourcing and contract negotiations to drive demonstrable cost efficiency across the GPU fleet
Job Responsibility
Job Responsibility
  • Own the full lifecycle of CSP leasing agreements, from sourcing and RFPs through contract execution and renewals, across Together's GPU fleet
  • Partner with infrastructure engineering to translate technical requirements into commercial specifications, and coordinate technical evaluations during vendor selection
  • Develop and maintain relationships with cloud compute suppliers to expand Together's vendor ecosystem
  • Build and maintain a market intelligence capability to stay ahead of pricing trends, supply availability, and the evolving cloud compute vendor landscape
  • Develop negotiation strategies that improve our cost position and terms as we scale
  • Drive continuous improvement on Total Cost of Ownership (TCO) and actively mitigate supply chain and commercial risks across the GPU fleet
  • Establish and lead regular, structured vendor performance reviews, including Quarterly Business Reviews (QBRs), with key cloud and infrastructure compute suppliers
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Engineering Manager, Managed AI

As an Engineering Manager on the Managed AI team at Crusoe, you will play a crit...
Location
Location
United States , San Francisco, CA; Sunnyvale, CA
Salary
Salary:
237600.00 - 288000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years managing/leading high-performing engineering teams
  • Ability to lead teams through ambiguity and align on complex technical goals
  • Proven success hiring, developing, and retaining talent
  • Hands-on experience with distributed and concurrent systems or AI infrastructure
  • Deep knowledge of cloud-native environments, container orchestration, and SOAs
  • Some familiarity with CPU & GPU performance, inference frameworks, or LLM systems is a strong plus
  • Comfortable owning deliverables from design through production
  • Strong collaboration skills, prioritizing clarity, context, and customer impact
  • Experience in fast-paced startup or growth-stage environments
  • Background in Computer Science, Engineering, or a related technical field
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow a team of high-caliber software engineers
  • Partner with leadership to define and execute the AI roadmap, setting clear goals and driving accountability
  • Cultivate a high-performance, collaborative engineering culture grounded in technical excellence
  • Oversee the architecture and development of core AI services: fault-tolerant task queues, model management systems, cost-aware scheduling, etc.
  • Ensure delivery of scalable systems capable of handling millions of API requests per second
  • Deliver an AI platform that can handle a large variety of load from training, to agentic execution infrastructure
  • Work cross-functionally with Product, Infrastructure, and GTM stakeholders
  • Represent Engineering in strategic discussions to influence AI platform growth and customer adoption
  • Promote knowledge sharing, technical mentorship, and the evolution of engineering processes
What we offer
What we offer
  • Bonus
  • Restricted Stock Units
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Engineer- Artificial Intelligence

We’re looking for a seasoned Senior AI Engineer to join our growing AI team. In ...
Location
Location
Canada , Toronto
Salary
Salary:
126090.00 - 140100.00 CAD / Year
tucows.com Logo
Tucows
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, or related field
  • 5+ years of software engineering experience, with recent focus on AI/LLM systems
  • Advanced proficiency in Python and Golang
  • Strong knowledge of software design patterns (SOLID, DRY, CQRS, Saga, event-driven)
  • Deep understanding of the Software Development Life Cycle (SDLC)
  • Proven experience building distributed, highly available systems at scale
  • Strong system design expertise: APIs, async processing, backpressure, fault tolerance
  • Experience with event-driven systems (Kafka, RabbitMQ)
  • Strong engineering practices: TDD, CI/CD, code reviews, and technical debt management
  • Experience writing and communicating Architecture Decision Records (ADRs)
Job Responsibility
Job Responsibility
  • Lead the architecture and development of AI-driven features using Python and Golang
  • Own end-to-end delivery of LLM-based systems — from prototype to production — with a focus on scalability, reliability, and cost efficiency
  • Integrate and fine-tune open-source models (e.g., LLaMA, Mistral, Mixtral) and drive model selection and serving strategies
  • Research and champion emerging AI technologies aligned with product vision
  • Define and uphold architectural best practices through design and code reviews
  • Mentor junior and intermediate engineers, providing technical leadership on complex problems
  • Translate AI capabilities and constraints into clear business context for non-technical stakeholders
  • Shape responsible AI practices, including safety, privacy, and governance
  • Stay current with the open-source AI ecosystem and bring forward relevant innovations
What we offer
What we offer
  • Generous benefits
  • Fair compensation
  • Remote-first work for majority of roles
  • Reasonable accommodation for individuals with disabilities
  • Fulltime
Read More
Arrow Right

Principal Solution Architect

We are seeking a driven and skilled Solution Architect to lead the delivery and ...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in product delivery and support for developer tools, cloud infrastructure, or AI/ML platforms
  • Strong understanding of Kubernetes, GPU workloads, and AI
  • Ability to work cross-functionally with engineering, design, and product management teams
  • Excellent communication and stakeholder management skills
  • Experience with agile methodologies and product development lifecycle
  • Demonstrated success in technical SW product deliveries and support
  • Strong analytical and problem-solving skills
  • Passion for developer experience and infrastructure innovation
  • Initiative, ownership, and collaborative mindset
  • Bachelor’s or Master's degree in Computer Science, Engineering, or related field
Job Responsibility
Job Responsibility
  • Help to define product delivery and support vision, strategy and activities
  • Collaborate with engineering to prioritize activities and ensure timely, high-quality support for users
  • Engage with users and stakeholders to gather feedback, create requirements, and drive new features
  • Translate technical usage problems into clear user guidelines and tutorials
  • Champion usability and developer experience across CLI tools, APIs, and interfaces
  • Contribute to open-source community engagement and documentation efforts
Read More
Arrow Right

AI Systems Engineer – AI Model (Training & Inference)

The AMD AI Group is looking for a Senior Software Development Engineer to own th...
Location
Location
Canada , Markham
Salary
Salary:
106400.00 - 159600.00 CAD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Industry experience shipping production AI/ML infrastructure, with hands-on work spanning both training and inference.
  • Bachelor’s or Master’s degree or Ph.D in Computer/Software Engineering, Computer Science, or related technical discipline
Job Responsibility
Job Responsibility
  • Enable and optimize large-scale model training (LLMs, VLMs, MoE architectures) on AMD Instinct GPU clusters, ensuring correctness, reproducibility, and competitive throughput.
  • Build and maintain training infrastructure: job orchestration, distributed checkpointing, data loading pipelines, and storage optimization for multi-thousand GPU clusters on Kubernetes.
  • Debug and resolve training-specific issues including gradient norm explosions, non-deterministic behavior across GPU generations, and compute-communication overlap in distributed training (FSDP, DeepSpeed, Megatron-LM).
  • Optimize RCCL collective communication patterns for training workloads, including all-reduce, all-gather, and reduce-scatter across multi-node topologies.
  • Develop monitoring, alerting, and compliance infrastructure to ensure training cluster health, data security, and SLA adherence at scale.
  • Design and build end-to-end validation and testing infrastructure using proxy workloads, synthetic benchmarks, and configurable workload generators to systematically validate platform readiness across AMD Instinct GPU generations.
  • Write and optimize high-performance GPU kernels (GEMM, attention, quantized matmul, GPTQ/AWQ) in HIP, Triton, and MLIR targeting AMD Instinct architectures, with demonstrated ability to outperform open-source baselines.
  • Drive end-to-end inference enablement on new AMD GPU silicon - be among the first to get frontier models running on each new Instinct generation, creating reproducible guides and reference implementations.
  • Optimize inference serving frameworks (vLLM, SGLang, TorchServe) for AMD GPUs: batching strategies, KV-cache management, speculative decoding, and continuous batching for production throughput/latency targets.
  • Develop novel approaches to inference acceleration, including bio-inspired algorithms, SLM-assisted batching, and custom scheduling strategies that exploit AMD hardware characteristics.
  • Fulltime
Read More
Arrow Right