Senior Manager, Performance AI/ML Network Deployment Engineering Job at AMD (Santa Clara)

Senior Manager, Performance AI/ML Network Deployment Engineering

AMD

Location:
United States , Santa Clara

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

210400.00 - 315600.00 USD / Year

Save Job

Apply Position

Job Description:

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering is a leadership position designed to optimize the design, roll-out and post-rollout management of AI/ML Fabrics. The candidate will be the technical interface between the customers and various internal engineering groups, field application engineers Leveraging extensive experience in large network architecture, Storage, AI/ML network deployments, and performance tuning, this role requires a disciplined approach to system triage, at-scale debug, and infrastructure optimization to ensure robust performance and efficient transitions from GPU production qualification to at-scale datacenter deployment.

Job Responsibility:

Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
Engage with AMD product groups to drive resolution of application and customer issues
Develop and present training materials to internal audiences, at customer venues, and at industry conferences

Requirements:

Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
Experience in working with large customers such as Cloud Service Providers and global enterprise customers
Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
Excellent communication level from engineer to mid-management to C-level of audience
Bachelors, master's in computer science, Engineering or related subjects of experience
This is a Senior level role
no recent college graduates will be considered
Ability to work well in a geographically dispersed team
Certifications in Networking, AI/ML, or Cloud Technologies