CrawlJobs Logo

Senior Manager, Performance AI/ML Network Deployment Engineering

amd.com Logo

AMD

Location Icon

Location:
United States , Santa Clara

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

210400.00 - 315600.00 USD / Year

Job Description:

The Senior Manager, DC GPU Advanced Forward Deployment and Systems Engineering is a leadership position designed to optimize the design, roll-out and post-rollout management of AI/ML Fabrics. The candidate will be the technical interface between the customers and various internal engineering groups, field application engineers Leveraging extensive experience in large network architecture, Storage, AI/ML network deployments, and performance tuning, this role requires a disciplined approach to system triage, at-scale debug, and infrastructure optimization to ensure robust performance and efficient transitions from GPU production qualification to at-scale datacenter deployment.

Job Responsibility:

  • Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI/ML models
  • Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability
  • Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI/ML workloads
  • Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations
  • Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins
  • Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement
  • Engage with AMD product groups to drive resolution of application and customer issues
  • Develop and present training materials to internal audiences, at customer venues, and at industry conferences

Requirements:

  • Expertise in networking and performance optimization for large-scale AI/ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements
  • Prefer candidates with solid, hands-on expertise in at least one or more of 3 domains, namely compute, network, storage
  • Experience in working with large customers such as Cloud Service Providers and global enterprise customers
  • Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc
  • Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it
  • Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics
  • Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends
  • Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista experience is preferred
  • Direct, co-development/deployment experience in working with strategic customers/partners in bringing solutions to market
  • Excellent communication level from engineer to mid-management to C-level of audience
  • Bachelors, master's in computer science, Engineering or related subjects of experience
  • This is a Senior level role
  • no recent college graduates will be considered
  • Ability to work well in a geographically dispersed team
  • Certifications in Networking, AI/ML, or Cloud Technologies

Additional Information:

Job Posted:
December 17, 2025

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Manager, Performance AI/ML Network Deployment Engineering

Senior DevOps Engineer (GCP)

Our client is a global UK-based financial services and investment banking organi...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, Cloud Engineering, or SRE roles
  • Strong hands-on experience with Google Cloud Platform, including: GKE / Kubernetes, Cloud Run, Cloud Functions, Pub/Sub, Cloud Storage, VPC, IAM, networking, security
  • Expertise in Terraform, Helm, or other IaC tools
  • Experience building CI/CD pipelines (GitHub Actions, GitLab CI, CircleCI, Jenkins, etc.)
  • Strong understanding of containerization and orchestration: Docker, Kubernetes
  • Solid experience with monitoring, observability, and logging stacks
  • Familiarity with networking, load balancing, security hardening, and zero-trust principles
  • Experience supporting production systems in high-availability, distributed environments
  • Strong scripting skills (Python, Bash, or similar)
  • Experience working with agile engineering teams
Job Responsibility
Job Responsibility
  • Design, implement, and maintain cloud infrastructure on Google Cloud (GKE, Cloud Run, Cloud Functions, Pub/Sub, Cloud Storage)
  • Build and optimize CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or similar)
  • Develop infrastructure-as-code using Terraform or similar tools
  • Set up and maintain container orchestration (Kubernetes, GKE) and automated deployment workflows
  • Implement monitoring, alerting, and observability using tools such as Prometheus, Grafana, ELK/Elastic, Stackdriver, or OpenTelemetry
  • Ensure compliance with security and governance standards across all environments
  • Collaborate closely with engineering teams to ensure scalable, high-performance deployment architectures
  • Support AI/ML and GenAI workloads (Vertex AI pipelines, model hosting, GPU workloads, inference optimization)
  • Manage environment strategies, release pipelines, configuration management, and secrets management
  • Optimize cloud costs and recommend improvements for performance and reliability
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
Read More
Arrow Right

Senior Devops & AI Engineer

This role presents a unique opportunity to contribute to the future of impactful...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
fissionlabs.com Logo
Fission Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field
  • 6+ years of experience in Infrastructure Mgmt. roles, with a focus on cloud platforms (Azure and AWS Preferred)
  • Hands-on experience with operations (DevSecOps) principles and best practices
  • Proficiency in scripting languages such as Python, PowerShell, or Bash
  • Excellent communication and collaboration skills
  • In-depth knowledge of Linux operating systems, including CentOS, Ubuntu, and Red Hat, with expertise in shell scripting, package management, and system administration
  • Hands-on experience with a wide range of AWS and Azure services
  • Develop and maintain Infrastructure as Code (IAC) templates using tools such as Terraform or AWS CloudFormation
  • Experience setting up cloud infrastructure stack, databases, service endpoints, GPU as well as CPU resource scaling, optimization etc.
  • Should have worked AIOps/MLOP
Job Responsibility
Job Responsibility
  • Configure and optimize Linux-based servers for performance, security, and resource utilization, including kernel tuning, file system management, and network configuration
  • Architect cloud solutions leveraging best practices and services offered by AWS and Azure, optimizing for scalability, reliability, and cost-effectiveness
  • Implement and manage hybrid cloud environments, facilitating seamless integration and interoperability between AWS and Azure services
  • Establish version control practices for IAC templates, ensuring traceability, auditability, and reproducibility of infrastructure changes
What we offer
What we offer
  • Opportunity to work on impactful technical challenges with global reach
  • Vast opportunities for self-development, including online university access and knowledge sharing opportunities
  • Sponsored Tech Talks & Hackathons to foster innovation and learning
  • Generous benefits packages including health insurance, retirement benefits, flexible work hours, and more
  • Supportive work environment with forums to explore passions beyond work
  • Fulltime
Read More
Arrow Right

Engineering Director

We are seeking a seasoned Engineering Director who thrives in challenging and fa...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant work experience as a director or similar position working across multiple stakeholder organizations, with at least 10+ years of people leadership experience specific to SW and Cloud engineering
  • Solid experience leading SW development across storage, networking, on-prem, and SaaS is a must
  • Experience in setting up geographically distributed sites
  • Must have a strong background in software development lifecycle including cloud infrastructure
  • Familiarity with agile methodologies and tools like JIRA
  • Prior experience in cloud product development and deployments
  • end to end ownership and accountability
  • Solid understanding of fundamental AI and machine learning concepts, including supervised and unsupervised learning, deep learning, reinforcement learning, natural language processing, computer vision, and statistical modeling
  • Extensive business acumen, technical knowledge, and industry experience encompassing one or more engineering, technology, and product domains
  • Demonstrated abilities to drive transformation across a business with exceptional skills in the management of change
Job Responsibility
Job Responsibility
  • Oversee the Puerto Rico Site daily operations, strategic planning and cross-functional team leadership for Hybrid Cloud
  • Recruit, mentor, and manage teams of AI/ML engineers, QA Engineers, Design Engineers and innovation specialists to deliver cutting-edge solutions
  • Continuously evaluate new tools, platforms, and frameworks in AI/ML to drive competitive advantage and operational efficiency
  • Ensure alignment with corporate goals while fostering a high-performance culture, operational efficiency, and employee engagement
  • Lead the development and execution of AI/ML strategies that align with business goals and drive innovation across products, services, or operations
  • Create strategic and tactical operations and resource plans, goals, and priorities for assigned organization based on business and technology roadmap and functional objectives
  • Engage with various senior leaders across the organization, program managers, R&D, support, Quality, product managers, technical leaders and executives to communicate program status, escalate issues, and guide and influence strategic decision-making
  • Manage senior relationships and escalated issues with outsourced partners and suppliers, including setting expectations regarding deliverables, product quality, schedules, and costs
  • ensures that organization is effectively leveraging outsourced resources
  • Identify opportunities for and drive organizational initiatives and programs to support business process improvements and cost reductions
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Go (Next-Gen Firewall)

We are seeking a Senior Backend Engineer (Go) to join our core engineering team ...
Location
Location
Vietnam , Ho Chi Minh City
Salary
Salary:
Not provided
qualgo.net Logo
Qualgo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master’s degree in Computer Science, Cybersecurity, Network Engineering, or related field
  • Deep understanding of goroutines, channels, memory management, and profiling (pprof)
  • Strong grasp of the OSI model, TCP/IP, DNS, TLS/SSL, VPNs (WireGuard/IPsec), and Routing
  • Experience with Docker, Kubernetes, and deploying network appliances on AWS/GCP/Azure
  • Production experience with Kafka, RabbitMQ, or NATS
  • Good English skills (speaking and listening) to communicate with the global teams
  • Hands-on experience with Suricata, Snort, Zeek, or Squid Proxy
  • Familiarity with OPNsense or pfSense architecture is a huge plus
Job Responsibility
Job Responsibility
  • Design and implement high-performance Go services that interact with network subsystems (netfilter/nftables) and open-source security engines (Suricata, Squid, Zeek)
  • Design and implement routing functionalities on low resource gateway system
  • Develop custom plugins or sidecars to ingest, parse, and normalize IDS/IPS alerts (Suricata EVE logs) and Proxy logs for the AI engine
  • Build the "Action Engine" that translates AI threat verdicts into real-time blocking rules (firewall policies, BGP blackholing, or DNS sinkholing)
  • Deeply integrate with OPNsense APIs/plugins to orchestrate policy updates across distributed firewall nodes
  • Architect scalable gRPC and REST APIs to serve as the control plane for thousands of firewall agents
  • Write highly optimized, concurrent Go code to handle high-throughput log ingestion with minimal latency/GC overhead
  • Design distributed locking and consistency mechanisms to ensure firewall policies are synchronized globally across multi-tenant environments
  • Build low-latency pipelines using Kafka or NATS JetStream to stream network telemetry to our AI/ML inference engine
  • Implement WebSocket or HTTP/2 streaming for real-time threat visualization and alerting dashboards
What we offer
What we offer
  • Meaningful work & impact
  • Competitive rewards
  • Growth & well-being
  • People & workspace
  • Young & dynamic environment
Read More
Arrow Right

Senior Technical Program Manager, Infrastructure

Glean is seeking a Senior Infrastructure Technical Program Manager (TPM) to lead...
Location
Location
United States , Palo Alto
Salary
Salary:
198000.00 - 235500.00 USD / Year
glean.com Logo
Glean
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS/MS in Computer Science, Engineering, or a related technical field
  • 8-10+ years of experience in technical program management, infrastructure, or SRE, with at least 3-5 years managing infra or platform-scale programs
  • Proven success delivering cross-functional infrastructure programs in B2B or enterprise environments where scalability, uptime, and performance are critical
  • Experience working closely with Infra, SRE, and ML/AI teams on distributed systems or data infrastructure
  • Strong understanding of cloud infrastructure (AWS, GCP, or Azure) including compute, networking, storage, and orchestration systems
  • Ability to structure complex multi-quarter infrastructure programs with clear milestones and measurable impact
  • Strong written and verbal communication and ability to manage through ambiguity, anticipate scaling challenges, and align teams across priorities
  • Builder mindset with focus on automation, reliability, and efficiency
Job Responsibility
Job Responsibility
  • Lead end-to-end infra programs spanning compute, networking, storage, orchestration, and AI workloads
  • Partner with Engineering to define standards for environment provisioning, deployment automation, and configuration governance
  • Develop and operationalize frameworks for runtime health, scaling, and disaster recovery
  • Drive consistency and automation across deployment orchestration systems
  • Establish clear metrics for reliability, performance, and cost efficiency
  • Coordinate cross-team delivery of high-impact programs such as data pipeline scalability, LLM infrastructure expansion, or infra observability improvements
  • Communicate program status and technical risks effectively to leadership and stakeholders
  • Continuously identify process or system bottlenecks, and drive automation to improve speed and reliability of infra operations
What we offer
What we offer
  • Medical, Vision, and Dental coverage
  • generous time-off policy
  • opportunity to contribute to your 401k plan
  • home office improvement stipend
  • annual education and wellness stipends
  • vibrant company culture through regular events
  • healthy lunches daily
  • Fulltime
Read More
Arrow Right
New

Director - AI

The Director -AI in Delivery leads multiple teams and managers, ensuring each gr...
Location
Location
United States
Salary
Salary:
168400.00 - 252600.00 USD / Year
3cloudsolutions.com Logo
3Cloud
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of experience delivering solutions in a primary domain such as application development, cloud platform engineering, DevOps, or data and analytics, including significant work at enterprise scale
  • 7+ years of experience in solution architecture and leading development or engineering teams, including multi-team or multi-workstream efforts at program scale
  • Proven ability to oversee delivery of complex, enterprise-scale AI/ML programs
  • Deep expertise across most of the following: Agents & Orchestration – leveraging Semantic Kernel, LangChain, or similar frameworks to design intelligent workflows
  • applying tool/function calling, planner and agent-based patterns, and workflow automation engines
  • Search & Retrieval – implementing solutions with vector databases such as Azure AI Search, Pinecone, FAISS, or Milvus
  • applying hybrid search methods, re-ranking strategies, and selecting embedding models for accuracy and performance
  • Evaluation, Quality & LLM Operations – establishing offline and online evaluation approaches
  • using golden datasets, hallucination and groundedness validation, toxicity and safety testing
  • building telemetry and feedback loops
Job Responsibility
Job Responsibility
  • Maintain a clear view of each team member's delivery commitments, growth plans, and internal contributions
  • Develop and grow team members through regular coaching, clear expectations, and constructive feedback. Build a culture of trust, inclusion, and collaboration
  • Monitor team health, morale, and workload balance, acting early to address engagement or performance concerns in partnership with practice leadership
  • Maintain regular one-to-one connections with each team member, provide clear feedback, and help them remove blockers related to client work, pursuits, or internal initiatives
  • Oversee staffing and resource alignment across teams, balancing utilization, development goals, client needs, and sustainable workloads
  • Communicate practice priorities and organizational objectives clearly, ensuring teams understand how their work contributes to broader organizational goals
  • Define interview standards and mentorship strategies, training others to apply them consistently and improving hiring outcomes and talent growth at scale
  • Create and maintain development plans for team members, including stretch assignments, shadowing, certifications, and opportunities for external visibility
  • Lead or support performance reviews, promotion recommendations, and compensation input for your teams, using consistent standards that reflect both impact and behavior
  • Represent your teams in leadership forums, communicate expectations and decisions clearly back to the group, and bring forward patterns, risks, and successes that should shape practice strategy
What we offer
What we offer
  • Flexible work location with a virtual first approach to work!
  • 401(K) with match up to 50% of your 6% contributions of eligible pay
  • Generous PTO providing a minimum of 15 days in addition to 9 paid company holidays and 2 floating personal days
  • Two medical plan options to allow you the choice to elect what works best for you!
  • Option for vision and dental coverage
  • 100% employer paid coverage for life and disability insurance
  • Paid leave for birth parents and non-birth parents
  • Option for Healthcare FSA, HSA, and Dependent Care FSA
  • $67.00 monthly tech and home office allowance
  • Utilization and/or discretionary bonus eligibility based on role
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer (Python)

We're looking for a Senior Backend Engineer (Python) to join our Data Storage te...
Location
Location
United Kingdom
Salary
Salary:
Not provided
supermetrics.com Logo
Supermetrics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of back-end experience (Python) in a production environment, preferably building a SaaS product
  • Experience with building data pipelines or handling large volumes of data
  • Experience working with API integrations
  • Ability to conduct unit testing, integration testing, and end-to-end testing
  • Proficient understanding of architecture & software design
  • Proficient grasp of the software testing discipline
  • Understanding of security best practices
  • Experience collaborating directly with product teams and designers
  • Detail-oriented with advanced analytical and problem-solving abilities
  • Effective communication skills and fluent in English
Job Responsibility
Job Responsibility
  • Development of new features and functionalities for our customers
  • Planning new initiatives and features
  • Collaborate with product managers, designers, and other stakeholders to define technical roadmaps, prioritize features, and estimate development efforts
  • Implement and uphold high code quality standards by conducting thorough code reviews, promoting best practices in software development, and ensuring maintainability and scalability
  • Mentor and guide team members, creating a culture of learning, collaboration, and continuous improvement. Provide technical guidance, conduct code reviews, and share knowledge to enhance the team's overall performance and proficiency
  • Take initiative to spot and mitigate potential issues in the system, improve monitoring mechanisms, and guarantee consistent performance and stability
  • Utilizing existing monitoring tools to ensure system stability
What we offer
What we offer
  • Competitive compensation package, including equity
  • Great work equipment, and home office allowance for those working in our fully remote locations
  • Health care benefit and leisure time insurance
  • Annual 1000 euros of personal learning budget
  • Sports and well-being allowance
  • Fulltime
Read More
Arrow Right
New

Business Intelligence Analyst II

At Aristocrat, we believe in pushing boundaries and redefining the gaming experi...
Location
Location
United States , Las Vegas
Salary
Salary:
64214.00 - 119255.00 USD / Year
aristocratgaming.com Logo
Aristocrat Gaming
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BA / BS degree or equivalent experience
  • At least 2 years of experience using data visualization tools (Tableau, Power BI, Qlik), specializing in building dashboards
  • At least 2 years of experience handling data warehouse tools (SQL, Snowflake), with responsibilities that include querying and building tables/views
  • Demonstrated practical experience in data presentation and analysis
  • Experience developing, testing, implementing, and operating data solutions in an enterprise environment
  • Strong interpersonal skills, with a focus on establishing relationships with mutual trust and respect
  • Ability to manage individual workload, meet deadlines, and communicate progress to leaders and collaborators
  • Strong communication skills, both oral and written, with the ability to present to peers and leaders
  • Ability to learn new concepts quickly and apply them to work products
Job Responsibility
Job Responsibility
  • Collaborate with internal partners to translate business requirements into technical solutions and remediate data quality issues
  • Iteratively develop and test data assets, ensuring scalability, maintainability, usability, and data quality
  • Build tables and views in Microsoft SQL Server, dashboards in Tableau, and PowerPoint presentations
  • Develop intuitive data visualizations to make large and sophisticated data more accessible and understandable
  • Establish relationships with business partners, including commercial strategy, enterprise data, sales, and finance
  • Estimate and communicate the complexity and duration of work on projects
  • Lead multiple projects and proactively communicate status to collaborators
  • Use the project management tool, Wrike, to track progress on work
What we offer
What we offer
  • health, dental, and vision insurance
  • paid time off
  • 401(k) plan with employer matching
  • robust benefits package
  • global career opportunities
  • Fulltime
Read More
Arrow Right