This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The AI/ML Frameworks team is hiring an Software Development Engineer to build and maintain scalable DevOps infrastructure that accelerates AMD's AI software development. You will design and own CI/CD pipelines, manage Kubernetes‑based GPU environments, and automate systems using Python, Go, and Ansible. The role involves creating and maintaining production‑grade automation and tooling that enables fast, reliable software delivery across teams.
Job Responsibility
Develop deep expertise in build tools and flows (CMake, Bazel, Make, compiler toolchains)
Triage complex build failures by understanding the full build pipeline
Identify root causes across infrastructure, toolchain, and code-level issues
Train and mentor team members on build systems, CI/CD workflows, and debugging techniques
Create documentation, runbooks, and training sessions
Understand the architecture and codebase of ML frameworks (PyTorch, TensorFlow, ROCm stack)
Review, debug, and contribute code changes as needed
Design and develop internal tools, automation scripts, and services primarily in Python and Go
Design, implement, and manage efficient continuous integration and delivery pipelines using Buildkite, GitHub Actions, and Jenkins
Deploy and maintain robust Kubernetes-based environments across both on-premise and cloud platforms
Automate provisioning, configuration, and management of infrastructure using Ansible, Python, and Bash
Administer application and service deployment in Kubernetes using Helm charts
Configure, manage, and maintain GPU-based compute environments
Interact with MySQL databases to support dynamic data updates and integrate data sources into Grafana dashboards
Work closely with ML framework developers, SREs, and project stakeholders
Integrate automated testing frameworks into CI pipelines
Requirements
Strong understanding of CMake, Bazel, Make, and compiler toolchains (GCC, Clang, LLVM)
Ability to debug complex build failures, understand dependency resolution, and optimize build performance
Strong proficiency in Python and Go for building tools, services, and automation
The ability to read and modify C++ code is a plus
Understanding of ML framework architecture (PyTorch, TensorFlow, JAX, or similar)
Ability to navigate large codebases, understand their build systems, and contribute fixes or improvements
Experience documenting complex systems and training team members
Ability to break down technical concepts and create effective learning materials
Proficient with Buildkite, GitHub Actions, Jenkins, Ansible, and scripting for streamlining DevOps workflows
Strong experience with Docker, Kubernetes, and Helm for deploying and managing scalable, containerized applications
Hands-on experience automating infrastructure provisioning and configuration to ensure reproducibility and scalability across environments
Familiarity with GPU server lifecycle management, ROCm/CUDA toolchains, and integration of GPU resources into CI test workflows
Experience using tools like Checkmk, Prometheus, and Grafana to monitor infrastructure health and application performance
Advanced knowledge of Git-based version control, including branching strategies and CI/CD integration
Solid background in Linux environments, including shell scripting and system-level troubleshooting across distributed systems
Comfort working in Agile teams and partnering with software, infrastructure, and product teams
Bachelor's or Master's degree in Computer Science, Software Engineering, or related technical discipline
Nice to have
Familiarity with C++
What we offer
Benefits offered are described: AMD benefits at a glance