CrawlJobs Logo

AI Training Reliability Engineer

China, Beijing · Job Posted January 31, 2026
Apply Position
Job Link Share

Job Description

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

Job Responsibility

  • Own reliability governance (standards, runbooks, SLIs/SLOs) and deliver KPI improvements (goodput/badput)
  • Productionize fast recovery paths: fault detection, isolation, membership change, and continuation without stop-the-world restarts
  • Establish fault-injection/chaos and regression gates to prevent reliability regressions (GPU/NIC/node, comms, storage, maintenance)
  • Drive day-to-day incident response and root-cause analysis, converting learnings into preventative fixes

Requirements

  • Strong software + systems engineering
  • can debug complex distributed failures end-to-end (Linux, networking, concurrency)
  • Hands-on large-scale distributed training experience (PyTorch Distributed/torchrun
  • common parallelism patterns)
  • Solid accelerator fundamentals and operational debugging (GPU/NPU, drivers/runtime, profiling tooling)
  • RDMA networking and collective communication fundamentals (all-reduce/all-gather/all-to-all) and related failure modes
  • Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Nice to have

  • TorchFT (or similar) per-step fault tolerance / checkpointless recovery experience
  • Experience with large cluster operations and automated remediation (health checks, drain/replace, topology-aware placement)
  • Training stability hardening experience (hang watchdogs, NaN/Inf containment, OOM/memory fragmentation mitigation)

What we offer

AMD benefits at a glance

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

AI Training Reliability Engineer

8 matching positions

Senior C++ Engineer - AI Training

As a Senior C++ Engineer, you will work remotely on an hourly paid basis to revi...
Location
Location
Salary
Salary:
20.00 USD / Hour
usebraintrust.com Logo
Braintrust
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or higher in Computer Science, Electrical/Computer Engineering, or a closely related technical field
  • 7+ years of professional experience developing production software in C++ for systems, performance-critical, or large-scale applications
  • Expert-level proficiency in modern C++ (C++11 and later), including templates, move semantics, smart pointers, and the STL
  • Strong background in systems programming concepts such as concurrency, operating systems, low-level performance optimization, and memory management
  • Hands-on experience with C++ build systems, compilers, linkers, debuggers, and profiling or analysis tools
  • Minimum C1 English proficiency (written and spoken), with the ability to write clear technical explanations and follow detailed English-language guidelines
  • Proven experience conducting detailed C++ code reviews and enforcing coding standards for safety, performance, and maintainability
  • Comfort working with version control, CI/CD workflows, and automated testing frameworks in modern C++ projects
Job Responsibility
Job Responsibility
  • Review AI-generated C++ code, systems designs, and technical explanations
  • Generate high-quality reference implementations and step-by-step reasoning for complex engineering problems
  • Assess solutions for accuracy, clarity, safety, and adherence to the prompt
  • Identify errors in logic, memory handling, concurrency, or performance
  • Fact-check technical information
  • Write high-quality explanations and model solutions that demonstrate correct methods
  • Rate and compare multiple AI responses based on correctness and reasoning quality
  • Develop AI Training Content: Create detailed prompts in various topics and responses to guide AI learning, ensuring the models reflect a comprehensive understanding of diverse subjects
  • Optimize AI Performance: Evaluate and rank AI responses to enhance the model's accuracy, fluency, and contextual relevance
  • Ensure Model Integrity: Test AI models for potential inaccuracies or biases, validating their reliability across use cases
What we offer
What we offer
  • Get rewarded with BTRST, Braintrust's cryptotoken, for inviting talent, taking a class—even signing up! Use token rewards to help shape the future of the platform
  • Parttime
Read More
Arrow Right

Bash/PowerShell Engineer - AI Training

As a remote, hourly paid Bash/PowerShell Engineer, you will review AI-generated ...
Location
Location
Salary
Salary:
60.00 - 65.00 USD / Hour
usebraintrust.com Logo
Braintrust
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert-level proficiency in Bash and/or PowerShell scripting
  • 2–3+ years of hands-on experience in at least one of Bash or PowerShell
  • Strong experience writing, debugging, and maintaining shell or object-based automation scripts
  • Ability to evaluate scripts for correctness, readability, safety, performance, and portability across environments
  • Professional experience in software engineering, DevOps, IT automation, or infrastructure-focused roles
  • Minimum Bachelor’s degree in Computer Science or a closely related technical field
  • Significant experience using large language models (LLMs) while coding, troubleshooting, and reviewing scripts
  • Excellent English writing skills with the ability to document and explain complex automation clearly
Job Responsibility
Job Responsibility
  • Develop AI Training Content: Create detailed prompts in various topics and responses to guide AI learning, ensuring the models reflect a comprehensive understanding of diverse subjects
  • Optimize AI Performance: Evaluate and rank AI responses to enhance the model's accuracy, fluency, and contextual relevance
  • Ensure Model Integrity: Test AI models for potential inaccuracies or biases, validating their reliability across use cases
What we offer
What we offer
  • Get rewarded with BTRST, Braintrust's cryptotoken, for inviting talent, taking a class—even signing up! Use token rewards to help shape the future of the platform
  • Parttime
Read More
Arrow Right

Go Engineer - AI Training

As an hourly paid, fully remote Go Engineer for AI Data Training, you will revie...
Location
Location
Salary
Salary:
50.00 - 80.00 USD / Hour
usebraintrust.com Logo
Braintrust
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1–2+ years of professional Go programming experience, ideally focused on backend services, APIs, or systems work
  • Strong understanding of Go idioms, including clear error handling, composition over inheritance, and simple, readable interfaces
  • Hands-on software engineering experience building and maintaining backend-style Go applications in production environments
  • Ability to evaluate correctness and readability in backend-oriented Go code, including concurrency patterns, error propagation, and API design
  • Significant experience using LLMs or AI coding assistants while programming, combined with a disciplined approach to validating their output
  • Excellent English writing skills, capable of producing precise, structured, and pedagogical technical explanations
  • Minimum Bachelor’s degree in Computer Science or a closely related technical field
  • Minimum C1 English proficiency and an extremely detail-oriented working style are required
Job Responsibility
Job Responsibility
  • Develop AI Training Content: Create detailed prompts in various topics and responses to guide AI learning, ensuring the models reflect a comprehensive understanding of diverse subjects
  • Optimize AI Performance: Evaluate and rank AI responses to enhance the model's accuracy, fluency, and contextual relevance
  • Ensure Model Integrity: Test AI models for potential inaccuracies or biases, validating their reliability across use cases
What we offer
What we offer
  • Get rewarded with BTRST, Braintrust's cryptotoken, for inviting talent, taking a class—even signing up! Use token rewards to help shape the future of the platform
  • Parttime
Read More
Arrow Right

Ruby Engineer - AI Training

As an hourly paid, fully remote Ruby Engineer for AI Data Training, you will rev...
Location
Location
Salary
Salary:
50.00 - 90.00 USD / Hour
usebraintrust.com Logo
Braintrust
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1–2+ years of professional Ruby programming experience in production environments
  • Strong understanding of Ruby on Rails, MVC architecture, and idiomatic Ruby patterns for structuring application code
  • Hands-on software engineering experience building and maintaining Ruby or Rails applications, with exposure to real-world production constraints
  • Ability to evaluate readability, maintainability, and logical correctness in Ruby and Rails code, including model, controller, and service-layer design
  • Significant experience using LLMs or AI coding assistants while programming, combined with a disciplined approach to validating and reviewing their output
  • Excellent English writing skills, capable of producing clear, structured, and pedagogical technical explanations and code reviews
  • Minimum Bachelor’s degree in Computer Science or a closely related technical field
  • Minimum C1 English proficiency and an extremely detail-oriented working style are required
Job Responsibility
Job Responsibility
  • Develop AI Training Content: Create detailed prompts in various topics and responses to guide AI learning, ensuring the models reflect a comprehensive understanding of diverse subjects
  • Optimize AI Performance: Evaluate and rank AI responses to enhance the model's accuracy, fluency, and contextual relevance
  • Ensure Model Integrity: Test AI models for potential inaccuracies or biases, validating their reliability across use cases
What we offer
What we offer
  • Get rewarded with BTRST, Braintrust's cryptotoken, for inviting talent, taking a class—even signing up! Use token rewards to help shape the future of the platform
  • Parttime
Read More
Arrow Right

Rust Engineer - AI Training

As an hourly paid, fully remote Rust Engineer for AI Data Training, you will rev...
Location
Location
Salary
Salary:
50.00 - 80.00 USD / Hour
usebraintrust.com Logo
Braintrust
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1–2+ years of professional Rust development experience in backend, CLI, or systems-focused projects
  • Strong understanding of Rust’s ownership, borrowing, and lifetime model, with the ability to reason clearly about aliasing and data races
  • Solid software engineering experience in at least one of backend services, command-line tools, or systems programming using Rust
  • Ability to evaluate safe, idiomatic Rust code, including appropriate use of traits, generics, pattern matching, and error handling
  • Significant experience using LLMs or AI coding assistants while programming, combined with a disciplined approach to validating their output
  • Excellent English writing skills, capable of producing precise, structured, and pedagogical technical explanations
  • Minimum Bachelor’s degree in Computer Science or a closely related technical field
  • Minimum C1 English proficiency and an extremely detail-oriented working style are required
Job Responsibility
Job Responsibility
  • Review AI-generated Rust code and explanations or generate your own
  • Evaluate the reasoning quality and step-by-step problem-solving
  • Provide expert feedback that helps models produce answers that are accurate, logical, and clearly explained
  • Assess solutions for correctness, safety, and adherence to the prompt
  • Identify errors in ownership, borrowing, lifetimes, or algorithmic reasoning
  • Fact-check information
  • Write high-quality explanations and model solutions that demonstrate idiomatic Rust patterns
  • Rate and compare multiple AI responses based on correctness and reasoning quality
  • Develop AI Training Content: Create detailed prompts in various topics and responses to guide AI learning
  • Optimize AI Performance: Evaluate and rank AI responses to enhance the model's accuracy, fluency, and contextual relevance
What we offer
What we offer
  • Get rewarded with BTRST, Braintrust's cryptotoken, for inviting talent, taking a class—even signing up! Use token rewards to help shape the future of the platform
  • Parttime
Read More
Arrow Right

Senior Site Reliability Engineer, Managed AI

At Crusoe, our Site Reliability Engineering team ensures the reliability and sca...
Location
Location
United States , San Francisco, Sunnyvale
Salary
Salary:
172000.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title)
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment
Job Responsibility
Job Responsibility
  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer, Managed AI

At Crusoe, our Site Reliability Engineering team ensures the reliability and sca...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
204000.00 - 247000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs
  • Building monitoring and observability systems
  • Driving performance and reliability improvements
  • Designing fault-tolerant systems and automated testing strategies
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
Job Responsibility
Job Responsibility
  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Principal Ai Engineer - Enterprise Ai Solutions

Our Mission: At Palo Alto Networks®, we’re united by a shared mission—to protect...
Location
Location
United States , Santa Clara
Salary
Salary:
185200.00 - 299475.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in a related field or equivalent military experience required
  • 12+ years of related experience, or a Master's degree with 8 years of experience, or a PhD with 5 years of experience
  • Proven expertise in designing and building complex, enterprise-grade AI/ML platforms and applications
  • Direct hands-on experience with Generative AI technologies, including Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and agentic AI systems
  • Strong programming skills in languages such as Python, Java, or Go and proficiency in modern AI/ML frameworks (e.g., TensorFlow, PyTorch)
  • Extensive hands-on experience with distributed systems architecture, streaming data platforms, and cloud AI/ML platforms (e.g., GCP Vertex AI, AWS SageMaker)
Job Responsibility
Job Responsibility
  • Lead the applied AI solution design and architecture, breaking down ambiguous business problems into concrete, actionable AI solution designs
  • Drive the hands-on development and implementation of key AI components, supporting both traditional and Generative AI model development and deployment
  • Contribute significantly to the detailed design of large-scale, distributed AI/ML systems, ensuring performance, reliability, and security
  • Proactively collaborate with executive leadership, data science, engineering, and product stakeholders to translate business use-cases into scalable AI solutions
  • Provide technical leadership and mentorship to other AI/ML engineers, fostering a culture of engineering excellence and hands-on experimentation
  • Lead the implementation and continuous improvement of MLOps pipelines, including automated model training, versioning, deployment, and monitoring
  • Champion and enforce design standards, patterns, and best practices for scalable and secure development of AI applications across various teams
  • Actively research and evaluate cutting-edge AI/ML techniques, algorithms, and models to identify opportunities for platform enhancement and new solution development
What we offer
What we offer
  • Restricted stock units
  • Bonus
  • Fulltime
Read More
Arrow Right