CrawlJobs Logo

AI Training Reliability Engineer

amd.com Logo

AMD

Location Icon

Location:
China , Beijing

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

Job Responsibility:

  • Own reliability governance (standards, runbooks, SLIs/SLOs) and deliver KPI improvements (goodput/badput)
  • Productionize fast recovery paths: fault detection, isolation, membership change, and continuation without stop-the-world restarts
  • Establish fault-injection/chaos and regression gates to prevent reliability regressions (GPU/NIC/node, comms, storage, maintenance)
  • Drive day-to-day incident response and root-cause analysis, converting learnings into preventative fixes

Requirements:

  • Strong software + systems engineering
  • can debug complex distributed failures end-to-end (Linux, networking, concurrency)
  • Hands-on large-scale distributed training experience (PyTorch Distributed/torchrun
  • common parallelism patterns)
  • Solid accelerator fundamentals and operational debugging (GPU/NPU, drivers/runtime, profiling tooling)
  • RDMA networking and collective communication fundamentals (all-reduce/all-gather/all-to-all) and related failure modes
  • Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

Nice to have:

  • TorchFT (or similar) per-step fault tolerance / checkpointless recovery experience
  • Experience with large cluster operations and automated remediation (health checks, drain/replace, topology-aware placement)
  • Training stability hardening experience (hang watchdogs, NaN/Inf containment, OOM/memory fragmentation mitigation)
What we offer:

AMD benefits at a glance

Additional Information:

Job Posted:
January 31, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI Training Reliability Engineer

AI Trainer - Mechanical Engineers

We’re looking for Mechanical Engineers to help train and evaluate cutting-edge A...
Location
Location
Salary
Salary:
Not provided
prolific.com Logo
Prolific
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Verified status as a mechanical engineer (e.g., by providing a professional profile that shows both educational and professional experience)
  • Recent clinical experience and comfort evaluating clinical reasoning and decision-making
  • Willingness to complete a short skills/eligibility screener to join our Domain Expert pool
  • Strong attention to detail and the ability to focus on complex tasks for up to one hour
  • A reliable, fast internet connection and access to a computer
  • Willingness to self-declare earnings (participants are self-employed)
  • A PayPal account to receive payments from our clients
Job Responsibility
Job Responsibility
  • Reviewing AI-generated responses to engineering scenarios and rating them for accuracy, appropriateness, safety, and reasoning quality
  • Comparing multiple model answers and selecting/justifying the best response
  • Writing improved exemplars, rationales, or structured feedback to help models learn where they fall short
What we offer
What we offer
  • Competitive pay rates
  • Flexible hours
  • Ability to work from home
Read More
Arrow Right

AI Trainer - Industrial Engineers

We’re looking for Industrial Engineers to help train and evaluate cutting-edge A...
Location
Location
Salary
Salary:
40.00 - 75.00 USD / Hour
prolific.com Logo
Prolific
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Verified status as a Industrial engineer (e.g., by providing a professional profile that shows both educational and professional experience)
  • Recent clinical experience and comfort evaluating clinical reasoning and decision-making
  • Willingness to complete a short skills/eligibility screener to join our Domain Expert pool
  • Strong attention to detail and the ability to focus on complex tasks for up to one hour
  • A reliable, fast internet connection and access to a computer
  • Willingness to self-declare earnings (participants are self-employed)
  • A PayPal account to receive payments from our clients
Job Responsibility
Job Responsibility
  • Reviewing AI-generated responses to engineering scenarios and rating them for accuracy, appropriateness, safety, and reasoning quality
  • Comparing multiple model answers and selecting/justifying the best response
  • Writing improved exemplars, rationales, or structured feedback to help models learn where they fall short
What we offer
What we offer
  • Competitive pay rates
  • Flexible hours
  • Ability to work from home
Read More
Arrow Right

Data Engineer – AI Insights

We are looking for an experienced Data Engineer with AI Insights to design and d...
Location
Location
United States
Salary
Salary:
Not provided
thirdeyedata.ai Logo
Thirdeye Data
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Data Engineering experience with exposure to AI/ML workflows
  • Advanced expertise in Python programming and SQL
  • Hands-on experience with Snowflake (data warehousing, schema design, performance tuning)
  • Experience building scalable ETL/ELT pipelines and integrating structured/unstructured data
  • Familiarity with LLM and RAG workflows, and how data supports these AI applications
  • Experience with reporting/visualization tools (Tableau)
  • Strong problem-solving, communication, and cross-functional collaboration skills
Job Responsibility
Job Responsibility
  • Develop and optimize ETL/ELT pipelines using Python, SQL, and Snowflake to ensure high-quality data for analytics, AI, and LLM workflows
  • Build and manage Snowflake data models and warehouses, focusing on performance, scalability, and security
  • Collaborate with AI/ML teams to prepare datasets for model training, inference, and LLM/RAG-based solutions
  • Automate data workflows, validation, and monitoring for reliable AI/ML execution
  • Support RAG pipelines and LLM data integration, enabling AI-driven insights and knowledge retrieval
  • Partner with business and analytics teams to transform raw data into actionable AI-powered insights
  • Contribute to dashboarding and reporting using Tableau, Power BI, or equivalent tools
  • Fulltime
Read More
Arrow Right

AI Engineer

As an AI Engineer at Eitan Medical, you will be part of a team committed to brin...
Location
Location
Israel , Netanya
Salary
Salary:
Not provided
eitanmedical.com Logo
Eitan Medical
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, Data Science, or a related STEM field (Master’s degree preferred)
  • Strong background in machine learning and data engineering
  • Proven experience deploying LLM-based or GenAI-powered applications (via APIs, frameworks, or pre-trained models)
  • Proficiency in Python and experience with AI/ML libraries (e.g., LangChain, Hugging Face, PyTorch, TensorFlow)
  • Experience with containerization and orchestration (Docker, Kubernetes, EKS/AKS)
  • Team player with excellent communication and collaboration skills, working effectively with multidisciplinary teams
  • Independent, proactive, and self-motivated, with a strong sense of ownership and the ability to drive initiatives from concept to delivery
  • Passion for continuous learning, staying at the forefront of AI and data innovation, and translating it into tangible impact
Job Responsibility
Job Responsibility
  • Integrate Generative AI (GenAI) capabilities into Eitan’s SaaS platforms to enhance clinical decision support, treatment optimization, and actionable medical insights
  • Identify and lead AI-driven initiatives across departments to streamline processes, boost productivity, and accelerate innovation
  • Design and implement AI-powered systems, including RAG architectures and agentic workflows, using frameworks such as LangChain, LlamaIndex, or similar
  • Develop effective prompt strategies and reasoning pipelines for adaptive, context-aware, and explainable AI behavior
  • Monitor and optimize AI system performance, maintaining accuracy, reliability, and safety in healthcare contexts
  • Stay ahead of emerging AI research and tools, evaluating new technologies for their potential to deliver measurable clinical and business impact
Read More
Arrow Right

Senior Engineering Manager - AI

We are seeking a Senior Engineering Manager (Level 5) to lead a high-performing ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of professional experience in software engineering
  • At least 4+ years in engineering leadership roles
  • Strong technical background in AI/ML systems, large-scale data pipelines, and cloud-native platforms
  • Hands-on experience with Python (preferred), modern ML frameworks (PyTorch/TensorFlow), and cloud services (AWS)
  • Proven success in managing teams of 4–6 engineers, scaling processes, and building diverse, high-performance teams
  • Strong architectural design and system-thinking abilities
  • Excellent communication skills with ability to influence cross-functional stakeholders
  • Passion for sustainability, decarbonization, and using technology to create positive climate impact
  • Experienced with building agentic pipelines with the latest models from Anthropic, Google, OpenAI, and more
Job Responsibility
Job Responsibility
  • Lead and grow a team of engineers focused on building AI-driven and data-intensive systems for the Arcadia platform
  • Design and train ML/AI models (forecasting, NLP, graph learning, generative AI) to improve data quality, cost effectiveness, and system scalability
  • Build true agentic workflows with multi-step processing incorporating RAG pipelines and MCPs
  • Balance management responsibilities (hiring, coaching, performance reviews, career growth) with technical leadership (architecture, system design, technical strategy)
  • Drive end-to-end delivery of complex projects in partnership with Product, Data, and Infrastructure teams
  • Guide the adoption of modern AI/ML technologies, ensuring practical, scalable use in production
  • Foster a culture of high performance, ownership, and technical excellence
  • Establish engineering best practices in testing, observability, reliability, and CI/CD
  • Partner with leadership to define roadmaps, set priorities, and align execution with Arcadia’s strategic goals
  • Represent AI across the company, articulating technical trade-offs and championing innovation
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

AI Software Engineer III

Planet DDS is a leading provider of a platform of cloud-based solutions that emp...
Location
Location
United Kingdom , Glasgow
Salary
Salary:
Not provided
planetdds.com Logo
Planet DDS
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-7 years of professional software engineering experience
  • At least 4 years in AI/ML-focused roles
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or related field
  • Experience working in a SaaS or enterprise software environment
  • Publications or contributions to open-source AI/ML projects
  • Exposure to reinforcement learning, generative AI (LLMs, diffusion models), or real-time inference systems
Job Responsibility
Job Responsibility
  • Design, develop, and deploy AI and machine learning models in production environments
  • Architect scalable solutions that integrate AI capabilities into our products and services
  • Collaborate with data scientists, product managers, and backend/front-end engineers to translate prototypes into reliable, maintainable code
  • Own end-to-end development of AI systems, including data ingestion, model training, evaluation, and deployment
  • Implement best practices in model versioning, monitoring, and continuous improvement
  • Contribute to the evolution of our AI/ML infrastructure, including CI/CD pipelines and MLOps tools
  • Stay current on advancements in AI, ML, and deep learning and assess their applicability to business needs
  • Ensure AI solutions are ethical, interpretable, and aligned with regulatory requirements
  • Fulltime
Read More
Arrow Right

Senior Devops & AI Engineer

This role presents a unique opportunity to contribute to the future of impactful...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
fissionlabs.com Logo
Fission Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field
  • 6+ years of experience in Infrastructure Mgmt. roles, with a focus on cloud platforms (Azure and AWS Preferred)
  • Hands-on experience with operations (DevSecOps) principles and best practices
  • Proficiency in scripting languages such as Python, PowerShell, or Bash
  • Excellent communication and collaboration skills
  • In-depth knowledge of Linux operating systems, including CentOS, Ubuntu, and Red Hat, with expertise in shell scripting, package management, and system administration
  • Hands-on experience with a wide range of AWS and Azure services
  • Develop and maintain Infrastructure as Code (IAC) templates using tools such as Terraform or AWS CloudFormation
  • Experience setting up cloud infrastructure stack, databases, service endpoints, GPU as well as CPU resource scaling, optimization etc.
  • Should have worked AIOps/MLOP
Job Responsibility
Job Responsibility
  • Configure and optimize Linux-based servers for performance, security, and resource utilization, including kernel tuning, file system management, and network configuration
  • Architect cloud solutions leveraging best practices and services offered by AWS and Azure, optimizing for scalability, reliability, and cost-effectiveness
  • Implement and manage hybrid cloud environments, facilitating seamless integration and interoperability between AWS and Azure services
  • Establish version control practices for IAC templates, ensuring traceability, auditability, and reproducibility of infrastructure changes
What we offer
What we offer
  • Opportunity to work on impactful technical challenges with global reach
  • Vast opportunities for self-development, including online university access and knowledge sharing opportunities
  • Sponsored Tech Talks & Hackathons to foster innovation and learning
  • Generous benefits packages including health insurance, retirement benefits, flexible work hours, and more
  • Supportive work environment with forums to explore passions beyond work
  • Fulltime
Read More
Arrow Right

HPC & AI Systems Engineer for Integrated Systems Test

HPC & AI Systems Engineer for Integrated Systems Test role at Hewlett Packard En...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master's degree in Computer Engineering, Computer Science, Electrical Engineering, Information Systems, or equivalent
  • Minimum 4 years of experience
  • Experience with certification & submission to OS vendors of Linux (RedHat, SLES, Ubuntu, etc.), Windows Server operating systems, Windows Client operating systems, and VMWare (ESXi)
  • Experience installing and working with Linux, Windows and VMWare OSes
  • Experience in programming or scripting languages, Python, PowerShell, Perl, Linux Shell, Java, MySQL, MS SQL Server
  • Understanding of Redfish commands, RESTful API, and JSON format
  • Knowledge of creating and using Docker containers and VMs
  • Experience in configuring Storage (internal/external storage, file systems, and raid/non-raid settings) and Networking devices (iSCSI, FCoE, IPs, VLANs, Bonding, Jumbo Frames, LAGs)
  • Knowledge of networking concepts such as NIC teaming, VLANs, IPv4, IPv6
  • Excellent written and verbal communication skills in English
Job Responsibility
Job Responsibility
  • Work with Program & Product Management, technical leads, and product development teams to obtain product feature requirements
  • Design and implement new test features in existing and new test cases
  • Analyze, debug and provide feedback/resolution on issues uncovered by test team prior to submission of results to OS vendors for approval
  • Implement software solutions for multiple test programs/projects with internal and outsourced development partners
  • Review and evaluate the implementation and use of test automation and test tools
  • Planning, development, and implementation of software tools for the testing and evaluation of current and next-generation HPE HPC products
  • Debug and analyze issues to a successful resolution
  • Perform testing in local and remote labs
  • Drive appropriate automated test execution to test engineers at various global locations
  • Provide training and guidance to test teams both onshore and offshore
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right