CrawlJobs Logo

AI Systems Engineer – AI Model (Training & Inference)

amd.com Logo

AMD

Location Icon

Location:
Canada , Markham

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

106400.00 - 159600.00 CAD / Year

Job Description:

The AMD AI Group is looking for a Senior Software Development Engineer to own the end-to-end model execution stack on AMD Instinct GPUs - spanning training infrastructure at scale and high-performance inference serving. This role demands someone who has shipped LLMs on real hardware, written GPU kernels that moved production metrics, and built the systems infrastructure (orchestration, storage, monitoring) that keeps thousands of GPUs productive. You will be instrumental in ensuring AMD GPUs are first-class citizens for frontier model training and inference across current and next-generation Instinct accelerators.

Job Responsibility:

  • Enable and optimize large-scale model training (LLMs, VLMs, MoE architectures) on AMD Instinct GPU clusters, ensuring correctness, reproducibility, and competitive throughput.
  • Build and maintain training infrastructure: job orchestration, distributed checkpointing, data loading pipelines, and storage optimization for multi-thousand GPU clusters on Kubernetes.
  • Debug and resolve training-specific issues including gradient norm explosions, non-deterministic behavior across GPU generations, and compute-communication overlap in distributed training (FSDP, DeepSpeed, Megatron-LM).
  • Optimize RCCL collective communication patterns for training workloads, including all-reduce, all-gather, and reduce-scatter across multi-node topologies.
  • Develop monitoring, alerting, and compliance infrastructure to ensure training cluster health, data security, and SLA adherence at scale.
  • Design and build end-to-end validation and testing infrastructure using proxy workloads, synthetic benchmarks, and configurable workload generators to systematically validate platform readiness across AMD Instinct GPU generations.
  • Write and optimize high-performance GPU kernels (GEMM, attention, quantized matmul, GPTQ/AWQ) in HIP, Triton, and MLIR targeting AMD Instinct architectures, with demonstrated ability to outperform open-source baselines.
  • Drive end-to-end inference enablement on new AMD GPU silicon - be among the first to get frontier models running on each new Instinct generation, creating reproducible guides and reference implementations.
  • Optimize inference serving frameworks (vLLM, SGLang, TorchServe) for AMD GPUs: batching strategies, KV-cache management, speculative decoding, and continuous batching for production throughput/latency targets.
  • Develop novel approaches to inference acceleration, including bio-inspired algorithms, SLM-assisted batching, and custom scheduling strategies that exploit AMD hardware characteristics.
  • Build quantization pipelines (FP8, FP6, FP4, GPTQ, AWQ) for production model deployment, ensuring quality-performance tradeoffs are well-characterized across AMD GPU generations.
  • Collaborate with AMD silicon architecture and pre-silicon teams to provide software feedback and validate software stack integration on next-generation Instinct GPU designs for both training and inference workloads.
  • Build observability and automated analysis tooling: log analysis pipelines, anomaly detection, performance baselining, regression detection, and diagnostic workflows for large-scale GPU clusters.
  • Contribute to the open ROCm ecosystem and AMD's developer experience — SDKs, CI dashboards, documentation, and developer cloud enablement.

Requirements:

  • Industry experience shipping production AI/ML infrastructure, with hands-on work spanning both training and inference.
  • Bachelor’s or Master’s degree or Ph.D in Computer/Software Engineering, Computer Science, or related technical discipline

Nice to have:

  • Direct experience enabling frontier models (GPT-4 class) on AMD Instinct hardware end-to-end.
  • Background in building anomaly detection, log analysis, or observability systems for large-scale distributed GPU infrastructure.
  • Familiarity with AMD Instinct MI-series architectures (MI300X, MI350X, MI355X) and RCCL communication library.
  • Contributions to open-source AI frameworks (PyTorch, vLLM, SGLang, DeepSpeed, Megatron-LM).
  • Experience designing validation frameworks, proxy benchmarks, or synthetic workload suites for GPU infrastructure at scale.
  • Experience with pre-silicon software validation or hardware-software co-verification workflows.
  • Publications or patents in HPC, ML systems, or GPU kernel optimization.

Additional Information:

Job Posted:
April 16, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI Systems Engineer – AI Model (Training & Inference)

AI Software Engineer III

Planet DDS is a leading provider of a platform of cloud-based solutions that emp...
Location
Location
United Kingdom , Glasgow
Salary
Salary:
Not provided
planetdds.com Logo
Planet DDS
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-7 years of professional software engineering experience
  • At least 4 years in AI/ML-focused roles
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or related field
  • Experience working in a SaaS or enterprise software environment
  • Publications or contributions to open-source AI/ML projects
  • Exposure to reinforcement learning, generative AI (LLMs, diffusion models), or real-time inference systems
Job Responsibility
Job Responsibility
  • Design, develop, and deploy AI and machine learning models in production environments
  • Architect scalable solutions that integrate AI capabilities into our products and services
  • Collaborate with data scientists, product managers, and backend/front-end engineers to translate prototypes into reliable, maintainable code
  • Own end-to-end development of AI systems, including data ingestion, model training, evaluation, and deployment
  • Implement best practices in model versioning, monitoring, and continuous improvement
  • Contribute to the evolution of our AI/ML infrastructure, including CI/CD pipelines and MLOps tools
  • Stay current on advancements in AI, ML, and deep learning and assess their applicability to business needs
  • Ensure AI solutions are ethical, interpretable, and aligned with regulatory requirements
  • Fulltime
Read More
Arrow Right

AI Research Engineer, Scaling

As a Research Engineer focused on Scaling, you will design and build robust infr...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of training and inference speed bottlenecks and scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi-node debugging, and experiment management
  • Proven skills in optimizing inference performance using graph compilers, batching/scheduling, and serving systems like TensorRT or equivalents
  • Familiarity with quantization strategies (PTQ, QAT, INT8/FP8) and tools such as TensorRT and bitsandbytes
  • Experience developing or tuning CUDA or Triton kernels with understanding of hardware-level optimization (vectorization, tensor cores, memory hierarchies)
Job Responsibility
Job Responsibility
  • Own and lead scaling of distributed training and inference systems
  • Ensure compute resources are optimized to make data the primary constraint
  • Enable massive training runs (1000+ GPUs) using robot data, with robust fault tolerance, experiment tracking, and distributed operations
  • Optimize inference throughput for datacenter use cases such as world models and diffusion engines
  • Reduce latency and enhance performance for on-device robot policies using techniques such as quantization, scheduling, and distillation
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

AI Software Engineer - NLP/LLM

At Moody's, we unite the brightest minds to turn today’s risks into tomorrow’s o...
Location
Location
United States , New York
Salary
Salary:
159300.00 - 230850.00 USD / Year
moodys.com Logo
Moody's
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of demonstrated experience building production-grade machine learning systems with measurable impacts
  • expertise in NLP and search and recommendation systems is preferred
  • Hands-on experience with large language model (LLM) applications and AI agents, including retrieval-augmented generation, prompt optimization, fine-tuning, agent design, and evaluation methodologies
  • familiarity with prompt optimization frameworks like DSPy is preferred
  • Deep expertise in machine learning models and systems design, including classic models (e.g., XGBoost), modern deep learning and graph machine learning architectures (e.g., transformers-based models, graph neural networks (GNN)), and reinforcement learning systems
  • Proven ability to take models and agents from research to production, including optimization for latency and cost, implementation of monitoring and tracing, and development of reusable platforms or frameworks
  • Strong technical leadership and mentorship skills, with a track record of growing engineers, improving team velocity through automation, documentation, and tooling, and influencing architectural decisions without direct authority
  • Excellent communication and strategic thinking abilities, capable of aligning technical decisions with business outcomes, navigating ambiguity, and driving cross-functional collaboration
  • Bachelor’s degree or higher in Computer Science, Engineering, or a related field
Job Responsibility
Job Responsibility
  • Design and deploy end to end AI and machine learning solutions including machine learning and graph-based models, natural language processing (NLP) models, and large language model (LLM) based AI agents
  • Build robust pipelines for data ingestion, feature engineering, model training, validation, and real-time or batch inference
  • Develop and integrate large language model (LLM) applications using techniques such as fine-tuning, retrieval-augmented generation, and reinforcement learning
  • Build autonomous agents capable of multi-step reasoning and tool use in production environments
  • Lead the full model and agent development lifecycle, from problem definition and data exploration through experimentation, implementation, deployment, and monitoring
  • Ensure solutions are scalable, reliable, and aligned with business goals
  • Advocate and implement machine learning operations (MLOps) best practices including data monitoring and tracing, error analysis, automated retraining, model and prompt versioning, business metrics monitoring, and incident response
  • Collaborate across disciplines and provide technical leadership, working with product managers, engineers, and researchers to deliver impactful solutions
  • Mentor team members, lead design reviews, and promote best practices in AI and machine learning systems development
What we offer
What we offer
  • medical
  • dental
  • vision
  • parental leave
  • paid time off
  • a 401(k) plan with employee and company contribution opportunities
  • life, disability, and accident insurance
  • a discounted employee stock purchase plan
  • tuition reimbursement
  • Fulltime
Read More
Arrow Right

Machine Learning Systems Engineer

As a Machine Learning Systems Engineer on the AI & ML Platform team, you will bu...
Location
Location
United States
Salary
Salary:
145800.00 - 229125.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fluency in at least one modern object-oriented programming language (preferably Java/Kotlin)
  • Understanding and experience with Machine Learning project lifecycle and tools
  • Understanding of LLMs, best deployment practices and inference optimisation
  • Experience in building and implementing high-performance RESTful micro-services
  • Experience building and operating large scale distributed systems using Amazon Web Services (Sagemaker, S3, Cloud Formation, AWS Security and Networking)
  • Experience with Continuous Delivery and Continuous Integration
Job Responsibility
Job Responsibility
  • Build and scale the core infrastructure to allow software engineers, ML engineers & data scientists to develop, train, evaluate, deploy, and operate Machine Learning models and pipelines
  • Build systems for product teams like Jira & Confluence to provide access to curated LLMs
  • Use software development expertise to solve difficult problems, tackling infrastructure and architecture challenges
  • Lead engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input and communicate results
  • Regularly tackle complex problems in the team, from technical design to launch
  • Routinely tackle complex architecture challenges and defines coding standards & patterns for the team
  • Lead the team through times of ambiguity, help them adapt and deliver positive impact
  • Mentor junior members on the team
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Bonuses
  • Commissions
  • Equity
  • Fulltime
Read More
Arrow Right

Artificial (AI) Engineer

VELOX is hiring an AI Developer to help design and implement intelligent systems...
Location
Location
United States , Boise
Salary
Salary:
Not provided
veloxmedia.com Logo
VELOX Media
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong proficiency in Python (Pandas, NumPy, scikit-learn, etc.)
  • Experience with deep learning frameworks such as TensorFlow or PyTorch
  • Hands-on experience with natural language processing, retrieval-augmented generation (RAG), or LLMs (e.g., OpenAI, Claude, Mistral)
  • Understanding of data pipelines, model deployment, and performance monitoring
  • Experience working with APIs and integrating ML models into production systems
  • Familiarity with vector databases (e.g., Pinecone, Weaviate, FAISS) and embedding generation
  • Comfort working in cloud environments (GCP, AWS, or Azure)
  • Bachelor’s or Master’s degree in Computer Science, Data Science, or a related field
  • 3+ years of experience in applied AI/ML roles
  • Track record of launching AI tools or systems into production
Job Responsibility
Job Responsibility
  • Research, design, and deploy AI/ML models that drive value across client-facing and internal applications
  • Build tools that support predictive analytics, natural language querying, and campaign automation
  • Collaborate with product and engineering teams to integrate AI functionality into web platforms
  • Integrate AI solutions with our PHP/Laravel backend and MySQL databases via REST APIs or microservices
  • Write clean, scalable code for inference pipelines, model training, and testing environments
  • Monitor model performance and retrain or refine when necessary
  • Stay ahead of LLMs, vector DBs, and open-source innovations to enhance our AI roadmap
  • Contribute to a long-term AI strategy that makes VELOX more automated, intelligent, and insightful
What we offer
What we offer
  • Competitive compensation and performance bonuses
  • Health insurance & 401k options
  • Paid vacation and holidays
  • Casual dress and regular team events
  • On-site gym and personal trainer access
  • Kombucha on tap
  • Fulltime
Read More
Arrow Right

AI Solution Engineer

As an AI Solution Engineer your role will be to architect, build, and deliver AI...
Location
Location
United States
Salary
Salary:
130500.00 - 300000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's, Master's or other Advanced degree in Engineering, Computer Science, or similar quantitative focus
  • 4 years + experience working with Machine Learning or Deep Learning
  • Experience working with Kubernetes
  • Competency working with the latest LLM frameworks, both Open Source (e.g. LangChain, LllamaIndex) and proprietary (e.g. NVIDIA NeMo/NIM)
  • Competency writing ML code (for example, using PyTorch)
  • Experience with Python, Unix-like systems
  • Ability to quickly prototype functionality into scripts for demos, integrations, troubleshooting, etc.
  • Understanding of hardware requirements associated with deep learning model training or inference, and how model attributes and performance factors affect it
  • Knowledge of current AI landscape, including popular models, frameworks, applications, and capabilities
  • Experience working with on-premise hardware / GPU clusters
Job Responsibility
Job Responsibility
  • Lead technical discussions with prospects and partners to propose HPE and partner Integrated solutions that address business challenges and opportunities using AI
  • Demo AI solutions (either existing or built by you) to prospects that address their use cases or desired AI outcomes
  • Lead Proof-of-Concepts / Proof-of-Value engagements for HPE prospects that demonstrate clear value from HPE's AI offerings, likely in combination with 3rd Party and Open Source components
  • Assist in any product or technical issue towards an initial sale or renewal of a customer
  • Help enable prospects, partners, and internal HPE teams on HPE's value in the AI landscape and how HPE and partner solutions can help solve real world business problems
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Software Engineer, AI Infrastructure

As a Software Engineer on our AI Infrastructure team, you will help design the c...
Location
Location
United States , New York, NY; San Mateo, CA
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 3 years of experience in software engineering, with a focus on infrastructure or machine learning systems
  • Strong programming skills in Python, Go, or a similar language
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, MLflow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Basic understanding of LLM knowledge (e.g., context length, disaggregated prefill, KV cache memory estimation, etc)
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as LLM CI/CD pipeline, control plane, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Building frameworks and safeguards to ensure Fireworks AI has the best model quality in the industry
  • Collaborate with performance, training, and product teams to translate research and product needs into infrastructure solutions
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure
  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally
  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results
  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineering Manager, Gen AI

We're seeking a Senior Machine Learning Manager (M60) to lead a cross-functional...
Location
Location
United States
Salary
Salary:
193500.00 - 303150.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in ML, search, or backend engineering roles, with 3+ years leading teams
  • Strong track record of shipping ML-powered or LLM-integrated user-facing products
  • Experience with RAG systems (vector search, hybrid retrieval, LLM orchestration)
  • Deep experience in either modeling (e.g., LLMs, search, NLP) or engineering (e.g., backend infra, full-stack), with the ability to lead end-to-end
  • Deep understanding of LLM ecosystems (OpenAI, Claude, Mistral, OSS), orchestration frameworks (LangChain, LlamaIndex), and vector databases (Weaviate, Pinecone, FAISS, etc.)
  • Strong product intuition and ability to translate complex tech into valuable user features
  • Familiarity with GenAI evaluation methods: hallucination detection, groundedness scoring, and human-in-the-loop feedback loops
  • Master’s or PhD in Computer Science, Machine Learning, or related field preferred—or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the vision, design, and execution of LLM-powered AI products, leveraging advance AI modeling (e.g. SLM post-training/fine-tuning), RAG architectures and hybrid ranking system
  • Define system architecture across retrievers, rankers, orchestration layers, prompt templates, and feedback mechanisms
  • Work closely with product and design teams to ensure delightful, fast, and grounded user experiences
  • Build and manage a cross-disciplinary team including ML engineers, backend/frontend engineers, and applied scientists
  • Foster a culture of E2E ownership — empowering the team to move from prototype to production quickly and iteratively
  • Mentor individuals to grow in both technical depth and product acumen
  • Shape the technical roadmap and long-term strategy for GenAI search across Atlassian’s product suite
  • Partner with platform and infra teams to scale inference, evaluate performance, and integrate usage signals for continuous improvement
  • Champion data quality, grounding, and responsible AI practices in all deployed features
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right