CrawlJobs Logo

Lead AI Infrastructure Engineer

thoughtworks.com Logo

Thoughtworks

Location Icon

Location:
Singapore , Singapore

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

At Thoughtworks, Lead AI Infrastructure Engineers design and maintain high-performance, scalable, and resilient infrastructure for modern AI workloads. You’ll focus on enabling advanced inference systems, including LLMs, VLMs, and SLMs, across on-premises GPU clusters and cloud environments. This role is critical to ensuring our clients’ AI systems achieve demanding requirements for throughput, latency, availability, and compliance. As a senior technical leader, you will partner with ML engineers, platform engineers, AI researchers, and client stakeholders to deliver optimized infrastructure that is both robust and future-proof. You will combine deep expertise in GPU-based inference infrastructure with a broader understanding of DevOps, agile delivery, and platform engineering to drive impactful AI solutions at enterprise scale.

Job Responsibility:

  • Design and operate GPU-based infrastructure (e.g., NVIDIA GB200, H100) across cloud and self-hosted environments
  • Architect scalable inference platforms that support real-time and batch serving with high availability, load balancing, and fault tolerance
  • Integrate inference workloads with orchestration frameworks such as Kubernetes, Slurm, and Ray, as well as observability stacks like Prometheus, Grafana, and OpenTelemetry
  • Automate infrastructure provisioning and deployment using Terraform, Helm, and CI/CD pipelines
  • Collaborate with ML engineers to co-design systems optimized for low-latency serving, continuous batching, and advanced inference optimization techniques (quantization, distillation, pruning, KV caching)
  • Lead client engagements by shaping technical roadmaps that align AI infrastructure with business objectives, ensuring compliance, scalability, and performance
  • Champion DevOps and agile practices to accelerate delivery while maintaining reliability, quality, and resilience
  • Mentor and guide teams in best practices for AI infrastructure engineering, fostering a culture of technical excellence and innovation

Requirements:

  • Expertise in GPU-based infrastructure for AI (H100, GB200, or similar), including scaling across clusters
  • Strong knowledge of orchestration frameworks: Kubernetes, Ray, Slurm
  • Experience with inference-serving frameworks (vLLM, NVIDIA Triton, DeepSpeed)
  • Proficiency in infrastructure automation (Terraform, Helm, CI/CD pipelines)
  • Experience building resilient, high-throughput, low-latency systems for AI inference
  • Strong background in observability and monitoring: Prometheus, Grafana, OpenTelemetry
  • Familiarity with security, compliance, and governance concerns in AI infrastructure (data sovereignty, air-gapped deployments, audit logging)
  • Solid understanding of DevOps, cloud-native architectures, and Infrastructure as Code
  • Exposure to multi-cloud and hybrid deployments (AWS, GCP, Azure, sovereign/private cloud)
  • Experience with benchmarking and cost/performance tuning for AI systems
  • Background in MLOps or collaboration with ML teams on large-scale AI production systems
  • Proven ability to partner with senior client stakeholders (CTO, CIO, COO) and translate technical strategy into business outcomes
  • Skilled at leading multi-disciplinary teams and building trust across diverse technical and business functions
  • Strong communication skills, with the ability to explain complex AI infrastructure concepts to both technical and non-technical audiences
  • Comfortable navigating uncertainty, making pragmatic decisions, and adapting quickly to evolving technologies
  • Passionate about creating scalable, sustainable, and high-impact solutions that help transform industries with AI

Additional Information:

Job Posted:
January 12, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Lead AI Infrastructure Engineer

Senior Engineering Manager - AI Core Platform

We’re hiring a Senior Engineering Manager (or high-potential EM2) for the Core P...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
intercom.com Logo
Intercom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading engineering teams, ideally across infrastructure or platform domains
  • Recent hands-on coding experience — you’ve shipped production code in the last couple of years
  • Strong technical judgment and the ability to coach senior engineers through complex architectural trade-offs
  • Adaptable leadership style suited to a group that will grow quickly, and change shape over time
  • Curiosity and enthusiasm for AI, with a desire to learn how ML systems are developed and operated in production
Job Responsibility
Job Responsibility
  • Lead a high-performing team building the platform and infrastructure that power Intercom’s AI capabilities
  • Contribute directly to production code, staying close to the work and building knowledge & context through first-hand experience
  • Support teams of ML Scientists and Engineers building AI powered capabilities
  • Plan, prioritize, and deliver high-impact roadmaps in partnership with the team’s most senior engineers, balancing delivery, quality, and innovation
  • Improve developer experience across the AI infrastructure stack, ensuring that systems are observable, scalable, and easy to build upon
  • Empower the engineers on the team to act with agency and maximize their impact
  • Expand your scope over time, potentially taking ownership of additional platform domains as the team and AI initiatives grow
What we offer
What we offer
  • Competitive salary and equity in a fast-growing start-up
  • We serve lunch every weekday, plus a variety of snack foods and a fully stocked kitchen
  • Regular compensation reviews - we reward great work
  • Pension scheme & match up to 4%
  • Peace of mind with life assurance, as well as comprehensive health and dental insurance for you and your dependents
  • Flexible paid time off policy
  • Paid maternity leave, as well as 6 weeks paternity leave for fathers, to let you spend valuable time with your loved ones
  • If you’re cycling, we’ve got you covered on the Cycle-to-Work Scheme. With secure bike storage too
  • MacBooks are our standard, but we also offer Windows for certain roles when needed
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
  • Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
  • Expertise in scripting and automation (Python, Go, or Bash)
  • Solid understanding of container technologies, Kubernetes, and cloud build systems
  • Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
  • Excellent communication and ability to drive technical decisions across multiple platform and product teams
Job Responsibility
Job Responsibility
  • Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
  • Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
  • AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
  • Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
  • Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
  • Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
  • Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture
Read More
Arrow Right

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Director of AI Engineering

We are entering a hyper-growth phase of AI innovation and are hiring a Director ...
Location
Location
Canada; United States
Salary
Salary:
300000.00 - 450000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10–15+ years in software engineering, with significant leadership experience owning AI/ML or applied LLM systems at scale
  • Proven history shipping LLM-powered features, agentic workflows, or AI assistants used by real customers in production
  • Deep understanding of LLM orchestration frameworks (LangChain, LlamaIndex), RAG pipelines, vector search, embeddings, and prompt engineering
  • Expert in backend & distributed systems (Python strongly preferred) and cloud infrastructure (AWS/GCP)
  • Strong experience with telemetry, observability, and cost-aware real-time inference optimizations
  • Demonstrated ability to lead senior engineers, define technical roadmaps, and deliver outcomes aligned to business metrics
  • Experience building or scaling teams working on experimentation, optimization, personalization, or ML-powered growth systems
  • Exceptional ability to simplify complex problems, set clear standards, and drive alignment across Product, Data, Design, and Engineering
  • Strong product sense, ability to weigh novelty vs. impact, focus on user value, and prioritize speed with guardrails
  • Fluent in integrating AI tools into engineering workflows for code generation, debugging, delivery velocity, and operational efficiency
Job Responsibility
Job Responsibility
  • Define the multi-year technical vision for Apollo’s AI stack, spanning agents, orchestration, inference, retrieval, and platformization
  • Prioritize high-impact AI investments by partnering with Product, Design, Research, and Data leaders to align engineering outcomes with business goals
  • Establish technical standards, evaluation criteria, and success metrics for every AI-powered feature shipped
  • Lead the architecture and deployment of long-horizon autonomous agents, multi-agent workflows, and API-driven orchestration frameworks
  • Build reusable, scalable agentic components that power GTM workflows like research, enrichment, sequencing, lead scoring, routing, and personalization
  • Own the evolution of Apollo’s internal LLM platform for high-scale, low-latency, cost-optimized inference
  • Oversee model-driven experiences for natural-language interfaces, RAG pipelines, semantic search, personalized recommendations, and email intelligence
  • Partner with Product & Design to build intuitive conversational UX that hides underlying complexity while elevating user productivity
  • Implement rigorous evaluation frameworks, including offline benchmarking, human-in-the-loop review, and online A/B experimentation
  • Ensure robust observability, monitoring, and safety guardrails for all AI systems in production
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA
  • Fulltime
Read More
Arrow Right

AI Platform Lead

A rapidly growing tech company is seeking an AI Platform Lead to architect and s...
Location
Location
United Kingdom , London
Salary
Salary:
100000.00 GBP / Year
formularecruitment.co.uk Logo
Formula Recruitment
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years’ experience building AI/ML platforms, LLM services, or ML infrastructure
  • Hands-on expertise with modern LLM frameworks—RAG, agents, prompt engineering, fine-tuning
  • Strong systems thinking and experience integrating AI into complex platforms
  • Product awareness and the ability to work across teams in a fast-moving environment
Job Responsibility
Job Responsibility
  • Architect and scale the core infrastructure powering their AI-driven modelling platform
  • Blend deep ML infrastructure expertise with strong systems thinking and product intuition
  • Work closely with engineering, product, and domain specialists to define how AI is integrated into the platform
  • Building LLM-backed agents, features, and developer-facing tools that deliver real value to end users
What we offer
What we offer
  • Bonus
  • Flexible hybrid working from a London office
  • Generous time off and modern workspace perks
  • Pension contributions
  • Equity options and the chance to grow with the company
  • Fulltime
Read More
Arrow Right

Software Engineer - AI & Marketplace

Full-time Software Engineer – AI & Marketplace Innovation position available wit...
Location
Location
Australia , Sydney
Salary
Salary:
140000.00 - 150000.00 AUD / Year
11recruitment.com.au Logo
11 Recruitment
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Software Engineering, or related field
  • Minimum of 3 years’ professional experience in software engineering
  • Strong proficiency in Python, TypeScript, Flutter, or related modern stacks
  • Experience managing cloud infrastructure (AWS, Firebase, etc.)
  • Ability to take ownership of a codebase and lead technical decisions
  • Excellent collaboration and communication skills
Job Responsibility
Job Responsibility
  • Develop and maintain core systems across app, web, and backend platforms
  • Own and manage CI/CD pipelines, cloud infrastructure, and production reliability (AWS/Firebase)
  • Work on Verity AI’s backend (Python/FastAPI) and integrate with Shopify and custom APIs
  • Collaborate closely with the CEO and founding team in an agile workflow
  • Lead product releases, feature upgrades, and technical strategy
  • Ensure security, data compliance, and minimal platform downtime (SLA-based)
What we offer
What we offer
  • plus superannuation
  • Fulltime
Read More
Arrow Right

Gen AI Tech Lead

We are seeking a dynamic and innovative Gen AI Lead to spearhead the development...
Location
Location
United States , Atlanta, Georgia; Tampa, Florida
Salary
Salary:
158400.00 - 237600.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Large Language Models (LLMs) & Fine-Tuning: Deep knowledge of LLMs and advanced fine-tuning techniques
  • Model Optimization: Expertise in model compression and quantization methods
  • Prompt Engineering: Adept at prompt engineering
  • Retrieval-Augmented Generation (RAG): Advanced knowledge of RAG techniques
  • Machine Learning Frameworks and Cloud Computing: Proficiency in TensorFlow, PyTorch, and high-level APIs like Keras
  • Natural Language Processing (NLP) and AI Deployment: Advanced NLP skills
  • Data Science, Engineering, and API Development: Strong proficiency in data preprocessing
  • Generative AI Tools & Platforms: Experienced with cutting-edge generative AI tools
  • AI Compliance & Guardrails: Knowledge of AI compliance frameworks
  • Team Leadership: Proven ability to build, lead, and develop high-performing AI teams
Job Responsibility
Job Responsibility
  • Lead Generative AI Strategy: Define and implement a comprehensive AI strategy
  • Build & Lead a High-Performing Team: Hire, mentor, and manage a team of AI specialists
  • Drive AI Innovation: Collaborate with internal stakeholders to identify business challenges
  • AI System Development: Oversee the design, development, and deployment of AI models
  • Cross-functional Collaboration: Work closely with the Data Mesh, Cloud Architecture, and broader tech teams
  • Stay Current on AI Trends: Continuously monitor the latest AI trends
  • Ensure Ethical AI Use: Ensure all AI initiatives comply with data privacy
What we offer
What we offer
  • medical, dental & vision coverage
  • 401(k)
  • life, accident, and disability insurance
  • wellness programs
  • paid time off packages
  • planned time off (vacation)
  • unplanned time off (sick leave)
  • paid holidays
  • Fulltime
Read More
Arrow Right