This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As an ML Team Lead, you will be responsible for leading the technical direction of AI/ML initiatives while remaining deeply hands-on in building scalable production-grade ML and LLM systems. This is a highly technical leadership role where approximately 70% of the time will be spent designing systems, writing production code, conducting design and code reviews, debugging production issues, and shipping ML/AI solutions, while 30% will focus on technical leadership, mentoring engineers, and collaborating with product and engineering leadership on roadmap planning. You will play a critical role in architecting and scaling AI/ML systems including classical machine learning models, Retrieval-Augmented Generation (RAG) pipelines, multi-step agentic systems, and LLM-powered applications. You will also establish engineering standards around scalability, security, resilience, observability, cost optimization, and production readiness for all AI systems built by the team.
Job Responsibility:
Lead the technical direction for the team’s ML and LLM systems, including architecture patterns, platform choices, evaluation frameworks, and engineering standards
Stay hands-on by designing and implementing complex ML and agentic AI systems, writing production-grade code, and leading through technical execution
Design, develop, and deploy scalable ML and LLM-powered applications and services in production environments
Build and optimize AI-powered solutions such as RAG systems, multi-step agents, AI assistants, chatbots, forecasting systems, ranking models, classification models, and optimization systems
Drive architecture and design reviews to ensure scalability, reliability, security, and maintainability of AI/ML systems
Own the technical roadmap for ML/LLM initiatives and translate business objectives into execution plans and scalable solutions
Collaborate closely with Product Managers, Engineers, Data Engineers, MLOps Engineers, QA Engineers, and cross-functional stakeholders to deliver business-aligned AI solutions
Establish engineering best practices for prompt engineering, model evaluation, regression testing, observability, and production readiness
Define and implement quality standards, evaluation suites, acceptance metrics, and regression plans for all AI/ML features
Ensure high availability, scalability, and resilience of tier-1 ML services through SLOs, monitoring, incident response, failover strategies, circuit breakers, and multi-zone deployments
Drive security and safety best practices for AI systems, including handling of untrusted inputs, prompt injection prevention, secure tool usage, PII protection, and internal security reviews
Optimize infrastructure, token, GPU, and operational costs by making cost efficiency a first-class design consideration
Mentor and grow ML, GenAI, Data, MLOps, and QA engineers through technical guidance, design discussions, code reviews, pairing sessions, and growth planning
Partner with the MLOps team on platform strategy and collaborate with Data Engineering teams to build scalable data foundations for ML systems
Participate in architecture councils, product reviews, engineering planning sessions, and security reviews as the representative of the engineering team
Support hiring efforts by interviewing, calibrating, onboarding, and mentoring new engineers
Stay updated with emerging trends and advancements in AI/ML, LLMs, cloud platforms, MLOps, and agentic AI frameworks
Requirements:
Bachelor's or Master's degree in computer science, Artificial Intelligence, Data Science, Software Engineering, or a related field
7+ years of professional software engineering experience with at least 5 years of hands-on experience building and deploying ML systems into production
Prior experience as a Tech Lead, Staff Engineer, or hands-on lead for AI/ML engineering teams
Strong expertise in classical machine learning domains such as forecasting, ranking, classification, and optimization
Hands-on experience building modern LLM and agentic AI systems including RAG pipelines, tool-using agents, multi-step workflows, and evaluation systems
Strong proficiency in Python and backend system development
Experience with ML frameworks such as PyTorch or TensorFlow
Strong understanding of scalable distributed systems, APIs, system integration, architecture design, and production engineering practices
Experience operating ML services at scale, including SLO management, monitoring, on-call practices, and incident response
Experience working with Kubernetes-based deployments, CI/CD pipelines, and modern cloud-native engineering practices
Production experience with cloud platforms such as Azure, AWS, or GCP, preferably Azure and AKS
Hands-on experience with vector databases, semantic search systems, and LLM evaluation methodologies
Experience with prompt engineering, model evaluation, regression testing, and AI system optimization
Strong understanding of system reliability, resiliency patterns, observability, scalability, and performance optimization
Experience working with modern data warehouses and distributed data systems
Strong leadership, mentoring, stakeholder management, analytical, and problem-solving skills
Excellent written and verbal communication skills in English
Ability to collaborate effectively with technical and non-technical stakeholders across distributed teams
Nice to have:
Experience in supply chain, logistics, warehousing, or e-commerce domains
Experience scaling AI/ML engineering teams and mentoring engineers across multiple disciplines
Experience with Azure AI Foundry, AWS Bedrock / AgentCore, Amazon SageMaker, or Vertex AI
Experience driving security and compliance reviews for AI/ML systems
Open-source contributions, conference speaking engagements, or published technical writing