This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a highly skilled and proactive AI Solutions SRE Lead to oversee the maintenance, optimization, and ongoing performance of deployed AI/ML systems and solutions. In this role, you'll act as the bridge between innovation and operations, ensuring our AI solutions consistently deliver value and operate seamlessly in real-world environments. You will lead efforts to monitor deployments, troubleshoot issues, and define best practices for sustaining AI systems throughout their lifecycle.
Job Responsibility:
Lead the post-deployment lifecycle of AI solutions, ensuring continued functionality, reliability, and scalability
Establish monitoring frameworks to oversee system performance, usage, and metrics for AI/ML models and APIs
Detect anomalies in AI systems, troubleshoot operational issues, and initiate timely corrective actions
Continuously assess and optimize the performance of AI models to maintain efficiency and accuracy in production environments
Collaborate with data scientists and engineers to refine algorithms, retrain models, and update solutions as needed
Implement automation where possible to streamline maintenance processes
Work with cross-functional teams (engineering, product, operations, etc.) to ensure alignment of AI sustainment activities with business goals
Communicate effectively with stakeholders to provide updates on system health, risks, and improvements
Define and implement best practices for sustaining AI solutions, including documentation, testing protocols, and version control
Ensure compliance with ethical AI standards, regulatory guidelines, and established governance frameworks
Manage and mitigate risks associated with model drift, data shifts, and system vulnerabilities
Lead responses to critical incidents involving AI systems by performing root cause analysis and deploying solutions for quick resolution
Advocate for proactive risk prevention and early detection strategies
Mentor and develop junior team members, fostering their skills in AI observability and domain-specific knowledge in ML, Computer Vision, and Generative AI
Requirements:
Bachelor's degree in Computer Science, Engineering, Data Science, or related field
advanced degree preferred
9+ years of experience in machine learning, data science, or software engineering roles, with significant exposure to Computer Vision and Generative AI projects
4+ years of experience specifically focused on AI/ML development and sustain the applications / solutions
Strong programming skills in languages such as Python, Java, or Go
Extensive experience with AI/ML frameworks (e.g., TensorFlow, PyTorch, scikit-learn) and cloud platforms (e.g., AWS, Azure, GCP)
Proficiency in data visualization tools and techniques (e.g., Grafana, Tableau, D3.js)
Deep understanding of AI/ML concepts, including model training, evaluation, and deployment, with specific knowledge of Computer Vision and Generative AI techniques
Experience with monitoring and observability tools such as Prometheus, ELK stack, or similar systems
Excellent problem-solving skills and ability to troubleshoot complex AI systems across various domains
Proven track record of mentoring and developing junior team members in AI-related roles
Nice to have:
Experience with MLOps practices and tools, particularly for large-scale AI systems
Familiarity with AI ethics and responsible AI principles, especially as they relate to Generative AI
Knowledge of relevant AI regulations and compliance requirements, including those specific to Computer Vision applications
Experience with distributed systems and large-scale data processing for AI applications
Contributions to open-source projects or research publications in AI solution at production scale
Previous experience with large-scale AI/ML solutions in production environments
Knowledge of DevOps principles and CI/CD pipelines specific to AI/ML systems