This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a highly motivated AI Integration Engineer to join our team and help design, deploy, and maintain the infrastructure that supports artificial intelligence (AI) systems, including Large Language Models (LLMs) and distributed AI workloads. This role is critical to bridging the gap between advanced AI models, compute infrastructure, and operational workflows. You will be responsible for managing AI readiness by architecting scalable infrastructure solutions, integrating complex systems, and maintaining operational excellence to ensure stable deployments of AI and machine learning applications. The ideal candidate has a strong background in high-performance computing, cloud infrastructure, MLOps or DevOps, and AI ecosystem integration. This is an exciting opportunity to be at the forefront of AI operational infrastructure and contribute to cutting-edge projects.
Job Responsibility:
Serve as the technical point of contact for integrating LLMs and other AI workloads across infrastructure systems, operational tools, and application pipelines
Architect, deploy, and maintain scalable GPU computing environments and infrastructure required for autonomous agentic workflows, including persistent state management, long-term memory systems such as Vector DBs, and multi-step reasoning traces
Develop, manage, and optimize CI/CD pipelines for AI deployments, ensuring smooth transitions from model development to production environments
Oversee network and infrastructure connectivity, ensuring seamless communication between distributed systems, GPUs, virtual machines (VMs), APIs, and Command and Control (C2) tools
Design and secure tool-calling environments where agents interact with external APIs, ensuring strict governance and sandboxing for autonomous actions
Provide diagnostic and troubleshooting expertise for AI systems, monitoring infrastructure to maintain availability, security, and performance benchmarks
Collaborate across engineering, data, and AI teams to align infrastructure solutions with business and operational goals
Requirements:
5+ years of experience in infrastructure engineering or system integration roles
2+ years of experience supporting large-scale AI/ML systems or GPU-centric environments
Experience with cloud platforms such as AWS, Azure, or Google Cloud, and their AI-focused services, including SageMaker, GCP AI Platform, and Azure Machine Learning
Experience with networking concepts, including TCP/IP, DNS, NGINX, load balancing, and firewalls, applied to AI model and infrastructure deployment
Experience integrating MLOps pipelines using tools such as MLflow, Kubeflow, TensorFlow Serving, or Vertex AI, including integration of AgentOps frameworks such as LangSmith and Arize Phoenix, to monitor autonomous decision-making paths and agent reasoning traces
Experience with orchestration frameworks for multi-agent systems such as LangGraph, CrewAI, or AutoGen, and managing the stateful databases required to support them, including Redis and Postgres
Experience working with NVIDIA GPU technologies, including CUDA, NCCL, TensorRT, and DGX systems, and container or orchestration tools such as Kubernetes, Docker, Terraform, or Pulumi
Ability to manage and optimize distributed, high-performance computing environments, including clusters of GPUs and cloud-based GPU instances
TS/SCI clearance with a polygraph
Bachelor's degree in CS, Computer Engineering, or Systems Engineering
Nice to have:
Experience with AI/ML frameworks for model training and deployment such as PyTorch, TensorFlow, or Hugging Face Transformers
Experience implementing observability and monitoring systems such as Grafana, Prometheus, and ELK, for AI infrastructure to track performance and operational health
Experience with security practices for AI systems, including encryption, role-based access controls, secure APIs, and compliance frameworks such as SOC 2 and GDPR
Experience with Agentic Safety, including the implementation of Human-in-the-Loop (HITL) approval gateways and automated kill switches for autonomous processes
Experience with Vector Database infrastructure such as Pinecone, Weaviate, or Milvus, and Retrieval-Augmented Generation (RAG) pipelines used to provide agents with contextual memory
Knowledge of distributed computing frameworks such as Ray, Horovod, or Dask, for AI training jobs
Knowledge of AI ethics and operational risk assessments, ensuring deployed systems align with organizational policies and standards
Certified Kubernetes Administrator (CKA) or Kubernetes Application Developer (CKAD) Certification
AWS Certified Solutions Architect or similar Cloud Certifications
NVIDIA Certifications such as the NVIDIA Certified Advanced GPU Infrastructure Specialist Certification
What we offer:
Health, life, disability, financial, and retirement benefits