This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Platform Infrastructure team at iCapital plays a critical role in ensuring that both production and development environments operate smoothly, securely, and reliably. This role leverages advanced cloud capabilities to support the Platform Infrastructure strategy of market agility and lean operating principles, with a strong emphasis on quality to meet the ever‑growing demands of our clients. As a Platform Engineer, you will wear multiple hats in a highly visible role, partnering closely with engineering, security, data, and business teams to deliver secure, reliable, and highly automated platforms that support both application and machine‑learning workloads.
Job Responsibility:
Design, build, and operate MLOps pipelines supporting the full ML lifecycle (training, validation, deployment, monitoring)
Enable production workloads for AI/ML and Generative AI systems, including LLM‑based services
Develop and maintain CI/CD pipelines for AI/ML services and supporting infrastructure
Build and manage cloud‑native infrastructure on AWS, with heavy use of Kubernetes and containerized workloads
Automate infrastructure provisioning and configuration using Infrastructure as Code (Terraform)
Implement model versioning, experiment tracking, and artifact management across environments
Ensure reliability, scalability, observability, and cost efficiency of AI platforms
Partner with AI/ML engineers to operationalize models and standardize deployment patterns
Implement monitoring and alerting for system health, model performance, and drift
Enforce security, compliance, and governance requirements for AI workloads
Participate in incident response, root cause analysis, and continuous improvement initiatives
Document standards, best practices, and reference architectures for MLOps and AI infrastructure
Requirements:
15+ years of experience in DevOps, SRE, or Platform Engineering, with AWS as a primary cloud
Experience supporting machine learning systems in production, including deployment and monitoring concerns
Hands‑on experience with machine learning platforms, particularly AWS SageMaker (required)
Strong hands-on experience with Kubernetes, containerized workloads, and cloud networking
Proven experience building and operating CI/CD pipelines (e.g., GitLab CI, ArgoCD)
Strong proficiency with Terraform and scripting/programming in Python or similar languages
Solid Linux, systems, and troubleshooting fundamentals
Excellent communication skills and ability to work across teams
Direct experience with MLOps platforms and tooling (model registries, experiment tracking, feature stores)
Exposure to Generative AI / LLM workloads in production environments
Familiarity with data stores commonly used in ML systems (e.g., Postgres, DynamoDB, object storage)
Experience operating in regulated or fintech environments
Background in cost optimization for compute‑intensive workloads
Strong written and verbal communication skills
Nice to have:
AWS certifications are a plus
What we offer:
Equity for all full-time employees
Annual performance bonus
Employer matched retirement plan
Generously subsidized healthcare with 100% employer paid dental, vision, telemedicine, and virtual mental health counseling