CrawlJobs Logo

Sr. Cloud Infrastructure Engineer (Ai & Llm Platforms)

· Job Posted May 14, 2026
Apply Position
Job Link Share

Job Description

We are seeking a specialized Infrastructure Engineer to bridge the gap between our large data repositories, Cloud Platform and the rapidly evolving world of Large Language Models (LLMs). You will be responsible for building the 'plumbing' that allows our internal teams and external users to leverage AI effectively. This includes deploying Model Context Protocol (MCP) servers, building agentic execution environments, and scaling our internal Retrieval-Augmented Generation (RAG) architecture.

Job Responsibility

  • Guide the architecture that will allow us to leverage AI tools with our large existing data stores and incoming streams of realtime intelligence
  • Work closely with other infrastructure engineers and software development teams to integrate AI tools into existing systems
  • Design, deploy, and maintain Model Context Protocol (MCP) servers to allow LLMs to securely interact with our internal databases, APIs, and external tooling
  • Build and orchestrate sandboxed, scalable environments (e.g., using Docker or specialized runtimes) where users can safely build and execute AI agents
  • Develop and manage the infrastructure for our internal RAG (Retrieval-Augmented Generation) pipeline, including vector database management (e.g., Pinecone, Weaviate, or pgvector) and automated embedding pipelines
  • Utilize Kubernetes (K8s) and Infrastructure as Code (Terraform/Pulumi) to deploy LLM-related tools, ensuring high availability and low latency for model inference and data retrieval
  • Implement strict guardrails for data privacy within LLM workflows, ensuring internal datasets remain secure while being accessible to authorized AI tools

Requirements

  • 5+ years of experience in DevOps, Platform Engineering, or SRE, with at least 1-2 years specifically focused on AI/ML infrastructure
  • Proven track record of building production-grade RAG pipelines or LLM-integrated applications
  • Thrives in 'day zero' environments where the tools and protocols (like MCP) are evolving weekly
  • Deep understanding of the security implications of LLMs (prompt injection, data leakage, and secure tool execution)
  • Experience working with substantial datasets (over 1bn objects, dozens or hundreds of TBs) and the challenges of leveraging AI tools with these data sets
  • Bachelor's degree or equivalent in computer science or related field
  • Cloud & Orchestration: AWS/GCP/Azure, Kubernetes, Terraform, Helm
  • AI Frameworks: LangChain, LlamaIndex, LangGraph
  • Data & Vectors: Pinecone, Milvus, Qdrant, or pgvector
  • Apache Kafka/Pulsar
  • Elasticsearch/OpenSearch
  • traditional SQL RDBMS
  • Languages: Python (Expert), TypeScript/Node.js (for MCP development), Go
  • AI Protocols: Model Context Protocol (MCP), REST/gRPC

What we offer

We offer a competitive compensation package and comprehensive benefits package

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Sr. Cloud Infrastructure Engineer (Ai & Llm Platforms)

8 matching positions

Sr. Forward Deployed AI Engineer

Location
Location
United Kingdom , London
Salary
Salary:
Not provided
smartsheet.com Logo
Smartsheet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6–10+ years production software engineering, including 3+ years deploying AI/ML to production. Deep Python
  • strong TypeScript/JavaScript.
  • Proven track record leading complex enterprise technical engagements. You’ve managed customer relationships, navigated security reviews, and delivered production systems organisations depend on.
  • Deep production LLM experience: multi-agent orchestration, MCP/tool-use, RAG optimisation, evaluation frameworks, agent deployment at scale.
  • System architecture expertise spanning enterprise systems, cloud infrastructure, and AI services explainable to a CIO and a junior engineer.
  • Demonstrated ability to create reusable technical assets that others use in production. You think in platforms, not projects.
  • Strong written and verbal communication. Architecture documents, runbooks, strategic memos, training materials. Your Deployment Kits are your portfolio.
  • You leave teams better than you found them. You make sure the people you work with understand what was built, why it works, and how to own it without you.
  • Ability to travel. This role requires domestic and international travel approximately 25–50% of the time.
  • Elligible to work in the UK on an ongoing basis
Job Responsibility
Job Responsibility
  • Lead complex, multi-system AI deployments end-to-end scope, architect, build, validate, and manage the customer relationship throughout.
  • Own the AI workshop program for your pod, customize modules per customer, lead technical sessions, translate outputs into production requirements, evolve content from field learning.
  • Architect multi-agent solutions selecting the right coordination pattern for each customer’s workflow characteristics and compliance requirements.
  • Design client-specific and industry-specific MCP resource packs that serve personalized intelligence from the server so every connected AI surface gets smarter for that customer automatically.
  • Own Deployment Kit quality for your pod. If a kit is not documented well enough for a solutions consultant with no engineering background to follow, it isn’t done.
  • Lead Solutions Enablement Sprints: transfer AI deployment patterns to solutions consultants and partners with training materials and certification criteria.
  • Drive the intelligence loop: author strategic memos, present field findings weekly, contribute Agent Bricks and Skills, file RFCs for platform improvements.
  • Mentor junior FDEs on customer engagements, code reviews, and Deployment Kit quality.
  • Fulltime
Read More
Arrow Right

Sr Cloud Solution Architect

As a Cloud Solution Architect aligned to the Azure AI platform for Microsoft's C...
Location
Location
United States , Multiple Locations
Salary
Salary:
106400.00 - 203600.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, Information Technology, Engineering, Business or related field AND 4+ years’ experience in cloud/infrastructure technologies, information technology (IT) consulting/support, systems administration, network operations, software development/support, technology solutions, practice development, architecture, and/or Business Applications consulting OR equivalent experience
  • Bachelor's Degree in Computer Science, Information Technology, Engineering, Business, Liberal Arts, or related field AND 8+ years experience in cloud/infrastructure technologies, information technology (IT) consulting/support, systems administration, network operations, software development/support, technology solutions, practice development, architecture, and/or consulting OR Master's Degree in Computer Science, Information Technology, Engineering, Business, Liberal Arts, or related field AND 6+ years experience in cloud/infrastructure technologies, technology solutions, practice development, architecture, and/or consulting OR equivalent experience
  • 4+ years experience working in a customer-facing role (e.g., internal and/or external)
  • 4+ years experience working on technical projects
  • Technical Certification in Cloud (e.g., Azure, Amazon Web Services, Google, security certifications)
  • Breadth of technical experience and knowledge in foundational security, foundational AI, architecture design, with depth / Subject Matter Expertise in one or more of the following: Deep Domain Expertise in Azure AI Areas: Deep domain expertise in one of the Azure AI specific areas, such as Cognitive Services, Azure OpenAI and CoPilot OR hands-on experience working with the respective products at the expert level
  • Expertise with Azure AI Search and/or Vector Indexes, Azure Document Processing and /or equivalent OCR technology
  • Programming Languages and Integration: Proficient with Python, C#, R, JavaScript, or similar programming languages in the context of application development, and ability to integrate Azure AI with other services (e.g., Azure Functions, Azure Container Apps, Docker, API Management)
  • Architecting Enterprise-Grade Solutions: The ability to create and explain 3-tier architecture diagrams, system context diagrams, system interaction diagrams, etc
  • Proven experience building enterprise-grade, AI-focused solutions on the cloud (Azure, AWS, GCP) for customers, from Minimum Viable Products (MVPs) leading to production deployments
Job Responsibility
Job Responsibility
  • Play a pivotal role in the AI Factory, providing technical enablement, operational support, and strategic engagement across customer projects
  • Understand customers' overall data estate, business priorities, and IT success measures
  • Innovate with AI solutions that drive business value
  • Facilitate scalable delivery through strong technical program management utilizing a factory model/approach, driving program awareness and demand across the regional operating units
  • Attend in-flight project status meetings to monitor progress and identify support needs
  • Engage directly with complex or non-standard customer use cases beyond existing accelerators
  • Participate in intake reviews for milestone sizing, objection handling, and technical scoping
  • Deliver solutions with high performance, security, scalability, maintainability, repeatability, reusability, and reliability upon deployment
  • Gather insights from customers and partners
  • Develop opportunities to enhance Customer Success and help customers extract value from their Microsoft investments
  • Fulltime
Read More
Arrow Right

Sr. Software Engineer

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform ...
Location
Location
United States , San Jose
Salary
Salary:
244900.00 - 321100.00 USD / Year
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master's degree in Computer Science, Computer Engineering, Electrical Engineering, Data Science, or a related technical field
  • 2+ years of experience in software engineering, AI/ML engineering, backend development, or adjacent domains, with strong software engineering fundamentals and the ability to build production-grade systems
  • Strong proficiency in Python, plus experience with C/C++ or another systems language
  • Hands-on experience with LLM-based systems, including prompt design, retrieval, tool use, memory handling, and agent orchestration patterns
  • Experience building and maintaining RAG pipelines, agent frameworks, MCP servers or equivalent function-calling architectures, and conversational interfaces
  • Familiarity with cloud platforms, REST APIs, containerization, and modern deployment environments
  • Experience with observability, evaluation, experimentation, and feedback loops for AI systems in production
  • Ability to work independently, manage ambiguity, move quickly, and deliver incrementally in a fast-paced environment
  • Excellent communication skills, sound engineering judgment, and a collaborative working style
Job Responsibility
Job Responsibility
  • Architect, develop, and deploy AI agents and copilots for Roku TV use cases, integrating them with internal systems, tools, and services
  • Own end-to-end agentic systems from concept to production, including model selection, prompt and context design, retrieval strategies, backend services, and conversational interfaces
  • Design and implement single-agent and multi-agent orchestration patterns, including handoffs, delegation, and cooperative task execution
  • Build scalable RAG and context pipelines that provide high-quality grounding for AI systems and keep them aligned with evolving data sources and business logic
  • Implement tool-calling, function-calling, and MCP-style integrations so agents can safely take actions and interact with the systems around them
  • Create reusable agent templates, modular components, and paved-path patterns that accelerate adoption across teams and use cases
  • Establish strong evaluation, observability, and monitoring for conversation quality, task success rate, latency, cost, and overall system performance
  • Build safeguards that improve production readiness and reliability, including testing pipelines, controlled rollouts, drift detection, and mechanisms that prevent error amplification in multi-step workflows
  • Prototype quickly, run experiments, and translate successful ideas into durable, scalable software solutions
  • Partner closely with engineering, product, QA, infrastructure, and cross-functional teams to deliver meaningful business and customer outcomes
What we offer
What we offer
  • Health insurance
  • equity awards
  • life insurance
  • disability benefits
  • parental leave
  • wellness benefits
  • paid time off
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life
  • Fulltime
Read More
Arrow Right

Sr Staff Engineer Software, Fullstack (Prisma AIRS) - NetSec

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience building and scaling multi-tenant SaaS platforms with strict data isolation
  • Strong knowledge of API design, RESTful principles, and OpenAPI specifications
  • Proficiency in modern JavaScript frameworks (React, Vue, or Svelte) with TypeScript
  • Experience building data-intensive dashboards with complex visualisations and real-time data
  • Strong CSS/styling skills and responsive design principles
  • Demonstrated experience working with production AI/ML systems at scale
  • Practical experience integrating LLM APIs and managing inference at scale
  • Understanding of LLM operational challenges: rate limiting, cost optimisation, latency management, fallback strategies
  • Familiarity with AI agent frameworks (LangChain, AutoGen, MCP, or similar)
  • Knowledge of prompt engineering, semantic search, and vector databases
Job Responsibility
Job Responsibility
  • Design and implement high-performance REST APIs with enterprise-grade multi-tenant isolation and strict security boundaries
  • Work on distributed systems architecture handling high-throughput workloads with mission-critical uptime requirements
  • Build responsive dashboards and administrative interfaces for platform management, data visualisation, and system configuration
  • Integrate multiple LLM providers, implement semantic search capabilities, and build intelligent agent workflows
  • Architect complex, multi-step AI evaluation pipelines for asynchronous job execution and large-scale data processing
  • Design and implement database schemas with proper indexing, query optimisation, and data isolation strategies
  • Build and maintain scalable micro-services with async/await patterns and type-safe code
  • Develop data-intensive UIs with real-time updates, complex state management, and intuitive user experiences
  • Deploy and manage containerised applications on Kubernetes with comprehensive observability
  • Write thorough tests (frontend and backend) and maintain high code quality standards with automated tooling
  • Fulltime
Read More
Arrow Right

Sr. Software Engineer (Agentic Runtime)

Dialpad’s AI Engineering organization is responsible for building and maintainin...
Location
Location
Argentina , Buenos Aires
Salary
Salary:
Not provided
dialpad.com Logo
Dialpad
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3–6 years of experience in distributed systems, platform engineering, or ML infrastructure, with exposure to LLM-based or agentic systems strongly preferred
  • Strong understanding of agent architectures, including ReAct, plan-and-execute, and multi-agent coordination patterns
  • Deep knowledge of context management, prompt lifecycle, tool-call protocols (e.g., function calling, MCP), and agent memory strategies (short-term, episodic, and long-term)
  • Experience integrating and managing external tool ecosystems, including web search, code interpreters, databases, and third-party APIs
  • Familiarity with retrieval-augmented generation (RAG) and how retrieval fits into broader agentic pipelines
  • Understanding of LLM output reliability challenges — hallucination, non-determinism, and retry/fallback strategies at runtime
  • Proficiency in Go and Python 3 (experience with Rust or TypeScript is a plus)
  • Strong understanding of distributed systems, microservices, and event-driven architectures suited to long-running agent tasks
  • Passion for real-time performance optimization, including streaming responses, async execution, and parallel tool invocation
  • Experience with API design using OpenAPI, Swagger, or equivalent, with an eye toward agentic interaction patterns
Job Responsibility
Job Responsibility
  • Contribute to the design, development, and maintenance of agentic runtime systems, including agent orchestration, tool execution pipelines, and multi-step reasoning loops
  • Build and optimize core runtime components, including task planners, action dispatchers, memory managers, and context window management systems
  • Work on agent coordination techniques, including dynamic tool selection, parallel agent execution, state management, and result aggregation across multi-agent workflows
  • Maintain and enhance highly scalable agentic platforms with a focus on low-latency execution, cost efficiency, and deterministic behavior
  • Ensure high availability, reliability, and fault tolerance in agent runtime services, including graceful degradation when LLM or tool calls fail
  • Collaborate with cross-functional teams — including ML researchers, product, and platform engineers — to translate agentic product requirements into robust runtime infrastructure
  • Develop and optimize real-time distributed systems, microservices, and event-driven architectures powering agentic task execution
  • Design and implement sandboxed execution environments for safe agent use of tools, code execution, and external API calls
  • Implement and maintain monitoring, alerting, and performance metrics covering agent run success rates, token consumption, latency, and cost attribution
  • Evaluate and integrate emerging agentic frameworks, LLM APIs, and tooling ecosystems to continuously improve platform capabilities
What we offer
What we offer
  • Competitive benefits and perks
  • Robust training program
  • Inclusive office environment
  • Recognized Great Place to Work culture
Read More
Arrow Right

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of ML engineering experience with hands-on LLM/NLP work
  • Practical experience building LLM-based applications (agents, multi-turn systems, evaluators)
  • Understanding of model fine-tuning, embedding optimization, and prompt engineering
  • Experience with LLM APIs (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI)
  • Knowledge of LLM orchestration frameworks ( LangChain, LlamaIndex, Pydantic AI, custom solutions)
  • Familiarity with model architectures and when to fine-tune vs prompt engineer
  • Strong experience deploying ML models to production at scale
  • Experience with Model serving frameworks (vLLM preferred
  • TensorRT-LLM, Ray Serve, or similar a plus)
  • Kubernetes and Docker proficiency for ML workload orchestration
Job Responsibility
Job Responsibility
  • Build and deploy LLM-based agents and multi-step evaluation workflows
  • Fine-tune models, optimize embeddings, and manage model weights and artifacts
  • Deploy and scale ML services on Kubernetes with proper monitoring and resource management
  • Implement experiment tracking, model versioning, and deployment automation
  • Develop observability dashboards for ML metrics, costs, latency, and quality
  • Optimize LLM API usage through caching, batching, and intelligent routing strategies
  • Manage vector database infrastructure and semantic search systems
  • Create CI/CD pipelines for ML artifacts and automated testing frameworks
  • Collaborate with ML researchers to productionize prototypes and scale experiments
  • Fulltime
Read More
Arrow Right

Applications Development Sr Programmer Analyst

Location
Location
Canada , Mississauga
Salary
Salary:
94300.00 - 141500.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of overall experience in large-scale application development with recent mandatory platform for the secure and scalable deployment of AI agents into application contexts
  • Minimum of 5+ years of proven experience in a Python and pyspark Engineering lead role focused on building enterprise-grade, high-volume ELT/ETL processes using the PySpark and Databricks ecosystem
  • Hands-on experience with agentic AI development using YAML, JSON, FAST API or Spring boot, Google ADK, LLM integrations, including Devin.AI or Github Copilot, and integrating models via platforms like MCP using advanced prompt engineering
  • Proven experience developing and automating microservice integrations to support data-intensive applications
  • Proficiency in at least one programming language commonly used for data analytics, engineering, such as Python or Scala
  • Strong SQL skills and experience with various relational databases
  • Deep understanding of data modeling, data warehousing concepts, Data Mesh architecture, and data federation
  • Excellent communication, collaboration, and problem-solving skills
  • Bachelor's degree in Computer Science, Engineering, or a related field
Job Responsibility
Job Responsibility
  • Design, develop, and maintain scalable, enterprise-grade AI agents, supporting ELT/ETL processes to handle large data volumes using the Python, FAST API, Microservices, PySpark, Kafka and Databricks ecosystem
  • Build and Deploy GEN AI Agents using Googles ADK and Google Flash 2.5+ LLMs to support application automation supports and its deep insights, workflow support with HIL - Human in loop architecture
  • Build and maintain data federation layers for lambda and Data Mesh architectures using tools like Starburst, with a strategy for adopting AI-based use cases (e.g., machine learning, deep learning, NLP) to drive efficiency
  • Develop, deploy, and automate microservice integrations to support data-intensive applications, ensuring scalability, resilience, and maintainability using cloud native infrastructure and openshift or Kubernates architecture including CI/CD pipelines
  • Integrate and leverage agentic AI tools (e.g., Devin.AI, Github Copilot) and platforms (e.g., MCP) through advanced prompt engineering to enhance development and operational efficiency
  • Ensure data quality, integrity, and security throughout the entire data lifecycle
  • Contribute to the continuous improvement of data engineering processes, standards, and best practices within the team
  • Appropriately assess risk when business decisions are made, demonstrating consideration for the firm's reputation and safeguarding Citi, its clients, and assets by driving compliance with applicable laws, rules, and regulations. Adhere to Policy, apply sound ethical judgment, and escalate, manage, and report control issues with transparency
  • Fulltime
Read More
Arrow Right

Apps Dev Tech Sr Lead Analyst Java SVP

Apps Dev Tech Sr Lead Analyst Java SVP at Citi. Own and drive end-to-end migrati...
Location
Location
India , Chennai, Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Technology, Engineering, or a related technical discipline
  • 12–16 years of progressive software engineering experience, with at least 5 years in a senior technical leadership role (Tech Lead, Staff Engineer, principle)
  • Demonstrated experience leading or significantly contributing to at least one large-scale Mainframe modernisation or legacy platform migration programme
  • Prior experience in financial services, banking, or a similarly regulated industry strongly preferred
  • Java 17/21 (Expert), Python, COBOL / JCL (reading & assessment level)
  • Spring Boot, Spring Cloud, Spring Security, Spring Data JPA, Project Reactor / WebFlux
  • OpenShift, Kubernetes, Docker, Helm
  • Tekton, Harness, Jenkins, Git (Bitbucket / GitHub), Artifactory, SonarQube
  • Oracle, MongoDB, PostgreSQL, MS SQL Server, Redis, DB2/z
  • Apache Kafka, IBM MQ
Job Responsibility
Job Responsibility
  • Own and drive end-to-end migration of legacy Mainframe workloads (COBOL, JCL, CICS, IMS, DB2/z) to modern Java-based microservices deployed on enterprise container platforms (OpenShift / Kubernetes)
  • Conduct application assessments to identify migration candidates, define target-state architectures, and produce sequenced migration roadmaps with risk registers and rollback plans
  • Establish reusable migration patterns, tooling, and runbooks to accelerate successive migration waves
  • Leverage AI-assisted code translation tools (e.g., autonomous AI coding agents such as Devin) to automate COBOL-to-Java conversion at scale, with human-in-the-loop review gates
  • Validate functional parity post-migration through automated testing strategies (unit, integration, regression, performance)
  • Identify and quantify cost-reduction opportunities across MIPS consumption, software licensing, infrastructure footprint, and operational overhead
  • Build and maintain a technology cost model
  • track savings realisation against committed targets on a monthly cadence
  • Drive rationalisation of redundant systems, decommission end-of-life platforms, and consolidate tooling to reduce Total Cost of Ownership (TCO)
  • Partner with Finance and Vendor Management to renegotiate contracts and optimise spend through right-sizing, reserved capacity, and FinOps practices
  • Fulltime
Read More
Arrow Right