CrawlJobs Logo

Platform Engineer (ai/llm Infrastructure)

United States, Santa Clara Employment contract 130000.00 - 170000.00 USD / Year · Job Posted May 31, 2026
Apply Position
Job Link Share

Job Description

We are currently seeking a Platform Engineer (AI/LLM Infrastructure) to join our team in Santa Clara, California (US-CA), United States (US).

Job Responsibility

  • Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clients
  • Act as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineers
  • Own end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and security
  • Partner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutions
  • Architect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBAC
  • Design and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database management
  • Lead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar)
  • Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux)
  • Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOps
  • Establish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearch
  • Lead incident response, on-call processes, and post-mortem analysis
  • Ensure strong security posture and lead InfoSec review processes
  • Coordinate delivery across multiple teams and client engagements

Requirements

  • 5+ years of experience in Platform Engineering, SRE, or Infrastructure Engineering
  • 3+ years of experience delivering and leading infrastructure for AI/LLM-based production systems
  • 3+ years of experience with Terraform and GitOps (ArgoCD/Flux)
  • 3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines)
  • 3+ years of Experience with CI/CD and container registry management

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Platform Engineer (ai/llm Infrastructure)

8 matching positions

Principal Software Engineer, AI Developer Tools

At Docker, we make app development easier so developers can focus on what matter...
Location
Location
United States , Seattle
Salary
Salary:
232000.00 - 319000.00 USD / Year
docker.com Logo
Docker
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years software engineering experience with 3+ years in Staff or Principal Engineer roles
  • Deep expertise in AI/ML technologies with hands-on production experience building LLM-powered applications, AI agents, or AI-assisted developer tools
  • Strong understanding of LLM APIs (OpenAI, Anthropic, etc.), prompt engineering, agent orchestration frameworks, and practical applications of AI in software development workflows
  • Proven track record of architecting and building highly scalable distributed systems and developer-facing platforms
  • Production experience with modern cloud-native infrastructure including Kubernetes, GitOps deployment patterns, observability systems, and CI/CD pipelines
  • Proficiency in Go (preferred), Rust, Java, or Python with strong software engineering fundamentals
  • Experience designing developer tools, platform engineering systems, or internal tools that enable other teams
  • Exceptional product and platform mindset considering business outcomes, developer experience, and technical trade-offs
  • Strong communication skills with ability to influence technical and non-technical stakeholders across the organization
  • Track record of technical mentorship and elevating engineering teams' capabilities
Job Responsibility
Job Responsibility
  • Define the long-term technical vision and architecture for AI-powered developer tools and the self-service platform that enables teams to build their own AI agents
  • Establish architectural patterns, technical standards, and best practices for LLM integration, AI agent development, and production AI systems serving developers
  • Lead technical strategy for platform capabilities including deployment frameworks (ArgoCD/GitOps), observability integration (Grafana), security controls, and operational tooling for AI developer tools
  • Design highly available, scalable infrastructure for hosting AI agents and developer tools with predictable performance and intelligent resource management
  • Drive technical decisions on AI technology choices, LLM provider strategies, prompt engineering approaches, and agent orchestration frameworks
  • Partner with Senior Manager and product leadership to align technical architecture with business objectives and productization opportunities
  • Architect and build production-ready AI agents for developer productivity including code review assistants, test generators, deployment diagnostics, and incident response automation
  • Design and implement the self-service platform infrastructure that reduces time-to-production for new AI tools from weeks to days
  • Build systems that accelerate adoption of AI-native development tools (Claude Code, Cursor, Warp) across Docker's engineering organization
  • Establish reliability, security, and performance standards for AI systems including SLOs, monitoring, incident response, and cost management
What we offer
What we offer
  • Freedom & flexibility
  • fit your work around your life
  • Designated quarterly Whaleness Days plus end of year Whaleness break
  • Home office setup
  • we want you comfortable while you work
  • 16 weeks of paid Parental leave
  • Technology stipend equivalent to $100 net/month
  • PTO plan that encourages you to take time to do the things you enjoy
  • Training stipend for conferences, courses and classes
  • Equity
  • Fulltime
Read More
Arrow Right

Software Engineer II, AI Developer Tools

At Docker, we make app development easier so developers can focus on what matter...
Location
Location
United States , Seattle
Salary
Salary:
128000.00 - 181500.00 USD / Year
docker.com Logo
Docker
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years building backend systems, APIs, or developer-facing tools with strong software engineering fundamentals
  • Proficiency in Go (preferred), Rust, Java, or Python with understanding of data structures, algorithms, and design patterns
  • Basic understanding of AI/ML concepts with eagerness to learn about LLM APIs, prompt engineering, and AI agent development through hands-on work
  • Experience with cloud platforms (AWS, GCP, or Azure) and understanding of distributed systems or microservices
  • Familiarity with CI/CD pipelines, automated testing, version control (Git), and modern development workflows
  • Strong problem-solving skills with ability to work through technical challenges with guidance from senior engineers
  • Good communication skills in remote, asynchronous environments with ability to document technical decisions
  • Collaborative mindset with eagerness to learn from code reviews and feedback
  • Self-motivated with ability to work autonomously while knowing when to ask for help
  • Passion for developer tools and user experience
Job Responsibility
Job Responsibility
  • Build AI Developer Tool Features: Implement features for AI-powered developer tools such as code review assistants, test generators, deployment diagnostics, and on-call assistance tools
  • Implement LLM Integrations: Build integrations with LLM APIs (OpenAI, Anthropic, etc.) such as prompt engineering, response handling, error management, and performance optimization
  • Contribute to Platform Infrastructure: Help build self-service platform capabilities such as deployment pipelines, observability integration, security controls, and operational tooling that enable teams to rapidly deploy AI developer tools
  • Support AI-Native Development Adoption: Contribute to tools and programs that help teams adopt AI developer tools such as Claude Code, Cursor, and Warp across Docker's engineering organization
  • Write Quality Code: Develop well-tested code with unit and integration tests
  • follow team coding standards and participate actively in code reviews to learn best practices
  • Maintain Production Systems: Assist with monitoring, alerting, and troubleshooting production AI systems
  • participate in incident response and learn operational best practices
  • Collaborate and Learn: Work closely with Senior Engineers and Principal Engineer on technical designs
  • ask questions, seek feedback, and continuously improve your skills in AI/LLM technologies and platform engineering
What we offer
What we offer
  • Freedom & flexibility
  • fit your work around your life
  • Designated quarterly Whaleness Days plus end of year Whaleness break
  • Home office setup
  • we want you comfortable while you work
  • 16 weeks of paid Parental leave
  • Technology stipend equivalent to $100 net/month
  • PTO plan that encourages you to take time to do the things you enjoy
  • Training stipend for conferences, courses and classes
  • Equity
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Frontend Platform

At Vanta, our mission is to help businesses earn and prove trust. We believe tha...
Location
Location
United States
Salary
Salary:
179000.00 - 211000.00 USD / Year
vanta.com Logo
Vanta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have experience building and maintaining software services and infrastructure platforms
  • Have a deep understanding of TypeScript and React
  • Have a deep understanding of writing and testing performant web client code
  • Have experience scaling platform systems
  • Have experience writing codemods and migration automations
  • Demonstrate empathy for the developer experience (DX) and have a strong product sense for developer tooling
  • Open to using AI to amplify their skills and strengthen their work - demonstrating curiosity, a willingness to learn, and sound judgment in applying AI responsibly to improve efficiency and impact
Job Responsibility
Job Responsibility
  • Lead complex projects with multiple stakeholders and engineers to enable our business and team to scale
  • Evolve our frontend build system: bundlers, static analysis, caching, package management, and monorepo tooling
  • Develop and scale our frontend testing strategy (unit, integration, visual regression, and E2E)
  • Maintain and evolve GraphQL tooling: code patterns, developer tooling, scalability, and reliability
  • Improve web performance and observability: set SLOs, implement monitoring and alerting, and drive remediation
  • Plan and run modernization and migration initiatives leveraging codemods and AI/LLM tooling
  • Work with talented and kind engineers to make a significant impact on our customer base, enabling them to improve their security and prove it
  • Contribute to building Vanta’s engineering culture as we grow
What we offer
What we offer
  • Offers Equity
  • medical benefits
  • 401(k) plan
  • other company perk programs
  • Comprehensive medical, dental, and vision coverage, with 100% of employee-only benefit premiums covered for most medical plans
  • 16 weeks fully-paid Parental Leave for all new parents
  • Health & wellness stipend
  • Remote workspace, internet, and cellphone stipend
  • Commuter benefits for team members who report to the SF and NYC office
  • Family planning benefits
  • Fulltime
Read More
Arrow Right

LLM & AI DevOps Engineer

Join our team as a DevOps Engineer specializing in Artificial Intelligence (AI) ...
Location
Location
United States , Remote
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as a DevOps Engineer, preferably supporting AI or machine learning platforms
  • Hands-on expertise with Kubernetes (EKS, AKS, GKE, or on-prem), Docker, Terraform, and Ansible
  • Experience with monitoring/observability tools such as Grafana and Prometheus
  • Familiarity with NVIDIA GPU drivers, CUDA, and hardware provisioning for machine learning tasks
  • Proficiency in at least one scripting language (Python, Bash, etc.)
  • Cloud platform experience (AWS, GCP, Azure)
  • hybrid/on-premise a plus
  • Previous work with MLOps tools and data pipeline automation is highly desirable
  • Bachelor’s degree in Computer Science or related field, or equivalent professional experience
Job Responsibility
Job Responsibility
  • Build, automate, and manage CI/CD pipelines for deploying and maintaining AI/LLM workloads
  • Collaborate with AI engineers and data scientists to streamline model deployment, versioning, and monitoring
  • Design and maintain cloud infrastructure using Infrastructure as Code (IaC) platforms such as Terraform and Ansible
  • Orchestrate and manage containerized AI environments using Kubernetes
  • Implement robust monitoring and logging solutions utilizing Grafana and Prometheus
  • Optimize AI model inference and training workloads—especially for NVIDIA GPU-powered environments
  • Apply strict security and compliance standards for all infrastructure components
  • Diagnose and resolve production issues, continuously improving reliability and scalability of AI services
What we offer
What we offer
  • medical
  • vision
  • dental
  • life and disability insurance
  • 401(k) plan
Read More
Arrow Right
New

Platform Engineer

We are seeking a highly progressive Platform Engineer specializing in AI infrast...
Location
Location
Canada , Vancouver
Salary
Salary:
43.79 - 58.39 USD / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
July 25, 2026
Flip Icon
Requirements
Requirements
  • 3-5 years of dedicated cloud platform engineering or SRE experience working with high-volume distributed systems natively in AWS and Azure
  • Elite proficiency with Terraform, with an emphasis on creating modular, reusable code structures and multi-environment pipelines
  • Coding proficiency in Python or Go, with a solid history of integrating with complex REST/JSON APIs
  • Strong operational working knowledge of GitLab CI/CD, Docker containerization, and cloud orchestration layers
  • Proven, hands-on exposure to AI/LLM development concepts (advanced prompting, tool/skill integration, and Retrieval-Augmented Generation [RAG])
  • Extensive experience leveraging AI and Agentic Coding tools to accelerate software delivery and maintain platform scripts
Job Responsibility
Job Responsibility
  • Build integration patterns, API mediation layers, and approval workflows supporting autonomous AI agent tool execution and runtime function calling
  • Integrate advanced distributed telemetry for agent runs (execution traces, evaluation metrics, latency logs, and token cost analytics)
  • Establish runtime safety controls for AI applications, embedding automated rollback scripts, cost control ceilings, and master kill-switches
  • Build and scale highly secure, automated multi-cloud landing zones (AWS and Azure) utilizing reusable Terraform modules
  • Construct and maintain robust GitLab CI/CD pipelines, package registries, and automated infrastructure release strategies
  • Implement strict automated infrastructure guardrails using Open Policy Agent (OPA), Conftest, or Azure Policies to guarantee security without breaking developer velocity
  • Embed least-privileged access, zero-trust network segmentation, private endpoints, KMS encryption keys, and advanced secrets management
  • Champion Site Reliability Engineering standards by managing Service Level Objectives (SLOs), calculating error budgets, configuring autoscaling matrices, and leading chaos engineering simulations
  • Apply cloud financial management protocols (structured resource tagging, budget alarms, anomaly detection, and cluster right-sizing)
  • Author clear, accessible developer guides and self-service templates that streamline the adoption of core AI platform features
What we offer
What we offer
  • Pioneering Technical Landscape
  • Elite Multi-Cloud Exposure
  • High Extensibility Indicators
  • Premier Workspace
  • Fulltime
Read More
Arrow Right

Principal Software Engineer - Office Suite Shared Experiences (OSSE)

Within the Office Suite Shared Experiences (OSSE) organization, we build largesc...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years experience in experimentation infrastructure, including system design, metrics, analysis, and operational considerations
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Provide end-to-end architectural leadership for largescale experimentation and experimentation infrastructure used across Office and Copilot
  • Define and evolve long term technical strategy for experimentation platforms, data pipelines, and shared services, balancing innovation, reliability, cost, and developer productivity
  • Act as a technical authority and advisor across multiple teams, guiding system design decisions and resolving complex, ambiguous technical challenges
  • Lead the design and scaling of experimentation systems, including assignment, metrics, analysis, and insight generation
  • Drive best practices for trustworthy experimentation, including data quality, metric definitions, statistical rigor, and observability
  • Design and operate largescale data systems leveraging ECS, Kusto, Cosmos DB, and SCOPE to support both real time and batch analytics
  • Build and evolve high reliability, multitenant services that are foundational to experimentation, insights, and decision making
  • Ensure systems meet Microsoft level expectations for availability, performance, security, and operational excellence
  • Serve as a Designated Responsible Individual (DRI) when needed, setting standards for incident response, post incident learning, and operational maturity
  • Partner across Office, Copilot, and adjacent organizations to align experimentation strategy and infrastructure, reducing fragmentation and duplicative investments
  • Fulltime
Read More
Arrow Right

Senior CyberSecurity Researcher

We are seeking a highly skilled and motivated senior security researcher to join...
Location
Location
France , Paris
Salary
Salary:
Not provided
gitguardian.com Logo
GitGuardian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience working in a security engineer role, with 2+ years dedicated to research-related work, or equivalent
  • Strong offensive security background (pentesting, vulnerability research, or red team experience) with the ability to think like an attacker and translate that into defensive insights
  • Experience with reverse engineering (binary analysis, malware inspection, malicious packages) and API/web security (OAuth, JWT, token validation, secret exposure patterns)
  • Comfortable working with modern infrastructure, such as cloud platforms (AWS, GCP, or Azure) or AI/LLM ecosystems, and able to assess their specific security implications
  • Leverage AI tools actively in your day-to-day research workflow, whether for automation, analysis, or accelerating prototyping
  • Proficient in at least one system or scripting language (Python, Go, or Rust), fluent with a terminal, and able to independently retrieve, transform, and analyze datasets to support research conclusions
  • Track down complex security problems in software and infrastructure and define their solutions
  • Enjoy hacking things and rapidly prototyping ideas
  • Drive research autonomously, identify topics, conduct investigations, and publish findings, while partnering with engineering and product teams to translate insights into platform improvements
  • Public research track record: CVEs, conference presentations, open-source tooling, or technical publications
Job Responsibility
Job Responsibility
  • Investigate novel and existing tactics to find and abuse exposed credentials
  • Publish findings as authoritative research
  • Analyze ongoing threats and attacks
  • Explore new exploitation techniques
  • Document emerging tactics
  • Collaborate with engineering teams to identify ways to improve products in terms of secret validation and coverage
  • Track offensive trends and techniques
  • Work closely with marketing team to produce 2–3 technical deep-dive articles or talks per quarter
What we offer
What we offer
  • Package that includes BSPCE
  • Lunch voucher (Swile, 12€ at 50%)
  • Sponsored Wellpass (gymlib)
  • Non-charged health insurance for children (Sidecare / Generali)
  • Up to €300 to improve your home office set-up
  • Yearly holiday allowance
  • Referral bonus of 4000€ for any new Guardian we might hire thanks to you
  • Team building: monthly budget dedicated to each employee that you can spend as you wish, with colleagues (latest examples to date: Michelin star restaurant, karaoke, stand-up show, kitesurfing week-end)
  • Remote policy: hybrid (3 days/week at the office in Paris)
  • Opportunities for career development in the long term
  • Fulltime
Read More
Arrow Right

Senior Backend Software Engineer

The Coaching team builds Highspot’s personalized, AI-enhanced coaching capabilit...
Location
Location
Canada , Vancouver
Salary
Salary:
146000.00 - 178000.00 CAD / Year
highspot.com Logo
Highspot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science or equivalent practical experience
  • 5+ years of experience in back-end software development building and maintaining large-scale web applications
  • At least 3 years of experience working with object-oriented programming languages
  • Ruby and Python preferred
  • Experience architecting, building, and deploying mid-to-large scale web applications in a distributed environment
  • Strong understanding of API design, data modeling, and backend scalability
  • Experience integrating or working with AI/LLM platforms such as OpenAI, Anthropic (Claude), or Azure OpenAI
  • Familiarity with AI-powered development tools (e.g., Cursor, GitHub Copilot, Cody, etc.) and a demonstrated ability to incorporate them effectively into day-to-day workflows
  • Deep expertise in web performance, security, and reliability best practices
  • Proven ability to deconstruct complex technical problems and deliver elegant, maintainable solutions
Job Responsibility
Job Responsibility
  • Design, develop, and maintain high-quality, scalable, and user-centric backend systems using modern technologies
  • Architect and optimize backend infrastructure to power intelligent, AI-driven workflows and Agentic AI integrations
  • Build and maintain integrations with multiple large language models (LLMs) including ChatGPT, Claude, and other OpenAI and Microsoft models
  • Collaborate closely with AI/ML engineers to productionize agentic workflows and autonomous reasoning systems
  • Partner effectively with Product Management and UX Design to translate ideas and research into production-ready, AI-enhanced features
  • Leverage AI-assisted development tools such as Cursor, GitHub Copilot, and other code generation frameworks to accelerate development and improve code quality
  • Lead and mentor engineers through complex projects, emphasizing clean architecture, testing, and software craftsmanship
  • Drive backend infrastructure improvements that enhance reliability, observability, and performance
  • Collaborate cross-functionally to deliver differentiated customer value through AI and data-driven solutions
  • Troubleshoot and resolve critical production issues while contributing to internal documentation and best practices
What we offer
What we offer
  • Comprehensive medical, dental, vision, disability, and life benefits
  • Group Retirement Savings Plan (RRSP) and matching employer contributions (DPSP) with immediate vesting
  • Flexible PTO
  • Generous Holiday Schedule + 5 Days for Annual Holiday Week
  • Quarterly Recharge Fridays (paid days off for mental health recharge)
  • Flexible work schedules
  • Access to Coaches and Therapists through Modern Health
  • 2 Volunteer days per year
  • Monthly transportation allowance for employees that work in our Vancouver Hub location
  • Employees are eligible to receive stock options
  • Fulltime
Read More
Arrow Right