CrawlJobs Logo

AI Platform Engineer, Infrastructure

United States, San Francisco Bay Area · Job Posted February 18, 2026
Apply Position
Job Link Share

Job Description

As an Infrastructure Engineer at Brain Co., you will build and scale the core platform that powers AI systems deployed inside some of the world’s most critical institutions — from government to energy and healthcare. Our platform runs in highly variable, high-stakes environments across cloud and on-prem settings, and your work will directly determine our reliability, performance, and ability to deliver for customers.

Job Responsibility

  • Design and scale Kubernetes- and Terraform-based infrastructure across customer environments
  • Define standards for networking, security, CI/CD, and multi-region deployments
  • Build and maintain metrics, logging, tracing, dashboards, and SLOs
  • Diagnose and improve performance across distributed systems and AI workloads
  • Support high-performance inference, data pipelines, and large-scale backend services
  • Ensure systems scale reliably under fast-growing and unpredictable workloads
  • Partner with technical leaders on architecture and mentor engineers

Requirements

  • 5+ years building large-scale backend or infra systems
  • Knowledge of Kubernetes, Terraform, cloud networking, orchestration, and observability tools
  • Strong distributed systems, performance, and incident-response experience
  • Ability to work well with customers and teams to deliver reliable solutions
  • Thrive in high-agency environments and set engineering standards
  • Enjoy ambiguity, autonomy, and solving tough infra problems

What we offer

  • Competitive salary plus equity
  • Daily lunches
  • Commuter benefits
  • 401(k)
  • Medical, Dental and Vision
  • Unlimited PTO

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

AI Platform Engineer, Infrastructure

8 matching positions

Senior AI Infrastructure Engineer - Training Platform

As a Software Engineer on the Machine Learning Infrastructure team, you will bui...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
216000.00 - 270000.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
  • Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
  • Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
  • Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
  • Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
  • Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments
Job Responsibility
Job Responsibility
  • Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
  • Design and implement scheduling primitives to optimize the lifecycle of training jobs
  • Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
  • Work closely with Finance and Procurement teams to drive our capacity planning process
  • Participate in our team's on call process to ensure the availability of our services
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • commuter stipend (may be eligible)
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Data Platform, AI Infrastructure

We are building a large-scale, productized data platform that powers critical in...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
  • Strong programming experience in Python
  • Experience building and operating large-scale distributed systems
  • Hands-on experience with: Backend services or APIs (e.g., FastAPI, Flask, or similar)
  • Cloud-based infrastructure (Azure, AWS, or GCP)
  • Monitoring and observability systems (metrics, logging, alerting)
  • Experience designing systems with reliability, scalability, and operational clarity in mind
  • Proven ability to own and deliver production systems end-to-end
  • Ability to break down ambiguous problems, ask the right questions, and execute effectively
Job Responsibility
Job Responsibility
  • Design, build, and operate core components of a distributed data platform, including: Orchestration systems (e.g., Airflow or equivalent)
  • Backend services and APIs (Python/FastAPI or similar)
  • Monitoring, alerting, and reliability systems
  • Own the end-to-end lifecycle of platform components - from design through deployment, scaling, and maintenance
  • Ensure systems meet requirements for availability, performance, and data reliability at large scale
  • Define and enforce standardized patterns for infrastructure, deployment, and observability across the platform
  • Partner with data engineering teams to enable efficient, reliable data processing workflows
  • Diagnose and resolve complex issues in distributed systems, including performance bottlenecks and failure modes
  • Contribute to infrastructure-as-code and deployment systems to support reproducibility and operational excellence
  • Drive continuous improvements in system robustness, cost efficiency, and operational clarity
What we offer
What we offer
  • Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
  • Fulltime
Read More
Arrow Right

Staff Infrastructure Software Engineer - AI Platform

We are currently seeking a Staff Software Engineer to join the AI Platform team ...
Location
Location
United Kingdom , Edinburgh
Salary
Salary:
Not provided
addepar.com Logo
Addepar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience as a Software/Backend Engineer, with a track record of taking on increasing responsibility
  • Experience across the full product lifecycle: designing, implementing, shipping, scaling, operationalizing, and maintaining technology/SaaS products
  • Exceptional Programming skills and fundamentals in Python/Go/Java, with a proven track record of building large scale production systems
  • Proficient experience with diverse compute environments including microservices (K8s), Databricks and serverless architectures (e.g. AWS Lambda)
  • Demonstrable experience leading initiatives with infrastructure-as-code tools such as Terraform in complex, multi-account environments
  • Proficient experience with comprehensive monitoring and alerting stacks (e.g. Prometheus/Grafana/Sentry/cloud-native tools), with a focus on observability strategy
  • Excellent interpersonal and communication skills to effectively collaborate with multi-functional teams, articulate complex technical concepts, and influence outcomes
Job Responsibility
Job Responsibility
  • Design and build the production runtime for LLM-based agents and products, creating the services and infrastructure that serve autonomous agents
  • Develop deep application-level knowledge to proactively inform and influence requirements, constraints and best practices for implementing composable, complex AI systems
  • Lead the design, implementation, and automation of production infrastructure on a variety of cloud environments (Kubernetes/Databricks), to enable us to ship and scale AI features instantly
  • Evangelize and promote disciplined, best engineering practices to enforce strong production hygiene and culture
  • Initiate and lead collaborations with cross-functional teams to identify and resolve complex application or infrastructure issues, serving as a technical subject matter expert
  • Architect, build, and maintain advanced, automated CI/CD pipelines e.g. using Jenkins, ArgoCD, AWS CodeBuild/Pipeline, GitHub Actions, or similar, establishing best practices for deployment strategies (e.g., blue/green, canary)
  • Develop systems and best practices monitoring, alerting, and troubleshooting of our probabilistic and AI-driven systems and broader software stack
Read More
Arrow Right

Ai infrastructure engineer, model serving platform

As a Software Engineer on the ML Infrastructure team, you will design and build ...
Location
Location
United States , San Francisco; New York
Salary
Salary:
179400.00 - 224250.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience building large-scale, high-performance backend systems
  • Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++)
  • Experience with LLM serving and routing fundamentals (e.g. rate limiting, token streaming, load balancing, budgets, etc.)
  • Experience with LLM capabilities and concepts such as reasoning, tool calling, prompt templates, etc.
  • Experience with containers and orchestration tools (e.g., Docker, Kubernetes)
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments
Job Responsibility
Job Responsibility
  • Build and maintain fault-tolerant, high-performance systems for serving LLMs workloads at scale
  • Build an internal platform to empower LLM capability discovery
  • Collaborate with researchers and engineers to integrate and optimize models for production and research use cases
  • Conduct architecture and design reviews to uphold best practices in system design and scalability
  • Develop monitoring and observability solutions to ensure system health and performance
  • Lead projects end-to-end, from requirements gathering to implementation, in a cross-functional environment
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • Fulltime
Read More
Arrow Right

Senior ML Platform Engineer, AI Platform

We are seeking a skilled and passionate ML Platform Engineer to join our team an...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
airwallex.com Logo
Airwallex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in backend software development
  • at least 2+ years focus on AI/ML Platform or MLOps infrastructure
  • deep expertise in MLOps practices, including automated deployment pipelines, model optimization, and production lifecycle management
  • proven experience designing and implementing low-latency model serving solutions
  • proficiency in Python
  • skill in writing high-quality, maintainable code
  • experience in design and development of large-scale distributed, high concurrency, low-latency inference, high availability systems
  • excellent communication and mentoring abilities
  • a relevant degree in Computer Science, Mathematics or related fields
Job Responsibility
Job Responsibility
  • Platform Development: Design, build, and maintain the end-to-end MLOps platform using Kubernetes and Cloud Services
  • Infrastructure as Code (IaC): Use Terraform or similar tools to manage, provision, and scale all ML-related infrastructure securely and efficiently
  • Pipeline Automation: Implement and optimize CI/CD/CT (Continuous Integration, Delivery, Training) pipelines to automate model training, testing, packaging, and deployment using tools like Argo and Kubeflow Pipelines
  • Serving Infrastructure: Build highly available, low-latency, and high-throughput model serving infrastructure
  • Observability: Implement robust monitoring, alerting, and logging solutions to track infrastructure health, model performance, and data/model drift
  • Tooling & Support: Evaluate, integrate, and support ML tools such as Feature Stores and distributed model training pipelines
  • Security & Compliance: Ensure platform security, implement RBAC (Role-Based Access Control), and manage secrets for sensitive data and production environments
  • Collaboration: Work closely with Data Scientists and ML Engineers to understand their needs and provide technical guidance on best practices for scaling their models
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Managed AI - AI Platform

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'...
Location
Location
United States , San Francisco, CA; Sunnyvale, CA
Salary
Salary:
208725.00 - 253000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advanced degree in Computer Science/Engineering
  • 8-10+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
  • Experience with distributed systems, cloud services (compute, storage, networking, database), and delivering early-stage projects quickly
  • Experience with Generative AI (LLMs, Multimodal) and familiar with AI infrastructure (training, inference, ETL pipelines)
  • Proficient with container runtimes (e.g., Kubernetes), microservices, REST APIs, gRPC, and the full software development lifecycle including CI/CD
Job Responsibility
Job Responsibility
  • Lead the design and implementation of core AI services, including: Resilient fault-tolerant queues for efficient task distribution
  • Model catalogs for managing and versioning AI models
  • Scheduling mechanisms optimized for cost and performance
  • Architect and scale infrastructure to handle millions of API requests per second
  • Implement robust monitoring and alerting to ensure system health and 24/7 availability
  • Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap
  • Influence the long-term vision and architectural decisions of the platform
  • Contribute to open-source AI frameworks and actively participate in the AI community
  • Prototype and rapidly iterate on emerging technologies and new features
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Managed AI - AI Platform

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'...
Location
Location
United States , San Francisco, CA; Sunnyvale, CA
Salary
Salary:
172425.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advanced degree in Computer Science/Engineering
  • 4-5+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
  • Experience with distributed systems, cloud services (compute, storage, networking, database), and delivering early-stage projects quickly
  • Experience with Generative AI (LLMs, Multimodal) and familiar with AI infrastructure (training, inference, ETL pipelines)
  • Proficient with container runtimes (e.g., Kubernetes), microservices, REST APIs, gRPC, and the full software development lifecycle including CI/CD
Job Responsibility
Job Responsibility
  • Lead the design and implementation of core AI services, including: Resilient fault-tolerant queues for efficient task distribution
  • Model catalogs for managing and versioning AI models
  • Scheduling mechanisms optimized for cost and performance
  • Architect and scale infrastructure to handle millions of API requests per second
  • Implement robust monitoring and alerting to ensure system health and 24/7 availability
  • Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap
  • Influence the long-term vision and architectural decisions of the platform
  • Contribute to open-source AI frameworks and actively participate in the AI community
  • Prototype and rapidly iterate on emerging technologies and new features
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
  • Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
  • Expertise in scripting and automation (Python, Go, or Bash)
  • Solid understanding of container technologies, Kubernetes, and cloud build systems
  • Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
  • Excellent communication and ability to drive technical decisions across multiple platform and product teams
Job Responsibility
Job Responsibility
  • Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
  • Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
  • AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
  • Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
  • Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
  • Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
  • Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture
Read More
Arrow Right