AI Platform Engineer, Infrastructure Job at Brain Co. (San Francisco Bay Area)

Senior AI Infrastructure Engineer - Training Platform

As a Software Engineer on the Machine Learning Infrastructure team, you will bui...

Location

United States , San Francisco; Seattle; New York

Salary:

216000.00 - 270000.00 USD / Year

Scale

Expiration Date

Until further notice

Requirements

5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
Proven ability to solve complex problems and work independently in fast-moving environments

Job Responsibility

Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
Design and implement scheduling primitives to optimize the lifecycle of training jobs
Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
Work closely with Finance and Procurement teams to drive our capacity planning process
Participate in our team's on call process to ensure the availability of our services
Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment

What we offer

Comprehensive health, dental and vision coverage
retirement benefits
a learning and development stipend
generous PTO
commuter stipend (may be eligible)

Fulltime

Senior Software Engineer - Data Platform, AI Infrastructure

We are building a large-scale, productized data platform that powers critical in...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Strong programming experience in Python
Experience building and operating large-scale distributed systems
Hands-on experience with: Backend services or APIs (e.g., FastAPI, Flask, or similar)
Cloud-based infrastructure (Azure, AWS, or GCP)
Monitoring and observability systems (metrics, logging, alerting)
Experience designing systems with reliability, scalability, and operational clarity in mind
Proven ability to own and deliver production systems end-to-end
Ability to break down ambiguous problems, ask the right questions, and execute effectively

Job Responsibility

Design, build, and operate core components of a distributed data platform, including: Orchestration systems (e.g., Airflow or equivalent)
Backend services and APIs (Python/FastAPI or similar)
Monitoring, alerting, and reliability systems
Own the end-to-end lifecycle of platform components - from design through deployment, scaling, and maintenance
Ensure systems meet requirements for availability, performance, and data reliability at large scale
Define and enforce standardized patterns for infrastructure, deployment, and observability across the platform
Partner with data engineering teams to enable efficient, reliable data processing workflows
Diagnose and resolve complex issues in distributed systems, including performance bottlenecks and failure modes
Contribute to infrastructure-as-code and deployment systems to support reproducibility and operational excellence
Drive continuous improvements in system robustness, cost efficiency, and operational clarity

What we offer

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Fulltime

Staff Infrastructure Software Engineer - AI Platform

We are currently seeking a Staff Software Engineer to join the AI Platform team ...

Location

United Kingdom , Edinburgh

Salary:

Not provided

Addepar

Expiration Date

Until further notice

Requirements

Extensive experience as a Software/Backend Engineer, with a track record of taking on increasing responsibility
Experience across the full product lifecycle: designing, implementing, shipping, scaling, operationalizing, and maintaining technology/SaaS products
Exceptional Programming skills and fundamentals in Python/Go/Java, with a proven track record of building large scale production systems
Proficient experience with diverse compute environments including microservices (K8s), Databricks and serverless architectures (e.g. AWS Lambda)
Demonstrable experience leading initiatives with infrastructure-as-code tools such as Terraform in complex, multi-account environments
Proficient experience with comprehensive monitoring and alerting stacks (e.g. Prometheus/Grafana/Sentry/cloud-native tools), with a focus on observability strategy
Excellent interpersonal and communication skills to effectively collaborate with multi-functional teams, articulate complex technical concepts, and influence outcomes

Job Responsibility

Design and build the production runtime for LLM-based agents and products, creating the services and infrastructure that serve autonomous agents
Develop deep application-level knowledge to proactively inform and influence requirements, constraints and best practices for implementing composable, complex AI systems
Lead the design, implementation, and automation of production infrastructure on a variety of cloud environments (Kubernetes/Databricks), to enable us to ship and scale AI features instantly
Evangelize and promote disciplined, best engineering practices to enforce strong production hygiene and culture
Initiate and lead collaborations with cross-functional teams to identify and resolve complex application or infrastructure issues, serving as a technical subject matter expert
Architect, build, and maintain advanced, automated CI/CD pipelines e.g. using Jenkins, ArgoCD, AWS CodeBuild/Pipeline, GitHub Actions, or similar, establishing best practices for deployment strategies (e.g., blue/green, canary)
Develop systems and best practices monitoring, alerting, and troubleshooting of our probabilistic and AI-driven systems and broader software stack

Ai infrastructure engineer, model serving platform

As a Software Engineer on the ML Infrastructure team, you will design and build ...

Location

United States , San Francisco; New York

Salary:

179400.00 - 224250.00 USD / Year

Scale

Expiration Date

Until further notice

Requirements

4+ years of experience building large-scale, high-performance backend systems
Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++)
Experience with LLM serving and routing fundamentals (e.g. rate limiting, token streaming, load balancing, budgets, etc.)
Experience with LLM capabilities and concepts such as reasoning, tool calling, prompt templates, etc.
Experience with containers and orchestration tools (e.g., Docker, Kubernetes)
Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
Proven ability to solve complex problems and work independently in fast-moving environments

Job Responsibility

Build and maintain fault-tolerant, high-performance systems for serving LLMs workloads at scale
Build an internal platform to empower LLM capability discovery
Collaborate with researchers and engineers to integrate and optimize models for production and research use cases
Conduct architecture and design reviews to uphold best practices in system design and scalability
Develop monitoring and observability solutions to ensure system health and performance
Lead projects end-to-end, from requirements gathering to implementation, in a cross-functional environment

What we offer

Comprehensive health, dental and vision coverage
retirement benefits
a learning and development stipend
generous PTO

Fulltime

Senior ML Platform Engineer, AI Platform

We are seeking a skilled and passionate ML Platform Engineer to join our team an...

Location

Singapore , Singapore

Salary:

Not provided

Airwallex

Expiration Date

Until further notice

Requirements

5+ years in backend software development
at least 2+ years focus on AI/ML Platform or MLOps infrastructure
deep expertise in MLOps practices, including automated deployment pipelines, model optimization, and production lifecycle management
proven experience designing and implementing low-latency model serving solutions
proficiency in Python
skill in writing high-quality, maintainable code
experience in design and development of large-scale distributed, high concurrency, low-latency inference, high availability systems
excellent communication and mentoring abilities
a relevant degree in Computer Science, Mathematics or related fields

Job Responsibility

Platform Development: Design, build, and maintain the end-to-end MLOps platform using Kubernetes and Cloud Services
Infrastructure as Code (IaC): Use Terraform or similar tools to manage, provision, and scale all ML-related infrastructure securely and efficiently
Pipeline Automation: Implement and optimize CI/CD/CT (Continuous Integration, Delivery, Training) pipelines to automate model training, testing, packaging, and deployment using tools like Argo and Kubeflow Pipelines
Serving Infrastructure: Build highly available, low-latency, and high-throughput model serving infrastructure
Observability: Implement robust monitoring, alerting, and logging solutions to track infrastructure health, model performance, and data/model drift
Tooling & Support: Evaluate, integrate, and support ML tools such as Feature Stores and distributed model training pipelines
Security & Compliance: Ensure platform security, implement RBAC (Role-Based Access Control), and manage secrets for sensitive data and production environments
Collaboration: Work closely with Data Scientists and ML Engineers to understand their needs and provide technical guidance on best practices for scaling their models

Fulltime

Staff Software Engineer, Managed AI - AI Platform

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'...

Location

United States , San Francisco, CA; Sunnyvale, CA

Salary:

208725.00 - 253000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

Advanced degree in Computer Science/Engineering
8-10+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
Experience with distributed systems, cloud services (compute, storage, networking, database), and delivering early-stage projects quickly
Experience with Generative AI (LLMs, Multimodal) and familiar with AI infrastructure (training, inference, ETL pipelines)
Proficient with container runtimes (e.g., Kubernetes), microservices, REST APIs, gRPC, and the full software development lifecycle including CI/CD

Job Responsibility

Lead the design and implementation of core AI services, including: Resilient fault-tolerant queues for efficient task distribution
Model catalogs for managing and versioning AI models
Scheduling mechanisms optimized for cost and performance
Architect and scale infrastructure to handle millions of API requests per second
Implement robust monitoring and alerting to ensure system health and 24/7 availability
Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap
Influence the long-term vision and architectural decisions of the platform
Contribute to open-source AI frameworks and actively participate in the AI community
Prototype and rapidly iterate on emerging technologies and new features

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Senior Software Engineer, Managed AI - AI Platform

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'...

Location

United States , San Francisco, CA; Sunnyvale, CA

Salary:

172425.00 - 209000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

Advanced degree in Computer Science/Engineering
4-5+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
Experience with distributed systems, cloud services (compute, storage, networking, database), and delivering early-stage projects quickly
Experience with Generative AI (LLMs, Multimodal) and familiar with AI infrastructure (training, inference, ETL pipelines)
Proficient with container runtimes (e.g., Kubernetes), microservices, REST APIs, gRPC, and the full software development lifecycle including CI/CD

Job Responsibility

Lead the design and implementation of core AI services, including: Resilient fault-tolerant queues for efficient task distribution
Model catalogs for managing and versioning AI models
Scheduling mechanisms optimized for cost and performance
Architect and scale infrastructure to handle millions of API requests per second
Implement robust monitoring and alerting to ensure system health and 24/7 availability
Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap
Influence the long-term vision and architectural decisions of the platform
Contribute to open-source AI frameworks and actively participate in the AI community
Prototype and rapidly iterate on emerging technologies and new features

What we offer

Restricted Stock Units
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...

Location

Czechia , Prague

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
Expertise in scripting and automation (Python, Go, or Bash)
Solid understanding of container technologies, Kubernetes, and cloud build systems
Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
Excellent communication and ability to drive technical decisions across multiple platform and product teams

Job Responsibility

Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture

Select Country

AI Platform Engineer, Infrastructure

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?