CrawlJobs Logo

Software Engineer - Data Infra Reliability

United States, Palo Alto 220000.00 - 280000.00 USD / Year · Job Posted January 13, 2026
Apply Position
Job Link Share

Job Description

Luma's mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable, and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change. As our models scale to "omni" capabilities, our data infrastructure must be unbreakable. We are looking for a Data Reliability Engineer who brings a Site Reliability Engineering (SRE) mindset to the world of massive-scale data. You will be responsible for the resilience, automation, and scalability of the petabyte-scale pipelines that feed our research. This is not just about keeping the lights on; it’s about treating infrastructure as code and building self-healing data systems that allow our researchers to train on massive datasets without interruption. Whether you are a junior engineer with a passion for automation or a seasoned SRE veteran, you will play a critical role in hardening the backbone of Luma’s intelligence.

Job Responsibility

  • Automate Everything: Apply Infrastructure-as-Code (IaC) principles using Terraform to provision, manage, and scale our data infrastructure
  • Harden Data Pipelines: Build reliability and fault tolerance into our core data ingestion and processing workflows, ensuring high availability for research jobs
  • Scale Kubernetes & Ray: Operate and optimize large-scale Kubernetes clusters and Ray deployments to handle bursty, high-throughput workloads
  • Define Reliability: Establish Service Level Objectives (SLOs) and observability standards (Prometheus/Grafana) for our data platforms
  • Debug & Heal: serve as the first line of defense for complex infrastructure failures, diagnosing root causes in distributed storage and compute systems

Requirements

  • Deep SRE/DevOps proficiency: You live and breathe Linux, networking, and automation
  • Infrastructure-as-Code Native: You have extensive experience with Terraform, Ansible, or similar tools to manage complex cloud environments (AWS/GCP)
  • Kubernetes Expert: You have managed Kubernetes in production and understand its internals, not just how to deploy containers
  • Python Proficiency: You can write high-quality Python code for automation, tooling, and infrastructure management
  • Data-Minded: You understand the specific challenges of stateful data systems and high-throughput storage (S3/Object Store)

Nice to have

  • Experience managing GPU clusters or AI/ML workloads
  • Background in both Software Engineering and Operations (DevOps)
  • Experience with high-performance networking (InfiniBand/RDMA)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Engineer - Data Infra Reliability

8 matching positions

Staff Software Engineer (Infra)

As a Staff Software Engineer (Infra) at Amigo, you'll own the technical directio...
Location
Location
United States , New York City
Salary
Salary:
220000.00 - 260000.00 USD / Year
amigo.ai Logo
Amigo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of production infrastructure experience, with significant time at elite engineering organizations
  • Expert-level experience with Kubernetes and container orchestration at scale
  • Proven track record designing infrastructure that scales across multiple regions
  • Deep experience with cloud platforms (AWS, GCP, or Azure)
  • Strong understanding of infrastructure-level networking and security configurations
  • History of establishing engineering standards and mentoring engineers
  • Extremely high standards for reliability, security, and operational excellence
  • Both execution-oriented and defensive-minded: you ship infrastructure while anticipating failure modes
  • Deep knowledge of infrastructure as code tools (Terraform, Pulumi, or similar)
  • Experience with compliance requirements and data residency controls in regulated industries
Job Responsibility
Job Responsibility
  • Own technical architecture for infrastructure across cloud platforms, Kubernetes, Databricks, and supporting systems
  • Drive engineering standards for reliability, security, observability, and incident response
  • Architect multi-region deployment strategies with zero-downtime updates for critical systems
  • Design the compliance & security infrastructure for healthcare (HIPAA, SOC 2) and support future regulatory requirements
  • Own disaster recovery architecture and backup systems meeting healthcare compliance requirements
  • Make build vs. buy decisions for infrastructure tooling and evaluate technical tradeoffs
  • Design auto-scaling systems that handle traffic spikes while maintaining cost efficiency
  • Own infrastructure as code of our infrastructure, ensuring clearly documented and identical deployments across regions
  • Mentor engineers and establish patterns that raise the bar for the infrastructure team
  • Collaborate with backend, platform, and security teams to ensure system-wide coherence
What we offer
What we offer
  • Comprehensive health, dental, and vision insurance
  • Mental health support and wellness coaching
  • Flexible wellness stipend for fitness, therapy, or personal growth
  • Daily catered lunch and dinner
  • Annual learning budget for courses, books, or conferences
  • Conference attendance budget for professional development
  • Development setup of your choice
  • Academic collaboration opportunities
  • Fulltime
Read More
Arrow Right
New

Software Engineer, Maps Infra

We are looking for a Software Engineer to partner with our Mapping team to deliv...
Location
Location
United States , San Francisco
Salary
Salary:
162000.00 - 260000.00 USD / Year
aurora.tech Logo
Aurora Innovation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years experience building server side and data processing systems
  • Expert proficiency in C++ with a commitment to writing clean, testable, and production-ready code
  • Deep understanding of distributed systems principles, with a proven ability to deliver scalable, reliable backend systems
  • Strong understanding of cloud-native technologies (e.g., AWS, GCP, Kubernetes)
  • Excellent communication and collaboration skills
  • Proven ability to rapidly learn new technologies and adapt to evolving requirements
Job Responsibility
Job Responsibility
  • Design, develop, and maintain the scalable backend infrastructure and data processing pipeline for storing and serving map data as we onboard the Aurora Driver to more commercial routes
  • Establish and maintain robust testing and performance optimization practices to ensure the stability and scalability of the Atlas system
  • Partner closely with internal and external customers to influence existing and future designs and features
What we offer
What we offer
  • Annual bonus
  • Equity compensation
  • Benefits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Software Engineer - Agent Infra

Perplexity is looking for a Software Engineer to build the core infrastructure t...
Location
Location
United States , San Francisco; Palo Alto; New York City
Salary
Salary:
210000.00 - 385000.00 USD / Year
perplexity.ai Logo
Perplexity
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of industry experience building large scale systems or platforms
  • Experience building agent applications with tool calling, context engineering, or open connector integrations
  • Strong coding skills in one or more of: Python, Java, Go
  • Comfortable with service design, APIs, and data models for high throughput systems
  • Working knowledge of containers, virtualization, and sandboxing
  • Familiar with metrics, tracing, on call, and incident practices
  • Bias to own problems across layers, collaborate in fast moving teams, and ship
Job Responsibility
Job Responsibility
  • Design and implement a highly reliable and scalable agent runtime: orchestration, shared state and memory, tool calling interfaces, and scheduling for cost, latency, and quality
  • Build secure sandboxed execution for agent actions and code
  • Optimize cold start, isolation, and observability
  • Ship unified interfaces for multiple model sizes and providers
  • Integrate with open tool ecosystems such as MCP style connectors for data and actions
  • Develop an evaluation platform for online and offline assessments, A/B tests, safety checks, and regression gates that improve agent reliability over time
  • Partner with Research, Inference, and Search to land new agent capabilities end to end, from prototype to production
What we offer
What we offer
  • Equity
  • Health
  • Dental
  • Vision
  • Retirement
  • Fitness
  • Commuter and dependent care accounts
  • Fulltime
Read More
Arrow Right

Staff Software Engineer (Frontend), Infra

Staff Software Engineer role in Infrastructure team at Harmonic, a startup disco...
Location
Location
United States , New York
Salary
Salary:
210000.00 - 280000.00 USD / Year
harmonic.ai Logo
Harmonic
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years building frontend applications at scale
  • Deep expertise in React, TypeScript, and modern build tooling
  • Track record of fixing production performance issues at scale
  • Strong opinions on frontend architecture, backed by experience
  • NYC-based, in office 3 days/week
Job Responsibility
Job Responsibility
  • Fix core reliability issues: error boundaries, state management, data consistency
  • Optimize performance: virtualization for large datasets, bundle optimization, render performance
  • Build monitoring and observability to catch issues before users do
  • Establish testing strategies that prevent regressions
  • Create abstractions and patterns that help engineers ship faster without breaking things
  • Drive technical decisions and mentor the team through complex migrations
What we offer
What we offer
  • Top of the line health, dental and vision insurance, with 100% premium covered
  • 401k matching
  • Free lunch in office
  • Monthly team dinner for each office
  • Commuter benefits
  • Fulltime
Read More
Arrow Right

Software Engineer II (Backend, Healthcare Infra)

As a Software Engineer II on the Healthcare Infra team, you will help build and ...
Location
Location
United States , Boston
Salary
Salary:
125000.00 - 170000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Professional experience in backend development, with a strong foundation in object-oriented programming, API design, and relational databases (RESTful APIs, Postgres)
  • Familiarity with asynchronous processing systems (Kafka, SQS)
  • Experience writing automated tests and documenting code for a variety of audiences
  • A passion for approaching large-scale problems guided by data-driven insights and a commitment to agile, iterative development
  • A proactive, collaborative team player, eager to take on new challenges, continuously learn, and adapt in a fast-paced, data-informed environment
Job Responsibility
Job Responsibility
  • Contribute to engineering efforts within a cross-functional team, collaborating with designers, product managers, other engineers, and our Digital Health team to refine and advance the WHOOP platform
  • Develop and maintain robust backend services using Java, Kafka, Postgres, and other AWS technologies, ensuring stability and performance
  • Contribute to the ideation, technical design, and implementation of new features and platforms, transforming complex requirements into reliable, scalable solutions
  • Work on scaling challenges that span multiple systems and demand high availability and reliability
  • Write clean, testable, and maintainable code, while participating in code reviews and documentation practices
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right
New

Lead Specialty Software Engineer - AI Tooling & Enablement

Wells Fargo is seeking a Specialty Software Engineer to support the onboarding, ...
Location
Location
United States , Charlotte; Irving
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
June 19, 2026
Flip Icon
Requirements
Requirements
  • 5+ years of Specialty Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of AI Engineering experience
  • Hands-on experience using one or more of the following tools: Devin.AI, Claude Code, Cursor.AI, or Github Copilot
  • Hands-on experience with AI software development tools, CI/CD pipelines, and SDLC processes
  • Hands-on experience building or supporting GenAI applications, including prompt design and RAG concepts
  • Familiarity with AI/ML frameworks or orchestration tools (LangChain or similar)
  • Experience integrating applications with APIs, data platforms, or enterprise systems
  • Understanding of cloud-based AI services (GCP Vertex AI, AWS, or Azure equivalents)
  • Experience supporting or implementing developer tooling, platforms, or engineering enablement capabilities
  • Strong general software engineering — fluency across your org's main stacks (Python/FastAPI, Scala/Spark, Java/Spring, TypeScript/React) so they can validate Devin's output across teams
Job Responsibility
Job Responsibility
  • Drive onboarding, adoption, and effective usage of AI tools and SDLC tooling across engineering teams
  • Provide technical leadership and guidance on tooling standards, best practices, and integration patterns
  • Lead complex initiatives related to developer platforms, AI tooling, and engineering productivity
  • Design, develop, test, and implement tools, utilities, and automation that enhance developer experience and operational efficiency
  • Mentor junior engineers and provide guidance on tool usage and engineering best practices
  • Influence team members and stakeholders to adopt new tools and capabilities
  • Onboarding repos — run repos through the qualifying criteria, create environment blueprints, generate codebase wikis, run proof-of-concept PRs
  • Building reusable assets — author org-wide and repo-specific playbooks, knowledge notes, and scheduled automations
  • Running the champions program — train and support per-team tools champions, run 30-min team onboarding workshops, hold office hours
  • Driving adoption — identify high-value use cases per team, pair with developers, lower the barrier to starting sessions (Teams/GitHub entry points)
  • Fulltime
!
Read More
Arrow Right