CrawlJobs Logo

Ai Infrastructure Engineer, Core Infrastructure

scale.com Logo

Scale

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

179400.00 - 310500.00 USD / Year

Job Description:

As a Software Engineer on the ML Infrastructure team, you will design and build the next generation of foundational systems that power all ML Infrastructure compute at Scale - from model training and evaluation to large-scale inference and experimentation. Our platform is responsible for orchestrating workloads across heterogeneous compute environments (GPU, CPU, on-prem, and cloud), optimizing for reliability, cost efficiency, and developer velocity.

Job Responsibility:

  • Design and maintain fault-tolerant, cost-efficient systems that manage compute allocation, scheduling, and autoscaling across clusters and clouds
  • Build common abstractions and APIs that unify job submission, telemetry, and observability across serving and training workloads
  • Develop systems for usage metering, cost attribution, and quota management, enabling transparency and control over compute budgets
  • Improve reliability and efficiency of large-scale GPU workloads through better scheduling, bin-packing, preemption, and resource sharing
  • Partner with ML engineers and API teams to identify bottlenecks and define long-term architectural standards
  • Lead projects end-to-end — from requirements gathering and design to rollout and monitoring — in a cross-functional environment

Requirements:

  • 4+ years of experience building large-scale backend or distributed systems
  • Strong programming skills in Python, Go, or Rust, and familiarity with modern cloud-native architecture
  • Experience with containers and orchestration tools (Kubernetes, Docker) and Infrastructure as Code (Terraform)
  • Familiarity with schedulers or workload management systems (e.g., Kubernetes controllers, Slurm, Ray, internal job queues)
  • Understanding of observability and reliability practices (metrics, tracing, alerting, SLOs)
  • A track record of improving system efficiency, reliability, or developer velocity in production environments

Nice to have:

  • Experience with multi-tenant compute platforms or internal PaaS
  • Knowledge of GPU scheduling, cost modeling, or hybrid cloud orchestration
  • Familiarity with LLM or ML training workloads, though deep ML expertise is not required
What we offer:
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • equity based compensation

Additional Information:

Job Posted:
February 20, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 2204 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Ai Infrastructure Engineer, Core Infrastructure

Software Engineer, AI Infrastructure

As a Software Engineer on our AI Infrastructure team, you will help design the c...
Location
Location
United States , New York, NY; San Mateo, CA
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 3 years of experience in software engineering, with a focus on infrastructure or machine learning systems
  • Strong programming skills in Python, Go, or a similar language
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, MLflow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Basic understanding of LLM knowledge (e.g., context length, disaggregated prefill, KV cache memory estimation, etc)
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as LLM CI/CD pipeline, control plane, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Building frameworks and safeguards to ensure Fireworks AI has the best model quality in the industry
  • Collaborate with performance, training, and product teams to translate research and product needs into infrastructure solutions
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure
  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally
  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results
  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
  • Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
  • Expertise in scripting and automation (Python, Go, or Bash)
  • Solid understanding of container technologies, Kubernetes, and cloud build systems
  • Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
  • Excellent communication and ability to drive technical decisions across multiple platform and product teams
Job Responsibility
Job Responsibility
  • Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
  • Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
  • AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
  • Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
  • Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
  • Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
  • Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture
Read More
Arrow Right

Engineering Manager, AI Platform

Lead Airtable's AI Platform pod, which builds the foundational infrastructure an...
Location
Location
United States , San Francisco; New York City
Salary
Salary:
240000.00 - 339900.00 USD / Year
airtable.com Logo
Airtable
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Platform builder at heart: think in systems and abstractions
  • experience building infrastructure other teams depend on
  • Technical depth with strategic thinking
  • Systems thinker with shipping velocity
  • AI infrastructure experience: worked on ML platforms, agent frameworks, or AI infrastructure at scale
  • Quality through architecture
  • Strong technical and management growth trajectory: 5+ years experience as an engineer (previously in a staff or TL level IC position) and 1+ years as a manager, or a similar combination
Job Responsibility
Job Responsibility
  • Build the AI platform foundation: own the core agent architecture, orchestration layer, and runtime
  • Design for platform scale: create robust abstractions and APIs
  • Establish AI reliability systems: build evaluation frameworks, monitoring, and quality assurance systems
  • Drive technical strategy: partner with Staff+ engineers to define the technical roadmap
  • Enable AI democratization: build platform capabilities that make sophisticated AI accessible to all Airtable users
What we offer
What we offer
  • Benefits
  • Restricted stock units
  • Incentive compensation
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

As a Software Engineer on our Infrastructure team, you will help design and buil...
Location
Location
United States , New York; San Mateo; Redwood City
Salary
Salary:
140000.00 - 150000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • Strong programming skills in Python, C++, or a similar language
  • Solid understanding of computer systems concepts such as networking, storage, and distributed computing
  • Familiarity with cloud platforms like AWS, GCP, or Azure, and containerization tools like Docker or Kubernetes
  • Knowledge and interest in cloud infrastructure, distributed systems, and machine learning
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as job schedulers, autoscalers, resource managers, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Collaborate with ML, DevOps, and product teams to translate research and product needs into infrastructure solutions
  • Learn and apply modern cloud technologies including Kubernetes, Ray, Kubeflow, and MLFlow
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Director of AI Engineering

We are entering a hyper-growth phase of AI innovation and are hiring a Director ...
Location
Location
Canada; United States
Salary
Salary:
300000.00 - 450000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10–15+ years in software engineering, with significant leadership experience owning AI/ML or applied LLM systems at scale
  • Proven history shipping LLM-powered features, agentic workflows, or AI assistants used by real customers in production
  • Deep understanding of LLM orchestration frameworks (LangChain, LlamaIndex), RAG pipelines, vector search, embeddings, and prompt engineering
  • Expert in backend & distributed systems (Python strongly preferred) and cloud infrastructure (AWS/GCP)
  • Strong experience with telemetry, observability, and cost-aware real-time inference optimizations
  • Demonstrated ability to lead senior engineers, define technical roadmaps, and deliver outcomes aligned to business metrics
  • Experience building or scaling teams working on experimentation, optimization, personalization, or ML-powered growth systems
  • Exceptional ability to simplify complex problems, set clear standards, and drive alignment across Product, Data, Design, and Engineering
  • Strong product sense, ability to weigh novelty vs. impact, focus on user value, and prioritize speed with guardrails
  • Fluent in integrating AI tools into engineering workflows for code generation, debugging, delivery velocity, and operational efficiency
Job Responsibility
Job Responsibility
  • Define the multi-year technical vision for Apollo’s AI stack, spanning agents, orchestration, inference, retrieval, and platformization
  • Prioritize high-impact AI investments by partnering with Product, Design, Research, and Data leaders to align engineering outcomes with business goals
  • Establish technical standards, evaluation criteria, and success metrics for every AI-powered feature shipped
  • Lead the architecture and deployment of long-horizon autonomous agents, multi-agent workflows, and API-driven orchestration frameworks
  • Build reusable, scalable agentic components that power GTM workflows like research, enrichment, sequencing, lead scoring, routing, and personalization
  • Own the evolution of Apollo’s internal LLM platform for high-scale, low-latency, cost-optimized inference
  • Oversee model-driven experiences for natural-language interfaces, RAG pipelines, semantic search, personalized recommendations, and email intelligence
  • Partner with Product & Design to build intuitive conversational UX that hides underlying complexity while elevating user productivity
  • Implement rigorous evaluation frameworks, including offline benchmarking, human-in-the-loop review, and online A/B experimentation
  • Ensure robust observability, monitoring, and safety guardrails for all AI systems in production
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA
  • Fulltime
Read More
Arrow Right

Software Engineer, Data Infrastructure

The Data Infrastructure team at Figma builds and operates the foundational platf...
Location
Location
United States , San Francisco; New York
Salary
Salary:
149000.00 - 350000.00 USD / Year
figma.com Logo
Figma
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, specifically in backend or infrastructure engineering
  • Experience designing and building distributed data infrastructure at scale
  • Strong expertise in batch and streaming data processing technologies such as Spark, Flink, Kafka, or Airflow/Dagster
  • A proven track record of impact-driven problem-solving in a fast-paced environment
  • A strong sense of engineering excellence, with a focus on high-quality, reliable, and performant systems
  • Excellent technical communication skills, with experience working across both technical and non-technical counterparts
  • Experience mentoring and supporting engineers, fostering a culture of learning and technical excellence
Job Responsibility
Job Responsibility
  • Design and build large-scale distributed data systems that power analytics, AI/ML, and business intelligence
  • Develop batch and streaming solutions to ensure data is reliable, efficient, and scalable across the company
  • Manage data ingestion, movement, and processing through core platforms like Snowflake, our ML Datalake, and real-time streaming systems
  • Improve data reliability, consistency, and performance, ensuring high-quality data for engineering, research, and business stakeholders
  • Collaborate with AI researchers, data scientists, product engineers, and business teams to understand data needs and build scalable solutions
  • Drive technical decisions and best practices for data ingestion, orchestration, processing, and storage
What we offer
What we offer
  • equity
  • health, dental & vision
  • retirement with company contribution
  • parental leave & reproductive or family planning support
  • mental health & wellness benefits
  • generous PTO
  • company recharge days
  • a learning & development stipend
  • a work from home stipend
  • cell phone reimbursement
  • Fulltime
Read More
Arrow Right

AI Engineer

In this role you will design and build intelligent, autonomous AI systems that e...
Location
Location
United States , San Diego
Salary
Salary:
199500.00 - 299300.00 USD / Year
teradata.com Logo
Teradata
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Engineering, Data Science, or a related field
  • 3–5+ years of experience in software architecture, backend development, or AI infrastructure
  • Strong Python skills and familiarity with Java, Go, and C++
  • Deep expertise in agent development, LLM integration, prompt engineering, runtime systems, and AI tooling
  • Experience with MCP servers, vector databases, RAG systems, graph-based memory, and NLP frameworks
  • Ability to design core agentic capabilities such as memory management, context handling, observability, and identity
  • Strong background in distributed systems, backend services, API design, and cloud-native deployments (AWS, Azure, GCP)
  • Proficiency with containerization, CI/CD pipelines, and scalable production infrastructures
  • Excellent communication skills, documentation habits, and ability to mentor or collaborate across teams
  • Passion for building safe, human-aligned, autonomous systems and extending open-source tools to innovate
Job Responsibility
Job Responsibility
  • Design and build intelligent, autonomous AI systems that enable Teradata to push the boundaries of enterprise-scale agentic technology
  • Lead the development of scalable, secure, cloud-native frameworks that allow AI agents to reason, plan, act, and collaborate in real-world production environments
  • Create the foundational runtime components, automation capabilities, and infrastructure that power next-generation GenAI and Agentic AI solutions
  • Work closely with AI researchers, platform teams, and product leadership to bring advanced agentic capabilities from concept to production across Teradata’s data and AI platform
  • Succeed in this role by enabling enterprise customers to leverage powerful, resilient, and safely governed AI agents that drive measurable business value
What we offer
What we offer
  • Healthcare, life and disability insurance plans
  • 401(k)-retirement savings plan
  • Time-off programs
  • Flexible work model
  • Well-being focus
  • Diversity, Equity, and Inclusion commitment
  • Fulltime
Read More
Arrow Right

Principal Engineer, AI Strategy and Innovation

Shape the architecture and execution of CLEAR’s AI platform strategy, from infra...
Location
Location
United States , New York
Salary
Salary:
250000.00 - 290000.00 USD / Year
clearme.com Logo
Clear
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering and/or technical experience with deep expertise in AI systems, ML platforms, and data infrastructure
  • At least 5 years of experience with various AI technologies including GenAI, ML, Deep Learning, RPA or others
  • Proven ability to scale AI capabilities into high-throughput, low-latency environments
  • Strong technical background in cloud-native architectures (AWS or similar) and modern AI/ML stacks (TensorFlow/PyTorch, MLflow, RAG, MCP, etc.)
  • Experience leading AI strategy and platform adoption in enterprise-scale environments
  • Skilled at translating regulatory and compliance requirements into responsible AI practices
  • Track record of partnering closely with Product, Engineering, Analytics, and Security teams as well as business executives
  • Excellent communicator who can set a vision for AI, explain technical trade-offs, and influence executives, peers, and partners
  • Passionate about embedding AI into core products to deliver measurable impact for members and enterprise partners
Job Responsibility
Job Responsibility
  • Define and scale CLEAR’s AI strategy: spanning data pipelines, ML lifecycle management, and intelligent applications
  • Lead engineering execution for AI models (development, deployment, monitoring, retraining) with a focus on reliability, observability, and ethical AI practices
  • Modernize analytics and intelligence systems to deliver predictive insights and partner-facing transparency in real time
  • Operationalize trust in AI by embedding privacy, compliance, and security into all platforms and workflows
  • Influence cross-functional stakeholders across the business, fostering a culture of technical rigor, collaboration, and innovation, advising C Suite executives, leaders, and individual contributors
  • Lead the AI Governance group and drive best practices across business functions
  • Track and optimize KPIs on AI adoption, model performance, scalability, and business impact
What we offer
What we offer
  • Comprehensive healthcare plans
  • Family-building benefits (fertility and adoption/surrogacy support)
  • Flexible time off
  • Annual wellness stipend
  • Free OneMedical memberships for you and your dependents
  • A CLEAR Plus membership
  • A 401(k) retirement plan with employer match
  • Catered lunches every day
  • Fully stocked kitchens
  • Stipends and reimbursement programs for well-being and learning & development
  • Fulltime
Read More
Arrow Right