CrawlJobs Logo

Ai Infrastructure Engineer, Core Infrastructure

scale.com Logo

Scale

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

179400.00 - 310500.00 USD / Year

Job Description:

As a Software Engineer on the ML Infrastructure team, you will design and build the next generation of foundational systems that power all ML Infrastructure compute at Scale - from model training and evaluation to large-scale inference and experimentation. Our platform is responsible for orchestrating workloads across heterogeneous compute environments (GPU, CPU, on-prem, and cloud), optimizing for reliability, cost efficiency, and developer velocity.

Job Responsibility:

  • Design and maintain fault-tolerant, cost-efficient systems that manage compute allocation, scheduling, and autoscaling across clusters and clouds
  • Build common abstractions and APIs that unify job submission, telemetry, and observability across serving and training workloads
  • Develop systems for usage metering, cost attribution, and quota management, enabling transparency and control over compute budgets
  • Improve reliability and efficiency of large-scale GPU workloads through better scheduling, bin-packing, preemption, and resource sharing
  • Partner with ML engineers and API teams to identify bottlenecks and define long-term architectural standards
  • Lead projects end-to-end — from requirements gathering and design to rollout and monitoring — in a cross-functional environment

Requirements:

  • 4+ years of experience building large-scale backend or distributed systems
  • Strong programming skills in Python, Go, or Rust, and familiarity with modern cloud-native architecture
  • Experience with containers and orchestration tools (Kubernetes, Docker) and Infrastructure as Code (Terraform)
  • Familiarity with schedulers or workload management systems (e.g., Kubernetes controllers, Slurm, Ray, internal job queues)
  • Understanding of observability and reliability practices (metrics, tracing, alerting, SLOs)
  • A track record of improving system efficiency, reliability, or developer velocity in production environments

Nice to have:

  • Experience with multi-tenant compute platforms or internal PaaS
  • Knowledge of GPU scheduling, cost modeling, or hybrid cloud orchestration
  • Familiarity with LLM or ML training workloads, though deep ML expertise is not required
What we offer:
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • equity based compensation

Additional Information:

Job Posted:
February 20, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Ai Infrastructure Engineer, Core Infrastructure

Software Engineer, AI Infrastructure

As a Software Engineer on our AI Infrastructure team, you will help design the c...
Location
Location
United States , New York, NY; San Mateo, CA
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 3 years of experience in software engineering, with a focus on infrastructure or machine learning systems
  • Strong programming skills in Python, Go, or a similar language
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, MLflow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Basic understanding of LLM knowledge (e.g., context length, disaggregated prefill, KV cache memory estimation, etc)
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as LLM CI/CD pipeline, control plane, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Building frameworks and safeguards to ensure Fireworks AI has the best model quality in the industry
  • Collaborate with performance, training, and product teams to translate research and product needs into infrastructure solutions
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure
  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally
  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results
  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation
  • Fulltime
Read More
Arrow Right

Senior Engineering Manager - AI Core Platform

We’re hiring a Senior Engineering Manager (or high-potential EM2) for the Core P...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
intercom.com Logo
Intercom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading engineering teams, ideally across infrastructure or platform domains
  • Recent hands-on coding experience — you’ve shipped production code in the last couple of years
  • Strong technical judgment and the ability to coach senior engineers through complex architectural trade-offs
  • Adaptable leadership style suited to a group that will grow quickly, and change shape over time
  • Curiosity and enthusiasm for AI, with a desire to learn how ML systems are developed and operated in production
Job Responsibility
Job Responsibility
  • Lead a high-performing team building the platform and infrastructure that power Intercom’s AI capabilities
  • Contribute directly to production code, staying close to the work and building knowledge & context through first-hand experience
  • Support teams of ML Scientists and Engineers building AI powered capabilities
  • Plan, prioritize, and deliver high-impact roadmaps in partnership with the team’s most senior engineers, balancing delivery, quality, and innovation
  • Improve developer experience across the AI infrastructure stack, ensuring that systems are observable, scalable, and easy to build upon
  • Empower the engineers on the team to act with agency and maximize their impact
  • Expand your scope over time, potentially taking ownership of additional platform domains as the team and AI initiatives grow
What we offer
What we offer
  • Competitive salary and equity in a fast-growing start-up
  • We serve lunch every weekday, plus a variety of snack foods and a fully stocked kitchen
  • Regular compensation reviews - we reward great work
  • Pension scheme & match up to 4%
  • Peace of mind with life assurance, as well as comprehensive health and dental insurance for you and your dependents
  • Flexible paid time off policy
  • Paid maternity leave, as well as 6 weeks paternity leave for fathers, to let you spend valuable time with your loved ones
  • If you’re cycling, we’ve got you covered on the Cycle-to-Work Scheme. With secure bike storage too
  • MacBooks are our standard, but we also offer Windows for certain roles when needed
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
  • Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
  • Expertise in scripting and automation (Python, Go, or Bash)
  • Solid understanding of container technologies, Kubernetes, and cloud build systems
  • Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
  • Excellent communication and ability to drive technical decisions across multiple platform and product teams
Job Responsibility
Job Responsibility
  • Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
  • Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
  • AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
  • Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
  • Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
  • Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
  • Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture
Read More
Arrow Right

Engineering Manager, AI Platform

Lead Airtable's AI Platform pod, which builds the foundational infrastructure an...
Location
Location
United States , San Francisco; New York City
Salary
Salary:
240000.00 - 339900.00 USD / Year
airtable.com Logo
Airtable
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Platform builder at heart: think in systems and abstractions
  • experience building infrastructure other teams depend on
  • Technical depth with strategic thinking
  • Systems thinker with shipping velocity
  • AI infrastructure experience: worked on ML platforms, agent frameworks, or AI infrastructure at scale
  • Quality through architecture
  • Strong technical and management growth trajectory: 5+ years experience as an engineer (previously in a staff or TL level IC position) and 1+ years as a manager, or a similar combination
Job Responsibility
Job Responsibility
  • Build the AI platform foundation: own the core agent architecture, orchestration layer, and runtime
  • Design for platform scale: create robust abstractions and APIs
  • Establish AI reliability systems: build evaluation frameworks, monitoring, and quality assurance systems
  • Drive technical strategy: partner with Staff+ engineers to define the technical roadmap
  • Enable AI democratization: build platform capabilities that make sophisticated AI accessible to all Airtable users
What we offer
What we offer
  • Benefits
  • Restricted stock units
  • Incentive compensation
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

As a Software Engineer on our Infrastructure team, you will help design and buil...
Location
Location
United States , New York; San Mateo; Redwood City
Salary
Salary:
140000.00 - 150000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • Strong programming skills in Python, C++, or a similar language
  • Solid understanding of computer systems concepts such as networking, storage, and distributed computing
  • Familiarity with cloud platforms like AWS, GCP, or Azure, and containerization tools like Docker or Kubernetes
  • Knowledge and interest in cloud infrastructure, distributed systems, and machine learning
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as job schedulers, autoscalers, resource managers, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Collaborate with ML, DevOps, and product teams to translate research and product needs into infrastructure solutions
  • Learn and apply modern cloud technologies including Kubernetes, Ray, Kubeflow, and MLFlow
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Director of AI Engineering

We are entering a hyper-growth phase of AI innovation and are hiring a Director ...
Location
Location
Canada; United States
Salary
Salary:
300000.00 - 450000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10–15+ years in software engineering, with significant leadership experience owning AI/ML or applied LLM systems at scale
  • Proven history shipping LLM-powered features, agentic workflows, or AI assistants used by real customers in production
  • Deep understanding of LLM orchestration frameworks (LangChain, LlamaIndex), RAG pipelines, vector search, embeddings, and prompt engineering
  • Expert in backend & distributed systems (Python strongly preferred) and cloud infrastructure (AWS/GCP)
  • Strong experience with telemetry, observability, and cost-aware real-time inference optimizations
  • Demonstrated ability to lead senior engineers, define technical roadmaps, and deliver outcomes aligned to business metrics
  • Experience building or scaling teams working on experimentation, optimization, personalization, or ML-powered growth systems
  • Exceptional ability to simplify complex problems, set clear standards, and drive alignment across Product, Data, Design, and Engineering
  • Strong product sense, ability to weigh novelty vs. impact, focus on user value, and prioritize speed with guardrails
  • Fluent in integrating AI tools into engineering workflows for code generation, debugging, delivery velocity, and operational efficiency
Job Responsibility
Job Responsibility
  • Define the multi-year technical vision for Apollo’s AI stack, spanning agents, orchestration, inference, retrieval, and platformization
  • Prioritize high-impact AI investments by partnering with Product, Design, Research, and Data leaders to align engineering outcomes with business goals
  • Establish technical standards, evaluation criteria, and success metrics for every AI-powered feature shipped
  • Lead the architecture and deployment of long-horizon autonomous agents, multi-agent workflows, and API-driven orchestration frameworks
  • Build reusable, scalable agentic components that power GTM workflows like research, enrichment, sequencing, lead scoring, routing, and personalization
  • Own the evolution of Apollo’s internal LLM platform for high-scale, low-latency, cost-optimized inference
  • Oversee model-driven experiences for natural-language interfaces, RAG pipelines, semantic search, personalized recommendations, and email intelligence
  • Partner with Product & Design to build intuitive conversational UX that hides underlying complexity while elevating user productivity
  • Implement rigorous evaluation frameworks, including offline benchmarking, human-in-the-loop review, and online A/B experimentation
  • Ensure robust observability, monitoring, and safety guardrails for all AI systems in production
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA
  • Fulltime
Read More
Arrow Right

Software Engineer, Data Infrastructure

The Data Infrastructure team at Figma builds and operates the foundational platf...
Location
Location
United States , San Francisco; New York
Salary
Salary:
149000.00 - 350000.00 USD / Year
figma.com Logo
Figma
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, specifically in backend or infrastructure engineering
  • Experience designing and building distributed data infrastructure at scale
  • Strong expertise in batch and streaming data processing technologies such as Spark, Flink, Kafka, or Airflow/Dagster
  • A proven track record of impact-driven problem-solving in a fast-paced environment
  • A strong sense of engineering excellence, with a focus on high-quality, reliable, and performant systems
  • Excellent technical communication skills, with experience working across both technical and non-technical counterparts
  • Experience mentoring and supporting engineers, fostering a culture of learning and technical excellence
Job Responsibility
Job Responsibility
  • Design and build large-scale distributed data systems that power analytics, AI/ML, and business intelligence
  • Develop batch and streaming solutions to ensure data is reliable, efficient, and scalable across the company
  • Manage data ingestion, movement, and processing through core platforms like Snowflake, our ML Datalake, and real-time streaming systems
  • Improve data reliability, consistency, and performance, ensuring high-quality data for engineering, research, and business stakeholders
  • Collaborate with AI researchers, data scientists, product engineers, and business teams to understand data needs and build scalable solutions
  • Drive technical decisions and best practices for data ingestion, orchestration, processing, and storage
What we offer
What we offer
  • equity
  • health, dental & vision
  • retirement with company contribution
  • parental leave & reproductive or family planning support
  • mental health & wellness benefits
  • generous PTO
  • company recharge days
  • a learning & development stipend
  • a work from home stipend
  • cell phone reimbursement
  • Fulltime
Read More
Arrow Right

AI Engineer

In this role you will design and build intelligent, autonomous AI systems that e...
Location
Location
United States , San Diego
Salary
Salary:
199500.00 - 299300.00 USD / Year
teradata.com Logo
Teradata
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Engineering, Data Science, or a related field
  • 3–5+ years of experience in software architecture, backend development, or AI infrastructure
  • Strong Python skills and familiarity with Java, Go, and C++
  • Deep expertise in agent development, LLM integration, prompt engineering, runtime systems, and AI tooling
  • Experience with MCP servers, vector databases, RAG systems, graph-based memory, and NLP frameworks
  • Ability to design core agentic capabilities such as memory management, context handling, observability, and identity
  • Strong background in distributed systems, backend services, API design, and cloud-native deployments (AWS, Azure, GCP)
  • Proficiency with containerization, CI/CD pipelines, and scalable production infrastructures
  • Excellent communication skills, documentation habits, and ability to mentor or collaborate across teams
  • Passion for building safe, human-aligned, autonomous systems and extending open-source tools to innovate
Job Responsibility
Job Responsibility
  • Design and build intelligent, autonomous AI systems that enable Teradata to push the boundaries of enterprise-scale agentic technology
  • Lead the development of scalable, secure, cloud-native frameworks that allow AI agents to reason, plan, act, and collaborate in real-world production environments
  • Create the foundational runtime components, automation capabilities, and infrastructure that power next-generation GenAI and Agentic AI solutions
  • Work closely with AI researchers, platform teams, and product leadership to bring advanced agentic capabilities from concept to production across Teradata’s data and AI platform
  • Succeed in this role by enabling enterprise customers to leverage powerful, resilient, and safely governed AI agents that drive measurable business value
What we offer
What we offer
  • Healthcare, life and disability insurance plans
  • 401(k)-retirement savings plan
  • Time-off programs
  • Flexible work model
  • Well-being focus
  • Diversity, Equity, and Inclusion commitment
  • Fulltime
Read More
Arrow Right