CrawlJobs Logo

AI Infra Engineer

perplexity.ai Logo

Perplexity

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

210000.00 - 385000.00 USD / Year

Job Description:

We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters

Job Responsibility:

  • Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
  • Manage and optimize Slurm-based HPC environments for distributed training of large language models
  • Develop robust APIs and orchestration systems for both training pipelines and inference services
  • Implement resource scheduling and job management systems across heterogeneous compute environments
  • Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
  • Build monitoring, alerting, and observability solutions tailored to ML workloads running on Kubernetes and Slurm
  • Respond swiftly to system outages and collaborate across teams to maintain high uptime for critical training runs and inference services
  • Optimize cluster utilization and implement autoscaling strategies for dynamic workload demands

Requirements:

  • Strong expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
  • Hands-on experience with Slurm workload management, including job scheduling, resource allocation, and cluster optimization
  • Experience with deploying and managing distributed training systems at scale
  • Deep understanding of container orchestration and distributed systems architecture
  • High level familiarity with LLM architecture and training processes (Multi-Head Attention, Multi/Grouped-Query, distributed training strategies)
  • Experience managing GPU clusters and optimizing compute resource utilization
  • Expert-level Kubernetes administration and YAML configuration management
  • Proficiency with Slurm job scheduling, resource management, and cluster configuration
  • Python and C++ programming with focus on systems and infrastructure automation
  • Hands-on experience with ML frameworks such as PyTorch in distributed training contexts
  • Strong understanding of networking, storage, and compute resource management for ML workloads
  • Experience developing APIs and managing distributed systems for both batch and real-time workloads
  • Solid debugging and monitoring skills with expertise in observability tools for containerized environments
  • Demonstrated experience managing large-scale Kubernetes deployments in production environments
  • Proven track record with Slurm cluster administration and HPC workload management
  • Previous roles in SRE, DevOps, or Platform Engineering with focus on ML infrastructure
  • Experience supporting both long-running training jobs and high-availability inference services
  • Ideally, 3-5 years of relevant experience in ML systems deployment with specific focus on cluster orchestration and resource management

Nice to have:

  • Experience with Kubernetes operators and custom controllers for ML workloads
  • Advanced Slurm administration including multi-cluster federation and advanced scheduling policies
  • Familiarity with GPU cluster management and CUDA optimization
  • Experience with other ML frameworks like TensorFlow or distributed training libraries
  • Background in HPC environments, parallel computing, and high-performance networking
  • Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
  • Experience with container registries, image optimization, and multi-stage builds for ML workloads
What we offer:
  • Equity
  • Health
  • Dental
  • Vision
  • Retirement
  • Fitness
  • Commuter and dependent care accounts

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI Infra Engineer

Senior Machine Learning Engineering Manager, Gen AI

We're seeking a Senior Machine Learning Manager (M60) to lead a cross-functional...
Location
Location
United States
Salary
Salary:
193500.00 - 303150.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in ML, search, or backend engineering roles, with 3+ years leading teams
  • Strong track record of shipping ML-powered or LLM-integrated user-facing products
  • Experience with RAG systems (vector search, hybrid retrieval, LLM orchestration)
  • Deep experience in either modeling (e.g., LLMs, search, NLP) or engineering (e.g., backend infra, full-stack), with the ability to lead end-to-end
  • Deep understanding of LLM ecosystems (OpenAI, Claude, Mistral, OSS), orchestration frameworks (LangChain, LlamaIndex), and vector databases (Weaviate, Pinecone, FAISS, etc.)
  • Strong product intuition and ability to translate complex tech into valuable user features
  • Familiarity with GenAI evaluation methods: hallucination detection, groundedness scoring, and human-in-the-loop feedback loops
  • Master’s or PhD in Computer Science, Machine Learning, or related field preferred—or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the vision, design, and execution of LLM-powered AI products, leveraging advance AI modeling (e.g. SLM post-training/fine-tuning), RAG architectures and hybrid ranking system
  • Define system architecture across retrievers, rankers, orchestration layers, prompt templates, and feedback mechanisms
  • Work closely with product and design teams to ensure delightful, fast, and grounded user experiences
  • Build and manage a cross-disciplinary team including ML engineers, backend/frontend engineers, and applied scientists
  • Foster a culture of E2E ownership — empowering the team to move from prototype to production quickly and iteratively
  • Mentor individuals to grow in both technical depth and product acumen
  • Shape the technical roadmap and long-term strategy for GenAI search across Atlassian’s product suite
  • Partner with platform and infra teams to scale inference, evaluate performance, and integrate usage signals for continuous improvement
  • Champion data quality, grounding, and responsible AI practices in all deployed features
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Machine Learning System Engineer

As a Senior ML System Engineer on the AI & ML Platform team, you will play a piv...
Location
Location
United States , Seattle; San Francisco; New York; Austin
Salary
Salary:
165500.00 - 265800.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in building machine learning systems or ML infra / MLOps platform
  • Fluency in at least one modern object-oriented programming language (preferably Java/Kotlin and Python)
  • Experience with RESTful microservices
  • Experience using cloud tools such as Amazon Web Services (S3, Kinesis, Cloud Formation, EKS, AWS Security and Networking)
  • Experience with Continuous Delivery and Continuous Integration
Job Responsibility
Job Responsibility
  • Collaborate with your teammates to solve complex problems, from technical design to launch
  • Deliver cutting-edge solutions that are used by other Atlassian teams and products to build AI features that reach millions of customers
  • Deliver code reviews, documentation & bug fixes within a strong engineering culture
  • Partner across engineering teams to take on company-wide initiatives spanning multiple projects
  • Mentor junior members of the team
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Senior Vue/Nuxt Frontend Engineer

Flanks is shaking up the wealth management industry by making it simpler and way...
Location
Location
Spain , Barcelona
Salary
Salary:
Not provided
vuejs.org Logo
Vue
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert-level frontend skills with Vue + Nuxt, including scalable component architecture, state management, routing, and performance tuning
  • Real-time UI experience using WebSockets/SSE or other event-driven streaming patterns
  • Strong experience building complex dashboards and data visualizations (D3.js, Cytoscape, Vue Flow, or similar)
  • Comfortable using Docker / Docker Compose for local multi-service development
  • Familiarity with backend concepts, API design, and event schemas
  • 7+ years in software engineering (senior-level)
  • Fluent in Spanish and English
Job Responsibility
Job Responsibility
  • Own the architecture of the AI/multi-agent frontend (Vue + Nuxt), from early design to production readiness
  • Maintain and evolve tooling, CI/CD, testing strategy, and internal component libraries relevant to the AI product area
  • Build event-driven UIs using WebSockets/SSE to show streaming agent responses, live logs, system state transitions, execution traces
  • Ensure all real-time views are smooth, performant, and reliable
  • Build rich, interactive components for conversation UIs, agent graphs/flows, timelines, status panels, and debugging views
  • Craft dense financial data dashboards that support auditing, validation, and decision-making
  • Work closely with backend, ML, and infra teams to define events, APIs, and schemas
  • Ensure the frontend reflects the underlying multi-agent system with accuracy and clarity
  • Partner with Product and Design to create UX patterns for AI interactions
  • Mentor engineers on frontend best practices, especially around real-time apps and visualization
What we offer
What we offer
  • A cool office between Sants Estació and Plaça Espanya with stunning views of Barcelona
  • Flexible working hours and hybrid work options
  • Paid day off on your birthday
  • Weekly fresh fruit, coffee, and tea on tap
  • Friday happy hours after our all-hands meetings
  • Team-building events to bond and have fun
  • Health insurance and flexible compensation with Alan
  • A digital canteen, thanks to Nora Real Food, subsidised at 50%
  • A yearly training budget to keep growing
  • Fulltime
Read More
Arrow Right

Technical Sourcer

We’re looking for a Technical Sourcer to join our global Talent team at ElevenLa...
Location
Location
United States
Salary
Salary:
Not provided
elevenlabs.io Logo
ElevenLabs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven ability to globally source the top 1% engineering, product, and technical talent in competitive markets (we’d love to see ~2+ years, but impact > years)
  • Comfortable operating in high-volume, high-complexity technical searches — you stay focused, inventive, and energized even when the problem has many moving parts
  • Skilled in research-driven sourcing: market mapping, building signal-based profiles, and tailoring outreach with precision
  • Thrives in fast-paced, remote-first environments and collaborates effectively across global time zones
  • Excellent written communication — crisp, engaging outreach that resonates with technical audiences
  • Motivated by creativity, curiosity, and a deep interest in engineering, AI, and how people build things
Job Responsibility
Job Responsibility
  • Mapping and engaging technical talent markets globally — product engineering, ML/AI, infra, research and more
  • Running multiple complex searches simultaneously without compromising depth or quality
  • Developing creative, non-obvious sourcing strategies that uncover exceptional engineers others miss
  • Partnering closely with Technical Recruiters and hiring leaders to refine requirements, identify signals, and raise the quality bar
  • Serving as an early ambassador for ElevenLabs — your outreach shapes candidates’ first impression and sets the tone for the entire process
What we offer
What we offer
  • Innovative culture
  • Growth paths
  • Learning & development: ElevenLabs proactively supports professional development through an annual discretionary stipend
  • Social travel: We also provide an annual discretionary stipend to meet up with colleagues each year, however you choose
  • Annual company offsite
  • Co-working: If you’re not located near one of our main hubs, we offer a monthly co-working stipend
  • Fulltime
Read More
Arrow Right

Senior Vue/Nuxt Frontend Engineer

Flanks is shaking up the wealth management industry by making it simpler and way...
Location
Location
Spain , Barcelona
Salary
Salary:
Not provided
flanks.io Logo
Flanks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert-level frontend skills with Vue + Nuxt, including scalable component architecture, state management, routing, and performance tuning
  • Real-time UI experience using WebSockets/SSE or other event-driven streaming patterns
  • Strong experience building complex dashboards and data visualizations (D3.js, Cytoscape, Vue Flow, or similar)
  • Comfortable using Docker / Docker Compose for local multi-service development
  • Familiarity with backend concepts, API design, and event schemas
  • 7+ years in software engineering (senior-level)
  • Fluent in Spanish and English
Job Responsibility
Job Responsibility
  • Own the architecture of the AI/multi-agent frontend (Vue + Nuxt), from early design to production readiness
  • Maintain and evolve tooling, CI/CD, testing strategy, and internal component libraries relevant to the AI product area
  • Build event-driven UIs using WebSockets/SSE to show streaming agent responses, live logs, system state transitions, execution traces
  • Build rich, interactive components for conversation UIs, agent graphs/flows, timelines, status panels, and debugging views
  • Craft dense financial data dashboards that support auditing, validation, and decision-making
  • Work closely with backend, ML, and infra teams to define events, APIs, and schemas
  • Mentor engineers on frontend best practices, especially around real-time apps and visualization
  • Contribute to architectural discussions, standards, and documentation
What we offer
What we offer
  • A cool office between Sants Estació and Plaça Espanya with stunning views of Barcelona
  • Flexible working hours and hybrid work options
  • Paid day off on your birthday
  • Weekly fresh fruit, coffee, and tea on tap
  • Friday happy hours after our all-hands meetings
  • Team-building events to bond and have fun
  • Health insurance and flexible compensation with Alan
  • A digital canteen, thanks to Nora Real Food, subsidised at 50%
  • A yearly training budget to keep growing
  • Fulltime
Read More
Arrow Right

Hpc/ai solution architect

As part of the APAC/India HPC/AI group, your role will be to define, architect, ...
Location
Location
Singapore , Central Singapore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering or from a technical university
  • +7 years of experience in the technology industry with a focus on technical consulting and solution selling
  • HPC/AI-related experience preferred
  • Presales experience preferred
  • Deep understanding of AI infra technologies and AI workload
  • Demonstrates expert technical skills in assigned area of specialization
  • Expert knowledge of the company offerings, strategic initiatives, current trends, competitor products, and strategies within the area of responsibility
  • Expert level written and verbal communication skills and mastery over English and local language
  • Demonstrates expert consultative selling techniques, including active listening, framing, whiteboarding, storytelling, etc.
  • Advanced experience with using social media, blogging, and related information-sharing technologies
Job Responsibility
Job Responsibility
  • HPC/AI Presales tasks as a team member of the APAC/India HPC/AI Presales team
  • Solution architecting, system configuration, technical consulting, presentation delivery, and sales support for general AI and HPC area
  • Participates in deep-dive discussions and evaluates customers' current business needs and desired end-state infrastructure solutions to translate the technical view into the implementation view and determine implementation steps necessary to meet complex technical requirements
  • Delivers in-depth comparative analysis of alternative proposals to meet complex technical solution requirements
  • Maintains excellent communication with customers, with a key focus on IT managers, administrators, and specialists
  • Uses pipeline insights to prioritize high-potential deal pursuit activities with appropriate resource investment
  • Communicates to the account team the tangible offer value regarding financial return and achievement of business goals
  • Provides direction and guidance to improve processes and establish policies
  • Works with partners to identify gaps and address needs to accelerate channel business
  • Assesses the impact of new technologies on the company's technical solution portfolio and translates this understanding into practice and knowledge-sharing activities for broader presales peers, account managers, and partners
What we offer
What we offer
  • Comprehensive suite of benefits that supports physical, financial, and emotional wellbeing
  • Programs for personal and professional development
  • Inclusive work environment celebrating individual uniqueness
  • Fulltime
Read More
Arrow Right

AI Engineering Leader—Robotics Innovation

Ready to architect the future of “physical AI”? Lead the buildout of next-gen da...
Location
Location
United States , Burlington
Salary
Salary:
Not provided
ndt.com Logo
Nondestructive & Visual Inspection
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years driving data infra, ML systems, or end-to-end AI engineering at scale
  • Hands-on with orchestration tools, feature stores, and cloud infra (AWS, GCP, Azure)
  • Deep software engineering skills (Python, Scala, Java) & streaming frameworks (Spark, Flink)
  • Background with robotics, CV data, and edge deployment preferred
Job Responsibility
Job Responsibility
  • Spearhead full-stack data & ML pipelines for sensor, video, and telemetry data powering real-time robotics and vision systems
  • Design scalable infrastructure with strong foundations-schema, lineage, validation, and anomaly detection for embedded AI
  • Integrate edge intelligence, observability, & feedback loops into robotics and perception
  • Set technical standards, mentor talent, and align architecture to real-world product goals
What we offer
What we offer
  • base + bonus + equity
Read More
Arrow Right