CrawlJobs Logo

Design Engineer - AI Infrastructure

United States, Menlo Park 113000.00 - 164000.00 USD / Year · Job Posted January 23, 2026
Apply Position
Job Link Share

Job Description

Want to build cutting-edge AI developer tools? The AI Infrastructure Experiences team designs the tooling and platforms that power Meta’s AI efforts, including model authoring, training, experimentation, deployment, and debugging. This work enables many of Meta’s top AI priorities, including recommendations on Instagram and Facebook, Ads ranking, research (FAIR) and more. As a Product Design Engineer, you'll have the opportunity to contribute to and influence everything from product strategy, to user experience design, to UI engineering.

Job Responsibility

  • Works with full-stack software teams to realize designs into experiences and will validate or discard hypotheses by testing new designs and concepts with advanced prototypes
  • Develop and iterate rapid models of proposed design solutions for the purposes of communication and evaluation across cross-functional teams
  • Improve tools and drive quality and craft initiatives in an effort to up-level the look and feel of the AI infrastructure products
  • Translate static designs (both hi-fidelity and wire-frame) into end-to-end, interactive prototypes that can be used in research, critique, reviews with leadership, and public-facing product announcements
  • Present work to peers, product teams and executive leadership/stakeholders

Requirements

  • Bachelor's degree in Design, Human-Computer Interaction, Computer Science, or a related field
  • 2+ years of experience shipping software products
  • Experience in full lifecycle of design and development -- including ideation, prototyping, development, backwards compatibility, failovers, validation, and user research
  • Understanding of design principles and the knowledge to build experiences that span mobile and desktop browsers and native platforms
  • Experience with front-end technologies and prototyping tools, including Figma, PHP, HTML, CSS, and JavaScript frameworks (e.g., React, React Native, NodeJS)
  • A solid portfolio of things you’ve designed and built across a range of platforms, including native and web-based applications

Nice to have

  • You've built (and shipped) something - an app, an MVP, a proof of concept, etc
  • A portfolio that shows a range of technical skills and a creative use of code

What we offer

  • bonus
  • equity
  • benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Design Engineer - AI Infrastructure

8 matching positions

Ai Infrastructure Engineer, Core Infrastructure

As a Software Engineer on the ML Infrastructure team, you will design and build ...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
179400.00 - 310500.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience building large-scale backend or distributed systems
  • Strong programming skills in Python, Go, or Rust, and familiarity with modern cloud-native architecture
  • Experience with containers and orchestration tools (Kubernetes, Docker) and Infrastructure as Code (Terraform)
  • Familiarity with schedulers or workload management systems (e.g., Kubernetes controllers, Slurm, Ray, internal job queues)
  • Understanding of observability and reliability practices (metrics, tracing, alerting, SLOs)
  • A track record of improving system efficiency, reliability, or developer velocity in production environments
Job Responsibility
Job Responsibility
  • Design and maintain fault-tolerant, cost-efficient systems that manage compute allocation, scheduling, and autoscaling across clusters and clouds
  • Build common abstractions and APIs that unify job submission, telemetry, and observability across serving and training workloads
  • Develop systems for usage metering, cost attribution, and quota management, enabling transparency and control over compute budgets
  • Improve reliability and efficiency of large-scale GPU workloads through better scheduling, bin-packing, preemption, and resource sharing
  • Partner with ML engineers and API teams to identify bottlenecks and define long-term architectural standards
  • Lead projects end-to-end — from requirements gathering and design to rollout and monitoring — in a cross-functional environment
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • equity based compensation
  • Fulltime
Read More
Arrow Right

Ai Infrastructure Engineer

We are seeking a DevOps / Platform Engineer to join our team building and operat...
Location
Location
United States , San Jose
Salary
Salary:
204000.00 - 306000.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, Platform, or Infrastructure Engineering
  • Deep hands-on experience with Kubernetes and container orchestration at scale
  • Proven ability to design and deliver platform features that serve internal customers or developer teams
  • Experience building developer-facing platforms or internal developer portals (e.g.custom workflow tooling)
Job Responsibility
Job Responsibility
  • Build and extend platform capabilities to enable new classes of workloads (e.g., interactive development pods, CI pipelines, inference services, benchmarking jobs)
  • Design and operate scalable orchestration systems using Kubernetes across both on-prem and multi-cloud environments
  • Develop platform features such as secret management, configuration management, and deployment automation for customers
  • Partner with development teams to extend the GPU developer platform with features, APIs, templates, and self-service workflows that streamline job orchestration and environment management
  • Manage service lifecycle within Kubernetes using Helm and GitOps workflows (e.g., ArgoCD or Flux)
  • Apply expertise in storage and networking to design and integrate CSI drivers, persistent volumes, and network policies that enable high-performance GPU workloads
  • Fulltime
Read More
Arrow Right

Systems Design Engineer - AI Cluster Software

WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great prod...
Location
Location
United States , Austin
Salary
Salary:
163200.00 - 244800.00 USD / Year
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors or Masters degree in electrical or computer engineering
  • Evidence of end-to-end systems thinking, debugging, and tradeoff decisions
  • hands-on familiarity with at least two schedulers and/or orchestration systems (e.g., Slurm, Kubernetes), MPI/OpenMP, distributed storage patterns, or performance analysis
  • experience writing evaluation docs/RFCs with clear criteria, benchmarks, risks, and recommendations
  • Strong Linux fundamentals: Linux operating systems, networking, filesystems, containers, performance tooling (perf, flamegraphs, nvprof/rocprof, basic eBPF)
  • ability to turn complex systems into accessible, structured documentation with diagrams and reproducible steps
  • ROCm, RCCL, Instinct GPUs, EPYC platforms, compiler/toolchain impacts, and performance tuning
  • DDP, collective comms, sharded/stateful optimizers
  • NCCL/RCCL behavior and transport considerations (PCIe, NVLink, IF)
  • Slurm configuration patterns, Kubernetes for HPC/AI (GPU operators, device plugins), Apptainer/Singularity
Job Responsibility
Job Responsibility
  • Apply your expertise to shape AI infrastructure by creating reference architectures, configuration guides, and deployment blueprints that help internal teams and customers make informed hardware and software decisions
  • Perform deep technical evaluations of AI stacks across compute, storage, networking, and observability layers, documenting how they work, where they fit, and the tradeoffs involved
  • Design and execute reproducible experiments and benchmarking harnesses to compare technologies such as schedulers, distributed training libraries, and observability stacks
  • Develop small reference implementations and tools to validate performance hypotheses, analyze system behavior and more
  • Build a library of technical artifacts—including presentations, design documents, and “how it works” guides, to support pre-sales engineers and enable others to skill up from an HPC perspective
  • Present findings through demos, documentation, and internal talks, and create templates and checklists to support repeatable evaluations and cluster designs
  • Fulltime
Read More
Arrow Right

Sr. Cloud Infrastructure Engineer (Ai & Llm Platforms)

We are seeking a specialized Infrastructure Engineer to bridge the gap between o...
Location
Location
Salary
Salary:
Not provided
q6cyber.com Logo
Q6 Cyber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, Platform Engineering, or SRE, with at least 1-2 years specifically focused on AI/ML infrastructure
  • Proven track record of building production-grade RAG pipelines or LLM-integrated applications
  • Thrives in 'day zero' environments where the tools and protocols (like MCP) are evolving weekly
  • Deep understanding of the security implications of LLMs (prompt injection, data leakage, and secure tool execution)
  • Experience working with substantial datasets (over 1bn objects, dozens or hundreds of TBs) and the challenges of leveraging AI tools with these data sets
  • Bachelor's degree or equivalent in computer science or related field
  • Cloud & Orchestration: AWS/GCP/Azure, Kubernetes, Terraform, Helm
  • AI Frameworks: LangChain, LlamaIndex, LangGraph
  • Data & Vectors: Pinecone, Milvus, Qdrant, or pgvector
  • Apache Kafka/Pulsar
Job Responsibility
Job Responsibility
  • Guide the architecture that will allow us to leverage AI tools with our large existing data stores and incoming streams of realtime intelligence
  • Work closely with other infrastructure engineers and software development teams to integrate AI tools into existing systems
  • Design, deploy, and maintain Model Context Protocol (MCP) servers to allow LLMs to securely interact with our internal databases, APIs, and external tooling
  • Build and orchestrate sandboxed, scalable environments (e.g., using Docker or specialized runtimes) where users can safely build and execute AI agents
  • Develop and manage the infrastructure for our internal RAG (Retrieval-Augmented Generation) pipeline, including vector database management (e.g., Pinecone, Weaviate, or pgvector) and automated embedding pipelines
  • Utilize Kubernetes (K8s) and Infrastructure as Code (Terraform/Pulumi) to deploy LLM-related tools, ensuring high availability and low latency for model inference and data retrieval
  • Implement strict guardrails for data privacy within LLM workflows, ensuring internal datasets remain secure while being accessible to authorized AI tools
What we offer
What we offer
  • We offer a competitive compensation package and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Vice President - Technology (Data & AI Infrastructure Engineer)

Our client's technology team is responsible for creating and continuously improv...
Location
Location
United States , New York
Salary
Salary:
175000.00 - 215000.00 USD / Year
rennerbrown.com Logo
Renner Brown
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, Computer Engineering, or related field (Master's degree is a plus)
  • 8+ years in infrastructure engineering, cloud platform engineering, or data engineering
  • Demonstrated experience building shared platforms or developer services in an enterprise environment
  • Azure expertise: Azure AI Foundry, Azure Data Factory, Azure Databricks, AKS, Azure API Management, Azure Key Vault, Azure Entra ID
  • Strong Python skills: backend services, REST APIs (FastAPI or Flask), and automation scripting
  • PowerShell for infrastructure tasks
  • Infrastructure-as-Code: Terraform and/or Bicep
  • container orchestration with Docker and Kubernetes
  • Experience integrating LLM APIs (Anthropic Claude, Azure OpenAI) in production including token cost management and observability
  • RAG pipeline experience: vector search (Azure AI Search or pgvector), document processing, and retrieval patterns
Job Responsibility
Job Responsibility
  • Design, build, and operate the firm's AI platform, enabling developers to build and deploy Python-based AI applications
  • Implement and manage Azure AI Foundry environments: model deployments, AI hubs, project workspaces, and access controls
  • Integrate and operationalize third-party AI APIs (Anthropic Claude API, Azure OpenAI) with secure access patterns, API gateway controls, rate limiting, and cost monitoring
  • Build internal developer tooling and SDK scaffolding to accelerate AI application development across the firm
  • Build and maintain data pipelines using Azure Data Factory and Azure Databricks to serve AI application data needs
  • Implement vector search and document retrieval infrastructure (Azure AI Search) to support RAG-based applications
  • Manage structured and unstructured data stores including Azure Data Lake, Azure SQL, and Cosmos DB
  • Provision and maintain secure, scalable infrastructure on Azure (primary) and AWS using Infrastructure-as-Code (Terraform or Bicep)
  • Build and maintain CI/CD pipelines for AI application deployment via Azure DevOps or GitHub Actions
  • Manage containerized workloads using Docker and Kubernetes (AKS) for AI application hosting and API services
  • Fulltime
Read More
Arrow Right

Senior AI Infrastructure Engineer - Training Platform

As a Software Engineer on the Machine Learning Infrastructure team, you will bui...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
216000.00 - 270000.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
  • Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
  • Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
  • Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
  • Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
  • Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments
Job Responsibility
Job Responsibility
  • Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
  • Design and implement scheduling primitives to optimize the lifecycle of training jobs
  • Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
  • Work closely with Finance and Procurement teams to drive our capacity planning process
  • Participate in our team's on call process to ensure the availability of our services
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • commuter stipend (may be eligible)
  • Fulltime
Read More
Arrow Right

Senior Data Engineer - AI Infrastructure

We are building a large-scale data platform that transforms raw system logs into...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 3+ years experience in business analytics, data science, software development, data modeling, or data engineering OR Bachelor's Degree in Computer Science, Math, Software Engineering, Computer Engineering, or related field AND 4+ years experience in business analytics, data science, software development, data modeling, or data engineering OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Design and implement large-scale data pipelines using PySpark and distributed processing frameworks
  • Build and maintain data models that accurately represent underlying system behavior and business logic
  • Ensure high standards of data correctness, completeness, and consistency across datasets
  • Develop validation, monitoring, and alerting mechanisms to detect data quality issues
  • Partner with data scientists to support experimentation and analytics use cases
  • Collaborate with platform engineers to ensure efficient data ingestion, processing, and storage
  • Optimize pipelines for performance, scalability, and cost efficiency
  • Define and enforce best practices for schema design, data transformations, and pipeline reliability
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

The AI Platform organization builds the end-to-end Azure AI stack, from the infr...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript | OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Work on the design and development of the core AI Infrastructure distributed and in-cluster services that support large scale AI training and inferencing
  • Develop, test, and maintain control plane services written in C#, hosted on Service Fabric or Kubernetes (AKS) clusters
  • Enhance systems and applications to ensure high stability, efficiency and maintainability, low latency, tight cloud security
  • Provide operational support and DRI (on-call) responsibilities for the service
  • Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services used by our customers
  • Collaborate closely with service engineers, product managers, and internal applied research and data science teams within Microsoft to build better solutions together
  • Provide vision, expertise, and technical leadership to other team members
  • Help to grow talent in these areas
  • Embody our culture and values
  • Fulltime
Read More
Arrow Right