CrawlJobs Logo

Ai infrastructure engineer, model serving platform

scale.com Logo

Scale

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

179400.00 - 224250.00 USD / Year

Job Description:

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and production systems, supporting both internal and external use cases across various environments. The ideal candidate combines strong ML fundamentals with deep expertise in backend system design. You’ll work in a highly collaborative environment, bridging research and engineering to deliver seamless experiences to our customers and accelerate innovation across the company.

Job Responsibility:

  • Build and maintain fault-tolerant, high-performance systems for serving LLMs workloads at scale
  • Build an internal platform to empower LLM capability discovery
  • Collaborate with researchers and engineers to integrate and optimize models for production and research use cases
  • Conduct architecture and design reviews to uphold best practices in system design and scalability
  • Develop monitoring and observability solutions to ensure system health and performance
  • Lead projects end-to-end, from requirements gathering to implementation, in a cross-functional environment

Requirements:

  • 4+ years of experience building large-scale, high-performance backend systems
  • Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++)
  • Experience with LLM serving and routing fundamentals (e.g. rate limiting, token streaming, load balancing, budgets, etc.)
  • Experience with LLM capabilities and concepts such as reasoning, tool calling, prompt templates, etc.
  • Experience with containers and orchestration tools (e.g., Docker, Kubernetes)
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments

Nice to have:

Experience with modern LLM serving frameworks such as vLLM, SGLang, TensorRT-LLM, or text-generation-inference

What we offer:
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO

Additional Information:

Job Posted:
February 20, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Ai infrastructure engineer, model serving platform

Senior ML Platform Engineer

At WHOOP, we're on a mission to unlock human performance and healthspan. WHOOP e...
Location
Location
United States , Boston
Salary
Salary:
150000.00 - 210000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science, Engineering, or a related field
  • or equivalent practical experience
  • 5+ years of experience in software engineering with a focus on ML infrastructure, cloud platforms, or MLOps
  • Strong programming skills in Python, with experience in building distributed systems and REST/gRPC APIs
  • Deep knowledge of cloud-native services and infrastructure-as-code (e.g., AWS CDK, Terraform, CloudFormation)
  • Hands-on experience with model deployment platforms such as AWS SageMaker, Vertex AI, or Kubernetes-based serving stacks
  • Proficiency in ML lifecycle tools (MLflow, Weights & Biases, BentoML) and containerization strategies (Docker, Kubernetes)
  • Understanding of data engineering and ingestion pipelines, with ability to interface with data lakes, feature stores, and streaming systems
  • Proven ability to work cross-functionally with Data Science, Data Platform, and Software Engineering teams, influencing decisions and driving alignment
  • Passion for AI and automation to solve real-world problems and improve operational workflows
Job Responsibility
Job Responsibility
  • Architect, build, own, and operate scalable ML infrastructure in cloud environments (e.g., AWS), optimizing for speed, observability, cost, and reproducibility
  • Create, support, and maintain core MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry), ensuring reliability, scalability, and long-term sustainability
  • Develop, evolve, and operate MLOps platforms and frameworks that standardize model deployment, versioning, drift detection, and lifecycle management at scale
  • Implement and continuously maintain end-to-end CI/CD pipelines for ML models using orchestration tools (e.g., Prefect, Airflow, Argo Workflows), ensuring robust testing, reproducibility, and traceability
  • Partner closely with Data Science, Sensor Intelligence, and Data Platform teams to operationalize and support model development, deployment, and monitoring workflows
  • Build, manage, and maintain both real-time and batch inference infrastructure, supporting diverse use cases from physiological analytics to personalized feedback loops for WHOOP members
  • Design, implement, and own automated observability tooling (e.g., for model latency, data drift, accuracy degradation), integrating metrics, logging, and alerting with existing platforms
  • Leverage AI-powered tools and automation to reduce operational overhead, enhance developer productivity, and accelerate model release cycles
  • Contribute to and maintain internal platform documentation, SDKs, and training materials, enabling self-service capabilities for model deployment and experimentation
  • Continuously evaluate and integrate emerging technologies and deployment strategies, influencing WHOOP’s roadmap for AI-driven platform efficiency, reliability, and scale
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

As a Software Engineer on our Infrastructure team, you will help design and buil...
Location
Location
United States , New York; San Mateo; Redwood City
Salary
Salary:
140000.00 - 150000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • Strong programming skills in Python, C++, or a similar language
  • Solid understanding of computer systems concepts such as networking, storage, and distributed computing
  • Familiarity with cloud platforms like AWS, GCP, or Azure, and containerization tools like Docker or Kubernetes
  • Knowledge and interest in cloud infrastructure, distributed systems, and machine learning
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as job schedulers, autoscalers, resource managers, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Collaborate with ML, DevOps, and product teams to translate research and product needs into infrastructure solutions
  • Learn and apply modern cloud technologies including Kubernetes, Ray, Kubeflow, and MLFlow
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Machine Learning Platform / Backend Engineer

We are seeking a Machine Learning Platform/Backend Engineer to design, build, an...
Location
Location
Serbia; Romania , Belgrade; Timișoara
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4-5+ years of work experience in either ML infrastructure, MLOps, or Platform Engineering
  • Bachelors degree or equivalent focusing on the computer science field is preferred
  • Excellent communication and collaboration skills
  • Expert knowledge of Python
  • Experience with CI/CD tools (e.g., GitLab, Jenkins)
  • Hands-on experience with Kubernetes, Docker, and cloud services
  • Understanding of ML training pipelines, data lifecycle, and model serving concepts
  • Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex AI, Azure ML)
  • A demonstrated understanding of the ML lifecycle, model versioning, and monitoring
  • Experience with ML frameworks (e.g., TensorFlow, PyTorch)
Job Responsibility
Job Responsibility
  • Design, build, and maintain scalable infrastructure that empowers data scientists and machine learning engineers
  • Own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure)
  • Design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring
  • Integrate and extend orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines
  • Develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries)
  • Build out documentation in relation to architecture, policies and operations runbooks
  • Share skills, knowledge, and expertise with members of the data engineering team
  • Foster a culture of collaboration and continuous learning by organizing training sessions, workshops, and knowledge-sharing sessions
  • Collaborate and drive progress with cross-functional teams to design and develop new features and functionalities
  • Ensure that the developed solutions meet project objectives and enhance user experience
  • Fulltime
Read More
Arrow Right

AI (Infrastructure & Pipelines) Architect

As an AI Architect and reporting to Everseen's CTO, you will be responsible for ...
Location
Location
Timișoara / Belgrade
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong knowledge of Machine Learning, Deep Learning
  • Generative AI & LLM Integration
  • Edge AI & Real-time Inference
  • Proficiency in Python, and ML frameworks like TensorFlow, PyTorch, Hugging Face Transformers
  • Experience with cloud platforms (Azure, GCP)
  • Understanding of data architecture, big data technologies, and model deployment
  • Strong understanding of ML training pipelines, data lifecycle, and model serving concepts
  • Experience with GPU orchestration (e.g., NVIDIA GPU Operator, MIG)
  • Experience with MLOps and AI governance frameworks
  • Familiarity with ethical AI practices and data privacy regulations
Job Responsibility
Job Responsibility
  • Systems & Infrastructure: Architect and oversee the adoption of scalable AI infrastructures and select appropriate frameworks, tools, and technologies
  • Performance & Monitoring Standardisation: Ensure all products and systems adhere to a commonly defined performance evaluation methodology and appropriate monitoring and reporting systems are in place
  • Compliance & Ethics: Ensure AI solutions adhere to ethical standards and regulatory requirements
  • Integration: Collaborate with data scientists, engineers, and business stakeholders to integrate AI into products and services
  • Leadership: Provide technical guidance and mentorship to cross-functional teams
  • Fulltime
Read More
Arrow Right

Vice President - Bigdata Engineer - AI & NLP

The Applications Development Technology Lead Analyst is a senior-level position ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 13+ years of relevant experience in Apps Development or systems analysis role
  • Extensive experience in system analysis and programming of software applications
  • Experience in managing and implementing successful projects
  • Expert in coding Python in building Machine Learning and developing LLM-based applications in a professional environment
  • SQL skills able to perform data interrogations
  • Proficiency in enterprise-level application development using Java 8, Scala, Oracle (or comparable database), and Messaging infrastructure like Solace, Kafka, Tibco EMS
  • Develop LLM solutions for querying structured data with natural language, including RAG architectures on enterprise knowledge bases
  • Build, scale, and optimize data science workloads, applying best MLOps practices for production
  • Lead the design and development of LLM-based tools to increase data accessibility, focusing on text-to-SQL platforms
  • Train and fine-tune LLM models to accurately interpret natural language queries and generate SQL queries
Job Responsibility
Job Responsibility
  • Partner with multiple management teams to ensure appropriate integration of functions to meet goals
  • Identify and define necessary system enhancements to deploy new products and process improvements
  • Resolve variety of high impact problems/projects through in-depth evaluation of complex business processes, system processes, and industry standards
  • Provide expertise in area and advanced knowledge of applications programming
  • Ensure application design adheres to the overall architecture blueprint
  • Utilize advanced knowledge of system flow and develop standards for coding, testing, debugging, and implementation
  • Develop comprehensive knowledge of how areas of business, such as architecture and infrastructure, integrate to accomplish business goals
  • Provide in-depth analysis with interpretive thinking to define issues and develop innovative solutions
  • Serve as advisor or coach to mid-level developers and analysts, allocating work as necessary
  • Appropriately assess risk when business decisions are made, demonstrating particular consideration for the firm's reputation and safeguarding Citigroup, its clients and assets
What we offer
What we offer
  • Global Benefits
  • Best-in-class benefits to be well, live well and save well
  • Fulltime
Read More
Arrow Right
New

AI Infrastructure Engineer

At BlackRock, technology underpins everything we do. AI is a core strategic prio...
Location
Location
United Kingdom , Edinburgh
Salary
Salary:
Not provided
blackrock.com Logo
BlackRock Investments
Expiration Date
May 27, 2026
Flip Icon
Requirements
Requirements
  • Strong experience in cloud infrastructure, platform engineering, or systems engineering roles
  • 4+ hands-on expertise with AWS and/or Azure and/or GCP, including Azure ML, Azure Foundry, AWS Bedrock, Google Vertex, as well as cloud compute, networking, storage, and security services
  • Understanding of ML platform operations and governance concepts, including model deployment strategies, lifecycle management, monitoring/observability, and Disaster Recovery
  • Experience supporting LLMs, generative AI platforms, or model serving infrastructure
  • Experience supporting AI and machine learning workloads, with exposure to managed compute for model training and finetuning, experimentation over large datasets, and endtoend MLOps pipeline flow including data ingestion, training, validation and deployment
  • Proficiency with Infrastructure as Code tools (e.g., Terraform, ARM/Bicep, CloudFormation)
  • Strong programming or scripting skills (e.g., Python, Bash, or similar)
  • Experience building and operating containerized and Kubernetes based platforms
  • Solid understanding of reliability, scalability, observability, and operational best practices
  • Ability to work effectively in cross functional teams and communicate complex technical concepts clearly
Job Responsibility
Job Responsibility
  • Design, build, and operate AI focused infrastructure platforms supporting model development, training, evaluation, and inference
  • Engineer scalable, reliable, and secure cloud native services to support AI workloads across AWS, Azure, and hybrid environments
  • Partner with AI Engineering and Data Science teams to improve developer experience, performance, and operational stability of AI systems
  • Enable production deployment of ML models and LLMs within governed enterprise environments, aligned with firmwide risk and compliance standards
  • Implement and maintain infrastructure as code and automation to ensure repeatable, auditable platform provisioning
  • Build and operate observability, monitoring, and alerting solutions for AI platforms, ensuring availability, performance, and cost transparency
  • Collaborate with Security and Risk partners to integrate identity, access controls, data protection, and governance into AI infrastructure
  • Contribute to architectural decisions and technical standards for AI platforms across Aladdin
  • Participate in on-call rotations and operational support as required for critical platforms
  • Continuously evaluate emerging AI infrastructure technologies and apply them pragmatically within BlackRock’s enterprise context
What we offer
What we offer
  • Retirement investment and tools designed to help you in building a sound financial future
  • Access to education reimbursement
  • Comprehensive resources to support your physical health and emotional well-being
  • Family support programs
  • Flexible Time Off (FTO)
  • Fulltime
Read More
Arrow Right

Machine Learning Engineer - Data Foundation and AI

You’ll be a machine learning engineer on the Data Foundation & AI team. In this ...
Location
Location
United States , San Francisco
Salary
Salary:
186000.00 - 236400.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1-3 years of experience training, deploying, and scaling ML/AI models in production environments
  • Strong experience with distributed systems and ML operations — from large-scale training to low-latency serving and monitoring
  • Proficiency in Python and modern ML frameworks (e.g., PyTorch), with the ability to implement and optimize complex models
  • Hands-on experience building or scaling ML/AI infrastructure, pipelines, or reusable platforms that support multiple teams
  • Curiosity and drive to experiment with advanced AI techniques (e.g., embeddings, retrieval, generative modeling) while staying grounded in production impact
  • Ability to thrive in a collaborative environment, working with both technical and non-technical partners to drive measurable outcomes
Job Responsibility
Job Responsibility
  • Building and scaling advanced ML/AI systems that power core Plaid products and applications used by millions of consumers
  • Driving impact at scale by improving distributed training, serving, and ML operations to make Plaid’s AI capabilities faster, more reliable, and more widely available
  • Developing new AI applications that enable innovative product experiences across fintech
  • Tackling 0 to 1 problems where you explore new approaches, as well as scaling 1 to 10 systems for reliability and efficiency
  • Collaborating with some of the strongest MLEs at Plaid in a high-ownership, bottom-up driven team
  • Experimenting with cutting-edge ML and AI techniques while balancing practical productionization and measurable business impact
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Machine Learning Engineer - Data Foundation and AI

You’ll be a machine learning engineer on the Data Foundation & AI team. In this ...
Location
Location
United States , New York
Salary
Salary:
186000.00 - 236400.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1-3 years of experience training, deploying, and scaling ML/AI models in production environments
  • Strong experience with distributed systems and ML operations — from large-scale training to low-latency serving and monitoring
  • Proficiency in Python and modern ML frameworks (e.g., PyTorch), with the ability to implement and optimize complex models
  • Hands-on experience building or scaling ML/AI infrastructure, pipelines, or reusable platforms that support multiple teams
  • Curiosity and drive to experiment with advanced AI techniques (e.g., embeddings, retrieval, generative modeling) while staying grounded in production impact
  • Ability to thrive in a collaborative environment, working with both technical and non-technical partners to drive measurable outcomes
Job Responsibility
Job Responsibility
  • Building and scaling advanced ML/AI systems that power core Plaid products and applications used by millions of consumers
  • Driving impact at scale by improving distributed training, serving, and ML operations to make Plaid’s AI capabilities faster, more reliable, and more widely available
  • Developing new AI applications that enable innovative product experiences across fintech
  • Tackling 0 to 1 problems where you explore new approaches, as well as scaling 1 to 10 systems for reliability and efficiency
  • Collaborating with some of the strongest MLEs at Plaid in a high-ownership, bottom-up driven team
  • Experimenting with cutting-edge ML and AI techniques while balancing practical productionization and measurable business impact
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right