CrawlJobs Logo

Ai infrastructure engineer, model serving platform

scale.com Logo

Scale

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

179400.00 - 224250.00 USD / Year

Job Description:

As a Software Engineer on the ML Infrastructure team, you will design and build platforms for scalable, reliable, and efficient serving of LLMs. Our platform powers cutting-edge research and production systems, supporting both internal and external use cases across various environments. The ideal candidate combines strong ML fundamentals with deep expertise in backend system design. You’ll work in a highly collaborative environment, bridging research and engineering to deliver seamless experiences to our customers and accelerate innovation across the company.

Job Responsibility:

  • Build and maintain fault-tolerant, high-performance systems for serving LLMs workloads at scale
  • Build an internal platform to empower LLM capability discovery
  • Collaborate with researchers and engineers to integrate and optimize models for production and research use cases
  • Conduct architecture and design reviews to uphold best practices in system design and scalability
  • Develop monitoring and observability solutions to ensure system health and performance
  • Lead projects end-to-end, from requirements gathering to implementation, in a cross-functional environment

Requirements:

  • 4+ years of experience building large-scale, high-performance backend systems
  • Strong programming skills in one or more languages (e.g., Python, Go, Rust, C++)
  • Experience with LLM serving and routing fundamentals (e.g. rate limiting, token streaming, load balancing, budgets, etc.)
  • Experience with LLM capabilities and concepts such as reasoning, tool calling, prompt templates, etc.
  • Experience with containers and orchestration tools (e.g., Docker, Kubernetes)
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments

Nice to have:

Experience with modern LLM serving frameworks such as vLLM, SGLang, TensorRT-LLM, or text-generation-inference

What we offer:
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO

Additional Information:

Job Posted:
February 20, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
PREMIUM
More languages and countries
+ Unlock 31696 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Ai infrastructure engineer, model serving platform

Senior ML Platform Engineer

At WHOOP, we're on a mission to unlock human performance and healthspan. WHOOP e...
Location
Location
United States , Boston
Salary
Salary:
150000.00 - 210000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science, Engineering, or a related field
  • or equivalent practical experience
  • 5+ years of experience in software engineering with a focus on ML infrastructure, cloud platforms, or MLOps
  • Strong programming skills in Python, with experience in building distributed systems and REST/gRPC APIs
  • Deep knowledge of cloud-native services and infrastructure-as-code (e.g., AWS CDK, Terraform, CloudFormation)
  • Hands-on experience with model deployment platforms such as AWS SageMaker, Vertex AI, or Kubernetes-based serving stacks
  • Proficiency in ML lifecycle tools (MLflow, Weights & Biases, BentoML) and containerization strategies (Docker, Kubernetes)
  • Understanding of data engineering and ingestion pipelines, with ability to interface with data lakes, feature stores, and streaming systems
  • Proven ability to work cross-functionally with Data Science, Data Platform, and Software Engineering teams, influencing decisions and driving alignment
  • Passion for AI and automation to solve real-world problems and improve operational workflows
Job Responsibility
Job Responsibility
  • Architect, build, own, and operate scalable ML infrastructure in cloud environments (e.g., AWS), optimizing for speed, observability, cost, and reproducibility
  • Create, support, and maintain core MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry), ensuring reliability, scalability, and long-term sustainability
  • Develop, evolve, and operate MLOps platforms and frameworks that standardize model deployment, versioning, drift detection, and lifecycle management at scale
  • Implement and continuously maintain end-to-end CI/CD pipelines for ML models using orchestration tools (e.g., Prefect, Airflow, Argo Workflows), ensuring robust testing, reproducibility, and traceability
  • Partner closely with Data Science, Sensor Intelligence, and Data Platform teams to operationalize and support model development, deployment, and monitoring workflows
  • Build, manage, and maintain both real-time and batch inference infrastructure, supporting diverse use cases from physiological analytics to personalized feedback loops for WHOOP members
  • Design, implement, and own automated observability tooling (e.g., for model latency, data drift, accuracy degradation), integrating metrics, logging, and alerting with existing platforms
  • Leverage AI-powered tools and automation to reduce operational overhead, enhance developer productivity, and accelerate model release cycles
  • Contribute to and maintain internal platform documentation, SDKs, and training materials, enabling self-service capabilities for model deployment and experimentation
  • Continuously evaluate and integrate emerging technologies and deployment strategies, influencing WHOOP’s roadmap for AI-driven platform efficiency, reliability, and scale
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

As a Software Engineer on our Infrastructure team, you will help design and buil...
Location
Location
United States , New York; San Mateo; Redwood City
Salary
Salary:
140000.00 - 150000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • Strong programming skills in Python, C++, or a similar language
  • Solid understanding of computer systems concepts such as networking, storage, and distributed computing
  • Familiarity with cloud platforms like AWS, GCP, or Azure, and containerization tools like Docker or Kubernetes
  • Knowledge and interest in cloud infrastructure, distributed systems, and machine learning
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as job schedulers, autoscalers, resource managers, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Collaborate with ML, DevOps, and product teams to translate research and product needs into infrastructure solutions
  • Learn and apply modern cloud technologies including Kubernetes, Ray, Kubeflow, and MLFlow
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Machine Learning Platform / Backend Engineer

We are seeking a Machine Learning Platform/Backend Engineer to design, build, an...
Location
Location
Serbia; Romania , Belgrade; Timișoara
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4-5+ years of work experience in either ML infrastructure, MLOps, or Platform Engineering
  • Bachelors degree or equivalent focusing on the computer science field is preferred
  • Excellent communication and collaboration skills
  • Expert knowledge of Python
  • Experience with CI/CD tools (e.g., GitLab, Jenkins)
  • Hands-on experience with Kubernetes, Docker, and cloud services
  • Understanding of ML training pipelines, data lifecycle, and model serving concepts
  • Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex AI, Azure ML)
  • A demonstrated understanding of the ML lifecycle, model versioning, and monitoring
  • Experience with ML frameworks (e.g., TensorFlow, PyTorch)
Job Responsibility
Job Responsibility
  • Design, build, and maintain scalable infrastructure that empowers data scientists and machine learning engineers
  • Own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure)
  • Design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring
  • Integrate and extend orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines
  • Develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries)
  • Build out documentation in relation to architecture, policies and operations runbooks
  • Share skills, knowledge, and expertise with members of the data engineering team
  • Foster a culture of collaboration and continuous learning by organizing training sessions, workshops, and knowledge-sharing sessions
  • Collaborate and drive progress with cross-functional teams to design and develop new features and functionalities
  • Ensure that the developed solutions meet project objectives and enhance user experience
  • Fulltime
Read More
Arrow Right

AI (Infrastructure & Pipelines) Architect

As an AI Architect and reporting to Everseen's CTO, you will be responsible for ...
Location
Location
Timișoara / Belgrade
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong knowledge of Machine Learning, Deep Learning
  • Generative AI & LLM Integration
  • Edge AI & Real-time Inference
  • Proficiency in Python, and ML frameworks like TensorFlow, PyTorch, Hugging Face Transformers
  • Experience with cloud platforms (Azure, GCP)
  • Understanding of data architecture, big data technologies, and model deployment
  • Strong understanding of ML training pipelines, data lifecycle, and model serving concepts
  • Experience with GPU orchestration (e.g., NVIDIA GPU Operator, MIG)
  • Experience with MLOps and AI governance frameworks
  • Familiarity with ethical AI practices and data privacy regulations
Job Responsibility
Job Responsibility
  • Systems & Infrastructure: Architect and oversee the adoption of scalable AI infrastructures and select appropriate frameworks, tools, and technologies
  • Performance & Monitoring Standardisation: Ensure all products and systems adhere to a commonly defined performance evaluation methodology and appropriate monitoring and reporting systems are in place
  • Compliance & Ethics: Ensure AI solutions adhere to ethical standards and regulatory requirements
  • Integration: Collaborate with data scientists, engineers, and business stakeholders to integrate AI into products and services
  • Leadership: Provide technical guidance and mentorship to cross-functional teams
  • Fulltime
Read More
Arrow Right

Vice President - Bigdata Engineer - AI & NLP

The Applications Development Technology Lead Analyst is a senior-level position ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 13+ years of relevant experience in Apps Development or systems analysis role
  • Extensive experience in system analysis and programming of software applications
  • Experience in managing and implementing successful projects
  • Expert in coding Python in building Machine Learning and developing LLM-based applications in a professional environment
  • SQL skills able to perform data interrogations
  • Proficiency in enterprise-level application development using Java 8, Scala, Oracle (or comparable database), and Messaging infrastructure like Solace, Kafka, Tibco EMS
  • Develop LLM solutions for querying structured data with natural language, including RAG architectures on enterprise knowledge bases
  • Build, scale, and optimize data science workloads, applying best MLOps practices for production
  • Lead the design and development of LLM-based tools to increase data accessibility, focusing on text-to-SQL platforms
  • Train and fine-tune LLM models to accurately interpret natural language queries and generate SQL queries
Job Responsibility
Job Responsibility
  • Partner with multiple management teams to ensure appropriate integration of functions to meet goals
  • Identify and define necessary system enhancements to deploy new products and process improvements
  • Resolve variety of high impact problems/projects through in-depth evaluation of complex business processes, system processes, and industry standards
  • Provide expertise in area and advanced knowledge of applications programming
  • Ensure application design adheres to the overall architecture blueprint
  • Utilize advanced knowledge of system flow and develop standards for coding, testing, debugging, and implementation
  • Develop comprehensive knowledge of how areas of business, such as architecture and infrastructure, integrate to accomplish business goals
  • Provide in-depth analysis with interpretive thinking to define issues and develop innovative solutions
  • Serve as advisor or coach to mid-level developers and analysts, allocating work as necessary
  • Appropriately assess risk when business decisions are made, demonstrating particular consideration for the firm's reputation and safeguarding Citigroup, its clients and assets
What we offer
What we offer
  • Global Benefits
  • Best-in-class benefits to be well, live well and save well
  • Fulltime
Read More
Arrow Right

AI Infrastructure Engineer

At BlackRock, technology underpins everything we do. AI is a core strategic prio...
Location
Location
United Kingdom , Edinburgh
Salary
Salary:
Not provided
blackrock.com Logo
BlackRock Investments
Expiration Date
May 27, 2026
Flip Icon
Requirements
Requirements
  • Strong experience in cloud infrastructure, platform engineering, or systems engineering roles
  • 4+ hands-on expertise with AWS and/or Azure and/or GCP, including Azure ML, Azure Foundry, AWS Bedrock, Google Vertex, as well as cloud compute, networking, storage, and security services
  • Understanding of ML platform operations and governance concepts, including model deployment strategies, lifecycle management, monitoring/observability, and Disaster Recovery
  • Experience supporting LLMs, generative AI platforms, or model serving infrastructure
  • Experience supporting AI and machine learning workloads, with exposure to managed compute for model training and finetuning, experimentation over large datasets, and endtoend MLOps pipeline flow including data ingestion, training, validation and deployment
  • Proficiency with Infrastructure as Code tools (e.g., Terraform, ARM/Bicep, CloudFormation)
  • Strong programming or scripting skills (e.g., Python, Bash, or similar)
  • Experience building and operating containerized and Kubernetes based platforms
  • Solid understanding of reliability, scalability, observability, and operational best practices
  • Ability to work effectively in cross functional teams and communicate complex technical concepts clearly
Job Responsibility
Job Responsibility
  • Design, build, and operate AI focused infrastructure platforms supporting model development, training, evaluation, and inference
  • Engineer scalable, reliable, and secure cloud native services to support AI workloads across AWS, Azure, and hybrid environments
  • Partner with AI Engineering and Data Science teams to improve developer experience, performance, and operational stability of AI systems
  • Enable production deployment of ML models and LLMs within governed enterprise environments, aligned with firmwide risk and compliance standards
  • Implement and maintain infrastructure as code and automation to ensure repeatable, auditable platform provisioning
  • Build and operate observability, monitoring, and alerting solutions for AI platforms, ensuring availability, performance, and cost transparency
  • Collaborate with Security and Risk partners to integrate identity, access controls, data protection, and governance into AI infrastructure
  • Contribute to architectural decisions and technical standards for AI platforms across Aladdin
  • Participate in on-call rotations and operational support as required for critical platforms
  • Continuously evaluate emerging AI infrastructure technologies and apply them pragmatically within BlackRock’s enterprise context
What we offer
What we offer
  • Retirement investment and tools designed to help you in building a sound financial future
  • Access to education reimbursement
  • Comprehensive resources to support your physical health and emotional well-being
  • Family support programs
  • Flexible Time Off (FTO)
  • Fulltime
!
Read More
Arrow Right

Distinguished Engineer – AI Security

We're building a world of health around every individual — shaping a more connec...
Location
Location
United States , Scottsdale
Salary
Salary:
175100.00 - 334750.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
June 30, 2026
Flip Icon
Requirements
Requirements
  • 15+ years of AI experience, including significant depth in advanced technical or architectural roles
  • 5+ years of cybersecurity experience defining and integrating security standards and controls that aligned to established frameworks such as NIST CSF
  • Deep expertise in AI security concepts such as adversarial ML, secure model deployment, AI agent authorization, AI data loss protection, AI safety, and AI risk management
  • Strong background in Zero Trust architecture and hybrid infrastructure security
  • Demonstrated ability to lead and influence large-scale, cross-functional security initiatives
  • Hands-on experience building, deploying, and securing AI systems and platforms in enterprise environments
  • Practical experience applying AI security and risk management frameworks in real-world engineering contexts
  • AI Security Frameworks: MITRE ATLAS, NIST RMF, ISACA AI Audit Toolkit, and emerging ISO/IEC AI security standards
  • AI Technologies: Expert conceptual and hands-on implementation knowledge of core ML and generative AI technologies including transformer-based NLP, LLM-based generative AI and agentic AI
  • AI Risk Management & Model Security: Threat modeling, adversarial defenses, model lifecycle security, and vulnerability management
Job Responsibility
Job Responsibility
  • Define and help execute the enterprise AI security strategy, spanning secure model selection, development, and deployment criteria, adversarial threat mitigation, and alignment with emerging AI governance requirements
  • Design, build, and maintain reusable AI security frameworks, reference patterns, and technical standards for model integrity, secure data pipelines, and privacy-preserving machine learning
  • Perform hands-on security assessments of AI systems, identify risks, and provide mitigation guidance based on AI security posture management and detection findings
  • Drive innovation in AI security techniques, controls, and tooling through applied research and practical implementation
  • Apply and guide the application of AI security frameworks such as MITRE ATLAS, NIST RMF, and emerging ISO/IEC AI standards to secure the end-to-end AI lifecycle
  • Apply Zero Trust principles to hybrid and cloud infrastructure environments supporting AI workloads, including workload identity, segmentation, and continuous verification
  • Partner closely with Enterprise Architecture and Platform Engineering to integrate AI security controls into infrastructure design patterns and shared services
  • Guide and, where appropriate, directly implement security capabilities across on-premises and cloud platforms to ensure consistent protection for AI and traditional systems
  • Hands-on Engineering & Prototyping: Design and build proof-of-concept solutions, reference implementations, and reusable components to validate AI security and infrastructure security approaches
  • Framework and Pattern Development: Architect repeatable security patterns and guardrails that can be adopted by data science, engineering, and platform teams
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan (including matching company contributions)
  • Employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Colleague assistance programs
  • Tuition assistance
  • Fulltime
Read More
Arrow Right

Machine Learning Engineer - Data Foundation and AI

You’ll be a machine learning engineer on the Data Foundation & AI team. In this ...
Location
Location
United States , San Francisco
Salary
Salary:
186000.00 - 236400.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1-3 years of experience training, deploying, and scaling ML/AI models in production environments
  • Strong experience with distributed systems and ML operations — from large-scale training to low-latency serving and monitoring
  • Proficiency in Python and modern ML frameworks (e.g., PyTorch), with the ability to implement and optimize complex models
  • Hands-on experience building or scaling ML/AI infrastructure, pipelines, or reusable platforms that support multiple teams
  • Curiosity and drive to experiment with advanced AI techniques (e.g., embeddings, retrieval, generative modeling) while staying grounded in production impact
  • Ability to thrive in a collaborative environment, working with both technical and non-technical partners to drive measurable outcomes
Job Responsibility
Job Responsibility
  • Building and scaling advanced ML/AI systems that power core Plaid products and applications used by millions of consumers
  • Driving impact at scale by improving distributed training, serving, and ML operations to make Plaid’s AI capabilities faster, more reliable, and more widely available
  • Developing new AI applications that enable innovative product experiences across fintech
  • Tackling 0 to 1 problems where you explore new approaches, as well as scaling 1 to 10 systems for reliability and efficiency
  • Collaborating with some of the strongest MLEs at Plaid in a high-ownership, bottom-up driven team
  • Experimenting with cutting-edge ML and AI techniques while balancing practical productionization and measurable business impact
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right