CrawlJobs Logo

Ai/ml Infrastructure Engineer

United States, San Francisco 150000.00 - 240000.00 USD / Year · Job Posted December 31, 2025
Apply Position
Job Link Share

Job Description

Zensors is the spatial intelligence platform for the physical world. Our AI platform provides real-time insights—from airport queue times to office utilization—helping organizations make smarter operational decisions. Zensors is processing massive streams of video data 24/7 with human-level accuracy. To do this at scale, we rely on cutting-edge optimization to ensure our vision transformer and detection models run efficiently on both cloud and edge compute resources. The AI Infrastructure team at Zensors builds the engine that powers our visual sensing platform. We provide the tools to automate the lifecycle of our AI workflow, including model development, evaluation, optimization, deployment, and monitoring across thousands of video streams. As a Machine Learning Engineer in ML Runtime & Optimization, you will develop technologies to accelerate the training and inference of computer vision models that power smart spaces and cities.

Job Responsibility

  • Optimizing Core ML Pipelines: Identifying key bottlenecks in our current video analytics pipeline and performing in-depth analysis to ensure the best possible performance on current server and edge compute architectures
  • Cross-Stack Collaboration: Collaborating closely with AI research and platform engineering teams to optimize core parallel algorithms and influence the design of our next-generation inference infrastructure
  • Model Acceleration: Applying advanced model optimization techniques—such as quantization (Int8/FP16), pruning, and layer fusion—to our Vision Transformers (ViTs) and CNNs to maximize throughput and minimize latency
  • Building Efficient Operators: Working across the entire ML framework/compiler stack (e.g., PyTorch, CUDA, TensorRT, and NVIDIA DeepStream) to write custom optimized ML operator libraries
  • Resource Efficiency: Reducing the compute cost per video stream to enable massive scalability of our SaaS product
  • Data Management: Building, improving, maintaining, and operating systems to facilitate the collection, labeling, and use of visual data for ML training

Requirements

  • BS/MS or Ph.D. in Computer Science, Electrical Engineering, or a related discipline
  • Strong programming skills in C/C++ and Python
  • Experience with model optimization, quantization, and efficient deep learning techniques (e.g., knowledge distillation, pruning)
  • Deep understanding of GPU hardware performance, including execution models, thread hierarchy, memory/cache management, and the cost/performance trade-offs of video processing
  • Experience with profiling and benchmarking tools (e.g., Nsight Systems, Nsight Compute) to validate performance on complex architectures
  • Experience identifying and resolving compute and data flow bottlenecks, particularly in high-bandwidth video processing pipelines
  • Strong communication skills and the ability to work cross-functionally between research and infrastructure teams

Nice to have

  • Familiarity with database systems (e.g., SQL, Neo4j)
  • Work in Computer Vision, Deep Learning, and Vision Transformers
  • Experience with video processing frameworks such as NVIDIA DeepStream, DALI, or FFmpeg
  • Familiarity with ML compilers (e.g., TVM, MLIR) or inference engines like TensorRT or ONNX Runtime
  • Knowledge of distributed training systems or cloud-scale inference serving (e.g., Triton Inference Server)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Ai/ml Infrastructure Engineer

8 matching positions

New

AI/ML Engineer

As an experienced engineer, you know that building reliable and scalable AI syst...
Location
Location
United States , Fort Meade
Salary
Salary:
99000.00 - 225000.00 USD / Year
boozallen.com Logo
Booz Allen Hamilton
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience with software engineering
  • Experience with Python
  • Experience developing and maintaining AI and ML pipelines for model training, validation, and deployment
  • Experience integrating AI and ML capabilities into production systems and mission workflows
  • Experience implementing monitoring and observability solutions for model performance, resource utilization, and system health
  • Knowledge of hybrid compute environments such as cloud and on-prem
  • Ability to automate workflows and build reproducible, scalable, and efficient ML and AI systems
  • Ability to collaborate with cross-functional teams, such as AI Model Engineers, Data Engineers, Cloud Architects, or ISSE, to align workflows with infrastructure and security requirement
  • TS/SCI clearance with a polygraph
  • HS diploma or GED
Job Responsibility
Job Responsibility
  • Build, test, deploy, and operate software end-to-end AI pipelines
  • Guide the development of mission-critical AI solutions by building reproducible pipelines, applying automation best practices, and introducing emerging tools
  • Collaborate with a cross-functional team of AI Model Engineers, Data Engineers, Cloud Architects, and ISSEs
  • Contribute to solutions that optimize data flow, automate operational workflows, ensure monitoring and observability, and support resilient mission execution across hybrid compute environments
What we offer
What we offer
  • Health, life, disability, financial, and retirement benefits
  • Paid leave
  • Professional development
  • Tuition assistance
  • Work-life programs
  • Dependent care
  • Recognition awards program
  • Fulltime
Read More
Arrow Right

AI/ML Engineer

We are seeking a highly skilled AI/ML Engineer to design, develop, and deploy sc...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
Codvo AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong proficiency in Python for machine learning and software development
  • Hands-on experience with PyTorch for deep learning model development
  • Solid understanding of deep learning architectures (CNNs, transfer learning, etc.)
  • Practical experience in computer vision applications
  • Experience working with Databricks and large-scale data processing
  • Strong knowledge of AWS services for ML deployment (EC2, S3, SageMaker, etc.)
  • Experience with MLOps tools and practices (model deployment, monitoring, CI/CD)
  • Good understanding of software engineering principles and production-grade system design
Job Responsibility
Job Responsibility
  • Design, develop, and optimize machine learning and deep learning models using PyTorch
  • Build and deploy computer vision solutions for real-world use cases
  • Develop end-to-end ML pipelines, including data ingestion, preprocessing, training, validation, and deployment
  • Implement and maintain MLOps workflows for model versioning, monitoring, CI/CD, and retraining
  • Deploy and scale ML models on AWS cloud infrastructure
  • Work with large-scale datasets using Databricks and distributed computing frameworks
  • Collaborate with data scientists, product managers, and software engineers to translate business requirements into AI solutions
  • Ensure high code quality by following software engineering best practices (modular design, testing, documentation)
  • Monitor model performance in production and continuously improve accuracy, efficiency, and reliability
  • Fulltime
Read More
Arrow Right

Infrastructure Engineer

Reducto is the agentic document platform for leading AI teams who demand enterpr...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 300000.00 USD / Year
reducto.ai Logo
Reducto
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Are your own worst critic—have an extremely high bar for quality and always aim for robust solutions rather than quick fixes
  • Have 5+ years of hands-on experience in building or supporting production-grade infrastructure and reliability processes for high-throughput systems
  • Are comfortable with Python or similar languages, and exceptional at working across cloud platforms, container orchestration (e.g., Kubernetes), networking, and storage technologies
  • Build your own tools on the fly to diagnose, experiment, and address reliability problems—whether it's an internal dashboard or an automated remediation workflow
  • Bring a quantitative, hands-on approach to system operations, automation, and continuous improvement
Job Responsibility
Job Responsibility
  • Designing, building, and maintaining highly available, scalable infrastructure to support intensive AI/ML workloads and real-time model deployments
  • Implementing robust monitoring, alerting, and observability systems to ensure system health, performance, and uptime across cloud and on-prem environments
  • Debugging, optimizing, and automating infrastructure for fast iteration and rapid deployment cycles, focusing on both reliability and developer velocity
  • Proactively identifying, investigating, and resolving incidents to minimize downtime and maintain world-class service levels for enterprise customers
  • Collaborating closely with engineers, ML specialists, and founders to shape product, infrastructure, and security strategies
What we offer
What we offer
  • Unlimited PTO
  • Lunch
  • Reimbursed Transportation
  • Insurance: Generous health insurance covering medical, dental, and vision
  • Health and Wellness Budget: We provide up to $150/mo reimbursement for health and wellness spending, such as gym memberships, fitness classes, or similar
  • Parental Leave
  • Fulltime
Read More
Arrow Right

Sr. Cloud Infrastructure Engineer (Ai & Llm Platforms)

We are seeking a specialized Infrastructure Engineer to bridge the gap between o...
Location
Location
Salary
Salary:
Not provided
q6cyber.com Logo
Q6 Cyber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, Platform Engineering, or SRE, with at least 1-2 years specifically focused on AI/ML infrastructure
  • Proven track record of building production-grade RAG pipelines or LLM-integrated applications
  • Thrives in 'day zero' environments where the tools and protocols (like MCP) are evolving weekly
  • Deep understanding of the security implications of LLMs (prompt injection, data leakage, and secure tool execution)
  • Experience working with substantial datasets (over 1bn objects, dozens or hundreds of TBs) and the challenges of leveraging AI tools with these data sets
  • Bachelor's degree or equivalent in computer science or related field
  • Cloud & Orchestration: AWS/GCP/Azure, Kubernetes, Terraform, Helm
  • AI Frameworks: LangChain, LlamaIndex, LangGraph
  • Data & Vectors: Pinecone, Milvus, Qdrant, or pgvector
  • Apache Kafka/Pulsar
Job Responsibility
Job Responsibility
  • Guide the architecture that will allow us to leverage AI tools with our large existing data stores and incoming streams of realtime intelligence
  • Work closely with other infrastructure engineers and software development teams to integrate AI tools into existing systems
  • Design, deploy, and maintain Model Context Protocol (MCP) servers to allow LLMs to securely interact with our internal databases, APIs, and external tooling
  • Build and orchestrate sandboxed, scalable environments (e.g., using Docker or specialized runtimes) where users can safely build and execute AI agents
  • Develop and manage the infrastructure for our internal RAG (Retrieval-Augmented Generation) pipeline, including vector database management (e.g., Pinecone, Weaviate, or pgvector) and automated embedding pipelines
  • Utilize Kubernetes (K8s) and Infrastructure as Code (Terraform/Pulumi) to deploy LLM-related tools, ensuring high availability and low latency for model inference and data retrieval
  • Implement strict guardrails for data privacy within LLM workflows, ensuring internal datasets remain secure while being accessible to authorized AI tools
What we offer
What we offer
  • We offer a competitive compensation package and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

AI/ML Engineer

Lead the design, development, and deployment of AI/ML models across the company’...
Location
Location
Egypt , New Cairo
Salary
Salary:
Not provided
ethicshr.com Logo
Ethics HR
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Information Technology, AI, Computer Engineering or a related field from a reputable university
  • Proven expertise in designing and deploying end-to-end LLM systems in production
  • Deep proficiency in Python and modern LLM frameworks
  • Strong background in data engineering, feature engineering, and model optimization
  • Experience with cloud-based AI infrastructure (AWS, GCP, or Azure)
  • Ability to translate complex business requirements into scalable, ethical AI solutions
  • Demonstrated commitment to privacy, security, and ethical AI principles
  • Strong collaboration and communication skills, especially in cross-functional teams
Job Responsibility
Job Responsibility
  • Lead the design, development, and deployment of AI/ML models across the company’s product suite (core banking, PFM, SME, travel, shopping, and wealth)
  • Collaborate with product, engineering, and compliance teams to translate business and Sharia requirements into AI-driven solutions
  • Build and optimize data pipelines supporting real-time analytics, recommendations, and fraud detection
  • Ensure the ethical, explainable, and compliant use of AI/ML in all aspects of the company’s platform
  • Implement AI-powered personalization for customer experiences and financial wellness tools
  • Mentor and guide junior engineers and data scientists in best practices and advanced methodologies
  • Stay abreast of AI/ML research, tools, and ethical frameworks relevant to fintech and Islamic finance
Read More
Arrow Right

Principal Software Consultant - AI/ML Engineer

As an ML Team Lead, you will be responsible for leading the technical direction ...
Location
Location
Pakistan , Lahore, Karachi, Islamabad
Salary
Salary:
Not provided
10pearls.com Logo
10Pearls
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in computer science, Artificial Intelligence, Data Science, Software Engineering, or a related field
  • 7+ years of professional software engineering experience with at least 5 years of hands-on experience building and deploying ML systems into production
  • Prior experience as a Tech Lead, Staff Engineer, or hands-on lead for AI/ML engineering teams
  • Strong expertise in classical machine learning domains such as forecasting, ranking, classification, and optimization
  • Hands-on experience building modern LLM and agentic AI systems including RAG pipelines, tool-using agents, multi-step workflows, and evaluation systems
  • Strong proficiency in Python and backend system development
  • Experience with ML frameworks such as PyTorch or TensorFlow
  • Strong understanding of scalable distributed systems, APIs, system integration, architecture design, and production engineering practices
  • Experience operating ML services at scale, including SLO management, monitoring, on-call practices, and incident response
  • Experience working with Kubernetes-based deployments, CI/CD pipelines, and modern cloud-native engineering practices
Job Responsibility
Job Responsibility
  • Lead the technical direction for the team’s ML and LLM systems, including architecture patterns, platform choices, evaluation frameworks, and engineering standards
  • Stay hands-on by designing and implementing complex ML and agentic AI systems, writing production-grade code, and leading through technical execution
  • Design, develop, and deploy scalable ML and LLM-powered applications and services in production environments
  • Build and optimize AI-powered solutions such as RAG systems, multi-step agents, AI assistants, chatbots, forecasting systems, ranking models, classification models, and optimization systems
  • Drive architecture and design reviews to ensure scalability, reliability, security, and maintainability of AI/ML systems
  • Own the technical roadmap for ML/LLM initiatives and translate business objectives into execution plans and scalable solutions
  • Collaborate closely with Product Managers, Engineers, Data Engineers, MLOps Engineers, QA Engineers, and cross-functional stakeholders to deliver business-aligned AI solutions
  • Establish engineering best practices for prompt engineering, model evaluation, regression testing, observability, and production readiness
  • Define and implement quality standards, evaluation suites, acceptance metrics, and regression plans for all AI/ML features
  • Ensure high availability, scalability, and resilience of tier-1 ML services through SLOs, monitoring, incident response, failover strategies, circuit breakers, and multi-zone deployments
  • Fulltime
Read More
Arrow Right

AI/ML Engineer

Location
Location
United States , Riviera Beach
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s in Computer Science, Engineering, Mathematics, or equivalent experience
  • 3–8+ years of hands-on experience shipping ML/AI systems to production
  • Expert-level Python and deep proficiency in PyTorch (preferred) or TensorFlow/JAX
  • Proven track record with modern ML infrastructure: Docker, Kubernetes, Ray, Triton Inference Server, cloud ML platforms (SageMaker, Vertex AI, Bedrock)
  • Strong MLOps experience (MLflow, Airflow, feature stores, model registries, monitoring tools)
  • Solid software engineering fundamentals: testing, code reviews, system design, versioning
  • Experience integrating models into larger systems (FastAPI, microservices, streaming pipelines)
Job Responsibility
Job Responsibility
  • Design and implement end-to-end ML pipelines: data ingestion, feature engineering, training, evaluation, deployment, and monitoring
  • Develop, optimize, and productionize models using PyTorch/TensorFlow/JAX (including LLMs, vision, multimodal, and custom architectures)
  • Optimize inference for latency, memory, and cost (quantization, pruning, distillation, TensorRT, ONNX, vLLM)
  • Integrate models into backend systems via REST/gRPC APIs, event-driven architectures, or real-time serving
  • Own MLOps practices: experiment tracking (MLflow, W&B), model registry, CI/CD for ML, canary deployments, drift detection, and observability
  • Collaborate with data scientists to harden research prototypes into clean, tested, production-ready code
  • Build and maintain retrieval-augmented generation (RAG), agentic workflows, and prompt-engineered systems when appropriate (LangChain, LlamaIndex)
  • Continuously monitor, retrain, and improve live models to maintain performance and reliability
What we offer
What we offer
  • medical
  • vision
  • dental
  • life and disability insurance
  • company 401(k) plan
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right