CrawlJobs Logo

AI Engineer – Intelligent Operations (Infrastructure)

Canada, Toronto 129150.00 USD / Year · Job Posted March 21, 2026
Apply Position
Job Link Share

Job Description

We are seeking an experienced AI Engineer – Intelligent Operations (Infrastructure) to design and implement AI-driven solutions that enhance infrastructure monitoring, automation, and operational efficiency. The ideal candidate will work at the intersection of AI/ML, cloud infrastructure, and DevOps to build intelligent operational systems.

Job Responsibility

  • Develop and deploy AI/ML models for infrastructure monitoring and predictive maintenance
  • Automate incident detection, root cause analysis, and remediation workflows
  • Integrate AI solutions with cloud and on-prem infrastructure platforms
  • Build data pipelines for infrastructure logs and telemetry analysis
  • Collaborate with DevOps, SRE, and Cloud teams
  • Optimize system performance, scalability, and reliability
  • Implement MLOps practices for model deployment and lifecycle management
  • Provide technical leadership and documentation

Requirements

  • Strong experience in Python and AI/ML frameworks (TensorFlow, PyTorch, Scikit-learn)
  • Experience working with infrastructure monitoring data (logs, metrics, traces)
  • Knowledge of cloud platforms (AWS, Azure, or GCP)
  • Experience with Docker and Kubernetes
  • Understanding of DevOps and CI/CD practices
  • Strong analytical and problem-solving skills

Nice to have

  • Experience in AIOps or Intelligent Automation
  • Knowledge of monitoring tools (Splunk, Datadog, Prometheus, etc.)
  • Experience with MLOps tools (MLflow, SageMaker, Vertex AI)
  • Strong communication and stakeholder collaboration skills

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

AI Engineer – Intelligent Operations (Infrastructure)

8 matching positions

Software Engineer - AI Infrastructure

We’re looking for a software engineer to join our Infrastructure team—building a...
Location
Location
United States , New York City
Salary
Salary:
135000.00 - 280000.00 USD / Year
assembled.com Logo
Assembled
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have 6+ years of engineering experience, with past ownership of high-scale, production-critical infrastructure
  • Have experience with distributed systems and container orchestration (especially Kubernetes)
  • Have worked with AI/ML platforms or are excited to build foundational infrastructure for LLM-based applications
  • Thrive in fast-paced environments with shifting requirements and ambiguous problem spaces
  • Are motivated by impact, enjoy deep technical challenges, and want to work cross-functionally across security, AI, and product
  • Have strong familiarity with one or more parts of our tech stack: Cloud provider: AWS
  • Orchestration: Kubernetes + Karpenter
  • LLM integration: Experience with OpenAI, Anthropic, or open-source model serving (e.g., vLLM, HuggingFace TGI, Ray Serve)
  • Prompt & embedding infrastructure: Vector databases (e.g., Pinecone, Weaviate, PGVector), semantic search, prompt templating systems
  • Datastores: Postgres + PgBouncer, Snowflake, Redis
Job Responsibility
Job Responsibility
  • Agent service reliability and scaling: We manage and scale the infrastructure that serves LLM-powered agents across chat, email, and voice. This includes selecting inference strategies, integrating with model providers (e.g. OpenAI, Anthropic), and dynamically routing traffic for performance and cost efficiency
  • Prompt and embedding storage systems: Assist relies heavily on dynamically generated prompts and semantic search across support content. The team owns highly-available, fast-access storage and indexing layers optimized for real-time AI interactions
  • Privacy and security: Enterprises expect strict guardrails around AI use. We’re building systems like network-level intrusion detection (IDS/IPS), audit logging, and LLM usage policy enforcement to meet these expectations and unlock new sales channels
  • Observability and usage analytics: We operate systems that surface key metrics—token usage, latency, cost per response, and quality signals—so the Assist team can continuously improve Assist’s performance and accuracy
  • AI-powered developer tools: We are beginning to explore and evangelize the use of AI to accelerate internal engineering workflows—through internal chat agents, pair programming tools, and intelligent automation for deployment, debugging, and on-call. Our goal is to empower engineers across the company to build faster and more confidently with AI
What we offer
What we offer
  • Generous medical, dental, and vision benefits
  • Paid company holidays, sick time, and unlimited time off
  • Monthly credits to spend on each: professional development, general wellness, Assembled customers, and commuting
  • Paid parental leave
  • Hybrid work model with catered lunches everyday (M-F), snacks, and beverages in our SF & NY offices
  • 401(k) plan enrollment
  • Stock options are provided as part of the compensation package
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - AI & Intelligent Tooling

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional software engineering experience
  • Experience building and operating production systems
  • Hands-on experience using AI-powered developer tools
  • Strong problem-solving and collaboration skills
Job Responsibility
Job Responsibility
  • Design, build, and operate internal tools and services used by Plaid’s engineers
  • Integrate AI-powered tools and workflows into core engineering processes
  • Improve reliability, usability, and performance of existing internal platforms
  • Own systems end-to-end, including production support and iterative improvement
  • Collaborate with teammates and stakeholders to deliver practical, high-impact solutions
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

AI Research Infrastructure Engineer

Block is scaling Customer Insights into an AI-powered insights accelerator that ...
Location
Location
United States , Bay Area
Salary
Salary:
168300.00 - 297000.00 USD / Year
cash.app Logo
Cash App
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in research, automation implementation, analytics, or related technical fields with hands-on workflow optimization experience
  • 3+ years implementing AI/ML solutions, with experience in automation, LLM integration, or applied AI/analytics workflows
  • Hands-on technical skills in programming languages (Python, R, SQL) for automation development, API/MCP integrations, cloud platforms, and research data pipeline creation
  • Experience with research and analytic platforms and tools (Qualtrics, Snowflake, etc) or transferable experience with analytics and automation platforms
  • Strong technical communication and translation skills with ability to make complex AI/ML concepts, data architecture decisions, and automation workflows accessible and actionable for researchers, product managers, and business stakeholders
  • Proven ability to build stakeholder confidence and alignment during technology transformation
  • Strong project management skills with ability to coordinate multiple complex automation initiatives, manage competing priorities, and deliver measurable operational efficiency gains (reduced cycle times, improved quality outcomes, increased research capacity)
  • Familiarity with financial services, fintech, or payments industry research contexts and regulatory requirements preferred
Job Responsibility
Job Responsibility
  • Design, build, and deploy AI agents and agentic workflows that automate research operations from study design through insights delivery, using LLMs, prompt engineering, MCP (Model Context Protocol) integrations, and workflow orchestration integrated with existing research and analytics tech stack
  • Design, build, and maintain automated data pipelines that ingest, transform, and unify research data from diverse sources (surveys, transcripts, analytics, behavioral logs) into AI-ready repositories with RAG capabilities for instant insight access via tools like Goose
  • Architect ETL/ELT frameworks using Python, SQL or equivalent tools to ensure data consistency, traceability, and scalability
  • Develop data models and schemas for research metadata, participant data, and AI-generated insights to support efficient querying and analysis
  • Design and prototype research automation systems using AI/ML techniques, partnering with design & engineering teams to productionize solutions
  • Partner with engineering, design, and platform teams to integrate research automation systems with Block's tech stack (i.e. Goose, GitHub, etc.) and establish governance frameworks for quality, ethics, and compliance
  • Mentor team members on AI agent development, agentic system design, and research automation best practices to build organizational capabilities in intelligent automation
What we offer
What we offer
  • Remote work
  • medical insurance
  • flexible time off
  • retirement savings plans
  • modern family planning
  • Fulltime
Read More
Arrow Right

Ai Application Operations & Maintenance Engineer (Azure)

The organization is seeking a professional specialized in Application Maintenanc...
Location
Location
Albania , Tirana
Salary
Salary:
Not provided
businessintegrationpartners.com Logo
Business Integration Partners
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in Application Maintenance and Operations for enterprise applications
  • Solid knowledge of Python in an application context focused on AI functionalities
  • Operational knowledge of Microsoft Azure and its main PaaS services
  • Experience with Azure Kubernetes Service (AKS) and containerized workloads
  • Strong troubleshooting skills based on logs, metrics, and alerts
  • Knowledge of monitoring, logging, and observability principles
  • Familiarity with microservices architectures and multi-layer environments
  • Understanding of IAM concepts, Managed Identities, and secret management
  • Experience operating AI / Generative AI solutions in production
  • Knowledge of Azure OpenAI, embedding services, and vector search
Job Responsibility
Job Responsibility
  • Manage corrective and adaptive maintenance activities for AI applications in production
  • Analyze and resolve application incidents and anomalies across front-end, back-end, and service layers
  • Support application release activities and configuration management across different environments (Dev/Test/Prod)
  • Collaborate with development teams to analyze application issues and improve overall software quality
  • Provide operational support for solutions based on Azure Kubernetes Service (AKS), including management of containerized workloads
  • Continuously monitor application and infrastructure services using Azure Monitor, Log Analytics, and Application Insights
  • Analyze application logs, metrics, and alerts to ensure appropriate levels of reliability and performance
  • Perform advanced troubleshooting on data ingestion pipelines, AI services, search services, and databases
  • Provide operational support for data persistence services, including Azure SQL Database for structured data, Azure Cosmos DB for unstructured data and conversational history, Azure Storage Accounts (Blob Storage) for document repositories
  • Verify and support correct content indexing and retrieval through Azure AI Search, including vector search and similarity search
  • Fulltime
Read More
Arrow Right

AI Research Engineer, Data Infrastructure

As a Research Engineer in Infrastructure, you will design and implement a robust...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 250000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in building data pipelines and ETL systems
  • Ability to design and implement systems for data collection and management from robotic fleets
  • Familiarity with architectures that span on-robot components, on-premise clusters, and cloud infrastructure
  • Experience with data labeling tools or building dataset visualization and annotation tooling
  • Proficiency in creating or applying machine learning models for dataset organization and automated labeling
Job Responsibility
Job Responsibility
  • Optimize operational efficiency of data collection across the NEO robot fleet
  • Design intelligent triggers to determine when and what data should be uploaded from the robots
  • Automate ETL pipelines to make fleet-wide data easily queryable and training-ready
  • Collaborate with external dataset providers to prepare diverse multi-modal pre-training datasets
  • Build frontend tools for visualizing and automating the labeling of large datasets
  • Develop machine learning models for automatic dataset labeling and organization
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

AI Research Engineer, Data Infrastructure

As a Research Engineer in Infrastructure, you will design and implement a robust...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 250000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in building data pipelines and ETL systems
  • Ability to design and implement systems for data collection and management from robotic fleets
  • Familiarity with architectures that span on-robot components, on-premise clusters, and cloud infrastructure
  • Experience with data labeling tools or building dataset visualization and annotation tooling
  • Proficiency in creating or applying machine learning models for dataset organization and automated labeling
Job Responsibility
Job Responsibility
  • Optimize operational efficiency of data collection across the NEO robot fleet
  • Design intelligent triggers to determine when and what data should be uploaded from the robots
  • Automate ETL pipelines to make fleet-wide data easily queryable and training-ready
  • Collaborate with external dataset providers to prepare diverse multi-modal pre-training datasets
  • Build frontend tools for visualizing and automating the labeling of large datasets
  • Develop machine learning models for automatic dataset labeling and organization
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Full Stack Engineer (AI & Agentic AI Systems)

The Full Stack Engineer (AI & Agentic AI Systems) is a strategic professional wh...
Location
Location
India , Pune; Chennai
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in a product development/product management environment
  • Strong analytical and quantitative skills
  • Data driven and results-oriented
  • Experience delivering with an agile methodology
  • Experience in affecting large culture change
  • Experience leading infrastructure programs
  • Skilled at working with third party service providers
  • Excellent written and oral communication skills
  • Bachelor’s/University degree or equivalent experience
  • Strong expertise in SQL (Oracle, PostgreSQL)
Job Responsibility
Job Responsibility
  • Design and deliver end‑to‑end solutions spanning architecture, system design, low‑level design, and high‑quality coding across modern full‑stack environments
  • Build responsive, modular UI applications using React, integrating complex AI-driven workflows and real‑time interactions
  • Develop scalable, high‑performance backend services in Java / Python, implementing resilient APIs, event‑driven patterns, and microservices architectures
  • Engineer AI‑powered features leveraging Google Gemini LLM, Vertex AI, ADK, vector databases (A2A), RAG pipelines, MCP, context engineering, and advanced prompt engineering techniques
  • Implement secure, well‑structured REST and GraphQL APIs, ensuring reliability, versioning discipline, and clean integration patterns across platforms
  • Optimize system performance and scalability, applying profiling, load‑testing insights, caching strategies, and distributed system tuning
  • Drive robust CI/CD practices, integrating automated testing, code quality gates, containerization, and cloud‑native deployment pipelines
  • Partner with QE to build and maintain automated test suites (UI, API, integration, and performance), improving release quality and reducing regression risk
  • Identify, diagnose, and remediate performance bottlenecks, penetration testing vulnerabilities, and production issues with precision and root‑cause clarity
  • Collaborate cross‑functionally with AI scientists, architects, and product teams to translate business challenges into production‑ready, intelligent agentic systems
  • Fulltime
Read More
Arrow Right
New

Artificial Intelligence (AI) Engineer

We are looking for an Artificial Intelligence (AI) Engineer to support the desig...
Location
Location
United States , Albuquerque
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, software engineering, information technology, or a related technical field, or equivalent practical experience
  • Must have experience deploying Kubernetes and MCP servers integrated with AI data sources
  • At least 2 years of hands-on experience supporting AI or machine learning platforms, model deployment, MLOps processes, or AI-focused infrastructure
  • Demonstrated experience deploying and managing server-based workloads in Kubernetes environments
  • Strong programming and automation capabilities using Python, Bash, or similar scripting languages
  • Solid understanding of DevOps and MLOps practices, including Git-based development, CI/CD pipelines, containers, and Kubernetes orchestration
  • Experience working with AI and machine learning frameworks such as PyTorch, Hugging Face, or related ecosystems
  • Familiarity with enterprise security and compliance requirements, including authentication approaches such as OAuth and regulated operating environments
  • Ability to communicate effectively with both technical and non-technical teams and collaborate across multiple functions
  • Secret Security Clearance – Active or Inactive or ability to get a clearance
Job Responsibility
Job Responsibility
  • Direct the rollout and integration of AI platforms and services, ensuring they work effectively with existing enterprise technologies and operational standards
  • Architect, implement, and refine AI infrastructure in partnership with cloud, server, and platform engineering teams to support dependable system performance
  • Move machine learning solutions from development into production by establishing repeatable processes for deployment, maintenance, and long-term support
  • Create and manage CI/CD and MLOps workflows that cover model validation, packaging, release, rollback, and lifecycle oversight
  • Automate infrastructure and platform operations through scripting, infrastructure-as-code methods, and configuration management tools
  • Troubleshoot platform and service issues, perform root cause analysis, and produce clear technical documentation for support and maintenance activities
  • Strengthen system visibility by implementing logging, monitoring, alerting, and incident response practices across AI environments
  • Uphold security and compliance expectations by contributing to audits, remediation efforts, vulnerability management, and secure design reviews
  • Identify and deliver improvements that increase performance, scalability, reliability, and cost efficiency across AI-enabled systems
  • Work with technical and business stakeholders to align AI implementations with organizational priorities and evaluate emerging tools for long-term operational value
What we offer
What we offer
  • medical
  • vision
  • dental
  • life and disability insurance
  • 401(k) plan
Read More
Arrow Right