CrawlJobs Logo

AI SRE / AI Ops engineer

realign-llc.com Logo

Realign

Location Icon

Location:
Canada , Montreal

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

140000.00 USD / Year

Requirements:

  • Production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
  • Excellent communication, documentation, and cross-team collaboration skills
  • Proven track record of reducing operational toil via automation

Nice to have:

Experience in regulated environments (financial services, compliance, audit, security) is a strong plus

Additional Information:

Job Posted:
March 19, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AI SRE / AI Ops engineer

Senior AI Engineer

We are seeking a Senior AI Engineer (L4, Individual Contributor) to design, buil...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of professional software engineering experience
  • 3+ years in AI/ML development
  • Strong expertise in Python, PyTorch/TensorFlow, scikit-learn, and ML tooling (MLflow, LangChain)
  • Proficiency with SQL, cloud services (AWS), containers (Docker, Kubernetes), and distributed systems
  • Understanding of modern AI research (LLMs, diffusion models, transformers)
  • Experience deploying ML models in production with CI/CD
  • Strong analytical skills, ability to balance speed and rigor in experimentation
  • A passion for sustainability and the clean-energy mission
  • Experienced with building agentic pipelines with the latest models from Anthropic, Google, OpenAI, and more
Job Responsibility
Job Responsibility
  • Integrate with LLMs and be an expert in prompt engineering to derive the right results from the models with limited hallucination
  • Design and train ML/AI models (forecasting, NLP, graph learning, generative AI) to improve data quality, cost effectiveness, and system scalability
  • Deploy and optimize models for large-scale production workloads using Python-based services in AWS/Kubernetes environments
  • Build robust, automated data pipelines and ML Ops workflows for continuous training and deployment
  • Research and experiment with modern AI methods (transformers, foundation models, reinforcement learning) and adapt them to energy-sector challenges not limited to utility statements
  • Drive performance improvements in model accuracy, latency, and cost efficiency
  • Collaborate with Product, SRE, and Analytics teams to deliver AI-enabled features across Arcadia’s platform
  • Write clean, maintainable code, contribute to architecture reviews, and mentor junior engineers
  • Build true agentic workflows with multi-step processing incorporating RAG pipelines and MCPs
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Sre design & support engineer

We are looking for a self-driven, software engineering mindset SRE engineer to •...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8-11 years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • The ideal Engineer will be highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams to ensure SRE orchestrating solutions are meeting customer/end-user expectations
  • The candidate will take a pragmatic approach resolving incidents, including the ability to systemically triangulate root causes and work effectively with external and internal teams to meet objectives
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes with a track record for improving service offerings – pro-actively resolving incidents, providing a seamless customer/end-user experience and proactively identifying and mitigating areas of risk
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
Job Responsibility
Job Responsibility
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Ensuring non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Execute as Pro-active SRE Support engineer, preventing P1, P2, potential P3s, diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, , and blameless postmortems,
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
  • Shape the SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams
  • Actively engage and drive AI Ops adoption across teams
Read More
Arrow Right

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...
Location
Location
Mexico , Miguel Hidalgo
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
Job Responsibility
Job Responsibility
  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
What we offer
What we offer
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life
Read More
Arrow Right

Director of Platform Engineering & Operations

NetApp is seeking a strategic and execution-oriented Director of Platform Engine...
Location
Location
United States , RTP
Salary
Salary:
199750.00 - 298100.00 USD / Year
netapp.com Logo
NetApp
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of progressive experience in infrastructure engineering and operations
  • 7+ years of leadership experience managing global, distributed teams at scale
  • Deep expertise in: Hybrid compute platforms (virtualization, containerization, public cloud IaaS/PaaS)
  • Enterprise storage technologies (block, file, object, hybrid architectures)
  • Global DDI services (enterprise DNS, DHCP, IPAM architectures)
  • Demonstrated experience implementing Infrastructure as Code and CI/CD-driven infrastructure delivery
  • Proven track record driving automation at scale across enterprise infrastructure
  • Strong experience with AI-Ops platforms, observability stacks, and operational analytics
  • Experience leading both engineering (build) and operations (run) functions within a unified organization
Job Responsibility
Job Responsibility
  • Define and execute the strategy for enterprise compute, storage, and DDI platforms across hybrid (on-prem and cloud) environments
  • Drive modernization of infrastructure services using IaC, GitOps, CI/CD automation, and policy-as-code frameworks
  • Lead the evolution toward self-service platform models with clear service catalogs, SLOs, and reliability metrics
  • Partner with executive stakeholders across IT, Security, Engineering, and Product to align platform capabilities with business priorities
  • Establish multi-year roadmaps for infrastructure transformation, cost optimization, resilience, and scalability
  • Oversee architecture, engineering, and lifecycle management of: On-prem and cloud-based compute platforms
  • On-prem and cloud-based storage platforms
  • Global DDI services (DNS, DHCP, IPAM)
  • Certificate lifecycle management
  • Standardize infrastructure patterns across data centers and public cloud providers
What we offer
What we offer
  • Health Insurance
  • Life Insurance
  • Retirement or Pension Plans
  • Paid Time Off
  • various Leave options
  • employee stock purchase plan
  • and/or restricted stocks (RSU’s)
  • Fulltime
Read More
Arrow Right

AI Platform Site Reliability Engineering Specialist

The AI Platform Site Reliability Engineering Specialist will operate and maintai...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science or related field, or equivalent job experience
  • 5 years of production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking and systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Job Responsibility
Job Responsibility
  • Operate, monitor, and maintain the infrastructure supporting GenAI applications ( training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor and enforce SLOs/SLIs/LSAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling and resource forecasting
  • Optimize cost vs. performance trade-offs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recover (DR) strategies, back/restore practices, fault tolerance mechanisms
Read More
Arrow Right

AI Applications Ops Lead

Scale’s rapidly growing International Public Sector team is focused on using AI ...
Location
Location
Qatar; United Kingdom , Doha; London
Salary
Salary:
Not provided
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years in a high-impact technical role (SRE, FDE or MLOps) with experience in the public sector
  • Familiarity with international government security standards and the complexities of deploying sovereign AI
  • Proven experience maintaining production-grade applications with a deep understanding of the full request lifecycle-connecting frontend/API layers to the backend and AI core
  • Proficiency in coding and the modern AI infrastructure, including Kubernetes, vector databases, agentic development, and LLM observability tools
  • Ownership: You treat every production deployment as your own. You race toward solving hard problems before the customer even sees them
  • Reliability: You understand that in the public sector, a model failure may be a risk to public safety or privacy
  • Customer communication: The ability to explain to a high-ranking official why the performance of the system has degraded and how we are fixing it
Job Responsibility
Job Responsibility
  • Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies
  • Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment
  • Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability
  • Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks
  • Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again
  • Bridge the gap: Translate deep technical performance metrics into clear insights for senior international government officials
  • Drive product evolution: Partner with our Engineering and ML teams to ensure the lessons learned in the field directly influence the technical architecture and decisions of future use cases
Read More
Arrow Right

Senior Infrastructure & Platform Engineer

You’re turning innovative research and bespoke tooling into secure, scalable, ob...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
iceye.com Logo
ICEYE
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep AWS experience operating production systems at scale
  • Strong with Python for automation/tooling (reading service code as needed)
  • Batch/workflow orchestration
  • Containers at scale (Docker, ECR), CI/CD, artifact integrity and rollout strategies
  • IaC (Terraform/Terragrunt or CDK) and Git‑driven ops
  • Production observability and SRE practices (SLOs, incident response), CloudWatch/Datadog
  • Security fundamentals: IAM, network segmentation, encryption, vulnerability management
  • Kubernetes or equivalent
  • LLM IDE Tooling proficiency & curiosity (e.g. Cursor, Claude, Copilot)
Job Responsibility
Job Responsibility
  • Lead and evolve cloud foundations in AWS: multi‑account setup and guardrails (Organizations, IAM/SCP/SSO), secure networking, encryption (KMS), secrets, artifact governance
  • Support on-premise deployments when needed, working closely with other engineering teams
  • Choose the right compute & orchestration for the service/product needs we have
  • Codify everything: build reusable Terraform/Terragrunt/CDK modules
  • drift control
  • environment promotion. Automate when possible
  • Harden CI/CD: SBOMs, image signing, policy gates, progressive delivery, fast rollback
  • Observability that matters: metrics/logs/traces (CloudWatch and/or Datadog), SLOs/error budgets, alerting, incident response with blameless postmortems
  • Security at speed: vulnerability management, supply‑chain hardening, least‑privilege by default, data‑access boundaries
  • Cost & performance: capacity planning, spot strategies, storage patterns for large rasters (S3, EFS/FSx), data‑locality aware processing
What we offer
What we offer
  • Occupational healthcare, occupational and accident insurance
  • A yearly benefit budget to spend as you wish (i.e. on sport, transport, bike benefit, wellness, lunch, etc.)
  • Phone subscription with iPhone of choice
  • Relocation support (i.e. flight tickets, accommodation, relocation agency support)
  • Time for self-development, research, training, conferences, or certification schemes
  • Inspiring and collaborating offices and silent workspaces enable you to focus
  • A wide variety of the best coffee, tea, snacks, and sweets to accompany your daily space mission
  • Fulltime
Read More
Arrow Right
New

Senior Program Manager, Merchant Operations Success

At Uber, providing excellent customer support to our users is a core feature of ...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Customer Experience, Program Management, Operations, or Support Enablement, preferably in a fast-paced, tech-driven, or complex environment
  • Strong working knowledge of support performance metrics
  • Excellent stakeholder management
  • Strong communicator
  • Team-player
  • Balance of strategic vision and operational rigor
  • Bias for action and problem-solving skills
  • Critical thinking
  • Data-driven and analytical approach
Job Responsibility
Job Responsibility
  • Drive Operational Excellence Across EMEA Merchant Operations
  • Own and Execute the EMEA Merchant Operations Strategy & Roadmap
  • Enhance the Merchant Experience Through Data & Insights
  • Optimize Support Delivery Through Specialization & Automation
  • Lead Market Expansion & Integration Initiatives
  • Act as a Primary EMEA Operations Point of Contact
  • Collaborate Cross-Functionally to Deliver Impact
  • Drive Governance, Reporting & Continuous Improvement
  • Fulltime
Read More
Arrow Right