AI SRE / AI Ops engineer Job at Realign (Montreal)

Ai Ops Platform Engineer

Join us as an AI Ops Engineer, to build and run an enterprise AI Factory within ...

Location

United Kingdom , London

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

LLMOps / MLOps at production scale, operating the full Generative AI lifecycle including models, prompts and agents, CI/CD pipelines, structured evaluation, drift and hallucination monitoring, and controlled, auditable release processes suitable for banking environments
Cloud‑native AI platform engineering on AWS, with hands‑on delivery using services such as Amazon Bedrock for foundation models, agent orchestration patterns, Lambda and Step Functions, alongside demonstrated Python engineering capability and secure microservices and API design
AI governance, observability and cost optimisation, embedding governance by design through policy as code, alignment to model risk framework expectations, lifecycle traceability and audit‑ready evidence, supported by SRE‑grade monitoring and ongoing optimisation of token usage and compute cost across AI workloads

Job Responsibility

Build and run an enterprise AI Factory within our Card Merchant Services organisation, enabling AI‑driven change across the merchant payments lifecycle
Accountable for the end‑to‑end operationalisation of AI, spanning model, prompt, and agent lifecycles
deployment and monitoring
guardrails
and cost optimisation, ensuring AI solutions are production‑ready, auditable, compliant, and scalable across merchant payment use cases
Accountable for the end‑to‑end engineering of GenAI and ML platforms, embedding governance, observability and operational resilience by design, while enabling teams to deploy and run AI solutions with clarity, assurance and accountability at scale
Lead and manage engineering teams, providing technical guidance, mentorship, and support to ensure the delivery of high-quality software solutions
Oversee timelines, team allocation, risk management and task prioritization
Mentor and support team members' professional growth, conduct performance reviews, provide actionable feedback, and identify opportunities for improvement
Evaluation and enhancement of engineering processes, tools, and methodologies

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Principal AI Ops Architect

Scale’s rapidly growing Global Public Sector team is focused on using AI to addr...

Location

Qatar; United Kingdom , Doha; London

Salary:

Not provided

Scale

Expiration Date

Until further notice

Requirements

6+ years in a high-impact technical role (SRE, FDE or MLOps) with experience in the public sector
Familiarity with international government security standards and the complexities of deploying sovereign AI
Proven experience maintaining production-grade applications with a deep understanding of the full request lifecycle-connecting frontend/API layers to the backend and AI core
Proficiency in coding and the modern AI infrastructure, including Kubernetes, vector databases, agentic development, and LLM observability tools
Ownership: You treat every production deployment as your own. You race toward solving hard problems before the customer even sees them
Reliability: You understand that in the public sector, a model failure may be a risk to public safety or privacy
Customer communication: The ability to explain to a high-ranking official why the performance of the system has degraded and how we are fixing it

Job Responsibility

Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies
Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment
Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability
Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks
Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again
Bridge the gap: Translate deep technical performance metrics into clear insights for senior international government officials
Drive product evolution: Partner with our Engineering and ML teams to ensure the lessons learned in the field directly influence the technical architecture and decisions of future use cases

Senior AI Engineer

We are seeking a Senior AI Engineer (L4, Individual Contributor) to design, buil...

Location

India , Chennai

Salary:

Not provided

Arcadia

Expiration Date

Until further notice

Requirements

12+ years of professional software engineering experience
3+ years in AI/ML development
Strong expertise in Python, PyTorch/TensorFlow, scikit-learn, and ML tooling (MLflow, LangChain)
Proficiency with SQL, cloud services (AWS), containers (Docker, Kubernetes), and distributed systems
Understanding of modern AI research (LLMs, diffusion models, transformers)
Experience deploying ML models in production with CI/CD
Strong analytical skills, ability to balance speed and rigor in experimentation
A passion for sustainability and the clean-energy mission
Experienced with building agentic pipelines with the latest models from Anthropic, Google, OpenAI, and more

Job Responsibility

Integrate with LLMs and be an expert in prompt engineering to derive the right results from the models with limited hallucination
Design and train ML/AI models (forecasting, NLP, graph learning, generative AI) to improve data quality, cost effectiveness, and system scalability
Deploy and optimize models for large-scale production workloads using Python-based services in AWS/Kubernetes environments
Build robust, automated data pipelines and ML Ops workflows for continuous training and deployment
Research and experiment with modern AI methods (transformers, foundation models, reinforcement learning) and adapt them to energy-sector challenges not limited to utility statements
Drive performance improvements in model accuracy, latency, and cost efficiency
Collaborate with Product, SRE, and Analytics teams to deliver AI-enabled features across Arcadia’s platform
Write clean, maintainable code, contribute to architecture reviews, and mentor junior engineers
Build true agentic workflows with multi-step processing incorporating RAG pipelines and MCPs

What we offer

Competitive compensation and employee stock options
Hybrid/remote-first working model (India-based role, with global collaboration)
Flexible leave policy
Comprehensive medical insurance (self + family members)
Annual performance cycle + quarterly recognition awards
A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation

Fulltime

AI Applications Ops Lead

Scale’s rapidly growing International Public Sector team is focused on using AI ...

Location

Qatar; United Kingdom , Doha; London

Salary:

Not provided

Scale

Expiration Date

Until further notice

Requirements

6+ years in a high-impact technical role (SRE, FDE or MLOps) with experience in the public sector
Familiarity with international government security standards and the complexities of deploying sovereign AI
Proven experience maintaining production-grade applications with a deep understanding of the full request lifecycle-connecting frontend/API layers to the backend and AI core
Proficiency in coding and the modern AI infrastructure, including Kubernetes, vector databases, agentic development, and LLM observability tools
Ownership: You treat every production deployment as your own. You race toward solving hard problems before the customer even sees them
Reliability: You understand that in the public sector, a model failure may be a risk to public safety or privacy
Customer communication: The ability to explain to a high-ranking official why the performance of the system has degraded and how we are fixing it

Job Responsibility

Own the production outcome: Take full accountability for the long-term performance and reliability of AI use cases deployed across international government agencies
Ensure Full-Stack integrity: Oversee the end-to-end health of the platform, ensuring seamless integration between the AI core and all full-stack components, from APIs to UI, to maintain a responsive and production-ready environment
Scale the feedback loop: Build automated systems to monitor model performance and data drift across geographically dispersed environments, ensuring the right levels of reliability
Navigate global compliance: Manage the technical lifecycle within diverse regulatory frameworks
Incident command: Lead the response for production issues in mission-critical environments, ensuring rapid resolution and building the guardrails to prevent them from happening again
Bridge the gap: Translate deep technical performance metrics into clear insights for senior international government officials
Drive product evolution: Partner with our Engineering and ML teams to ensure the lessons learned in the field directly influence the technical architecture and decisions of future use cases

Sre design & support engineer

We are looking for a self-driven, software engineering mindset SRE engineer to •...

Location

India , Hyderabad

Salary:

Not provided

Pepsico

Expiration Date

Until further notice

Requirements

8-11 years of work experience evolving to a SRE engineer
3-5 years of experience in continuously improving and transforming IT operations ways of working
Bachelor’s degree in Computer Science, Information Technology or a related field
Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
The ideal Engineer will be highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams to ensure SRE orchestrating solutions are meeting customer/end-user expectations
The candidate will take a pragmatic approach resolving incidents, including the ability to systemically triangulate root causes and work effectively with external and internal teams to meet objectives
A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes with a track record for improving service offerings – pro-actively resolving incidents, providing a seamless customer/end-user experience and proactively identifying and mitigating areas of risk
Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
A firm understanding of cloud archticture for distributed environments
Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js

Job Responsibility

Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
Ensuring non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
Execute as Pro-active SRE Support engineer, preventing P1, P2, potential P3s, diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
Collaborates with Engineering & support teams, including participation in escalations, , and blameless postmortems,
Work closely with customer-facing support teams to empower them with SRE insights and tooling
Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
Continuously optimize the L2/support operations work via SRE workflow automation
Shape the SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams
Actively engage and drive AI Ops adoption across teams

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...

Location

Mexico , Miguel Hidalgo

Salary:

Not provided

Pepsico

Expiration Date

Until further notice

Requirements

8+ years of work experience evolving to a SRE engineer
3-5 years of experience in continuously improving and transforming IT operations ways of working
Bachelor’s degree in Computer Science, Information Technology or a related field
Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
A firm understanding of cloud archticture for distributed environments
Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)

Job Responsibility

Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
Work closely with customer-facing support teams to empower them with SRE insights and tooling
Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
Continuously optimize the L2/support operations work via SRE workflow automation

What we offer

Opportunities to learn and develop every day through a wide range of programs
Internal digital platforms that promote self-learning
Development programs according to Leadership skills
Specialized training according to the role
Learning experiences with internal and external providers
Recognition programs for seniority, behavior, leadership, moments of life, among others
Financial wellness programs that will help you reach your goals in all stages of life
A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life

Principal AIOps Engineer

We’re building a world of health around every individual — shaping a more connec...

Location

United States

Salary:

144200.00 - 288400.00 USD / Year

CVS Health

Expiration Date

July 01, 2026

Requirements

10+ years of experience in SRE, production operations supporting highly available services along with experience with Product model
Proven technical leadership: ability to set direction, lead cross-team initiatives, and advise stakeholders through architecture reviews, tradeoffs, and operational readiness
Strong programming/scripting skills (Python preferred) and experience building automation, integrations, and APIs
Experience integrating observability platforms and event sources across hybrid environments (cloud/on-prem) and operating production-grade monitoring/event management at scale
Strong ServiceNow experience as an ITSM system of record (Incident/Problem/Change
CMDB/asset concepts). Ability to build and operate integrations at scale (REST, webhooks, event management) to support automation and auditability
Python (preferred) for automation and data/ML pipelines
experience building integrations, services, and operational tooling
Workflow orchestration and integrations (ServiceNow APIs, event pipelines, runbook automation) with strong reliability, security, and auditability practices
Observability: Prometheus/Grafana, OpenTelemetry, ELK/Splunk/Datadog (or equivalent)

Job Responsibility

Lead the AIOps strategy, roadmap, and operating model (intake, triage, automation lifecycle, KPIs) to measurably improve MTTR, alert quality, and operational efficiency
Own the observability-to-AIOps pipeline (metrics, logs, traces, events) and drive standardization of telemetry, service health models, and actionable alerting across teams and platforms
Design and implement event intelligence: correlation, deduplication, suppression, anomaly detection, incident clustering, and probable-cause analysis using topology/CMDB context
Advise operations, service owners, and leadership stakeholders
lead change enablement, adoption, and value measurement for AIOps and agentic automation across the organization
Develop ServiceNow-centric AIOps integrations (ITSM + ITOM/Event Management where applicable): event ingestion, alert-to-incident policies, enrichment, assignment/routing, approvals, change workflows, and closure updates for auditable closed-loop ops
Establish governance for operational AI (risk controls, approvals, auditability, data access, prompt/response logging, evaluation, and continuous improvement) in partnership with security, compliance, and operations
Build and operationalize agentic AI workflows for incident triage and resolution: signal summarization, similar-incident retrieval, knowledge article drafting, ticket updates, stakeholder communications, and human-in-the-loop remediation
Enable closed-loop automation and self-healing by connecting AIOps detections to orchestrated actions (runbooks/workflows), with clear approvals, safety checks, and rollback paths
Partner with NOC/SOC, infrastructure, and application owners to onboard services into AIOps, define service models, and improve signal quality, escalation paths, and operational readiness

What we offer

Medical, dental, and vision coverage
Paid time off
Retirement savings options
Wellness programs
Bonus, commission or short-term incentive program
Equity award program

Fulltime

!

Director of Platform Engineering & Operations

NetApp is seeking a strategic and execution-oriented Director of Platform Engine...

Location

United States , RTP

Salary:

199750.00 - 298100.00 USD / Year

NetApp

Expiration Date

Until further notice

Requirements

12+ years of progressive experience in infrastructure engineering and operations
7+ years of leadership experience managing global, distributed teams at scale
Deep expertise in: Hybrid compute platforms (virtualization, containerization, public cloud IaaS/PaaS)
Enterprise storage technologies (block, file, object, hybrid architectures)
Global DDI services (enterprise DNS, DHCP, IPAM architectures)
Demonstrated experience implementing Infrastructure as Code and CI/CD-driven infrastructure delivery
Proven track record driving automation at scale across enterprise infrastructure
Strong experience with AI-Ops platforms, observability stacks, and operational analytics
Experience leading both engineering (build) and operations (run) functions within a unified organization

Job Responsibility

Define and execute the strategy for enterprise compute, storage, and DDI platforms across hybrid (on-prem and cloud) environments
Drive modernization of infrastructure services using IaC, GitOps, CI/CD automation, and policy-as-code frameworks
Lead the evolution toward self-service platform models with clear service catalogs, SLOs, and reliability metrics
Partner with executive stakeholders across IT, Security, Engineering, and Product to align platform capabilities with business priorities
Establish multi-year roadmaps for infrastructure transformation, cost optimization, resilience, and scalability
Oversee architecture, engineering, and lifecycle management of: On-prem and cloud-based compute platforms
On-prem and cloud-based storage platforms
Global DDI services (DNS, DHCP, IPAM)
Certificate lifecycle management
Standardize infrastructure patterns across data centers and public cloud providers

What we offer

Health Insurance
Life Insurance
Retirement or Pension Plans
Paid Time Off
various Leave options
employee stock purchase plan
and/or restricted stocks (RSU’s)

Fulltime

Select Country

AI SRE / AI Ops engineer

Requirements

Nice to have

Looking for more opportunities?