CrawlJobs Logo

AIOps Automation Engineering Lead

https://www.citi.com/ Logo

Citi

Location Icon

Location:
India , Chennai

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

Not provided

Job Description:

The Engineering Lead Analyst is a senior level position responsible for leading a variety of engineering activities including the design, acquisition and deployment of hardware, software and network infrastructure in coordination with the Technology team. The position is within the Production Management AIOps Organization that is at the forefront of transforming production management and operations through cutting-edge technologies. The incumbent will lead the efforts to automate the routine production tasks, enhance predictive capabilities, reduce manual intervention and ensure integration of AI into existing operational workflows.

Job Responsibility:

  • Serve as a technology subject matter expert for internal and external stakeholders and provide direction for all firm mandated controls and compliance initiatives, all projects within the group and in creating a technology domain roadmap
  • ensure that all integration of functions meet business goals
  • define necessary system enhancements to deploy new products and process enhancements
  • recommend product customization for system integration
  • identify problem causality, business impact and root causes
  • exhibit knowledge of how own specialty area contributes to the business and apply knowledge of competitors, products and services
  • advise or mentor junior team members
  • impact the engineering function by influencing decisions through advice, counsel or facilitating services
  • drive and implement rigorous quality standards for all aspects of the automation delivery from initial concept to final implementation
  • continually evolve the working practices within and services provided by Production Management (regionally and globally) to improve efficiency and productivity
  • continuous forward compatibility and acquisition of competency around automation, Artificial Intelligence, Robotics Process Automation, predictive analytics, etc.
  • decision analytics and technology platforms to deliver immediate results and long-term business impact
  • develop predictive models that will form the basis of information-driven strategies executed with respect to services provided by Production Management

Requirements:

  • 10+ years of relevant experience in an Engineering role
  • experience working in Financial Services or a large complex and/or global environment
  • project management experience
  • J2EE/microservices development experience of running applications in cloud native environments (Google Cloud, AWS, API Gateway technologies)
  • strong proficiency in JavaScript, including experience with ReactJS and NodeJS
  • experience with MongoDB or other NoSQL databases
  • solid understanding of Python and experience with relevant libraries
  • experience with version control systems like Git
  • knowledge of CI/CD pipelines and DevOps practices is a plus
  • consistently demonstrates clear and concise written and verbal communication
  • comprehensive knowledge of design metrics, analytics tools, benchmarking activities and related reporting to identify best practices
  • demonstrated analytic/diagnostic skills
  • ability to work in a matrix environment and partner with virtual teams
  • ability to work independently, multi-task, and take ownership of various parts of a project or initiative
  • ability to work under pressure and manage to tight deadlines or unexpected changes in expectations or requirements
  • proven track record of operational process change and improvement

Nice to have:

  • knowledge of CI/CD pipelines and DevOps practices
  • project management experience
What we offer:
  • Equal opportunity employer
  • consideration without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law

Additional Information:

Job Posted:
May 03, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for AIOps Automation Engineering Lead

Executive Director, Digital SRE & Operations

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States , Austin, Texas
Salary
Salary:
175100.00 - 334750.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
March 31, 2026
Flip Icon
Requirements
Requirements
  • 18+ years of experience in software engineering, platform operations, or site reliability engineering
  • 8+ years leading large-scale SRE, DevOps, or platform reliability organizations
  • Experience leveraging AI/ML for operations, including anomaly detection, predictive alerts, log analysis, or automated remediation
  • Familiarity with AIOps tools such as Datadog Watchdog, Dynatrace Davis, Splunk AI, Elastic AIOps, or custom ML/LLM solutions
  • Understanding of how to safely operate and monitor AI-enabled production systems
  • Deep expertise in distributed systems, cloud infrastructure, and high-availability architectures
  • Strong knowledge of SRE principles, DevOps, and reliability engineering at scale
  • Experience implementing AIOps or AI-driven operational tooling
  • Executive-level communication skills with the ability to influence senior leaders and business stakeholders
  • Experience operating mission-critical digital platforms serving millions of users
Job Responsibility
Job Responsibility
  • Define and own the enterprise SRE strategy, including SLOs, SLIs, error budgets, and reliability roadmaps
  • Establish reliability standards and practices across web, mobile, backend services, APIs, data platforms, and AI workloads
  • Drive a culture of reliability-by-design and operational excellence across engineering teams
  • Lead adoption of AIOps capabilities for proactive issue detection, alert noise reduction, and predictive failure prevention
  • Implement AI-assisted incident triage, automated runbooks, root-cause analysis, and self-healing systems
  • Partner with the AI Platform team to integrate LLMs and ML models into operational workflows (log summarization, anomaly detection, remediation)
  • Own enterprise observability strategy across metrics, logs, traces, and user experience monitoring
  • Standardize tooling and practices using platforms such as Datadog, Splunk, Prometheus, Grafana, OpenTelemetry
  • Deliver real-time dashboards and executive reporting on uptime, performance, latency, and error budgets
  • Partner with DevOps and Platform teams to ensure safe, automated, and scalable CI/CD pipelines
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan (including matching company contributions)
  • Employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Colleague assistance programs
  • Tuition assistance
  • Fulltime
Read More
Arrow Right

AI Operations Tech Leader

We are looking for an experienced Al Ops Tech Leader — Operations Support to lea...
Location
Location
Salary
Salary:
Not provided
lingarogroup.com Logo
Lingaro
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in data engineering, Al/ML engineering, or operations support technology roles
  • 4—6+ years in technical leadership positions within operations support / IT operations / service operations environments
  • Proven track record delivering production Al/ML/data solutions that measurably improved operations support KPIs
  • Strong hands-on expertise with modern data/AI stacks (Python, Spark, Kafka, Airflow, cloud data platforms, PyTorch/TensorFlow, LLM frameworks) and integration into operations support ecosystems
  • Deep practical experience with AIOps patterns in live operations support settings: event correlation, anomaly detection, automated actions, predictive analytics, GenAI for ops
  • Experience leading development or significant enhancement of AIOps/internal tooling platforms specifically for operations support teams
  • Ability to stay deeply technical while leading people and strategy in a high-velocity operations support context
  • Excellent communication — can explain complex Al concepts to operations support practitioners and translate operational pain into technical roadmaps for executives
  • Strong bias for action, production impact, and reducing operational toil through intelligent automation
Job Responsibility
Job Responsibility
  • Actively lead and contribute to high-impact data/AI projects that directly improve operations support outcomes
  • Design and deliver scalable features embedded into operations support workflows and platforms
  • Ensure solutions meet strict operations support SLAs for reliability, low latency, auditability, explainability, and zero-downtime deployment
  • Up-to-date with innovations and research in AIOPS Tools
  • Lead the architecture, development, and continuous enhancement of internal AIOps platforms and reusable components that power operations support teams
  • Serve as the lead Al technical authority and trusted advisor for all operations support programs, automation movements, and Al transformation efforts
  • Lead technical discussions, architecture reviews, PoCs, vendor evaluations, and solution selection
  • Identify, prioritize, and drive the highest-ROI Al use cases in operations support
  • Build, mentor, and lead a high-performing squad of AIOps specialists focused on operations support outcomes
  • Foster a culture of rapid experimentation, production-first mindset, and relentless focus on operational impact
  • Fulltime
Read More
Arrow Right

MTS, Systems Architecture Engineering

The System Architecture Engineer's role is to develop and evolve technical netwo...
Location
Location
United States , Bellevue; Overland Park; Frisco
Salary
Salary:
142800.00 - 257600.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master’s/Advanced degree in Computer Science, Engineering, or related field. Equivalent experience considered
  • 7–10 years in system, network, or reliability engineering roles
  • Deep expertise in network infrastructure (Cisco, Juniper, Check Point, F5, A10, Infoblox, BIND, DNS)
  • Hands-on experience with observability tools: Dynatrace, ThousandEyes, SevOne, Splunk, ServiceNow AIOps, OTEL
  • Proficiency with automation tools (Terraform, Ansible, Chef, Puppet) and cloud deployments (AWS preferred)
  • Programming/scripting in Python, Go, or Shell
  • Experience with CI/CD pipelines, Kubernetes, and containerized environments
  • Communication
  • Technical Writing
  • Analytics
Job Responsibility
Job Responsibility
  • Develop and evolve technical network and service architectures and design strategies
  • Improve and protect the software, infrastructure, and network systems that power T-Mobile’s IT and customer-facing services
  • Ensure scalability, availability, performance, security, and reliability across applications and networks
  • Proactively identify and prevent network issues before they impact customers
  • Play a critical role in outage bridges, leveraging KPIs, telemetry, and AI-driven analytics to pinpoint problems
  • Create new designs, architectures, and standards for delivering software and network services
  • Improve scalability, latency, and efficiency of T-Mobile’s applications and network services
  • Contribute to cloud enablement, containerization, and microservices reliability
  • Manage improvement work, PoCs, and future automation projects
  • Diagnose and resolve complex issues in routers, firewalls, load balancers, DNS, and global traffic managers
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Annual bonus or periodic sales incentive or bonus
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off and up to 12 paid holidays
  • Paid parental and family leave
  • Fulltime
Read More
Arrow Right

Lead Platform Engineer

After the launch of its flagship product, a fast-growing scale-up is expanding i...
Location
Location
United Kingdom , Bradford and Leeds
Salary
Salary:
85000.00 - 95000.00 GBP / Year
lawrenceharvey.com Logo
Lawrence Harvey
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience leading high-performing Platform/DevOps teams, including hybrid/offshore or partner resource models
  • Over 3 years of hands-on experience with Google Cloud Platform
  • Strong expertise in CI/CD design and build using GitHub, Terraform or similar
  • Experience supporting microservices / API-driven architectures
  • Comfortable working in fast-paced, product-led organisations with multiple stakeholders
Job Responsibility
Job Responsibility
  • Take ownership of its cloud platform and enable rapid, secure product delivery at scale
  • Lead a multi-disciplinary platform function across CI/CD, networking, security, AIOps, and observability
  • Build a robust self-service platform that empowers engineering squads
  • Shape platform strategy
  • Champion automation
  • Play an integral role in the development of future products
What we offer
What we offer
  • 15% Bonus
  • Fulltime
Read More
Arrow Right
New

Senior Technical Architect – Site Reliability Engineering & AIOps

At Schwab, you’re empowered to make an impact on your career. Here, innovative t...
Location
Location
United States , Austin
Salary
Salary:
210000.00 - 240000.00 USD / Year
schwab.com Logo
Charles Schwab
Expiration Date
March 07, 2026
Flip Icon
Requirements
Requirements
  • 12+ years of experience in software development and engineering, infrastructure, or SRE
  • 5+ years in a senior architecture or technical leadership role
  • Deep expertise in distributed systems, cloud-native architectures, and large-scale production environments
  • Hands-on experience with Kubernetes, Docker, service mesh, CI/CD pipelines, and infrastructure-as-code tools
  • Strong understanding of ML/AI concepts and their application to operational intelligence
  • Proven experience designing observability platforms using OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, or equivalent
  • Expertise in incident management frameworks, chaos engineering, and SLO-driven reliability practices
  • Experience with major cloud platforms (AWS, GCP, Azure) at scale
  • Strong communication and executive presence with the ability to translate complex technical concepts for non-technical stakeholders
Job Responsibility
Job Responsibility
  • SRE Architecture & Reliability Strategy — Define and own the end-to-end reliability architecture, including SLO/SLI frameworks, error budget policies, observability standards, and resilience patterns across distributed microservices environments
  • AIOps Platform Architecture — Design and architect the AIOps platform encompassing ML-driven anomaly detection, predictive alerting, automated root cause analysis, event correlation, and intelligent remediation workflows
  • Infrastructure & Platform Design — Lead architecture decisions for cloud-native infrastructure (GCP/AWS/Azure), Kubernetes orchestration, service mesh (Istio/Envoy), infrastructure-as-code (Terraform/Pulumi), and multi-region disaster recovery strategies
  • Observability & Monitoring Architecture — Architect the unified observability stack integrating metrics, logs, traces, and events using technologies such as OpenTelemetry, Grafana, Datadog, and custom ML pipelines for intelligent alerting
  • Automation & Self-Healing Systems — Drive the architecture of automated remediation frameworks, self-healing infrastructure, chaos engineering pipelines, and progressive deployment strategies (canary, blue-green, feature flags) to achieve zero-touch operations
  • Technical Leadership & Governance — Establish architecture review boards, technical standards, design patterns, and reference architectures
  • lead technical due diligence and drive consistency across SRE and platform teams
  • Team Development & Mentorship — Build, mentor, and grow a team of senior SRE architects and engineers
  • foster a culture of engineering excellence, continuous learning, and innovation in reliability and AI-driven operations
  • Stakeholder & Executive Engagement — Partner with Engineering, Product, Security, and Infrastructure leadership to align reliability and AIOps investments with business priorities
What we offer
What we offer
  • 401(k) with company match and Employee stock purchase plan
  • Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave and family building benefits
  • Tuition reimbursement
  • Health, dental, and vision insurance
  • Fulltime
!
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right

Principal Customer Success Manager

The Customer Success Architect position is a technical champion within the Custo...
Location
Location
United States , New York
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10-15 years experience, preferably in the IT management (ITOM)/APM fields
  • At least 5+ years experience in senior customer-facing positions as an Implementation Architect, Service Delivery Architect, or Lead Solution Architect
  • In-depth knowledge and hands-on experience in one or more of the following: Observability, Process Automation, Patching, AIOps
  • An in-depth understanding of infrastructure management and intelligent automation is preferred
  • Familiarity with cloud-native design patterns, microservices, and modern web-scale architectures
  • Excellent written and oral communication skills, analytical, self-motivated, and quick on-the-job learning skills
  • Effectively multitask between initiatives with minimal oversight and provide a positive customer service attitude.
Job Responsibility
Job Responsibility
  • Being the trusted partner for the customer on use-case and product functionality
  • Lead customers in the application of OpsRamp products and services offerings to meet their Business Outcomes
  • Develop a deep understanding of OpsRamp IT Operations Platform, architecture, and its capabilities through training and hands-on experience
  • Build on the technical design and architecture developed during the implementation phase to maintain a point-in-time architecture for each customer
  • Serve as an important source for information regarding the customer’s technical needs and provide customer feedback
  • Perform and own the health checks during the customer success engagement lifecycle in a client environment
  • Understand and document client use cases and build best practice enablement and content packs for the various use cases
  • Track support and feature requirements and interface with the Product and Engineering team where required
  • Establish technical authority quickly with executive technical customer stakeholders
  • Invest time in documenting best practices, capturing and disseminating knowledge, and other initiatives.
What we offer
What we offer
  • Flexibility to manage work and personal needs
  • Health and emotional wellbeing support
  • Personal and professional development programs
  • Unconditional inclusion
  • Career growth and skill application programs.
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
United States
Salary
Salary:
150000.00 - 225000.00 USD / Year
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ years in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
What we offer
What we offer
  • Equity
  • Generous benefits program
  • Fulltime
Read More
Arrow Right