CrawlJobs Logo

Site Reliability Engineer

India, Chennai Employment contract · Job Posted June 09, 2026
Apply Position
Job Link Share

Job Description

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Ops / ML Ops). Are you passionate about deploying, monitoring, and scaling machine learning systems in production environments? Trimble's Construction Management Solutions (CMS) division is looking for a driven, early-career Site Reliability Engineer to join our high-performing team in Chennai. In this role, you will help build and manage robust AI infrastructure, bridging the gap between cutting-edge data science and resilient cloud operations.

Job Responsibility

  • Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
  • Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
  • Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
  • Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
  • Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns

Requirements

  • 1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
  • Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
  • Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
  • Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
  • Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
  • Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle

Nice to have

  • Familiarity with MLOps tracking tools and open-source frameworks (MLflow, Kubeflow, DVC, or similar)
  • Basic experience with monitoring software suites (Prometheus, Grafana, ELK stack)
  • Exposure to Infrastructure as Code (IaC) configuration tools like Terraform or Ansible
  • Knowledge of database systems, data pipeline technologies, or model serving frameworks (TensorFlow Serving, TorchServe, ONNX Runtime)
  • Experience with cross-platform (Windows/Linux) command-line administration and a basic understanding of cloud security best practices for AI workloads

What we offer

  • Structured environment to accelerate technical skills
  • Direct guidance from experienced engineering professionals
  • Projects that improve productivity, quality, safety, transparency and sustainability
  • Collaborative and supportive team
  • Entrepreneurial spirit empowering proactive doers
  • Flexible work arrangements

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer

8 matching positions

New

Site Reliability Engineer

At Schwab, you’re empowered to make an impact on your career. Here, innovative t...
Location
Location
United States , Austin
Salary
Salary:
126000.00 - 140000.00 USD / Year
schwab.com Logo
Charles Schwab
Expiration Date
June 12, 2026
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Engineering, Computer Science, or related field
  • 6+ years of software development and site reliability engineering experience supporting production applications in cloud environments such as Pivotal Cloud Foundry (PCF) or Google Cloud Platform (GCP)
  • 4+ years of DevOps engineering leadership experience focused on automation, tooling, and improving production operations
  • 2+ years of technical leadership experience guiding engineering teams and driving operational efficiencies
  • 2+ years of experience implementing and maturing operational best practices, including SLOs, SLIs, error budgets, monitoring, capacity planning, and incident management processes
  • Proficiency in programming and automation using tools such as Python, CloudFormation, or Terraform to build infrastructure-as-code solutions
  • Strong knowledge of database technologies (SQL, Aerospike, Postgres)
  • Experience working with messaging and streaming platforms such as RabbitMQ and Kafka
Job Responsibility
Job Responsibility
  • Ensure availability, performance, and resiliency of highly visible cloud-based platforms and applications
  • Influence how systems are designed, built, and operated, driving measurable improvements in reliability and scalability
  • Partner closely with engineering and platform teams to define and implement sustainable operating models
  • Identify and execute opportunities to enhance service health and telemetry
  • Shape and deliver forward-looking resiliency and availability roadmaps
  • Lead adoption of cloud-native technologies aligned with established SRE standards
  • Promote a proactive shift-left approach embedding reliability, fault tolerance, and performance into the development lifecycle
  • Optimize systems, reduce operational toil, and improve key metrics such as MTTD and MTTR
What we offer
What we offer
  • 401(k) with company match
  • Employee stock purchase plan
  • Paid time for vacation
  • Volunteering time
  • 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave
  • Family building benefits
  • Tuition reimbursement
  • Health insurance
  • Dental insurance
  • Fulltime
!
Read More
Arrow Right
New

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
United Kingdom , London
Salary
Salary:
150000.00 GBP / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout
  • Work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale
  • Play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions
  • Build strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets
  • Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
  • Reliability & SRE: Improve availability, resiliency, and incident response
  • own service health across regions
  • Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
  • Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
  • Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Location
Location
United Kingdom , Newcastle
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Engineering or a related field
  • At least 5 years of technical experience with a proven ability to take full ownership of production infrastructure
  • Excellent collaboration skills with leading cross-functional work
  • Demonstrated success in managing infrastructure in production environments
  • Expertise in capacity planning and cost optimisation for efficient operations
  • Extensive expertise managing cloud provider hosted infrastructure, specifically with Microsoft Azure or AWS
  • Proficient in high-level scripting languages like Python and Infrastructure as Code tools (Terraform), along with containerisation
  • Demonstrated success with Kubernetes or other containerization technologies
  • Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, GitHub
  • Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc
Job Responsibility
Job Responsibility
  • Develop and maintain scalable infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments
  • Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting
  • Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes
  • Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention
  • Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed
  • Lead incident response efforts and conduct deep-dive root cause analysis to implement long-term, innovative technical solutions
  • Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks
  • Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles
  • Participate in on-call rotations and handle critical incidents with confidence and expertise
  • Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team
Read More
Arrow Right
New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Westlak...
Location
Location
United States , Westlake
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, or related disciplines (understanding reliability engineering principles, SLIs, SLOs, error budgets, and operational excellence)
  • 5+ years’ hands-on Terraform experience
  • 5+ years’ experience supporting mission-critical enterprise applications in production environments
  • 5+ years’ experience with cloud networking, security, and infrastructure architecture
  • 5+ years of hands-on experience managing hybrid cloud environments
  • 5 + years of automation skills using Python, Ansible, Shell scripting, or similar technologies
  • 5+ years’ experience building reusable infrastructure modules and automated deployment frameworks
Job Responsibility
Job Responsibility
  • Design, implement, and support highly available load balancing solutions using F5 BIG-IP, Broadcom AVI, and cloud-native load balancing services
  • Build and maintain Infrastructure-as-Code (IaC) solutions using Terraform
  • Develop automation solutions for infrastructure provisioning, configuration management, and operational workflows
  • Support and enhance CI/CD pipelines using tools such as Jenkins, Azure DevOps, GitHub Actions, or similar platforms
  • Collaborate with application, cloud, network, and platform teams to improve reliability, performance, and scalability
  • Monitor production systems and proactively identify reliability, performance, and availability risks
  • Implement Site Reliability Engineering best practices including observability, incident management, capacity planning, and resiliency engineering
  • Troubleshoot complex issues across networking, cloud infrastructure, load balancing, and application environments
  • Support hybrid infrastructure environments spanning on-premises datacenters and public cloud platforms
  • Participate in on-call rotation and provide production support for critical business applications
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

RED Global is currently supporting one of our international clients in their sea...
Location
Location
Netherlands , Utrecht
Salary
Salary:
Not provided
redglobal.com Logo
RED Commerce - The Global SAP Solutions Provider
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience as a Site Reliability Engineer
  • Experience supporting and maintaining reliable, scalable production environments
  • Strong troubleshooting and incident management capabilities
  • Experience working within complex enterprise environments
  • Strong communication and stakeholder management skills
Read More
Arrow Right

Site Reliability Engineer

Qargo is a cloud-based (SaaS) Transport Management Platform. We are a scale-up b...
Location
Location
Belgium , Ghent
Salary
Salary:
Not provided
qargo.com Logo
Qargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience as a Software Engineer, with an interest in infrastructure, scalability, reliability
  • Strong programming skills (preferably Python or similar backend languages)
  • Experience working with cloud platforms, container orchestrators, serverless (preferably Google Cloud)
  • Familiarity with distributed systems and scalability challenges
  • Experience with CI/CD pipelines and automation
  • Solid understanding of databases and performance tuning (SQL and/or NoSQL)
  • Familiarity with monitoring and observability tools
  • A problem-solving mindset and the ability to think in systems
  • Strong collaboration skills and a proactive approach to improving systems
Job Responsibility
Job Responsibility
  • Build and maintain systems and tooling that improve the reliability, scalability, and performance of our platform
  • Improve software delivery cycle, focusing on automation and developer experience
  • Develop internal tools and services to reduce manual operational work
  • Improve observability by implementing monitoring, logging, and alerting across systems
  • Optimize system performance, including databases such as PostgreSQL and Firestore
  • Collaborate with backend engineers and other engineering teams to design reliable and scalable system architectures
  • Troubleshoot complex production issues and implement long-term fixes
  • Continuously improve infrastructure (Infrastructure as Code, automation, etc.)
What we offer
What we offer
  • A fast-growing SaaS company with a strong mission and an impact-driven team
  • A flexible work environment with flexible hours and hybrid working
  • A green office with a great atmosphere and lots of initiatives
  • A role with a lot of responsibility, ownership, and tangible impact
  • The opportunity to grow with us and shape both your career and our platform
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are looking for a Site Reliability Engineer (SRE) to support reliable, high-p...
Location
Location
United States , Novi
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Information Technology, Computer Science, Computer Engineering, or comparable practical experience
  • At least 5 years of experience supporting production environments in a corporate, startup, or similarly fast-paced technical setting
  • Hands-on expertise with infrastructure as code, including Terraform, along with experience in cloud platforms and related services
  • Working knowledge of container technologies such as Docker and orchestration platforms like Kubernetes
  • Experience supporting live systems, participating in on-call rotations, and contributing to incident reviews and corrective actions
  • Proficiency with automation and scripting using Bash and Python to reduce manual operational effort
  • Strong communication skills with the ability to explain technical decisions and tradeoffs to cross-functional or non-technical stakeholders
  • Willingness and ability to travel to customer or plant locations as business needs require
Job Responsibility
Job Responsibility
  • Maintain dependable and secure production environments across plant-edge and cloud-based systems, with a focus on uptime, responsiveness, and operational stability
  • Design, refine, and support monitoring dashboards, alerting frameworks, and operational runbooks using tools such as Prometheus, Grafana, and modern telemetry solutions
  • Build and manage infrastructure through code using Terraform, applying version control standards, peer reviews, and controlled deployment processes
  • Create automation scripts and lightweight tools in Bash and Python to streamline routine operations, recovery procedures, backup workflows, and environment setup
  • Take part in incident response and on-call coverage, troubleshoot service disruptions, coordinate initial communication, and document follow-up actions through blameless reviews
  • Establish and measure service reliability indicators and objectives, helping stakeholders balance system dependability with release speed and operational risk
  • Support secure connectivity between factory networks and cloud resources by configuring and maintaining VPNs, routing, private networking, and access controls
  • Administer and optimize relational or time-series databases, including backup planning, replication, performance tuning, and long-term storage health
  • Contribute to CI/CD delivery practices by improving deployment pipelines, supporting controlled release strategies, and preparing rollback procedures when needed
  • Partner with controls, software, and data teams to enable reliable data flow from industrial systems and ensure safe deployment to edge infrastructure
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • 401(k) plan
Read More
Arrow Right