CrawlJobs Logo

Lead Site Reliability Engineer

India, Chennai · Job Posted June 09, 2026
Apply Position
Job Link Share

Job Description

Trimble is looking for a Site Reliability Engineering Lead to join Business Systems. Our team is building the platform fuelling Trimble's digital transformation. We take a cloud-first approach to deliver customer-centric experiences & platform web services that are used by Trimble product teams and Trimble partners. As a Solutions Engineer, you'll be a vital part of Trimble's Engagement Team for Digital Transformation. This team enables and aids Trimble's product teams and partners with adopting and integrating Trimble cloud services with a customer-centric ideology always in mind. You will be an expert of Business Systems services, building, proving out, and communicating the value of the platform that is enabling Trimble's Digital Transformation.

Job Responsibility

  • Become well-versed in the opportunities and challenges of the business and Trimble's customers
  • Become an expert in Business Systems services, especially the interfaces—APIs, protocols (e.g. OAuth), and user interfaces
  • Establish, then utilize tight working relationships with stakeholders across the company, especially Trimble's engineering community
  • Prototype and create proofs of concept as required
  • Scope and deploy new integrations
  • Investigate, diagnose, and solve customer integration issues
  • Effectively communicate technical issues with stakeholders in non-technical language
  • Contribute to utilities and SDKs to help integration and migration efforts

Requirements

  • Bachelor's or Master's degree in Computer Engineering, Computer Science, or a related field
  • 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles with at least 2+ years in a leadership or mentoring capacity
  • Deep AWS expertise (EC2, S3, RDS, IAM, VPC, Lambda, CloudFormation/Terraform, etc.)
  • Strong knowledge of Infrastructure-as-Code (IaC) using Terraform, AWS CDK, or CloudFormation
  • Proven experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, or similar)
  • Proficiency in containerization and orchestration (Docker, Kubernetes, ECS, or EKS)
  • Expertise in monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana, ELK, CloudWatch, etc.)
  • Strong scripting or programming background (Python, Bash, or Go)
  • Sound understanding of networking, security, and identity/access management in the cloud
  • Experience designing high-availability and disaster recovery strategies for critical workloads
  • Excellent communication, problem-solving, and leadership skills with the ability to influence across teams

Nice to have

  • AWS or other Cloud Certification (Solutions Architect, DevOps Engineer, etc.)
  • Experience with AIOps, Serverless Architectures, and event-driven systems
  • Familiarity with FinOps practices and cost optimization frameworks
  • Experience with SaaS monitoring tools (Datadog, New Relic, Sumo Logic, PagerDuty)
  • Exposure to Atlassian tools (Jira, Confluence, Bitbucket)
  • Experience with SQL/NoSQL databases
  • Proven track record of leading cross-functional reliability initiatives or platform-wide automation projects

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Lead Site Reliability Engineer

8 matching positions

Lead Site Reliability engineer

Solution, Reliability and Monitoring Entity main objective is to define, provide...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
airbus.com Logo
Airbus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, information technology or other related discipline with 7+ years of experience
  • Solid experience designing and building secure solutions in AWS (Amazon Web Services)
  • Extensive experience in systems administration or a combination of software/systems experience
  • Some experience in scripting and automation of asset
  • Solid knowledge of Operating Systems & ability to perform troubleshooting required
  • Extensive knowledge of Cloud Technology concepts & ability to perform complex troubleshooting required
  • Solid knowledge of networking for enterprise environments required
  • Solid knowledge of Virtual Machine concepts and management of infrastructure
  • Demonstrated ability to identify root cause of issues and to recommend permanent, long term, fixes
  • Demonstrated ability to perform complex troubleshooting in AWS environment and providing guidance to other teams
Job Responsibility
Job Responsibility
  • Define, implement, and manage cloud-based infrastructure
  • Work closely with the Software Factory’s (SWF) Solution Architects to facilitate the transition from Development to In-Support phase
  • Creating/Animating an hosting network with SWF
  • Representing Hosting Group in the different Trains
  • Coordinating with Solution Architects (SAs) to support the technical architecture decisions related to Hosting
  • Supporting SWF for new components onboarding
  • Coordinate with SWF Systems & Architecture team for future planning
  • Contribute to Prioritization Reviews for the different trains
  • Guide products in Service Level Objectives (SLO) definitions & monitoring based on Hosting Operations feedbacks
  • Define, share and broadcast Guidelines and Non-Functional Requirements (NFR) related to: hosting, deployment and monitoring
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Glean is seeking a Site Reliability Engineering Lead to foster a culture of engi...
Location
Location
United States , Palo Alto
Salary
Salary:
200000.00 - 260000.00 USD / Year
glean.com Logo
Glean
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, a related field, or equivalent practical experience
  • 8+ years of experience in a senior-level role within Site Reliability Engineering or similar role, particularly in managing cloud-based services and infrastructure
  • 5+ years of experience with software development in one or more programming languages
  • 3+ years of experience managing people or teams, leading projects, and designing, analyzing, and troubleshooting distributed systems running in Cloud
  • Strong knowledge of cloud platforms such as Google Cloud Platform, AWS, or Azure
  • Practical experience with containerization technologies, including Docker and Kubernetes
  • Familiarity with infrastructure as code tools like Terraform is essential
  • Solid understanding of networking, security principles, and best SRE and security practices
  • Proficiency in using monitoring and alerting tools to detect and respond to potential issues effectively
Job Responsibility
Job Responsibility
  • Foster a culture of engineering excellence, drive technical strategy, and develop a high-performing, collaborative team
  • Ensure services meet stringent Service Level Objectives (SLOs)
  • Build resilient, automated production environments in the cloud
  • Lead a team and be responsible for products globally
  • Provide technical leadership to key projects
  • Manage the complex challenges of scale and fast growth
  • Keep Glean applications up and running
  • Drive technical excellence and foster a culture of reliability across engineering teams
  • Set best practices for incident management, performance optimization, and automation
  • Influence best practices, drive cross-team collaborations, and contribute to the execution of key objectives
What we offer
What we offer
  • Comprehensive benefits package
  • Medical, Vision, and Dental coverage
  • Generous time-off policy
  • Opportunity to contribute to 401k plan
  • Home office improvement stipend
  • Annual education and wellness stipends
  • Vibrant company culture through regular events
  • Healthy lunches daily
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Our client is committed to building trust and making the world more agreeable fo...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in a relevant programming language
  • Extensive knowledge of Cosmos DB management and optimization
  • Strong Terraform IaC deployment experience
  • Proven ability to interact with stakeholders and promote best practices
  • Dashboarding/data visualization experience
Job Responsibility
Job Responsibility
  • Identify and assess Cosmos DB resource utilization and recommend optimization strategies
  • Engage directly with resource owners to present findings and implement rightsizing
  • Design, build, and maintain dashboards to visualize Cosmos DB usage and opportunities for improvement
  • Develop Terraform-based solutions for efficient cloud database management
  • Stay updated on best practices around cloud cost optimization and security
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
Read More
Arrow Right

Lead Site Reliability Engineer

Our client is committed to building trust and making the world more agreeable fo...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in software development with languages such as C#, C++, or Java
  • Hands-on experience with Service Bus in a global enterprise setting
  • Proven expertise in Terraform and deployment automation
  • Experience with DR processes, dashboard creation, and resource rightsizing
  • Strong communication skills to drive engagement with service owners
Job Responsibility
Job Responsibility
  • Build and maintain DRI dashboards to identify resource utilization and optimization opportunities for Service Bus
  • Collaborate with service owners to recommend and implement right-sizing strategies
  • Author high-quality, scalable automation code to streamline disaster recovery processes
  • Develop and deploy IaC solutions using Terraform
  • Drive adoption of automation and robust monitoring for service health and disaster recovery
  • Participate in on-call rotations and refine processes for improved system reliability
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right
New

Lead Site Reliability Engineer/ Expert

Responsible for ensuring highly reliable, scalable, and resilient production sys...
Location
Location
Egypt; India , Cairo; Delhi
Salary
Salary:
Not provided
sita.aero Logo
SITA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. Master’s degree preferred for senior roles
  • Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA)
  • Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies
  • Certifications in automation and IaC tools (Ansible, Terraform)
  • Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK)
  • Certifications in ServiceNow, Jira, or other operational tooling
  • 8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer
  • Strong experience with high availability systems, resilience engineering, and DR readiness
  • Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues
  • Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows
Job Responsibility
Job Responsibility
  • Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance
  • Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments
  • Improve platform reliability, observability, and performance across cloud and on‑premises systems
  • Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability
  • Take ownership of production availability, capacity planning, performance tuning, and long‑term reliability initiatives
  • Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows
  • Develop and implement auto‑remediation and self‑healing solutions to reduce manual intervention
  • Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments
  • Implement and manage zero‑downtime deployment strategies (blue‑green, canary, rolling)
  • Support containerized and cloud‑native platforms including Kubernetes, Docker, and distributed systems
What we offer
What we offer
  • Work from home up to 2 days/week (depending on your team's needs)
  • Make your workday suit your life and plans
  • Take up to 30 days a year to work from any location in the world
  • Employee Assistance Program (EAP), for you and your dependents 24/7, 365 days/year
  • Champion Health - a personalized platform that supports a range of wellbeing needs
  • Access to world-class learning platforms and programs (LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Airport Council International, Pluralsight, Harvard Business Publishing, Stanford)
  • Competitive benefits that make sense with both your local market and employment status
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer (Lead)

10Pearls is an award-winning end-to-end digital innovation company that helps bu...
Location
Location
Pakistan , Islamabad
Salary
Salary:
Not provided
10pearls.com Logo
10Pearls
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science or related field
  • 5–8 years in SRE or production-engineering roles running distributed systems at scale
  • Deep Kubernetes expertise — operators, RBAC, network policy, storage, upgrades
  • Hands-on with Keycloak / Vault / MinIO / Harbor / Kong or equivalent identity/secrets/storage/registry/gateway stacks
  • Strong Linux fundamentals and at least one systems language (Go, Rust) or shell/Python for tooling
  • Proven SLO/SLI authorship and error-budget-driven decision-making
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Tempo)
  • Calm, clear communication during incidents
  • strong post-mortem writing
  • Hands-on with infra-as-code — Helm, Kustomize, Terraform
Job Responsibility
Job Responsibility
  • Substrate operation — own the Kubernetes cluster plus Keycloak (identity), Vault (secrets), MinIO (object storage), Harbor (registry), Kong (gateway) — from bootstrap to day-2 operations
  • SLO framework — define, publish, and defend SLOs for every tier-1 service
  • own error budgets and burn-rate alerting
  • Incident response — build the on-call rotation, paging, runbook library, and post mortem culture
  • lead incident command during P1/P2 events
  • Release operations — co-own the blue-green / canary release model with L6 Delivery
  • sign off production-bound releases
  • Air-gap operations — ensure every operational runbook works in a fully offline environment — no assumption of external dependencies
  • Lead the Platform squad — technically lead 1 Infrastructure Engineer, 1 Observability Engineer, 2 DevOps Engineers
  • set standards for infra-as-code and automation
  • Fulltime
Read More
Arrow Right

Technical Lead-Site Reliability Engineer

We are seeking an experienced Site Reliability Engineer to support Vodafone’s st...
Location
Location
Egypt , Cairo
Salary
Salary:
Not provided
vodafone.com Logo
Vodafone
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experienced in Site Reliability Engineering, DevOps, or production support roles within complex, enterprise-scale environments
  • Skilled in Unix/Linux administration with strong shell scripting experience
  • Experienced with CI/CD tools such as Git, Jenkins, Nexus, SonarQube, and configuration or automation tools
  • Proficient in infrastructure as code using tools such as Terraform or CloudFormation
  • Comfortable working with public cloud platforms such as AWS or Azure
  • Able to develop using one or more high-level programming languages, including Python, Java, or JavaScript
  • Experienced in containerisation and orchestration technologies, including Docker and Kubernetes
  • Familiar with monitoring and observability tools such as Prometheus, Grafana, CloudWatch, or Centreon
  • Knowledgeable in microservices architecture, APIs, and web services (REST, SOAP, JSON, XML)
  • Experienced with relational and NoSQL data stores such as PostgreSQL, MariaDB, Redis, MongoDB, or similar technologies
Job Responsibility
Job Responsibility
  • Drive reliability, availability, and performance across IoT platforms through proactive monitoring, automation, and operational improvements
  • Design, deploy, review, and troubleshoot technical integrations with multiple platforms, services, and connected devices
  • Implement and enhance CI/CD practices to enable high levels of operational automation and zero-touch operations
  • Partner with development teams to improve services through rigorous testing, release management, and operational readiness
  • Act as a technical subject matter expert, supporting and coaching team members to build capability across relevant technologies
  • Lead and support incident and problem management activities, ensuring timely resolution, root cause analysis, and preventive actions in line with agreed SLAs
  • Contribute to system design reviews, including HLDs and LLDs, translating architectural decisions into operational requirements
  • Balance feature delivery speed with platform reliability through clearly defined service level objectives
  • Design, implement, and continuously enhance monitoring, alerting, and observability solutions to maintain a holistic view of system health
  • Manage production environments through proactive capacity planning, performance optimisation, and release deployments
What we offer
What we offer
  • The opportunity to work on large-scale, business-critical IoT platforms with global reach
  • Exposure to modern cloud-native architectures, DevOps practices, and automation at enterprise scale
  • Collaboration with international teams across Vodafone Group and strategic partners
  • A role that blends hands-on engineering with system design, reliability strategy, and continuous improvement
  • A supportive environment that values learning, knowledge sharing, and professional growth
Read More
Arrow Right