Site Reliability Engineer (SRE) Job at Wissen (Bangalore South)

Cloud Engineer / Site Reliability Engineer (SRE)

Location

United States , Orlando

Salary:

75.00 USD / Hour

Beacon Hill

Expiration Date

Until further notice

Requirements

Strong hands-on AWS experience with solid understanding of core AWS services
Experience supporting and troubleshooting AWS and Azure cloud environments
Terraform experience for Infrastructure as Code
Docker/containerization experience
Strong troubleshooting and problem-solving skills
Ability to translate requirements into technical execution
Experience performing cloud architecture and diagramming
Experience supporting deployments, environments, and site standups
Strong communication and collaboration skills

Job Responsibility

Support cloud infrastructure and deployments across AWS and Azure
Troubleshoot infrastructure and application-related cloud issues
Build and maintain Terraform-based infrastructure
Support Docker/containerized environments
Create architecture diagrams and technical documentation
Work closely with engineering and project teams to execute cloud initiatives
Assist with automation and operational improvement efforts

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...

Location

India , Chennai

Salary:

Not provided

Dalet

Expiration Date

Until further notice

Requirements

Cloud platforms: AWS, Azure
Containerisation & Orchestration: Kubernetes
Infrastructure as Code: Terraform
Configuration Management: Ansible
Packaging & Deployment: Helm
Databases: MariaDB, MongoDB
Monitoring, observability, networking, and cloud security.

Job Responsibility

Act as a senior technical authority for APAC Site Reliability Engineering activities
Drive best practices in reliability, operations, and engineering standards
Promote technical excellence, collaboration, and accountability across stakeholders
Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
Collaborate closely with engineering to improve platform components, automation, and operational processes

What we offer

Great career opportunities around the world
Truly collaborative environment with supportive leadership
Cutting edge technologies (AI, Cloud, Cybersecurity...)
Talented and passionate team members
Fun working environment

Fulltime

Site Reliability Engineer (SRE) - Identity Access Management IAM

Join us as a Site Reliability Engineer (SRE) - Identity Access Management. You w...

Location

India , Pune

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Experience in designing, implementing, deploying, and running highly available, fault-tolerant, auto-scaling and auto-healing systems
Strong expertise in AWS (essential), (Azure, and GCP (Google cloud platform) is a plus), including Kubernetes (ECS is essential, Fargate and GCE is a plus) and server-less architectures
Strong experience in running disaster recovery, zero downtime solutions and in designing and implementing continuous delivery across large-scale, distributed, cloud-based micro service and API service solutions with 99.9%+ uptime
Hands-on experience coding in Python, Bash and JSON/Yaml (Configuration as Code)
The ability to drive reliability best practices across engineering teams, embed SRE principles into the DevSecOps lifecycle and partner with engineering, security and product teams, to balance reliability and feature velocity
Experience in hands-on configuration, deployment and operation of ForgeRock COTS based IAM (Identity Access management) solutions (PingGateway, PingAM, PingIDM, PingDS) with embedded security gates, HTTP header signing, access token and data at rest encryption, PKI based self-sovereign identity, or open source

Job Responsibility

Applying software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them
Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...

Location

United States , Austin

Salary:

Not provided

Dutech Systems

Expiration Date

Until further notice

Requirements

8+ years of experience in SRE, DevOps, or Systems Engineering
Strong expertise in Linux/Unix systems and system internals
Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
Experience designing and operating distributed systems
Hands-on experience with cloud platforms (AWS or GCP)
Experience with Docker and Kubernetes
Strong understanding of monitoring, alerting, and logging concepts
Experience managing SLIs, SLOs, and error budgets
Experience with incident management and RCA processes

Job Responsibility

Design, implement, and manage highly available, distributed systems
Maintain and optimize cloud infrastructure (AWS/GCP)
Develop automation scripts using Python, Go, Java, or Bash
Manage containerized environments using Docker and Kubernetes
Define and monitor SLIs, SLOs, and error budgets
Implement monitoring, logging, and alerting solutions
Lead incident management, root cause analysis (RCA), and postmortems
Ensure system security and compliance within operational workflows
Improve system reliability through performance tuning and optimization
Collaborate with engineering teams to enhance deployment and release processes

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Fulltime

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Peru

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems

Select Country

Site Reliability Engineer (SRE)

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?