Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems Job at Dutech Systems (Austin)

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...

Location

India , Chennai

Salary:

Not provided

Arcadia

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
Strong hands-on experience with: Terraform & Infrastructure as Code
AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
Kubernetes troubleshooting and operations
Prometheus/Grafana/Datadog observability stacks
Proven ability to operate in high-scale, high-uptime, multi-environment production systems
Experience building automation via Python/Bash and reducing operational toil
Strong understanding of incident management, root cause analysis, and reliability engineering principles

Job Responsibility

Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems

What we offer

Competitive compensation and employee stock options
Hybrid/remote-first working model (India-based role, with global collaboration)
Flexible leave policy
Comprehensive medical insurance (self + family members)
Annual performance cycle + quarterly recognition awards
A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation

Fulltime

Senior Site Reliability Engineer

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team...

Location

United States , El Segundo

Salary:

183000.00 - 235000.00 USD / Year

HiveWatch

Expiration Date

Until further notice

Requirements

7+ years of software engineering experience with strong coding skills in production environments
5+ years of SRE, DevOps, or production operations experience
Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
Hands-on experience with relational databases and SQL performance optimization
Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
Strong debugging skills across distributed systems and microservices architectures
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience

Job Responsibility

Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
Debug and resolve complex production issues across the full stack, from infrastructure to application code
Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
Perform root cause analysis requiring deep code-level investigation and implement preventive measures
Build automation and tooling to reduce operational toil and improve system reliability
Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
Increase the resiliency, scalability, and maintainability of production environments
Establish on-call procedures and disaster recovery processes
Provide technical leadership and mentorship to foster engineering excellence and reliability culture

What we offer

Comprehensive health coverage: medical, dental, vision, and life insurance
Cutting-edge work in an emerging field with huge growth potential
Competitive compensation packages designed to reward top talent
A modern, newly renovated HQ right on Main Street in El Segundo, CA
401(k) with a 4% company match to help you invest in your future (match launches in 2026)
Flexible paid time off so you can recharge when you need it
Additional benefits include ClassPass credits and a discount on pet insurance
A family-friendly, compassionate culture that values balance and belonging
Eligible to participate in HiveWatch Equity Incentive Plan

Fulltime

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...

Location

Salary:

175000.00 - 225000.00 USD / Year

Zilliz

Expiration Date

Until further notice

Requirements

4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
Proficiency in scripting languages such as Python, Go, or Java
Strong knowledge of container orchestration technologies like Kubernetes and Docker
Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
Experience with infrastructure as code tools such as Terraform or Ansible
Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
Proven ability to troubleshoot complex distributed systems and resolve issues promptly
Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously

Job Responsibility

Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
Develop and implement strategies for monitoring, incident management, and disaster recovery
Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
Collaborate with software engineers to enhance system reliability, scalability, and performance
Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency

Fulltime

Senior Site Reliability Engineer

We are looking for a Senior Site Reliability Engineer who is passionate about sc...

Location

Salary:

Not provided

Atlassian

Expiration Date

Until further notice

Requirements

5+ years experience operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring, tweaking dashboards, defining alerts, writing runbooks, etc.
5+ years of hands on experience with public cloud offerings (AWS components like EC2, CloudFormation, RDS / Aurora, Caches, SQS - or equivalents, e.g. in GCP / Azure)
Familiarity with Unix / Linux operating systems
Strong emphasis to debug, improve code, and automate routine tasks
Strong backend engineering experience in one or more prominent languages such as Java, Go or Python
Excellent communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
An ability and desire to mentor and coach engineers

Job Responsibility

Scaling Cloud services
Own the infrastructure, tooling and automation that Jira Cloud runs on
Analyse and help improve our services and processes to get us to an even higher level of reliability, performance, scalability, and cost efficiency

What we offer

Health and wellbeing resources
Paid volunteer days

Senior Security Operations Engineer II

As a Senior Security Operations Engineer, you’ll play a key role in ensuring the...

Location

United States , Scottsdale

Salary:

Not provided

Axon

Expiration Date

Until further notice

Requirements

7+ years of experience in operations, site reliability, or infrastructure engineering roles
Strong experience securing and managing cloud environments (e.g., AWS, Azure) and containerized workloads
Deep understanding of Linux systems, networking, distributed systems, and their associated security controls
Proficiency in automation, scripting, and security tooling integration to streamline operations and enforcement
Experience with security monitoring, alerting, SIEM platforms, and observability tools
Solid grasp of CI/CD practices with integrated security testing and compliance checks
Experience managing Kubernetes clusters and running containerized workloads in production
Experience with deploying and administrating any of the following: scalable cloud native secrets solutions such as AWS KMS, Azure KeyVault
PKI solutions such as EJBCA, Smallstep, Venafi
or vaulting solutions such as Hashicorp Vault

Job Responsibility

Implementing and improving automated security checks in CI/CD pipelines to prevent vulnerabilities from reaching production
Writing, reviewing, and maintaining security-focused infrastructure-as-code for scalable and compliant deployments
Investigating security incidents, performing root cause analysis, and implementing long-term mitigation strategies
Collaborating with developers to develop new features, services, and infrastructure requirements
Enhancing security observability through improved log collection, metrics, and alerting configurations
Maintaining and improving security runbooks, incident response playbooks, and internal security tooling for operational efficiency
Resolve security/infrastructure incidents by participating in high impact/high visibility incidents as a participant and ideally as an incident commander
Maintain and secure critical infrastructure components such as PKI (Public Key Infrastructure) and IAM ( Identity & Access Management) systems, ensuring reliability, scalability, and compliance with organizational and industry security standards
Build and maintain secure, reliable, and scalable infrastructure that protects core services and sensitive data
Troubleshoot and resolve complex operational and system-level issues across environments

What we offer

Competitive salary and 401k with employer match
Discretionary paid time off
Paid parental leave for all
Medical, Dental, Vision plans
Fitness Programs
Emotional & Mental Wellness support
Learning & Development programs
Snacks in our offices

Fulltime

Senior Security Operations Engineer II

As a Senior Security Operations Engineer, you’ll play a key role in ensuring the...

Location

United States , Scottsdale

Salary:

Not provided

Axon

Expiration Date

Until further notice

Requirements

7+ years of experience in operations, site reliability, or infrastructure engineering roles
Strong experience securing and managing cloud environments (e.g., AWS, Azure) and containerized workloads
Deep understanding of Linux systems, networking, distributed systems, and their associated security controls
Proficiency in automation, scripting, and security tooling integration to streamline operations and enforcement
Experience with security monitoring, alerting, SIEM platforms, and observability tools
Solid grasp of CI/CD practices with integrated security testing and compliance checks
Experience managing Kubernetes clusters and running containerized workloads in production
Experience with deploying and administrating any of the following: scalable cloud native secrets solutions such as AWS KMS, Azure KeyVault
PKI solutions such as EJBCA, Smallstep, Venafi
or vaulting solutions such as Hashicorp Vault

Job Responsibility

Implementing and improving automated security checks in CI/CD pipelines to prevent vulnerabilities from reaching production
Writing, reviewing, and maintaining security-focused infrastructure-as-code for scalable and compliant deployments
Investigating security incidents, performing root cause analysis, and implementing long-term mitigation strategies
Collaborating with developers to develop new features, services, and infrastructure requirements
Enhancing security observability through improved log collection, metrics, and alerting configurations
Maintaining and improving security runbooks, incident response playbooks, and internal security tooling for operational efficiency
Resolve security/infrastructure incidents by participating in high impact/high visibility incidents as a participant and ideally as an incident commander
Maintain and secure critical infrastructure components such as PKI (Public Key Infrastructure) and IAM ( Identity & Access Management) systems, ensuring reliability, scalability, and compliance with organizational and industry security standards
Build and maintain secure, reliable, and scalable infrastructure that protects core services and sensitive data
Troubleshoot and resolve complex operational and system-level issues across environments

What we offer

Competitive salary and 401k with employer match
Discretionary paid time off
Paid parental leave for all
Medical, Dental, Vision plans
Fitness Programs
Emotional & Mental Wellness support
Learning & Development programs
Snacks in our offices

Fulltime

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer you will lead curial initiatives in the...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
7+ years technical experience working with large-scale cloud or distributed systems
Experience building or scaling incident response programs at organizational or enterprise scope
Background in SRE, production engineering, or platform reliability roles
Track record of reducing customer impact through improved incident handling, tooling, or prevention

Job Responsibility

Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
Coach and help develop a team of Site Reliability Engineers serving as incident responders
Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
Communicate clearly and credibly with senior leadership during customer impacting events

Fulltime

New

Principal Service Reliability Engineer

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliab...

Location

United States , Redmond

Salary:

142800.00 - 304200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
Experience leading reliability efforts for enterprise-scale or globally distributed systems
Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
Demonstrated ability to mentor senior engineers and influence engineering culture at scale
Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
Deep experience in observability, incident management, and production operations at scale
Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles

Job Responsibility

Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries

Fulltime

Select Country

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer Cloud Platform

Senior Site Reliability Engineer

Senior Security Operations Engineer II

Senior Security Operations Engineer II

Principal Site Reliability Engineer

Principal Service Reliability Engineer

Our AI answers in your language