CrawlJobs Logo

Senior Site Reliability Engineer

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

119800.00 - 234700.00 USD / Year

Job Description:

The Core Services Infrastructure and Security team in Microsoft Teams provides the foundational compute, network, security, routing, traffic management, and governance layers that power Microsoft’s real‑time communication experiences across the globe. As a Senior Site Reliability Engineer, you will help build and operate the mission‑critical components of our distributed systems platform ensuring that Teams remains fast, reliable, secure, and resilient at global scale. You will focus on areas such as network security, front door and gateway infrastructure, routing and load balancing, DNS and CDN systems, and end‑to‑end monitoring and observability. This is a unique opportunity to deepen your expertise across high‑scale, active‑active architectures while collaborating with seasoned engineers who obsess over reliability and defense‑in‑depth. You will contribute hands‑on implementation work, participate in design discussions, and apply engineering best practices to continually improve the reliability, performance, and security of the platform.

Job Responsibility:

  • Contribute to the design, implementation, and operation of secure, reliable network and infrastructure services supporting Microsoft Teams’ microservices environment
  • Improve reliability by developing and refining monitoring, alerting, dashboards, and automated recovery mechanisms across critical control‑plane and data‑plane systems
  • Troubleshoot complex issues involving traffic routing, gateway behavior, certificate and TLS issues, DNS, CDN interactions, and network security policies
  • Serve as a Designated Responsible Individual (DRI) on a rotational basis triaging incidents, driving mitigations, documenting learnings, and helping improve live‑site processes
  • Participate in root‑cause analyses (RCAs) and implement durable fixes that eliminate recurring issues
  • Optimize the performance, reliability, and availability of services through data‑driven analysis using metrics, logs, and distributed tracing
  • Work closely with partner engineering teams (security, networking, microservices, compliance, governance) to deliver integrated improvements across shared infrastructure layers
  • Influence project designs by providing SRE perspective on reliability, scalability, testability, operability, and security
  • Contribute to documentation, patterns, and best practices that raise the operational bar across the broader Teams engineering organization
  • Identify opportunities for automation using scripts, pipelines, policy‑driven guardrails, or AI‑enabled tooling to reduce manual toil and increase engineering productivity
  • Stay current on emerging technologies, cloud infrastructure patterns, and security developments
  • apply new techniques to strengthen service reliability and efficiency

Requirements:

  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Nice to have:

  • Experience with core networking concepts including TCP/IP fundamentals, routing, load balancing, CDN, ACL/firewalling, TLS, certificate lifecycle management
  • Ability to diagnose and remediate performance or availability issues using logs, metrics, traces, and standard network troubleshooting tools
  • Hands‑on experience operating services in a cloud environment (Azure or equivalent)
  • 3+ years technical experience working with large-scale cloud or distributed systems
  • Experience with network security, cloud security controls, identity‑driven security policies, and certificate management at scale
  • Familiarity with large‑scale cloud infrastructure (IaaS), microservices patterns, API gateways, and global routing architectures
  • Experience leveraging automation frameworks, scripting, GitOps, Infrastructure‑as‑Code, or AI‑driven tooling to improve reliability and reduce operational load
  • Demonstrated ability to partner across teams, influence without authority, and contribute to multi‑team initiatives or modernization efforts

Additional Information:

Job Posted:
February 13, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer

Senior Site Reliability Engineer

We are looking for a Senior Site Reliability Engineer who is passionate about sc...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years experience operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring, tweaking dashboards, defining alerts, writing runbooks, etc.
  • 5+ years of hands on experience with public cloud offerings (AWS components like EC2, CloudFormation, RDS / Aurora, Caches, SQS - or equivalents, e.g. in GCP / Azure)
  • Familiarity with Unix / Linux operating systems
  • Strong emphasis to debug, improve code, and automate routine tasks
  • Strong backend engineering experience in one or more prominent languages such as Java, Go or Python
  • Excellent communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
  • An ability and desire to mentor and coach engineers
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Baxter International is seeking a skilled Senior Principal Site Reliability Engi...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes
What we offer
What we offer
  • Healthcare benefits
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Flexible Spending Accounts
  • Educational assistance programs
  • Paid holidays
  • Paid time off
  • Paid parental leave
  • Commuting benefits
  • Employee Discount Program
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
  • Applicants must be authorized to work for any employer in the U.S.
  • Unable to sponsor or take over sponsorship of an employment visa at this time.
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.
What we offer
What we offer
  • Support for Parents
  • Continuing Education/Professional Development
  • Employee Health & Well-Being Benefits
  • Paid Time Off
  • 2 Days a Year to Volunteer
  • Medical and dental coverage starting day one
  • Insurance coverage for basic life, accident, short-term and long-term disability
  • Business travel accident insurance
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...
Location
Location
United States , San Francisco
Salary
Salary:
180960.00 - 230900.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
  • four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
  • networking technologies such as TCP/IP or security
  • four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
  • knowledge of Linux and Windows systems
  • cloud technologies within AWS, GCP, Azure
  • continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
  • must pass technical interview
Job Responsibility
Job Responsibility
  • Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
  • provide real-time feedback on production systems
  • work with product family and platform developers to maintain and improve services and performance with a strong customer focus
  • utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
  • responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
  • build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
  • help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
  • responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
What we offer
What we offer
  • Health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

As a Senior Site Reliability Engineer on the Platform team, you will identify is...
Location
Location
United States , Denver; San Francisco
Salary
Salary:
138000.00 - 191000.00 USD / Year
https://checkr.com Logo
Checkr
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science (or related field)
  • 6+ years of experience in building tools with Python (preferred), GoLang, or Ruby
  • 6+ years of experience in maintaining and observing production customer-facing environments in AWS or Azure
  • 6+ years of experience as a member of an incident response team
  • Deep understanding of the fundamental infrastructure and platform concepts behind a micro-service architecture, REST APIs, and asynchronous queueing models
  • Experience with observability platforms and frameworks like Datadog, Splunk, Grafana, Prometheus, or OpenTelemetry
  • Strong collaboration, documentation, communication, and project management skills
  • Experience with container orchestration using Kubernetes/Docker/Terraform
  • Experience driving platform adoption across engineering teams, guided by a self-service and product-first approach
  • A passion for customer-centricity and building relationships with other teams
Job Responsibility
Job Responsibility
  • Collaborate, drive, and execute architectural discussions with cross-functional teams
  • Lead cross-team projects and SREs' technical roadmap to enable engineering and help Checkr customers
  • Design, build, ship, and maintain the core observability libraries, tools, and patterns used by all of Checkr’s engineering teams
  • Proactively engage across teams to foster service reliability, efficiency, and scalability
  • Troubleshoot complex production issues across the stack, with respect to performance, availability, and data quality
  • Present detailed technical information and benefits of the Checkr platform to a wide array of customers, including operations, developers, technical architects, and executives
What we offer
What we offer
  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive cash and equity compensation and opportunities for advancement
  • 100% medical, dental, and vision coverage
  • Up to $25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend, home office stipend
  • In-office perks such as lunch four times a week, commuter stipend, and an abundance of snacks and beverages
  • Fulltime
Read More
Arrow Right

Senior Vice President, Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 8+ years of relevant work experience
  • Highly motivated self-starter with excellent interpersonal and communication skills. Able to communicate efficiently at multiple levels of seniority
  • Certification or formal training in site reliability engineering concepts and practices
  • Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
  • 5+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS.
  • Experience with public cloud technologies such as AWS, GCP or Azure
  • Experience with Secrets products such as HashiCorp Vault or CyberArk
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organization
  • Actively owning production level incidents till resolution.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...
Location
Location
India , Chennai
Salary
Salary:
Not provided
arcadia.com Logo
Arcadia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
  • Strong hands-on experience with: Terraform & Infrastructure as Code
  • AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
  • Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
  • Kubernetes troubleshooting and operations
  • Prometheus/Grafana/Datadog observability stacks
  • Proven ability to operate in high-scale, high-uptime, multi-environment production systems
  • Experience building automation via Python/Bash and reducing operational toil
  • Strong understanding of incident management, root cause analysis, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
  • Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
  • Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
  • Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
  • Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
  • Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
  • Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
  • Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
  • Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
  • Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems
What we offer
What we offer
  • Competitive compensation and employee stock options
  • Hybrid/remote-first working model (India-based role, with global collaboration)
  • Flexible leave policy
  • Comprehensive medical insurance (self + family members)
  • Annual performance cycle + quarterly recognition awards
  • A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce. Its ...
Location
Location
India , Hyderabad
Salary
Salary:
25.00 - 30.00 INR / Year
autorabit.com Logo
AutoRABIT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in SRE, DevOps, or related roles
  • Solid hands-on experience with AWS services (EKS, ECS, EC2, RDS, S3, Redis, etc.)
  • Proficient in writing Terraform infrastructure scripts
  • Strong scripting skills in Python using Boto3
  • Deep understanding of monitoring/logging tools (ELK, CloudWatch, TrendMicro)
  • Experience building and managing CI/CD pipelines (CodeBuild, CodePipeline)
  • Knowledge of infrastructure security and incident response practices
  • Willing to work in rotational shifts and rotational week-offs
  • Bachelor’s in computers or any related field
  • AWS certifications is preferred
Job Responsibility
Job Responsibility
  • Provision and manage AWS infrastructure using Terraform
  • Write AWS Lambda functions (Python3 + Boto3) to automate operational tasks
  • Set up monitoring, logging, and alerting with ELK, TrendMicro, and AWS CloudWatch
  • Configure alerts for performance and security anomalies
  • Develop and maintain CI/CD pipelines using AWS CodeBuild and CodePipeline
  • Troubleshoot production issues and contribute to blameless postmortems
  • Contribute to system hardening and security compliance efforts
  • Responsibility to adhere to set internal controls
  • Fulltime
Read More
Arrow Right