CrawlJobs Logo

Site Reliability Engineer

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

84200.00 - 165200.00 USD / Year

Job Description:

The Site Reliability Engineer (SRE) for Azure xDPU Storage Team – Hardware Enablement is responsible for ensuring the reliability, availability, and performance of Fungible DPU based Azure Storage devices as they integrate next-generation networking and compute offload hardware. This role focuses on safe bring-up, validation, and scaled production operation of DPU-enabled platforms, bridging hardware, firmware, and software reliability and maintenance.

Job Responsibility:

  • Own end-to-end reliability for Azure Storage hardware running in on-prem lab environments
  • Partner with silicon, firmware, BIOS, networking, and OS teams to enable and validate DPU hardware for specific storage use cases
  • Define, measure, and improve Service Level Objectives (SLOs), Service Level Indicators (SLIs) for DPU-accelerated storage scenarios within our lab and pre-prod environments
  • Lead live-site incident response and mitigation for hardware-, firmware-, or DPU-related issues, including deep root-cause analysis across hardware/software boundaries within our lab and pre-prod environments
  • Build automation for provisioning, configuration, validation, canarying, rollback, patching, and recovery of DPU-enabled Azure Storage systems within our lab and pre-prod environments
  • Develop reliability validation strategies, including stress, fault-injection, and chaos testing for DPU hardware enablement and management
  • Create and maintain operational runbooks, diagnostics, telemetry, and health models specific to Fungible DPU platforms within our lab and pre-prod environments
  • Drive improvements in observability and alerting by extending Azure Monitor and internal systems with DPU- and hardware-level signals

Requirements:

  • Associate's Degree in Computer Science, Information Technology, or related field OR Bachelor's Degree in Computer Science, Information Technology, or related field OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
  • Bachelor's Degree in Computer Science, Electrical Engineering, Computer Engineering, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Experience operating large-scale, distributed systems in a lab/validation
  • Experience working close to hardware, including networking, storage, or accelerator technologies such as SmartNICs, DPUs, or offload engines
  • Proficiency in one or more programming or scripting languages (C++, C#, Python, Go, or PowerShell)
  • with experience reading lower-level system code
  • Hands-on experience with Microsoft and Azure lab infrastructure and live-site operations
  • Demonstrated understanding of networking, operating systems, and performance characteristics of I/O-intensive distributed systems
  • Direct experience with Fungible DPU technology or similar SmartNIC/DPU platforms
  • Existing hands-on experience working in Microsoft MLS (Microsoft Lab Services) or equivalent internal lab environments, including lab-based hardware validation, performance testing, and bring-up workflows
  • Experience enabling new hardware platforms or accelerators in a Windows/mixed OS environment
  • Familiarity with firmware lifecycles, hardware validation, and silicon bring-up processes
  • Experience with infrastructure-as-code and CI/CD pipelines (ARM/Bicep, Terraform, Azure DevOps)

Additional Information:

Job Posted:
February 17, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineer

Senior Site Reliability Engineer

Baxter International is seeking a skilled Senior Principal Site Reliability Engi...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes
What we offer
What we offer
  • Healthcare benefits
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Flexible Spending Accounts
  • Educational assistance programs
  • Paid holidays
  • Paid time off
  • Paid parental leave
  • Commuting benefits
  • Employee Discount Program
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
  • Applicants must be authorized to work for any employer in the U.S.
  • Unable to sponsor or take over sponsorship of an employment visa at this time.
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.
What we offer
What we offer
  • Support for Parents
  • Continuing Education/Professional Development
  • Employee Health & Well-Being Benefits
  • Paid Time Off
  • 2 Days a Year to Volunteer
  • Medical and dental coverage starting day one
  • Insurance coverage for basic life, accident, short-term and long-term disability
  • Business travel accident insurance
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...
Location
Location
United States , San Francisco
Salary
Salary:
180960.00 - 230900.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
  • four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
  • networking technologies such as TCP/IP or security
  • four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
  • knowledge of Linux and Windows systems
  • cloud technologies within AWS, GCP, Azure
  • continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
  • must pass technical interview
Job Responsibility
Job Responsibility
  • Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
  • provide real-time feedback on production systems
  • work with product family and platform developers to maintain and improve services and performance with a strong customer focus
  • utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
  • responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
  • build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
  • help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
  • responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
What we offer
What we offer
  • Health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team responsible for Private and Public...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 6+ years of relevant work experience
  • Highly motivated self-starter with excellent interpersonal and communication skills
  • Certification or formal training in site reliability engineering concepts and practices
  • Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
  • 4+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS
  • Experience with public cloud technologies such as AWS, GCP or Azure
  • Experience with Secrets products such as HashiCorp Vault or CyberArk
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
  • Actively owning production level incidents till resolution.
What we offer
What we offer
  • Equal opportunity employer
  • Accessibility support for persons with disabilities.
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...
Location
Location
United States; Canada
Salary
Salary:
186818.00 - 224183.00 USD; CAD / Year
babylist.com Logo
Babylist
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a Site Reliability Engineer or similar role
  • Experience supporting high-traffic consumer-facing websites
  • Proficiency with Terraform
  • Strong experience working with AWS cloud-based infrastructure and services
  • Proficiency with Docker and Kubernetes
  • Solid understanding of cloud-native systems design
  • Troubleshooting and debugging skills
  • Experience designing and supporting CI systems
  • Familiar with monitoring and alerting best practices
  • Proven experience in on-call management best practices
Job Responsibility
Job Responsibility
  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
  • Improve the speed and reliability of our Continuous Integration (CI) systems
  • Provide support to developers in troubleshooting issues
  • Establish, communicate, and support best practices for monitoring and alerting
What we offer
What we offer
  • Company-paid medical, dental, and vision insurance
  • Retirement savings plan with company matching and flexible spending accounts
  • Generous paid parental leave and PTO
  • Remote work stipend
  • Perks for physical, mental, and emotional health, parenting, childcare, and financial planning
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Location
Location
United States , Ft. Meade
Salary
Salary:
Not provided
cipherlogix.com Logo
CipherLogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
  • Ten (10) years experience in system engineering/architecture
  • Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
  • At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
  • At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
  • Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
  • Ten (10) years experience in the cleared environment
  • Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
  • Knowledge and experience with developing distributed storage routing and querying algorithms
  • Experience in developing documentation required to support a program’s technical issues and training situations
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

We are looking for a Site Reliability Engineer to own our internal systems infra...
Location
Location
United States , Sunnyvale
Salary
Salary:
175000.00 - 250000.00 USD / Year
figure.ai Logo
Figure
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience with Linux/Unix systems administration
  • Proficiency in programming/scripting
  • Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
  • Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
  • Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
  • Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
  • Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
  • Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
  • Ability to work in cross-functional teams with developers, infra, and product teams
  • Excellent verbal and written communication skills
Job Responsibility
Job Responsibility
  • Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more
  • Migrate SaaS to self-hosted solutions to enhance security and reliability
  • Implement monitoring and alerting systems, and define incident response plans and runbooks
  • Reduce human workload through automation to automate deployment and scaling
  • Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives
  • Use a data driven approach to demonstrate service robustness and track optimization work
  • Partner with the security team to ensure that security remediations and updates are applied in a timely manner
  • Fulltime
Read More
Arrow Right