CrawlJobs Logo

Site Reliability Engineer 2

United States, Redmond 100600.00 - 199000.00 USD / Year · Job Posted February 16, 2026
Apply Position
Job Link Share

Job Description

The M365 Copilot App Platform team is the team that provides the platform APIs, infrastructure and backend web server for the Microsoft 365 Copilot app. All partner teams have built their AI-enabled experiences on our platform and depend on us for their success. We own everything from the application code itself to the platform APIs to the deployment pipelines and infrastructure including the backend web server and middle-tier service that supports the application on the web, Windows, and Mac. This role is central to enabling the M365 Copilot app—one of Microsoft’s key strategic products in the competitive AI landscape.

Job Responsibility

  • Leverage expertise in distributed systems, cloud technology layers, platform APIs, and infrastructure components to improve availability, reliability, performance, observability, and security of the middle-tier services
  • Identify opportunities to enhance service quality by analyzing production telemetry and applying insights to propose and implement engineering changes
  • Participate in on‑call rotations and incident responses, engaging with product engineering teams throughout the product lifecycle
  • Independently create, test, and deploy changes through safe deployment processes (SDP) to improve operability and code quality
  • Collaborate with engineers and architects to diagnose and resolve production issues and prevent recurrence
  • Develop and maintain the middle-tier service, platform APIs, deployment pipelines, and infrastructure supporting the M365 Copilot app
  • Work closely with partner teams to enable new capabilities and ensure the platform meets reliability and performance requirements
  • Contribute to the continuous evolution of infrastructure and tooling to support services at scale
  • Collaborate with cross-functional teams to enable the M365 Copilot app and drive innovation
  • Work closely with partner teams to build new additional capabilities into our application

Requirements

  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Nice to have

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience or experience as Site Reliability Engineer in building and shipping production software or services with code in languages including, but not limited to, C#, JavaScript or Typescript OR equivalent experience
  • Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker, containers ecosystem)
  • Proven ability to modify componentized, well-architected infrastructure software and collaborate across teams
  • 1+ years experience with incident management and reliability engineering in cloud or AI environments
  • Proficient in scripting (PowerShell, Shell script, etc.) and expertise in Linux
  • Technical experience working with large-scale cloud or distributed systems
  • Experience running highly-available, mission-critical large-scale distributed systems, including domain expertise in areas such as scalable & fault tolerant system design, observability & monitoring, safe change management, automation, reliability & security risk reduction
  • Motivated and self-driven
  • Strong cross-team communication and partnership skills
  • Creativity, insightfulness, and sensitivity for a dynamic approach to problem solving

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer 2

8 matching positions

Site Reliability Engineer 2

FreeWheel is seeking a Junior DevOps / SRE 2 to join Freewheel OPS team based in...
Location
Location
United States , Chicago; Englewood; Philadelphia
Salary
Salary:
84478.50 - 126717.75 USD / Year
comcastadvertising.com Logo
Comcast Advertising
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1-3 years of experience as an SRE, DevOps or Operations Engineer
  • Proficient in at least one programming language, such as Python, Go, Java, or Scala
  • Experience with an automation tool or framework such as Ansible, Terraform, Kubernetes, Docker
  • Familiar with monitoring and log management tools such as Prometheus, Grafana, ELK Stack
  • Excellent communication skills
  • Proactive learner eager to grow in operations and governance
  • Bachelor’s degree or higher in Computer Science, Software Engineering, or a related field
Job Responsibility
Job Responsibility
  • Design and implement monitoring and alerting systems
  • Develop and maintain automation tools and scripts for deployment, monitoring, backup and disaster recovery
  • Analyze and optimize the performance of data storage, query performance, and data flows
  • Respond quickly to platform failures, perform troubleshooting, and coordinate cross-team efforts
  • Work with engineering teams to analyze and forecast capacity requirements
  • Maintain consistent cloud standards and support enforcement of governance and compliance practices
  • Document the architecture, configurations, and operational procedures
  • Ensure platforms meet security standards and compliance requirements
  • Collaborate with engineering team, product team, and project management team
What we offer
What we offer
  • Medical, prescription, vision, and dental insurance
  • 401(k) savings plan with dollar-for-dollar matching up to the first 6% of your pay
  • Paid time off including eight observed company holidays and flex time
  • Tuition assistance
  • Commuter benefits
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Are you interested in working on cutting-edge cloud security products? Would you...
Location
Location
United States , Redmond
Salary
Salary:
102100.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements
  • Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph)
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • 2+ years technical experience working with large-scale cloud or distributed systems
  • Demonstrated experience applying software engineering principles to production systems, including designing, building, or improving services and platforms
  • Proficiency in one or more programming languages such as C#, Go, Java, or Python, with the ability to develop and maintain production-quality code
  • Experience with automation that results in measurable improvements (e.g., reduced toil, fewer manual steps, improved system reliability)
  • Experience with debugging and troubleshooting complex distributed systems in production environments
  • Ability to independently identify problems and implement solutions that improve system reliability and operational efficiency
Job Responsibility
Job Responsibility
  • Live Site Operations: Serve as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health and responding to incidents within SLA timelines
  • Automation & Deployment: Contribute to automation efforts and validate code functionality in non-production environments to ensure smooth deployments
  • Compliance & Security: Support compliance processes by verifying security, privacy, and accessibility standards during onboarding of new technologies
  • Continuous Learning: Stay current with industry trends and internal tools to improve reliability, performance, and observability at scale
  • Engineering Best Practices: Apply proven development and scaling practices to meet performance and customer requirements
  • Cross-Team Collaboration: Communicate effectively with engineering partners to align on goals and deliver user-centric solutions
  • Incident Response & Postmortems: Address complex live site issues, implement mitigations, and document learnings through postmortems
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Trimble is looking for a Site Reliability Engineering Lead to join Business Syst...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Engineering, Computer Science, or a related field
  • 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles with at least 2+ years in a leadership or mentoring capacity
  • Deep AWS expertise (EC2, S3, RDS, IAM, VPC, Lambda, CloudFormation/Terraform, etc.)
  • Strong knowledge of Infrastructure-as-Code (IaC) using Terraform, AWS CDK, or CloudFormation
  • Proven experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, or similar)
  • Proficiency in containerization and orchestration (Docker, Kubernetes, ECS, or EKS)
  • Expertise in monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana, ELK, CloudWatch, etc.)
  • Strong scripting or programming background (Python, Bash, or Go)
  • Sound understanding of networking, security, and identity/access management in the cloud
  • Experience designing high-availability and disaster recovery strategies for critical workloads
Job Responsibility
Job Responsibility
  • Become well-versed in the opportunities and challenges of the business and Trimble's customers
  • Become an expert in Business Systems services, especially the interfaces—APIs, protocols (e.g. OAuth), and user interfaces
  • Establish, then utilize tight working relationships with stakeholders across the company, especially Trimble's engineering community
  • Prototype and create proofs of concept as required
  • Scope and deploy new integrations
  • Investigate, diagnose, and solve customer integration issues
  • Effectively communicate technical issues with stakeholders in non-technical language
  • Contribute to utilities and SDKs to help integration and migration efforts
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Op...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
  • Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
  • Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
  • Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
  • Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
  • Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle
Job Responsibility
Job Responsibility
  • Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
  • Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
  • Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
  • Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
  • Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns
What we offer
What we offer
  • Structured environment to accelerate technical skills
  • Direct guidance from experienced engineering professionals
  • Projects that improve productivity, quality, safety, transparency and sustainability
  • Collaborative and supportive team
  • Entrepreneurial spirit empowering proactive doers
  • Flexible work arrangements
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Work with all aspects of a high throughput and multi-tenant service
  • Collaborate effectively within the team and with partner teams across Microsoft
  • Be part of the on-call rotation for maintaining service health
  • Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams
  • Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement
  • Document and define existing data engineering processes, data and technology, while evaluating them for optimization
  • System Reliability & Uptime – Ensuring high availability of services
  • Incident Management – Detecting, responding to, and mitigating system failures
  • Performance Monitoring – Tracking system health and resolving bottlenecks
  • Automation & Tooling – Reducing manual work through scripts and automation
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a Site Reliability Engineer, you are passionate about experience innovation a...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
valtech.com Logo
Valtech
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
  • 2+ years in DevOps, SRE, or Support Engineering roles
  • Experience with incident management in high-traffic, public-facing platforms
  • Strong scripting skills (Python, Bash, or PowerShell)
  • Familiarity with CI/CD tools: GitHub Actions, Azure DevOps, GitLab, Jenkins
  • Experience with monitoring/APM tools: Datadog, New Relic, Dynatrace, Prometheus, Grafana
  • Basic knowledge of serverless services in AWS, Azure, or GCP
  • Proficiency with Docker and containerized environments
  • Excellent English communication skills (B2+ level)
  • Experience working in international, cross-cultural teams
Job Responsibility
Job Responsibility
  • Maintain and improve observability systems (monitoring, logging, alerting)
  • Define, adjust, and maintain Service Level Objectives (SLOs)
  • Participate in incident resolution and on-call rotations (max 1 week/month)
  • Drive proactive reliability improvements across platforms
  • Collaborate with teams to analyze failure scenarios and implement mitigations
  • Create and maintain runbooks for incident response and prevention
  • Eliminate non-value-adding tasks through automation and process optimization
What we offer
What we offer
  • Flexibility, with hybrid work options (country-dependent)
  • Learning and development, with access to cutting-edge tools, training and industry experts
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer - CTJ - Poly

We are seeking a Senior Site Reliability Engineer to lead a team that builds and...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
  • This position requires verification of U.S. citizenship due to citizenship-based legal restrictions
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Write secure, high-quality code that is maintainable, scalable, and performant
  • Architect, implement, and optimize hybrid and cloud infrastructure using Infrastructure as Code (e.g., Containers, Bicep, Terraform, AKS etc.) to improve availability, scale, security, and operational efficiency
  • Design and implement data governance, storage, backup, and disaster recovery for a multi-petabyte Azure environment, ensuring integrity, security, and performance
  • Build and operate large-scale data pipelines and data transformations to support analytics, governance, and operational needs
  • Evaluate emerging engineering tools and practices and incorporate them into the roadmap to continuously improve efficiency, reliability, and scale
  • Deliver automation to improve service health, manageability, reliability, telemetry, and alerting, with a focus on resiliency
  • Create and maintain clear technical documentation and design specifications aligned with best practices
  • Partner with engineering, project management, and operations to evolve services and optimize infrastructure in support of organizational goals
  • Participate in an on-call rotation to operate live services
  • troubleshoot and mitigate complex issues, escalate as needed, and write post-incident reviews to share learnings
  • Fulltime
Read More
Arrow Right