CrawlJobs Logo

Site Reliability Engineer II

Ireland, Dublin · Job Posted January 26, 2026
Apply Position
Job Link Share

Job Description

Site Reliability Engineer II - (Microsoft 365 Enterprise + Cloud). We are looking for a Site Reliability Engineers (SRE) with the right mix of systems engineering, data science, software development, AI, on-line services experience, and passion for quality to envision, design, and deliver Microsoft 365 (M365) Enterprise + Cloud service offerings. Team Overview: Within the vast framework of M365 Office Engineering Direct (OED), our SRE team is instrumental to the success of Exchange Online. With the service spanning hundreds of components, our goal is clear: ensure unmatched service availability and continually elevate user satisfaction. What We Do & Our Impact: Our approach is layered and precise. By implementing proactive engineering solutions, we identify and tackle incidents head-on, ensuring limited disruptions. Monitoring, both comprehensive and nuanced, remains our cornerstone, adeptly capturing anomalies beyond the scope of conventional systems. As swift diagnostics steer our course, we channel our efforts towards automation, efficiently managing the incident lifecycle from detection to resolution. Additionally, with a commitment rooted in understanding our users, we meticulously prioritize and execute Design Change Requests, ensuring Exchange Online's evolution aligns with user expectations. The Future – Artificial Intelligence (AI) & Machine Learning (ML) in Focus: As we look to the horizon, the fusion of AI and ML with our SRE practices beckons a transformative era for Online Cloud Services in M365. We are in the initial stages of integrating predictive analytics to anticipate issues before they manifest, allowing us to stay a step ahead. Customized ML models are being developed to intelligently sift through vast data lakes, identifying patterns and correlations previously overlooked. Our journey with AI and ML is not just about enhancement; it is about redefining reliability, precision, and the user experience in the M365 suite.

Job Responsibility

  • Researches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies
  • identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance
  • Drives the adoption of innovative solutions across engineering teams working with related products within an organization
  • Apply advanced statistical and machine learning techniques to analyze large datasets and extract meaningful insights
  • Experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with good tests and good maintainability
  • Engages with product engineering teams by partaking in code/design reviews, participating in on-call rotations and incident responses throughout product development and operations cycles
  • leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention
  • Develops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale
  • reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization
  • Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale
  • Contributes to the development of new tooling and/or predictive models to identify and test potential improvements in product development and/or operations and monitors the impact of changes on operations metrics (e.g., Time-to-X) within an organization
  • Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root cause(s)
  • alerts product teams, owners, and leadership to issues with major customer/business impact and escalates resolution of the overly complex, ambiguous, and impactful issues to include other engineering teams and/or subject matter experts as needed
  • Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings
  • Mentors and coaches less experienced engineers to help them identify and propose relevant solutions

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Mid-level years of software development: automation-related experience is most valued
  • Scripting languages such as bash, python, and PowerShell, or compiled languages such as C, C# are most relevant, but others are acceptable
  • Awareness of, and ability to reason about, modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, microservices, and so on
  • Associated troubleshooting skills, including the ability to follow RPC (Remote Procedure Call) call-chains across arbitrary network steps
  • Consequent understanding of monitoring in distributed systems
  • Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack
  • understanding of how applications are affected by the above, and ability to debug same
  • Experience with working in a team, including coordinating large projects, communicating well, and exercising initiative when presented with problems
  • Practical experience running large scale online systems is always an advantage
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Nice to have

Master's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND mid-level technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer II

8 matching positions

Site Reliability Engineer II

Are you interested in working on cutting-edge cloud security products? Would you...
Location
Location
United States , Redmond
Salary
Salary:
102100.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements
  • Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph)
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • 2+ years technical experience working with large-scale cloud or distributed systems
  • Demonstrated experience applying software engineering principles to production systems, including designing, building, or improving services and platforms
  • Proficiency in one or more programming languages such as C#, Go, Java, or Python, with the ability to develop and maintain production-quality code
  • Experience with automation that results in measurable improvements (e.g., reduced toil, fewer manual steps, improved system reliability)
  • Experience with debugging and troubleshooting complex distributed systems in production environments
  • Ability to independently identify problems and implement solutions that improve system reliability and operational efficiency
Job Responsibility
Job Responsibility
  • Live Site Operations: Serve as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health and responding to incidents within SLA timelines
  • Automation & Deployment: Contribute to automation efforts and validate code functionality in non-production environments to ensure smooth deployments
  • Compliance & Security: Support compliance processes by verifying security, privacy, and accessibility standards during onboarding of new technologies
  • Continuous Learning: Stay current with industry trends and internal tools to improve reliability, performance, and observability at scale
  • Engineering Best Practices: Apply proven development and scaling practices to meet performance and customer requirements
  • Cross-Team Collaboration: Communicate effectively with engineering partners to align on goals and deliver user-centric solutions
  • Incident Response & Postmortems: Address complex live site issues, implement mitigations, and document learnings through postmortems
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Location
Location
United States , Exton
Salary
Salary:
Not provided
bentley.com Logo
Bentley Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • U.S. Master of Science degree, or foreign equivalent in Information Quality,Computer and Information Science, or a closely related field, and 3 years of DevOps Engineering experience
  • 3 years’ experience with Site Reliability Engineering and DevOps automation including designing, implementing and maintaining CI/CD pipelines for cloud-based production systems
Job Responsibility
Job Responsibility
  • Responsible for designing, implementing, and maintaining automated cloud infrastructure and CI/CD pipelines to support enterprise software applications
  • Perform DevOps automation, Infrastructure as Code, and containerized deployments to improve system reliability, scalability, and operational efficiency while reducing manual intervention
  • Cloud platforms Azure and Amazon Web Services (AWS), including infrastructure provisioning, networking architecture, identity management and security configuration
  • Developing and maintaining IaC using Terraform, along with automation and scripting using Python or PowerShell, and configuration management using Ansible to support scalable and reliable cloud environments
  • Containerization and orchestration technologies, including Docker, Kubernetes and Helm for deploying, scaling, and managing distributed cloud-native applications
  • Build and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Grafana) and participate in a rotating on-call schedule for production support
Read More
Arrow Right

Site Reliability Engineer II

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Work with all aspects of a high throughput and multi-tenant service
  • Collaborate effectively within the team and with partner teams across Microsoft
  • Be part of the on-call rotation for maintaining service health
  • Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams
  • Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement
  • Document and define existing data engineering processes, data and technology, while evaluating them for optimization
  • System Reliability & Uptime – Ensuring high availability of services
  • Incident Management – Detecting, responding to, and mitigating system failures
  • Performance Monitoring – Tracking system health and resolving bottlenecks
  • Automation & Tooling – Reducing manual work through scripts and automation
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Location
Location
Canada
Salary
Salary:
170000.00 - 200000.00 CAD / Year
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of applicable experience
  • Experience managing cloud platforms such as Azure, AWS, or similar
  • Experience using managed languages such as Python, Go, C#, Java, or similar
  • Experience operating in Kubernetes platforms like AKS, EKS, or similar
  • Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, tests, and releases
  • Experience using observability tools such as APM, logging, and metrics to assist with debugging issues
  • Experience using Infrastructure as Code tools for provisioning infrastructure such as Terraform, AWS CloudFormation, or similar
  • Builder-operator mindset with proven production ownership (uptime, SLOs, on-call, incident leadership)
  • Empathy to support the needs of software engineers
Job Responsibility
Job Responsibility
  • Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, securely, and cost-effective
  • Exemplify cloud-native site reliability best practices
  • Write code that is performant, maintainable, clear, and concise
  • Employ strong problem-solving skills, with the ability to debug problems in cloud-native distributed systems
  • Influence and educate the engineering organization to adopt new and improved architectural patterns
  • Provide robust documentation for use by engineers to promote self-service
  • Take calculated risks, champion new ideas, and cultivate your craft
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

The IDEAS organization’s mission is to unlock the power of data to deliver actio...
Location
Location
United States , Redmond
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Bachelor's Degree in Computer Science, or related technical discipline with proven experience coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Experience with automation, live site operations, and incident response in large-scale cloud or distributed systems
  • Proficiency in at least one programming or scripting language (for example: C#, Java, Python, or PowerShell)
  • Strong analytical and problem-solving skills, including experience using telemetry and operational data to inform decisions
  • Effective written and verbal communication skills, and experience collaborating across teams and disciplines
  • Ability to meet Microsoft, customer, and/or government security screening requirements, including passing the Microsoft Cloud Background Check upon hire and periodically thereafter
  • The successful candidate must have an active U.S. Government Secret Security Clearance
Job Responsibility
Job Responsibility
  • Participate as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health, responding to incidents within defined SLAs, and contributing to post-incident reviews and learning
  • Design, build, and maintain automation for deployment, operations, and incident mitigation to improve reliability and reduce manual effort
  • Instrument services for observability
  • collect and analyze telemetry and health signals
  • and use data to guide reliability and performance improvements
  • Collaborate with engineering partners and stakeholders to align on goals, share operational insights, and deliver user-focused solutions
  • Apply engineering best practices for development, scaling, and operational excellence to meet performance and customer requirements
  • Support compliance with security, privacy, and accessibility requirements throughout service onboarding and ongoing operations
  • Continuously learn and adopt industry practices and internal tools to improve reliability, performance, and observability
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

Are you interested in working on cutting-edge cloud security products? Would you...
Location
Location
United States , Multiple Locations
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Active U.S. Government Top Secret Security Clearance
  • Ability to pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Live Site Operations: Serve as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health and responding to incidents within SLA timelines
  • Automation & Deployment: Contribute to automation efforts and validate code functionality in non-production environments to ensure smooth deployments
  • Compliance & Security: Support compliance processes by verifying security, privacy, and accessibility standards during onboarding of new technologies
  • Continuous Learning: Stay current with industry trends and internal tools to improve reliability, performance, and observability at scale
  • Engineering Best Practices: Apply proven development and scaling practices to meet performance and customer requirements
  • Cross-Team Collaboration: Communicate effectively with engineering partners to align on goals and deliver user-centric solutions
  • Incident Response & Postmortems: Address complex live site issues, implement mitigations, and document learnings through postmortems
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

As an intermediate Site Reliability Engineer on the Core Infrastructure team in ...
Location
Location
Canada , Toronto
Salary
Salary:
115000.00 - 165000.00 CAD / Year
https://www.pagerduty.com Logo
PagerDuty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles
  • Hands-on experience operating Linux-based systems in production environments
  • Working knowledge of networking fundamentals, such as load balancing, DNS, TLS, and ingress traffic flow
  • Experience with container orchestration (e.g., EKS, Kubernetes)
  • Experience working on cloud-native infrastructure (e.g., AWS, GCP, Azure), including networking and compute concepts
  • Proficiency in at least one programming language (e.g., Python, Ruby, Go, etc.)
  • Experience with Infrastructure as Code (e.g., Terraform, CloudFormation)
Job Responsibility
Job Responsibility
  • Support and improve foundational infrastructure, including networking, compute platforms, Kubernetes clusters, and ingress/traffic management systems
  • Contribute to the reliability and scalability of PagerDuty's core platform by hardening existing systems and supporting the rollout of new infrastructure capabilities
  • Participate in agile rituals (standups, planning, retros) and communicate progress/risks early
  • Stay current on technical trends to suggest innovative tools and approaches to interesting problems
  • Monitor system health using metrics, logs, and alerts, and participate in 24/7 on-call rotations to help detect, respond to, and resolve incidents
What we offer
What we offer
  • Competitive salary
  • Comprehensive benefits package
  • Flexible work arrangements
  • Company equity
  • ESPP (Employee Stock Purchase Program)
  • Retirement or pension plan
  • Generous paid vacation time
  • Paid holidays and sick leave
  • Dutonian Wellness Days & HibernationDuty - companywide paid days off in addition to PTO
  • Paid parental leave: 22 weeks for pregnant parent, 12 weeks for non-pregnant parent
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer II

We are seeking an experienced Site Reliability Engineer II to help build, mainta...
Location
Location
United States , Alpharetta
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years experience in SRE, DevOps, or Cloud Infrastructure roles
  • Strong hands-on experience with Microsoft Azure services
  • Advanced experience with Terraform and Terragrunt
  • Proficiency with Kubernetes/AKS and container orchestration
  • Experience with CI/CD tools including GitHub Actions and ArgoCD
  • Solid understanding of observability tooling, especially Grafana
  • Hands-on experience with Java environments (for app debugging/support)
Job Responsibility
Job Responsibility
  • Design, implement, and manage Azure cloud infrastructure using Terraform and Terragrunt
  • Maintain, monitor, and optimize Kubernetes clusters (AKS)
  • Build and manage CI/CD pipelines using GitHub Actions/Workflows and ArgoCD in a GitOps model
  • Enhance reliability through monitoring, alerting, and observability using Grafana (Prometheus, Loki, Tempo is a plus)
  • Automate operational tasks to reduce manual toil
  • Participate in on-call rotations, incident response, and post-mortem reviews
  • Collaborate with development teams to improve application reliability, performance, and scalability
  • Implement and advocate for SRE practices including SLIs, SLOs, and error budgets
  • Continuously improve infrastructure performance, cost efficiency, and security posture
What we offer
What we offer
  • medical
  • vision
  • dental
  • life and disability insurance
  • company 401(k) plan
Read More
Arrow Right