Site Reliability Engineer II Job at Microsoft Corporation (Multiple Locations)

Site Reliability Engineer II

Are you interested in working on cutting-edge cloud security products? Would you...

Location

United States , Redmond

Salary:

102100.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Candidates must be able to meet Microsoft, customer and/or government security screening requirements
Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph)
Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
2+ years technical experience working with large-scale cloud or distributed systems
Demonstrated experience applying software engineering principles to production systems, including designing, building, or improving services and platforms
Proficiency in one or more programming languages such as C#, Go, Java, or Python, with the ability to develop and maintain production-quality code
Experience with automation that results in measurable improvements (e.g., reduced toil, fewer manual steps, improved system reliability)
Experience with debugging and troubleshooting complex distributed systems in production environments
Ability to independently identify problems and implement solutions that improve system reliability and operational efficiency

Job Responsibility

Live Site Operations: Serve as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health and responding to incidents within SLA timelines
Automation & Deployment: Contribute to automation efforts and validate code functionality in non-production environments to ensure smooth deployments
Compliance & Security: Support compliance processes by verifying security, privacy, and accessibility standards during onboarding of new technologies
Continuous Learning: Stay current with industry trends and internal tools to improve reliability, performance, and observability at scale
Engineering Best Practices: Apply proven development and scaling practices to meet performance and customer requirements
Cross-Team Collaboration: Communicate effectively with engineering partners to align on goals and deliver user-centric solutions
Incident Response & Postmortems: Address complex live site issues, implement mitigations, and document learnings through postmortems

Fulltime

Site Reliability Engineer II

Location

United States , Exton

Salary:

Not provided

Bentley Systems

Expiration Date

Until further notice

Requirements

U.S. Master of Science degree, or foreign equivalent in Information Quality,Computer and Information Science, or a closely related field, and 3 years of DevOps Engineering experience
3 years’ experience with Site Reliability Engineering and DevOps automation including designing, implementing and maintaining CI/CD pipelines for cloud-based production systems

Job Responsibility

Responsible for designing, implementing, and maintaining automated cloud infrastructure and CI/CD pipelines to support enterprise software applications
Perform DevOps automation, Infrastructure as Code, and containerized deployments to improve system reliability, scalability, and operational efficiency while reducing manual intervention
Cloud platforms Azure and Amazon Web Services (AWS), including infrastructure provisioning, networking architecture, identity management and security configuration
Developing and maintaining IaC using Terraform, along with automation and scripting using Python or PowerShell, and configuration management using Ansible to support scalable and reliable cloud environments
Containerization and orchestration technologies, including Docker, Kubernetes and Helm for deploying, scaling, and managing distributed cloud-native applications
Build and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Grafana) and participate in a rotating on-call schedule for production support

Site Reliability Engineer II

Microsoft is a company where passionate innovators come to collaborate, envision...

Location

India , Hyderabad

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Work with all aspects of a high throughput and multi-tenant service
Collaborate effectively within the team and with partner teams across Microsoft
Be part of the on-call rotation for maintaining service health
Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams
Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement
Document and define existing data engineering processes, data and technology, while evaluating them for optimization
System Reliability & Uptime – Ensuring high availability of services
Incident Management – Detecting, responding to, and mitigating system failures
Performance Monitoring – Tracking system health and resolving bottlenecks
Automation & Tooling – Reducing manual work through scripts and automation

Fulltime

Site Reliability Engineer II

Location

Canada

Salary:

170000.00 - 200000.00 CAD / Year

Axon

Expiration Date

Until further notice

Requirements

5+ years of applicable experience
Experience managing cloud platforms such as Azure, AWS, or similar
Experience using managed languages such as Python, Go, C#, Java, or similar
Experience operating in Kubernetes platforms like AKS, EKS, or similar
Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, tests, and releases
Experience using observability tools such as APM, logging, and metrics to assist with debugging issues
Experience using Infrastructure as Code tools for provisioning infrastructure such as Terraform, AWS CloudFormation, or similar
Builder-operator mindset with proven production ownership (uptime, SLOs, on-call, incident leadership)
Empathy to support the needs of software engineers

Job Responsibility

Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, securely, and cost-effective
Exemplify cloud-native site reliability best practices
Write code that is performant, maintainable, clear, and concise
Employ strong problem-solving skills, with the ability to debug problems in cloud-native distributed systems
Influence and educate the engineering organization to adopt new and improved architectural patterns
Provide robust documentation for use by engineers to promote self-service
Take calculated risks, champion new ideas, and cultivate your craft

Fulltime

Site Reliability Engineer II

The IDEAS organization’s mission is to unlock the power of data to deliver actio...

Location

United States , Redmond

Salary:

100600.00 - 199000.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Bachelor's Degree in Computer Science, or related technical discipline with proven experience coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Experience with automation, live site operations, and incident response in large-scale cloud or distributed systems
Proficiency in at least one programming or scripting language (for example: C#, Java, Python, or PowerShell)
Strong analytical and problem-solving skills, including experience using telemetry and operational data to inform decisions
Effective written and verbal communication skills, and experience collaborating across teams and disciplines
Ability to meet Microsoft, customer, and/or government security screening requirements, including passing the Microsoft Cloud Background Check upon hire and periodically thereafter
The successful candidate must have an active U.S. Government Secret Security Clearance

Job Responsibility

Participate as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health, responding to incidents within defined SLAs, and contributing to post-incident reviews and learning
Design, build, and maintain automation for deployment, operations, and incident mitigation to improve reliability and reduce manual effort
Instrument services for observability
collect and analyze telemetry and health signals
and use data to guide reliability and performance improvements
Collaborate with engineering partners and stakeholders to align on goals, share operational insights, and deliver user-focused solutions
Apply engineering best practices for development, scaling, and operational excellence to meet performance and customer requirements
Support compliance with security, privacy, and accessibility requirements throughout service onboarding and ongoing operations
Continuously learn and adopt industry practices and internal tools to improve reliability, performance, and observability

Fulltime

Site Reliability Engineer II

As an intermediate Site Reliability Engineer on the Core Infrastructure team in ...

Location

Canada , Toronto

Salary:

115000.00 - 165000.00 CAD / Year

PagerDuty

Expiration Date

Until further notice

Requirements

3+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles
Hands-on experience operating Linux-based systems in production environments
Working knowledge of networking fundamentals, such as load balancing, DNS, TLS, and ingress traffic flow
Experience with container orchestration (e.g., EKS, Kubernetes)
Experience working on cloud-native infrastructure (e.g., AWS, GCP, Azure), including networking and compute concepts
Proficiency in at least one programming language (e.g., Python, Ruby, Go, etc.)
Experience with Infrastructure as Code (e.g., Terraform, CloudFormation)

Job Responsibility

Support and improve foundational infrastructure, including networking, compute platforms, Kubernetes clusters, and ingress/traffic management systems
Contribute to the reliability and scalability of PagerDuty's core platform by hardening existing systems and supporting the rollout of new infrastructure capabilities
Participate in agile rituals (standups, planning, retros) and communicate progress/risks early
Stay current on technical trends to suggest innovative tools and approaches to interesting problems
Monitor system health using metrics, logs, and alerts, and participate in 24/7 on-call rotations to help detect, respond to, and resolve incidents

What we offer

Competitive salary
Comprehensive benefits package
Flexible work arrangements
Company equity
ESPP (Employee Stock Purchase Program)
Retirement or pension plan
Generous paid vacation time
Paid holidays and sick leave
Dutonian Wellness Days & HibernationDuty - companywide paid days off in addition to PTO
Paid parental leave: 22 weeks for pregnant parent, 12 weeks for non-pregnant parent

Fulltime

Site Reliability Engineer II

We are seeking an experienced Site Reliability Engineer II to help build, mainta...

Location

United States , Alpharetta

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

3+ years experience in SRE, DevOps, or Cloud Infrastructure roles
Strong hands-on experience with Microsoft Azure services
Advanced experience with Terraform and Terragrunt
Proficiency with Kubernetes/AKS and container orchestration
Experience with CI/CD tools including GitHub Actions and ArgoCD
Solid understanding of observability tooling, especially Grafana
Hands-on experience with Java environments (for app debugging/support)

Job Responsibility

Design, implement, and manage Azure cloud infrastructure using Terraform and Terragrunt
Maintain, monitor, and optimize Kubernetes clusters (AKS)
Build and manage CI/CD pipelines using GitHub Actions/Workflows and ArgoCD in a GitOps model
Enhance reliability through monitoring, alerting, and observability using Grafana (Prometheus, Loki, Tempo is a plus)
Automate operational tasks to reduce manual toil
Participate in on-call rotations, incident response, and post-mortem reviews
Collaborate with development teams to improve application reliability, performance, and scalability
Implement and advocate for SRE practices including SLIs, SLOs, and error budgets
Continuously improve infrastructure performance, cost efficiency, and security posture

What we offer

medical
vision
dental
life and disability insurance
company 401(k) plan

Site Reliability Engineer II

Under general supervision, the Site Reliability Systems Administrator II is resp...

Location

United States , Birmingham

Salary:

Not provided

Alliance Automotive UK LV Ltd

Expiration Date

Until further notice

Requirements

Typically requires a bachelor's degree and three (3) to five (5) years of related experience or an equivalent combination
Intermediate knowledge of appropriate networks, products, and protocols
Knowledge of Unix, Windows NT/2000/98, Internet Security, Oracle ERP, Distributed computing systems
Knowledge of job associated database/software/documentation/programming languages/monitoring and version control tools
Troubleshooting skills
Problem solving skills
Demonstrated knowledge and adherence to Change Management processes
Ability to interface well with customers, end users, partners, and associates

Job Responsibility

Defines, designs, and administers network systems used for data communications and recommends improvements to problems of moderate scope
Responsible for making sure that the company network works
Manages the load configuration of a central data communication processor under limited guidance and makes some recommendations for the purchase or upgrade of data networks
Exercises some discretion in proposing and implementing network system enhancements (software and hardware updates)
Serves as a point of contact for performance analysis, scalability, and service architecture/database administration issues
Coordinates equipment orders including terminals and cable installation, as well as upgrading, monitoring, testing, and servicing the database/systems
Helps to negotiate and place orders with common carriers
Performs other duties as assigned

What we offer

options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay

Fulltime

Select Country

Site Reliability Engineer II

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Site Reliability Engineer II

Site Reliability Engineer II

Site Reliability Engineer II

Site Reliability Engineer II

Site Reliability Engineer II

Site Reliability Engineer II

Site Reliability Engineer II

Site Reliability Engineer II

Site Reliability Engineer II

Our AI answers in your language