Site Reliability Engineering Job at Microsoft Corporation (Redmond)

Loan IQ Product Development and Site Reliability Engineering Manager

The Applications Development Group Manager is a senior management level position...

Location

Singapore , Singapore

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

10+ years of relevant experience
5 years of experience in Loan IQ product
10+ years of experience with Java related technologies including spring boot and angular
5+ years experience of platform management
Experience managing global technology teams
Working knowledge of industry practices and standards
Consistently demonstrates clear and concise written and verbal communication
Bachelor's degree/University degree or equivalent experience

Job Responsibility

Manage multiple teams of professionals to accomplish established goals and conduct personnel duties for team (e.g. performance evaluations, hiring and disciplinary actions)
Provide strategic influence and exercise control over resources, budget management and planning while monitoring end results
Utilize in-depth knowledge of concepts and procedures within own area and basic knowledge of other areas to resolve issues
Ensure essential procedures are followed and contribute to defining standards
Integrate in-depth knowledge of applications development with overall technology function to achieve established goals
Provide evaluative judgement based on analysis of facts in complicated, unique, and dynamic situations including drawing from internal and external sources
Influence and negotiate with senior leaders across functions, as well as communicate with external parties as necessary
Appropriately assess risk when business decisions are made, demonstrating particular consideration for the firm's reputation and safeguarding Citigroup, its clients and assets, by driving compliance with applicable laws, rules and regulations, adhering to Policy, applying sound ethical judgment regarding personal behavior, conduct and business practices, and escalating, managing and reporting control issues with transparency, as well as effectively supervise the activity of others and create accountability with those who fail to maintain these standards

Fulltime

Director, Site Reliability Engineering

As our Director of Infrastructure platform, you will be a key driver of Doctolib...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

12+ years in software engineering, including 6+ years leading large (30+) distributed, international platform or infrastructure teams
Proven experience driving platform-as-a-product transformations and modularizing large monolithic architectures at scale
Demonstrated ability to architect, deliver, and operate secure, reliable, and scalable developer platforms in SaaS, multi-product, or regulated environments
Strong process orientation: experience implementing OKRs, robust monitoring/observability, and best-in-class incident management
Measurable impact on developer productivity, platform adoption, reliability, and cost-efficiency
Effective communicator and influencer, with the ability to align and inspire cross-functional stakeholders
Experience leading change and building high-performing, people-first engineering cultures
Fluent in English and comfortable in fast-paced, international environments

Job Responsibility

Lead and scale a high-performing infrastructure organization of 30+ engineers across Infrastructure, Automation, SRE, and Database teams, while maintaining strong engagement and fostering a culture of excellence and ownership
Own the infrastructure platform strategy and roadmap that enables Doctolib's modularization journey, delivers on company OKRs, and ensures predictable execution across all infrastructure and automation initiatives
Champion platform-as-a-product by building self-service capabilities (infrastructure provisioning, CI/CD, observability, database management) that transform developer experience and unlock team autonomy across the engineering organization
Be the guardian of quality and reliability by establishing world-class incident management, driving measurable improvements in availability and performance, and ensuring infrastructure components operate at the highest standards of security and resilience
Accelerate engineering velocity by reducing platform friction, enabling faster modularization, and leveraging AI-augmented development tools to multiply productivity across feature teams
Drive the infrastructure transformation from monolith-supporting infrastructure to a modular, multi-service platform architecture - enabling international expansion, product velocity, and operational excellence at scale
Act as a senior technical leader within the Platform organization and broader Tech leadership team, bringing strong technical opinions and challenging architectural decisions while clearly articulating how infrastructure investments contribute to company strategy and business outcomes

What we offer

Free comprehensive health insurance for you and your children
Parent Care Program: receive additional leave on top of the legal parental leave
Free mental health and coaching services through our partner Moka.care
For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
Work from abroad for up to 10 days per year thanks to our flexibility days policy
Work Council subsidy to refund part of sport club membership or creative class
Up to 14 days of RTT
Lunch voucher with Swile card

Fulltime

Manager of Site Reliability Engineering (SRE)

The Manager of Site Reliability Engineering leads and develops a team of SRE pra...

Location

United States , Birmingham

Salary:

Not provided

Genuine Parts Company

Expiration Date

Until further notice

Requirements

Typically requires a bachelor's degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination
Proven experience working in large, complex enterprise environments (Fortune 500 or equivalent)
Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale
Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, and ArgoCD
In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery
Championing DevOps practices and embedding reliability early in the SDLC
Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability
Strong software engineering background with proficiency in infrastructure-as-code tools (e.g., Terraform, ArgoCD) and CI/CD automation
Deep knowledge of cloud platforms, specifically Google Cloud Platform (GCP), Kubernetes, container orchestration, and cloud-native architecture
Familiarity with monitoring and observability tools such as Dynatrace, Datadog, or equivalents

Job Responsibility

Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence
Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability, availability, and performance
Define and track key SRE metrics such as service uptime, incident response and resolution times
Drive automation efforts including CI/CD pipeline enhancements, infrastructure-as-code practices, and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil
Own and continuously improve observability practices including system monitoring, logging, alerting, and diagnostics to ensure rapid issue detection and resolution
Participate in incident response processes including incident management, root cause analysis, post-mortems, and continuous improvement to enhance system resilience
Partner closely with software engineering, product management, architecture, and security teams to embed reliability and security early in the software development lifecycle (SDLC)
Oversee the management and scalability of cloud infrastructure environments, primarily on Google Cloud Platform (GCP), with a focus on Kubernetes, container orchestration, and hybrid cloud integrations
Advocate for and apply best practices in performance tuning, capacity planning, and system design for high availability
Develop and execute a long-term roadmap for our hybrid cloud platform, aligning with evolving business objectives and technology trends

What we offer

comprehensive benefit plans and programs designed to support your health and wellness, provide income protection and build financial security for your retirement

Fulltime

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
Strong proficiency in Infrastructure as Code (IaC) using Terraform
Solid understanding of cloud platforms including AWS, GCP, or Azure
Experience with automation/configuration management tools like Ansible, Chef, or Puppet
Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
Experience managing Kubernetes and containerized environments (Docker, Helm)
Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
Excellent leadership, communication, and collaboration skills

Job Responsibility

Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Fulltime

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...

Location

Canada , Mississauga

Salary:

120800.00 - 170800.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

6–10 years of relevant experience in a hands‑on technical role
Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
Experience working with senior stakeholders or technology partners
Demonstrated experience supporting IT service improvements or platform stability initiatives
Strong communication and presentation skills, with the ability to convey technical concepts clearly
Experience supporting or contributing to technical roadmaps or operational workstreams
Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
Ability to collaborate with cross‑functional support teams and technology groups
Strong organizational and workload‑planning skills
Consistently demonstrates clear and concise written and verbal communication skills

Job Responsibility

Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
Assist with vendor relationship management, including coordination with offshore managed services
Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
Partner with development teams to guide improvements in application stability and supportability
Contribute to frameworks for managing capacity, throughput, and latency
Assist in defining and implementing application onboarding guidelines and standards
Support team members by fostering a collaborative environment and encouraging skill development
Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
Participate in business review meetings to help align technology tools and strategies with business requirements
Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program

Fulltime

Sr Associate Site Reliability Engineering

Workday is looking for a highly skilled SRE with a focus on Open-Source database...

Location

Australia , North Sydney

Salary:

Not provided

Workday

Expiration Date

Until further notice

Requirements

4+ years of experience managing databases for enterprise cloud applications at scale
3+ years of working in MySQL and/or PostgreSQL database environments in Private and Public Cloud (AWS and/or GCP)
Expertise with Python/GO
Experience with Infrastructure automation (Terraform, Ansible, etc.), CI/CD pipelines (GIT, Jenkins, Argo etc), and configuration management tools (Ansible, Chef etc)
Experience working with private and public clouds (IAAS, AWS, etc.) and capacity management principles
Working knowledge in technologies like Kubernetes/docker
Great teammate with excellent interpersonal skills as well as the ability to prioritize multiple tasks in a fast-paced environment
Available for on-call support on a rotating basis
BS/MS or equivalent experience in Computer Science or a related technical field

Job Responsibility

Ensuring the entire Workday's Data related needs are met with high performance and scale, while providing utmost high availability that our customers expect from Workday
Ensuring seamless operation of 1000s of production and non-production databases across multiple data centers, public clouds and geographies

Fulltime

Director, Site Reliability Engineering

We are seeking a Director of Site Reliability Engineering to lead a global organ...

Location

Finland , Helsinki

Salary:

Not provided

Aiven Deutschland GmbH

Expiration Date

Until further notice

Requirements

Proven experience leading and scaling global SRE or infrastructure organizations through managers, ideally across multiple regions and time zones
Strong track record of defining and executing reliability strategy at scale, including ownership of SLIs/SLOs, incident management frameworks, and operational excellence programs
Demonstrated ability to build, develop, and mentor senior leaders, creating high-performing, inclusive teams and strong leadership pipelines
Experience operating in a 24/7/365 production environment, with deep understanding of follow-the-sun models, on-call design, and large-scale incident response
Ability to partner cross-functionally at the executive level (Engineering, Product, Support) to influence architecture, prioritization, and long-term platform investments
Strong data-driven leadership approach, with experience defining SLI/SLOs and using metrics to drive prioritization, accountability, and continuous improvement
Solid technical foundation in distributed systems, cloud infrastructure, and automation, with the ability to engage credibly with senior engineers and influence technical direction
Experience driving large-scale change and organizational design, including scaling teams, evolving operating models, and improving efficiency and reliability at company level

Job Responsibility

Define and drive global SRE operating strategy in partnership with regional SRE leaders across EMEA, AMER and APAC, ensuring alignment on reliability goals, operating models, and execution across a 24/7/365 follow-the-sun organization
Build and lead a multi-regional SRE organization through managers, developing leadership capability, mentoring team, and ensuring consistent performance, culture, and delivery across geographies
Set the vision and roadmap for reliability engineering, enabling teams to deliver high-impact tools, automation, and process initiatives that improve platform resilience, scalability, and efficiency
Own global incident management strategy and operating model, including on-call design, coverage, and escalation frameworks, ensuring seamless coordination and high availability across regions
Establish a metrics-driven operating cadence, defining KPIs/SLIs/SLOs/Error Budget, driving data-informed prioritization, and embedding operational rigor and continuous improvement across the SRE organization

What we offer

Participate in Aiven’s equity plan
Balance work and life with our hybrid work policy
Choose the equipment you need to set yourself up for success
Use your Professional Development Plan budget for learning opportunities
Receive holistic wellbeing support through our global Employee Assistance Program
Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
Enjoy country-specific benefits for our global cast

Fulltime

Site Reliability Engineering Specialist

The Site Reliability Engineering Specialist independently executes activities th...

Location

India , Bengaluru

Salary:

Not provided

Plusnet

Expiration Date

Until further notice

Requirements

A degree in IT, Maths or Science
A deep understanding of full stack monitoring solutions such as Dynatrace
Strong proficiency in one or more programming languages (e.g. Java, Python)
Experience with cloud platforms (AWS, Azure, or GCP)
Solid understanding of software architecture, design patterns, and microservices
Familiarity with CI/CD tools and DevOps practices
High levels of quality presentation and reporting capabilities
Resilience to ensure support teams are engaged 24x7x365
Ability to adapt to latest industry trends
CI/CD/CT Pipeline management

Job Responsibility

Executes the implementation of new software development life cycle automation tools, frameworks, and code pipelines
Coordinates a diverse team and creates the initial test schedule
Executes the implementation of automation technologies
Proactively identifies and manages risk
Leads scale testing to measure, tune and optimise system performance
Executes metric/monitoring analysis
Designs, analyses, develops and troubleshoots highly distributed large-scale production systems
Executes approaches that scale systems sustainably
Writes and delivers infrastructure as code software
Implements robust monitoring and alerting systems and performs root cause analysis

Fulltime

Select Country

Site Reliability Engineering

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?