Senior Site Reliability Engineer Job at Optimizely (Hanoi)

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...

Location

United Kingdom; United States; Canada

Salary:

Not provided

Mozilla

Expiration Date

Until further notice

Requirements

7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
Excellent async written communication skills
comfortable working with a geographically distributed team
Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes

Job Responsibility

Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
Diagnose and debug production incidents
drive root-cause analysis and post-incident improvements to prevent recurring problems
Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
Contribute to runbooks, architecture documentation, and team processes

What we offer

Fully remote work & schedule flexibility
Company-provided laptop
Annual bonus program
Monthly remote work stipend
Annual professional development stipend
Industry conferences
Company all-hands and team gatherings
24 days PTO per year (prorated)
Birthday
Year-end company shutdown

Fulltime

Senior Site Reliability Engineer

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting language used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience with distributed caching systems: including their underlying algorithms and how to optimize their performance
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Working closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure.
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength

Fulltime

Senior Site Reliability Engineer

At bsport, the Senior Site Reliability Engineer is a role for someone who doesn’...

Location

Spain; France , Barcelona; Paris

Salary:

Not provided

Bsport

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, Platform Engineering, Infrastructure or Backend Engineering
Strong experience with cloud infrastructure (AWS preferred), Kubernetes and CI/CD
Experience building or maintaining high-availability, scalable systems
Solid Python experience (bonus points for Django)
Experience working with SQL databases, ideally PostgreSQL
A proactive mindset: you enjoy taking ownership and solving complex technical challenges
Strong communication skills and fluency in English

Job Responsibility

Scale infrastructure and design resilient systems supporting international growth
Improve deployment speed, CI/CD pipelines and developer experience
Shape platform architecture through modularisation and scalable deployment strategies
Enhance observability, reliability and incident response capabilities
Influence engineering practices and collaborate across teams to improve how we build and ship

What we offer

Competitive salary packages based on your experience and role
Hybrid model with 3 days in the office per week
Work from anywhere: up to 15 days of remote work from abroad each year
Exclusive fitness perks: discounted access to Wellhub for Spain and HelloCSE membership for France
Private health insurance and flexible remuneration for Spain
Diverse fun loving team: multicultural colleagues, after-work events, team-building & more

Fulltime

Senior Site Reliability Engineer

Embark on a transformative journey as a Senior Site Reliability Engineer - AVP. ...

Location

United States , Whippany

Salary:

120000.00 - 175000.00 USD / Year

Barclays

Expiration Date

Until further notice

Requirements

Considerable programming expertise in languages such as Python, Java, and others
Practical experience with Infrastructure as Code (IaC) tools, including Ansible, Chef, and Terraform
Validated experience with observability and monitoring platforms such as Observe, Elastic, InfluxDB, and Grafana
Solid understanding of containerization technologies and Unix/Linux environments
Demonstrates a Site Reliability Engineering (SRE) mindset, with good analytical skills, ownership, and a forward-thinking approach to problem-solving

Job Responsibility

Build and maintain infrastructure platforms and products that support applications and data systems
Ensure the reliability, availability, and scalability of the systems, platforms, and technology
Development, delivery, and maintenance of high-quality infrastructure solutions
Monitoring of IT infrastructure and system performance to measure, identify, address, and resolve any potential issues, vulnerabilities, or outages
Development and implementation of automated tasks and processes to improve efficiency and reduce manual intervention
Implementation of a secure configuration and measures to protect infrastructure against cyber-attacks, vulnerabilities, and other security threats
Cross-functional collaboration with product managers, architects, and other engineers to define IT Infrastructure requirements
Stay informed of industry technology trends and innovations

What we offer

medical, dental and vision coverage
401(k)
life insurance
other paid leave for qualifying circumstances

Fulltime

Senior Site Reliability Engineer

Are you interested in working on cutting-edge cloud security products Would you ...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance
Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
Failure to maintain or obtain the appropriate clearance and/or customer screening requirements may result in employment action up to and including termination
This position requires successful verification of the stated security clearance to meet federal government customer requirements
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Ensure 24x7 Service Reliability: Act as a Designated Responsible Individual (DRI) in an on-call rotation, leading incident response and resolution to maintain uptime and performance for Microsoft's most critical services
Support and Automate Deployments: Execute and improve manual operations and deployments for our products, while designing automation to scale and streamline those processes across environments
Build Scalable Systems: Develop automation for monitoring, alerting, debugging, and deployment to reduce manual effort and accelerate safe, reliable delivery
Drive Compliance and Security: Ensure systems meet Microsoft's standards for security, privacy, and accessibility, especially when onboarding new technologies
Lead Post-Incident Learning: Conduct postmortems, share insights, and implement solutions that prevent recurrence—fostering a culture of learning and continuous improvement
Collaborate Across Teams: Partner with engineering and product teams to align reliability goals with customer needs and deliver seamless user experiences
Stay Ahead Technically: Continuously invest in your technical growth to improve system availability, observability, and performance at scale
Embody our company's Culture and Values

Fulltime

Senior Site Reliability Engineer

Do you want to be at the heart of cloud computing? The Compute team is at the co...

Location

United States , Reston

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Verification of U.S. citizenship due to citizenship-based legal restrictions

Job Responsibility

Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of service fabric services while also driving consistency in monitoring and operations at scale
Drives development of design documents for a product, application, service, or platform
Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
Leverages subject-matter expertise of product features and partners with appropriate stakeholders to drive a workgroup's project plans, release plans, and work items
Take full ownership of assigned services, actively contributing to its enhancement across all cloud environments
Identify opportunities for automation and optimization within the cloud to better support customers

What we offer

Certain roles may be eligible for benefits and other compensation

Fulltime

Senior Site Reliability Engineer

Doctolib is looking for a Senior Site Reliability Engineer to keep Doctolib prod...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

Have a strong hands-on experience (6y+) on a production platform, if possible at scale
Have proven experience with cloud platforms such as AWS, Azure or Google Cloud
Have proven experience with datastores such as PostgreSQL and/or Kafka and/or Couchbase
Have solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Have proficiency in at least one programming language (Ruby, Python, Go, Java, etc.) and understanding of infrastructure as code principles
Are fluent in English

Job Responsibility

Design, build and maintain core infrastructure databases that allow Doctolib scaling to support hundreds of thousands of concurrent users
Automate deployment, scaling, and maintenance of databases to enhance system reliability and operational efficiency
Implement and improve monitoring, alerting, and incident response processes to identify and address potential issues before they impact both practitioners and patients
Provide documentation and tooling to empower the feature teams in their use of their databases, while ensuring their reliability
Mitigate production database issues during working hours when the issue cannot be fixed by the responsible feature team
Research and evaluate new technologies, tools, and best practices to continuously improve the reliability and availability of our systems and processes

What we offer

Free Health Insurance for you
Up to 14 days of RTT
A flexible workplace policy offering both hybrid and office-based modes
Flexibility days allowing to work in EU countries and the UK 10 days per year
Wellbeing program with free mental health and coaching through moka.care
Special support package for caregivers and workers with disabilities
Lunch voucher with Swile card
Work Council subsidy for sport club membership or creative activities
Bicycle subsidy
Public transportation reimbursement

Fulltime

Senior Site Reliability Engineer

Join us as a "Project Manager" at Barclays, where you'll spearhead the evolution...

Location

India , Pune; Bengaluru

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Proven experience with managing projects of medium - large complexity, business critical and cross-functional teams spanning multiple regions and functions, with project management qualification accreditation (e.g. Prince 2, PMI, APM, Agile)
Excellent written and verbal communication skills, including presentation to Senior level stakeholders
Good IT skills, including proficiency in Microsoft PowerPoint, Word PowerPoint, Projects

Job Responsibility

Manage a single project with specific, defined objectives, deadlines, and deliverables
Operate more tactically, focusing on day-to-day management of resources, schedules, and deliverables for their individual project
Work with a shorter, more defined timeframe as projects have a set beginning and end
Primarily manage stakeholders related to their specific project, ensuring communication and expectations are clear for the project’s deliverables
Focus on risks and issues specific to their project and work to mitigate them within the project’s scope
Manage resources for their individual project, ensuring that the project team has the necessary skills, tools, and time to complete the work
Focus on managing the budget of their specific project, ensuring it is completed within the financial constraints
Measure success based on the timely completion of project deliverables within scope, time, and budget
Manage changes that directly impact their specific project, including scope changes, timelines, or resource allocation adjustments

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Select Country

Senior Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Our AI answers in your language