Sre Manager Job at Randstad (Tokyo)

Manager, Site Reliability Engineering and Incident Management

Planet DDS is seeking a Manager, Site Reliability Engineering and Incident Manag...

Location

United States , Atlanta

Salary:

118000.00 - 160000.00 USD / Year

Planet DDS

Expiration Date

Until further notice

Requirements

7+ years in SRE, DevOps, or Infrastructure roles
3+ years in Incident Management leadership
Deep understanding of reliability, scalability, and performance optimization
Multi-cloud expertise in AWS, Azure, or GCP
Understanding of DNS, load balancing, firewalls, and compliance frameworks
Knowledge of fundamental cloud security (e.g., identity and access management, firewalls)
Deep understanding of logging and monitoring and security best practices
Strong collaboration and communication skills
Bachelor’s Degree in a relevant major or equivalent years of experience is a plus

Job Responsibility

Lead and mentor a team of SREs and Incident Managers
Foster a culture of reliability, accountability, and continuous improvement
Collaborate with engineering teams to design resilient platform architectures
Oversee the incident response process for outages and service disruptions
Ensure timely detection, escalation, and resolution of incidents
Drive post-incident reviews (PIRs) and root cause analysis
Implement improvements based on lessons learned to prevent recurrence
Mature and enforce best practices for incident response and runbooks
Automate operational tasks to reduce toil and improve efficiency
Maintain observability tools (monitoring, alerting, logging)

Fulltime

Site Reliability Engineering Manager

The Wikimedia Foundation is looking for an Engineering Manager to join our SRE t...

Location

United States of America

Salary:

132439.00 - 208378.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Prior experience managing teams
Prior hands-on experience with software or reliability engineering (within the last 3 years preferred)
Ability to analyze complex systems, troubleshoot issues, and devise effective solutions under pressure
Proficiency in project management methodologies to effectively plan, execute, and track new and existing initiatives
Strong understanding of cloud computing, networking, Linux systems administration, containerization (e.g., Docker, Kubernetes), and infrastructure as code (e.g., Terraform, Ansible) to be able to provide technical support to the team
Aptitude for automation and streamlining of tasks
Communicate effectively in both spoken and written English
Ability to work independently, as an effective part of a globally distributed team
Ability to travel several times a year for occasional in-person meetings
B.S. or M.S. in Computer Science or the equivalent in related work experience

Job Responsibility

Managing one to two globally distributed teams within Wikimedia’s Site Reliability Engineering organization
Providing guidance, mentorship, and support to ensure the team's effectiveness and growth
Working with team members to set individual performance goals, and supporting them in meeting and evolving their goals and career path
Recruiting, hiring, and helping onboard new team members
Triaging incoming workload, maintaining focus on priorities, and setting realistic expectations for both peers and team members
Coordinating and communicating with other members of the Wikimedia product & engineering teams on relevant projects, executing complex projects and contributing to the organizational strategy
Continuously developing the roadmap of the team in alignment with other SRE and Product & Technology teams, and helping to draft and execute the team’s annual and quarterly plans
Project managing new and existing initiatives
Leading the definition, refinement, and execution of the processes through which the team manages and performs work
Leading incident response, diagnosis, and follow-up on system alerts and outages across Wikimedia’s production infrastructure

Fulltime

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...

Location

Canada; United States

Salary:

195000.00 - 285000.00 USD / Year

Apollo.io

Expiration Date

Until further notice

Requirements

5+ years of hands-on software or infrastructure engineering experience
2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
Strong grounding in networking, security, and reliability principles
Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale

Job Responsibility

Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
Run effective 1:1s, career development conversations, and quarterly performance reviews
Support recruiting efforts to attract top engineering talent across time zones

What we offer

Equity
Company bonus or sales commissions/bonuses
401(k) plan
At least 10 paid holidays per year
Flex PTO
Parental leave
Employee assistance program and wellbeing benefits
Global travel coverage
Life/AD&D/STD/LTD insurance
FSA/HSA and medical, dental, and vision benefits

Fulltime

Engineering Manager, Platform

We are looking for an engineering manager to help us scale, improve organisation...

Location

Salary:

Not provided

Airalo

Expiration Date

Until further notice

Requirements

Minimum 5 years of hands-on technical experience in cloud-native environments, specifically with distributed systems and platform development
Minimum 2 years of experience in directly leading and managing platform, DevOps, or SRE teams
Expertise in designing, building, refactoring, and operating distributed systems and scalable cloud infrastructure at scale
Expertise in event-driven architecture and various Messaging systems (e.g., Kafka, SQS, RabbitMQ, Pub/Sub)
Strong knowledge of both relational (SQL) and NoSQL database technologies and their operational considerations in cloud environments
Extensive hands-on experience and deep understanding of core AWS services (e.g., EC2, EKS, Lambda, SQS, Security Groups, IAM, Aurora, DynamoDB, S3, RDS, CloudWatch, CloudTrail)
Proven expertise with Infrastructure as Code (e.g., Terraform, CloudFormation)
Strong experience with containerisation technologies (Docker) and orchestration platforms (Kubernetes), including Helm and related ecosystem tools
Extensive experience with modern monitoring, logging, and observability platforms (e.g., Datadog, Prometheus, Grafana, ELK Stack, Jaeger/OpenTelemetry)
Strong familiarity with DevSecOps practices and the implementation of automated security tooling throughout the CI/CD pipeline (e.g., SAST, DAST, secret management, vulnerability scanning)

Job Responsibility

Lead the strategy, architecture, and execution of our core platform technologies
Extend and improve engineering best practices across the organisation
Maintain and improve a collaborative environment, acting as a key bridge between application development teams and the platform team
Motivate and instil a strong sense of ownership in your team for the end-to-end lifecycle, stability, scalability, and performance of our core platform services
Mentor and guide the professional and technical development of your team members
Ensures that the team delivers high quality products and solutions by following the best practices
Build and scale teams that are collaborative, inclusive, and respectful of each other
Provide continuous, actionable feedback, address underperformance proactively, and recognise the individual strengths and contributions of your team members
Work closely with engineers and collaborate with key stakeholders to define, maintain a prioritised backlog, and establish clear short-term and long-term goals for the platform roadmap
Own your team’s deliverables and ensure the continuous delivery of scalable, highly-available, and cost-efficient platform services and infrastructure

What we offer

Health Insurance
work-from-anywhere stipend
annual wellness & learning credits
annual all-expenses-paid company retreat in a gorgeous destination

Fulltime

FX Applications Support Senior Analyst

As an OpsTech Application Support Analyst, the candidate will play a pivotal rol...

Location

Australia , Sydney

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

5-8 years experience in an Application Support role
experience installing, configuring or supporting business applications
experience with some programming languages and willingness/ability to learn
advanced execution capabilities and ability to adjust quickly to changes and re-prioritization
effective written and verbal communications including ability to explain technical issues in simple terms that non-IT staff can understand
demonstrated analytical skills
issue tracking and reporting using tools
knowledge/experience of problem Management Tools
good all-round technical skills
effectively share information with other support team members and with other technology teams

Job Responsibility

Provide technical and business support for users of Citi Applications
maintain application systems
manage, maintain and support applications
perform start of day checks, continuous monitoring, and regional handover
develop and maintain technical support documentation
maximize the potential of applications
assess risk and impact of production issues and escalate
ensure storage and archiving procedures are functioning correctly
formulate and define scope and objectives for complex application enhancements
prioritize bug fixes and support tooling requirements

What we offer

Rewarding work in a supportive environment
clear opportunities for progression
exciting company benefits

Fulltime

Senior Site Reliability Manager

RUCKUS Networks is seeking an experienced Site Reliability Engineering (SRE) Man...

Location

United States , Sunnyvale

Salary:

135600.00 - 200000.00 USD / Year

CommScope

Expiration Date

Until further notice

Requirements

12+ years in Site Reliability Engineering (SRE), with 6+ years leading SRE, DevOps, or infrastructure teams
Proven experience mentoring engineering managers and developing leadership talent
Track record of transforming traditional operations or NOC teams into modern SRE organizations
Strong project management skills with Agile/Kanban experience and JIRA proficiency
Excellent communication skills, including executive-level presentations
Deep SRE expertise: incident management, on-call systems, monitoring, and reliability engineering
Infrastructure automation experience with Terraform, Kubernetes, Docker, and CI/CD pipelines
Cloud platform proficiency (GCP/AWS), including networking, security, and cost optimization
Monitoring and observability experience with Prometheus, Grafana, APM tools, and log aggregation
24/7 operations experience with global team coordination and escalation management

Job Responsibility

Lead and develop engineering managers and technical operations engineers across India and APAC time zones
Build a collaborative team culture that emphasizes knowledge sharing, automation, and operational excellence
Mentor engineering managers to strengthen leadership capabilities and technical expertise
Set clear performance expectations and provide ongoing coaching for growth
Partner cross-functionally with Product, Security, Development, and global operations teams
Own 24/7 operational stability for India/APAC, including incident response, escalation, and post-incident reviews
Drive comprehensive incident management: alert handling, outage response, and root cause analysis (RCA/CAR)
Transform traditional operations into modern SRE practices using SLOs, error budgets, and reliability engineering
Implement robust monitoring and alerting with APM tools, dashboards, and automation frameworks
Lead technical project delivery with clear timelines, resource planning, and stakeholder communication

What we offer

medical, dental, and vision plans
life and accidental death insurance
a 401(k) plan
participation in the Company’s Incentive Plan
eleven paid holidays in a full calendar year
two weeks of paid vacation (prorated based on start date)
other leave options

Fulltime

Engineer 4, Software Development & Engineering

Make your mark at Comcast -- a Fortune 30 global media and technology company. B...

Location

India , Chennai

Salary:

Not provided

Comcast Corporation

Expiration Date

Until further notice

Requirements

7+ years of hands-on troubleshooting experience of a complex large scale enterprise application server environment
7+ years of technical experience supporting Operational Planning, Operations, Incident Management, Problem Management, and Change Management
Understanding of Site Reliability Engineering (SRE) principles, Incident Management and Crisis Management
Outstanding written and oral communication skills, with ability to articulate sophisticated emergent situations clearly to all levels of the organization
Solid understanding of application servers and applications services, load balancing and database technologies
Knowledge of the AWS cloud environment, including API Gateway, SNS/SQS, Lambda, CloudWatch, DynamoDB
Embraces challenges, displays strong creative flexibility
Previous senior operational experience
Working knowledge of AWS, specifically API Gateway, Lambda, CloudWatch, SNS, SQS, Elasticsearch, DynamoDB
Excellent hands-on scripting skills (Python preferred)

Job Responsibility

Responsible for planning and designing new software and web applications
Analyzes, tests and assists with the integration of new applications
Oversees the documentation of all development activity
Trains non-technical personnel
Assists with tracking performance metrics
Integrates knowledge of business and functional priorities
Acts as a key contributor in a complex and crucial environment
May lead teams or projects and shares expertise
Designs new software and web applications, supports applications under development and customizes current applications
Develops software update process for existing applications

What we offer

Paid Time off
Physical Wellbeing benefits
Financial Wellbeing benefits
Emotional Wellbeing benefits
Life Events + Family Support benefits

Fulltime

Director, Service Reliability Engineering

As Director of SRE, you will lead the team responsible for accelerating and auto...

Location

United States , Bethesda

Salary:

125600.00 - 203700.00 USD / Year

Marriott Bonvoy

Expiration Date

Until further notice

Requirements

Undergraduate degree in computer science, software engineering, or a related field (or equivalent experience)
10+ years of experience in SRE, devsecops or IT operations
At least 5 years’ experience in a previous leadership role within SRE, devsecops or IT Operations
At least five years of experience in the following technologies - Presentation Management: HTML, CSS, JS, Backbone, Node JS, Android, iOS, Application Platforms: NGINX, Java, Akana, Play Framework, Tomcat, Docker, Openshift, Application Data: PostgreSQL, Couchbase, Cassandra, Integration Services: Apache Kafka, Apache Spark, Akana, Analytics Platforms: Hadoop, dashDB, Cognos, Tableau, Security: Forgerock, OpenID, OAUTH, Ping Identity, Public Cloud: Azure, Google Cloud, AliCloud, Amazon Web Services, CI/CD: Harness
Experience with test automation
Working knowledge and proven track record of implementing disaster indifferent architecture
Experience with CDN and Akamai tools
Linux/Unix system administration experience
Proficient in scripting and programming languages (like Python, Go, Bash, Shell)
Hands on experience with infrastructure as code (like Terraform), container orchestration (like Kubernetes), and reliability automation

Job Responsibility

Define and execute Marriott’s SRE vision, aligning with business objectives and technology roadmaps
Build, mentor and lead a high-performing SRE team, fostering a culture of collaboration and innovation
Establish reliability, observability and automation goals to improve system uptime, performance and scalability
Partner with engineering, operations and security teams to drive best practices and continuous improvement
Implement reliability-focused engineering practices, including SLAs, SLOs/SLIs and error budgets
Design and maintain resilient, scalable and fault-tolerant architectures across cloud and hybrid environments
Develop strategies to proactively identify and mitigate risks to system performance and availability
Drive root cause analysis (RCA) and post-mortem processes to prevent recurring incidents
Champion automation in monitoring, deployment and incident resolution to reduce toil and enhance efficiency
Lead and optimize incident response processes, ensuring rapid detection, diagnosis, and resolution of system failures

What we offer

Bonus program
comprehensive health care benefits
401(k) plan with up to 5% company match
employee stock purchase plan at 15% discount
accrued paid time off (including sick leave where applicable)
life insurance
group disability insurance
travel discounts
adoption assistance
paid parental leave

Fulltime

Sre Manager

Randstad

Location:
Japan , Tokyo

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Job Description:

Requirements:

Additional Information:

Job Posted:
May 09, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Sre Manager