This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are currently seeking an experienced professional to join our team in the role of Associate Director, Service Management Specialist.
Job Responsibility:
Lead the design, implementation, and enhancement of service monitoring systems to ensure services operate within agreed Service Level Objectives (SLOs) and enable rapid response to performance indicator breaches
Drive automation initiatives by identifying opportunities to replace manual tasks with software solutions, improving efficiency and reliability across systems
Perform in-depth system analysis, configuration management, and implement improvements to enhance system software performance, availability, scalability, and reliability
Oversee and approve deployment changes, ensuring adherence to best practices and minimizing change-related incidents that could impact the error budget
Collaborate with cross-functional teams, including software engineers, testers, and product managers, to ensure systems meet non-functional requirements such as performance, security, and availability
Develop and enforce best practices for incident management, root cause analysis, and post-mortem processes to improve system resilience
Mentor and guide junior SREs, fostering a culture of continuous learning and operational excellence
Maintain and expand system documentation, including runbooks, architecture diagrams, and operational procedures, ensuring critical knowledge is accessible to the team
Lead capacity planning and disaster recovery strategies to ensure system readiness for growth and unexpected events
Stay updated on industry trends and emerging technologies, driving innovation and improvements in reliability engineering practices
Requirements:
Minimum 10 years of experience in production support, SRE, or DevOps roles, with a proven track record of managing and improving large-scale, mission-critical systems
Advanced programming and scripting skills (e.g., Java, Python, Go, SQL, API development, backend systems)
Extensive experience with containerization (Docker) and orchestration platforms (Kubernetes), including designing and managing large-scale deployments
Proficiency in monitoring and observability tools such as Splunk, CloudWatch, AppDynamics, Prometheus, or Grafana
Strong expertise in Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible, with experience in managing cloud-based infrastructure (AWS, Azure, or GCP)
Demonstrable experience in designing and implementing automation pipelines for CI/CD and operational tasks
Proven ability to lead cross-functional teams to resolve complex technical issues and drive system improvements
Strong understanding of security best practices, including vulnerability management and secure system design
Excellent written and verbal communication skills in both Mandarin and English, with the ability to communicate complex technical concepts to diverse audiences
Experience in mentoring and leading junior engineers, fostering a collaborative and high-performing team environment
Strong analytical and problem-solving skills, with a focus on delivering scalable and reliable solutions
What we offer:
Continuous professional development
Flexible working
Opportunities to grow within an inclusive and diverse environment
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.