This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
A team within Engineering under the Platform Excellence pillar exhibits an unwavering attention to detail and a deep understanding of the platform wide monitoring implications to all merchants. In this role, you will be on-call monitoring platform performance, coordinating and commanding incidents, communicating with our customers, working on monitoring frameworks, providing feedback to product engineering teams to improve the reliability of the platform. You will initiate and lead initiatives across our platform offerings prioritizing merchant impact to proactively detect any issues, inform merchants quickly, and increase the reliability of our platform.
Job Responsibility
Participate in 24/7 on-call monitoring and observe platform and merchant performance and detect any issues proactively to mitigate risks in partnership with Engineering teams
Coordinate the mitigation, recovery, and resolution of high-impact incidents, ensuring a rapid and effective response across teams
Represent the customer perspective during incidents, maintaining a strong customer-centric approach
Communicate with merchants real time during an incident and present the most accurate and updated information to keep them informed
Escalate critical incidents when needed and provide structured communication to senior management
Go beyond reactive incident response by analyzing incident trends to identify recurring issues and systemic weaknesses and partner with engineering and product teams to advocate for long-term fixes
Work together with Operations, Product, and Engineering teams to integrate, grow, and continuously improve monitoring strategy and increase reliability
Investigate alerts and provide feedback to engineering teams to build effective logging and alerts across the platform architecture
Mitigate merchant impact risk by actioning on alerts in partnership with Engineering teams and contribute to the monitoring playbook by documenting learnings
Improve operations by leading/project managing initiatives and tools development of automation for effective monitoring
Focus on prioritizing, automating, and scaling every aspect of detection capabilities
Requirements
At least 5 years of experience with incident management, problem management, incident client communication, and platform monitoring operations
Experience with problem management practices - identifying trends across incidents, conducting root cause investigations and driving preventative action
Solid communication skills and the ability to develop strong working relationships throughout the organization, able to translate technical situations clearly and concisely to a diverse audience via data-visualizing dashboards and written documents
Willing to participate in the on-call rotation and work in a fast-paced, dynamic environment
Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, etc.
Experience with observability platforms like Datadog, Dynatrace, Splunk
Excellent analytical and problem-solving skills, with the ability to analyze complex systems and spot the root cause of issues
Thrive in an environment where collaboration is crucial and where a global approach is key for successful implementation of processes and projects
Passion for defining and standardizing processes to drive strategic improvement and able to translate complex technical concepts with ease for all non technical audiences
Natural ability for handling complex situations and multiple responsibilities simultaneously
Strong team player and thrive in a dynamic environment
Work schedule: The shifts are from 9.00AM - 6.00PM with a 6-day workweek at least twice a month (Sunday–Friday or Monday–Saturday)