This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re passionate about building software that solves problems. We count on our site reliability engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions. As we expand customer deployments, we’re seeking an experienced SRE to deliver insights from massive-scale data in real time. Specifically, we’re searching for someone who has fresh ideas and a unique viewpoint, and who enjoys collaborating with a cross-functional team to develop real-world solutions and positive user experiences for every interaction.
Job Responsibility:
Monitoring and reporting on application behavior analytics, conducts smart triage by identifying, diagnosing, and coordinating resolution of performance problems before they impact end users, and participates in rapid root cause diagnosis of problems occurring within the application and infrastructure
Identifying the functional domain in which problems reside (Server Utilization, network Saturation, Application Tuning)
Participating in all Major Incident Management and Root Cause Analysis calls and provides expert troubleshooting support as needed
Understanding of troubleshooting, incidents and problems, work to resolve issues timely and determine fault or underlying issue. Work with both customer and vendor personnel
Monitoring high value Business-centric transactions and manages response actions
Maintaining accurate documentation for assigned workspace and procedures, updating procedures including, but not limited to software, hardware layers
Understand and utilize de-escalation techniques when working with difficult customers
Monitoring Application infrastructure and network through monitoring tools like Splunk, AppDynamics, Dynatrace
Proactively detects, reports, logs, and responds to all network performance and availability problems in each part of the Application
Follows incident, problem and change management processes related to technology infrastructure being supported. Reviews system requirements and application dependencies to determine monitoring configuration
Involving in creating documentation
Requirements:
Master’s degree in Computer Science or related discipline
Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
5 to 6 years of experience in Production Support
Minimum 6+ years of professional experience in SRE Production Support
Experience with NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
Must Provide 24×7 support on the production servers on a rotation basis
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.