Site Reliability Engineer Job at Skyhigh Security (San Jose)

Job Description

The Site Reliability Engineer at Skyhigh Security will be responsible for monitoring, maintaining and troubleshooting operational issues of a high availability production environment. The SRE will also act as a bridge between Operations, Engineering and Product Management teams and you will represent the customer point of view to continue driving enhancements to our products and uptime. SREs are responsible for managing and improving the operational aspects of systems, such as monitoring, alerting, incident response, and vendor interactions.

Job Responsibility

Perform Incident Management and Change Management to maintain the continuous availability of all Cloud Infrastructure services
Ensure all SRE and operating procedures are maintained and executed
Maintain a 24×7 production environment with a high level of service availability and perform quality reviews, manage operational issues
Perform root cause analysis for major incidents and drive the process by involving required stakeholders
Perform problem management by analyzing metrics, alarms and dashboards to troubleshoot problem areas, report issues to assist in performance tuning and fault finding
Implementation of proactive monitoring, alerting, trend analysis, and self-healing solutions
Explore and innovate new technologies, features, and tools to improve the platform and automate operational tasks using Bash, Python or any other programming language
Manage and maintain Runbooks and Standard Operating procedures
Manage, coordinate, and document all types of maintenance activities and outages
Perform patching and upgrades for vulnerability management
Work closely with the teams to initiate the development of new ideas into internal tools
Understand the existing architecture and work with various Engineering teams to develop and execute strategies to provide a high-quality production service
Capable of working a flexible work schedule in a 24 x 7 environment with rotational shifts

Requirements

Bachelor’s degree in computer science, electrical engineering or a related area, with 7+ years of SRE experience in a large enterprise organization
System admin experience on Linux environments
Experience with end-to-end monitoring setup for infra and applications
Experience with Prometheus, Grafana, ELK, Opensearch, Cloudwatch, PagerDuty and other monitoring tools
Solid experience with Cloud Technologies such as AWS and OCI
Good experience with containerized workloads tools like Kubernetes
Network knowledge (TCP/IP, UDP, DNS, Load balancing) and prior network administration experience is required
Experience with BGP, NAT, TCP/IP, iBGP, Proxies, Cross connects
Experience with L2/L3 switching, knowledge of Juniper and Cisco routing devices
Experience understanding and managing web servers (Apache, Tomcat, Nginx)
Ability to script/program with one or more high level languages, such as Python, Go, etc
Experience with any configuration management tools like Salt or Puppet or Ansible or similar
Experience with source control tools such as Github and SVN
Experience with deployment tools Jenkins, Harness etc
Experience with SQL and NoSQL databases like Redis, Crate, Elasticsearch
Experience in performing and writing Root Cause Analysis documents
Strong communication and analytical/problem-solving skills
Systematic approach and to drive problems to resolution
Only US Citizens are eligible

Nice to have

Good to have experience/knowledge of GCP, Azure
Experience in Security domain will be added advantage
Experience with open-source technologies like Kafka, Hadoop, HBase, Zookeeper, Oozie will be an added advantage

What we offer

Retirement Plans
Medical, Dental and Vision Coverage
Paid Time Off
Paid Parental Leave
Support for Community Involvement

Skyhigh Security - All Job Offers

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?