This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Glean is seeking a Site Reliability Engineering Lead to foster a culture of engineering excellence, drive technical strategy, and develop a high-performing, collaborative team. Your role is pivotal in ensuring our services meet stringent Service Level Objectives (SLOs) and in building resilient, automated production environments in the cloud. You'll lead a team and be responsible for products globally, providing technical leadership to key projects and empowering your team to do the same. Much of our software development focuses on building infrastructure to scale our operations in a hybrid cloud environment and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale and fast growth which are unique to Glean, while using your expertise in coding, algorithms, problem-solving, and SRE practices. We keep Glean applications up and running, ensuring our customers have the best and most reliable experience possible.
Job Responsibility:
Foster a culture of engineering excellence, drive technical strategy, and develop a high-performing, collaborative team
Ensure services meet stringent Service Level Objectives (SLOs)
Build resilient, automated production environments in the cloud
Lead a team and be responsible for products globally
Provide technical leadership to key projects
Manage the complex challenges of scale and fast growth
Keep Glean applications up and running
Drive technical excellence and foster a culture of reliability across engineering teams
Set best practices for incident management, performance optimization, and automation
Influence best practices, drive cross-team collaborations, and contribute to the execution of key objectives
Establish strong technical credibility, shaping architectural decisions and ensuring the delivery of high-quality, reliable systems
Implement and maintain resilient cloud architectures, monitor system performance, and proactively identify and resolve potential bottlenecks or points of failure
Participate in primary oncall rotation
Cultivate technical curiosity and growth mindset, and a blameless postmortem culture
Continuously optimize the on-call process for sustainability and efficiency
Develop and maintain automation scripts, tools, and processes to streamline system deployment, monitoring, and management tasks
Optimize cloud infrastructure and applications for performance, scalability, and cost-effectiveness
Collaborate with security engineers to implement best practices and ensure compliance with security standards and policies
Design and configure advanced monitoring systems to gain insights into system behavior, set up alerts, and respond proactively to potential issues
Create and maintain comprehensive dashboards and playbooks for production on-call
Engage actively in the entire software development lifecycle
Participate in system design reviews and provide valuable SRE insights during launch reviews
Requirements:
Bachelor’s degree in Computer Science, a related field, or equivalent practical experience
8+ years of experience in a senior-level role within Site Reliability Engineering or similar role, particularly in managing cloud-based services and infrastructure
5+ years of experience with software development in one or more programming languages
3+ years of experience managing people or teams, leading projects, and designing, analyzing, and troubleshooting distributed systems running in Cloud
Strong knowledge of cloud platforms such as Google Cloud Platform, AWS, or Azure
Practical experience with containerization technologies, including Docker and Kubernetes
Familiarity with infrastructure as code tools like Terraform is essential
Solid understanding of networking, security principles, and best SRE and security practices
Proficiency in using monitoring and alerting tools to detect and respond to potential issues effectively