CrawlJobs Logo

Security Reliability Engineering Lead

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

293000.00 - 385000.00 USD / Year

Job Description:

This is a new, bootstrap team focused on applying strong Site Reliability Engineering discipline to environments where uptime, safety, recoverability, and security are non-negotiable. The team replaces bespoke, one off infrastructure with standardized infrastructure-as-code building blocks that compound reliability and operational leverage as OpenAI scales. We are looking for a Security Reliability Engineering Lead to design, build, and operate reliable, secure, and scalable infrastructure that underpins identity, access, endpoint, and shared platform services across the company. In this role, you will own infrastructure and identity systems end to end, from foundational design and provisioning through policy enforcement, upgrades, recovery, and day two operations. You will establish durable, production grade platforms that remove operational friction, enforce security by default, and enable teams to move faster with confidence. This role is well suited for a senior engineer who thrives in ambiguity, enjoys owning complex systems end to end, and raises the reliability and security bar by replacing fragile implementations with standardized, repeatable infrastructure.

Job Responsibility:

  • Set direction and establish strong foundations
  • Define and evolve infrastructure patterns for on prem and hybrid environments, including self hosted platforms, vendor supported systems, and lab environments
  • Establish standardized, production grade deployment and operational models that replace bespoke implementations
  • Partner with IT, Security, Identity, and Network teams to ensure infrastructure meets reliability, security, and access requirements by design
  • Design and mature the production architecture for IAM adjacent platforms such as Microsoft Entra using SRE principles
  • Establish common management rules and shared resources within Azure subscriptions to ensure consistent, policy aligned operations
  • Build, operate, and scale reliably
  • Own the full lifecycle of infrastructure systems, including deployment, upgrades, patching, recovery, and ongoing operations
  • Operate and harden shared infrastructure provisioned through Infra Terraform, ensuring repeatability, auditability, and safe change management
  • Design and implement infrastructure as code and configuration management to support shared services, identity adjacent systems, and endpoint platforms using tools like Chef, Ansible and Terraform
  • Build and operate monitoring, alerting, and incident response mechanisms to meet high availability and recoverability targets
  • Lead incident response and postmortems across infrastructure, identity adjacent platforms, and fleet systems, driving durable fixes and shared learning
  • Build and operate containerized and platform services, including Kubernetes and Docker-based workloads, using DevOps practices that emphasize reliability, repeatability, and safe change management
  • Use Git-based workflows as the source of truth for infrastructure and policy changes, enabling review, auditability, and safe, reversible automation
  • Automate for leverage and safety
  • Identify high leverage automation opportunities that eliminate manual toil and reduce operational risk across infrastructure and access related systems
  • Implement guardrails, safety mechanisms, and progressive rollout patterns for infrastructure and policy enforcement changes
  • Ensure automation is safe, observable, and resilient under failure conditions, particularly for shared services and high blast radius systems
  • Partner and lead through influence
  • Work closely with Security, Identity, Network, Client Platform, and Platform Engineering teams to operate secure, policy enforced infrastructure
  • Support execution and enforcement of access management policies and privileged access mechanisms owned by partner teams, with a focus on reliability and operability
  • Coach and elevate engineers and partner teams through design reviews, incidents, and operational improvements
  • Drive reliability improvements across teams, even without direct authority

Requirements:

  • 10 or more years of experience operating and architecting mission critical infrastructure in high reliability environments
  • Have led the design and maturation of complex on prem, hybrid, or cloud integrated systems, setting durable architectural patterns used by multiple teams
  • Apply Site Reliability Engineering principles at scale, using observability, automation, and incident learnings to materially reduce risk and operational toil
  • Operate comfortably in ambiguity, making sound architectural decisions under pressure while staying close to technical detail
  • Influence cross functional partners across security, identity, network, and platform teams to land reliability improvements without direct authority

Nice to have:

  • Experience operating infrastructure for R&D or specialized labs, manufacturing, or other safety critical environments where uptime and recoverability are essential
  • Hands on experience with fleet, endpoint, or virtual desktop platforms such as FleetDM, Chef, or Azure Virtual Desktop
  • Experience partnering closely with identity or security engineering teams on hardened, policy enforced infrastructure at scale
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Security Reliability Engineering Lead

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Lead Site Reliability Engineer

As a Lead Site Reliability Engineer (SRE), you will ensure the stability, perfor...
Location
Location
United States
Salary
Salary:
184000.00 - 229000.00 USD / Year
https://corelight.com/ Logo
Corelight
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience building and operating FedRAMP environments or similarly regulated systems
  • Expertise in AWS services (e.g., EC2, S3, RDS, Lambda, ECS/EKS, Glue, EMR, Redshift, OpenSearch, VPC)
  • Deep understanding of the FedRAMP framework, controls, and compliance requirements
  • Proficiency in programming languages such as Python, Go, or Java
  • Experience with big data technologies (Hadoop, Spark, Kafka)
  • Strong skills in Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible
  • Knowledge of containerization and orchestration tools like Docker and Kubernetes
  • Experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI
  • Proven track record in building and scaling platforms with high availability, resilience, and strict SLO objectives
  • Strong experience with Unix/Linux systems and cloud providers, ideally AWS
Job Responsibility
Job Responsibility
  • Collaborate with software engineering teams to ensure the reliability, performance, and security of the Federal region’s infrastructure
  • Design, implement, and manage FedRAMP-compliant infrastructure and systems
  • Establish continuous monitoring, logging, and auditing processes to ensure compliance with FedRAMP controls
  • Partner with security teams to conduct security assessments and implement necessary controls
  • Design and implement scalable infrastructure solutions that support multi-region growth
  • Drive automation efforts, enabling infrastructure and platforms to scale efficiently with a focus on compliance
  • Stay up-to-date on best practices, evolving security threats, and FedRAMP guidelines to maintain a strong security posture
  • Deploy and maintain cloud-native services in AWS that are resilient and elastic
  • Participate in 24x7 incident response and on-call rotations
  • Plan for capacity and work with teams to prepare for platform growth
What we offer
What we offer
  • Equity and additional benefits will also be awarded
  • Fulltime
Read More
Arrow Right

Senior Security Engineer, Sailpoint Development Lead - IAM

We are seeking an experienced and motivated Sr. Engineer to lead the Sailpoint d...
Location
Location
United States , Bethesda
Salary
Salary:
108300.00 - 176300.00 USD / Year
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, information systems, cybersecurity or a related field or equivalent experience/certification
  • 7+ years of progressive Information Technology/Information Security experience in engineering and development of IGA features & Application integration including at least 4 years of experience in SailPoint IIQ Implementation, Configuration, Customization, and deployment in an enterprise environment
  • 4 + years of experience in technologies such as Java, JavaScript, JSON, XML, Python and REST development
  • 4 + years of experience in writing and troubleshooting rules, workflows, custom connectors
  • 4 + years of developing/understanding of requirements, design, implementation, integration, testing
  • 2+ years’ experience working in agile methodologies
Job Responsibility
Job Responsibility
  • Makes decisions on the architecture and design of software projects, validating that the system design meets scalability, reliability, and performance requirements
  • Provides technical direction, mentoring, and support to team members
  • Solves complex technical issues and functions as an escalation for the team in problem-solving
  • Leads code reviews to ensure high-quality, maintainable, and efficient code
  • Establishes and ensures compliance with coding standards
  • Exercises strong interpersonal/relationship/communication skills, with the ability to convey technical concepts to non-technical stakeholders
  • Contributes to the codebase, particularly for critical or complex components
  • Participates in project planning, including estimation of tasks, defining milestones, and ensuring realistic timelines
  • Assigns tasks to team members based on their skills and project requirements
  • Monitors progress and adjusting plans as necessary
What we offer
What we offer
  • Bonus program
  • Comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • Employee stock purchase plan at 15% discount
  • Accrued paid time off (including sick leave where applicable)
  • Life insurance
  • Group disability insurance
  • Travel discounts
  • Adoption assistance
  • Paid parental leave
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Senior Security Operations Engineer II

As a Senior Security Operations Engineer, you’ll play a key role in ensuring the...
Location
Location
United States , Scottsdale
Salary
Salary:
Not provided
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in operations, site reliability, or infrastructure engineering roles
  • Strong experience securing and managing cloud environments (e.g., AWS, Azure) and containerized workloads
  • Deep understanding of Linux systems, networking, distributed systems, and their associated security controls
  • Proficiency in automation, scripting, and security tooling integration to streamline operations and enforcement
  • Experience with security monitoring, alerting, SIEM platforms, and observability tools
  • Solid grasp of CI/CD practices with integrated security testing and compliance checks
  • Experience managing Kubernetes clusters and running containerized workloads in production
  • Experience with deploying and administrating any of the following: scalable cloud native secrets solutions such as AWS KMS, Azure KeyVault
  • PKI solutions such as EJBCA, Smallstep, Venafi
  • or vaulting solutions such as Hashicorp Vault
Job Responsibility
Job Responsibility
  • Implementing and improving automated security checks in CI/CD pipelines to prevent vulnerabilities from reaching production
  • Writing, reviewing, and maintaining security-focused infrastructure-as-code for scalable and compliant deployments
  • Investigating security incidents, performing root cause analysis, and implementing long-term mitigation strategies
  • Collaborating with developers to develop new features, services, and infrastructure requirements
  • Enhancing security observability through improved log collection, metrics, and alerting configurations
  • Maintaining and improving security runbooks, incident response playbooks, and internal security tooling for operational efficiency
  • Resolve security/infrastructure incidents by participating in high impact/high visibility incidents as a participant and ideally as an incident commander
  • Maintain and secure critical infrastructure components such as PKI (Public Key Infrastructure) and IAM ( Identity & Access Management) systems, ensuring reliability, scalability, and compliance with organizational and industry security standards
  • Build and maintain secure, reliable, and scalable infrastructure that protects core services and sensitive data
  • Troubleshoot and resolve complex operational and system-level issues across environments
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right

Senior Security Operations Engineer II

As a Senior Security Operations Engineer, you’ll play a key role in ensuring the...
Location
Location
United States , Scottsdale
Salary
Salary:
Not provided
axon.com Logo
Axon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in operations, site reliability, or infrastructure engineering roles
  • Strong experience securing and managing cloud environments (e.g., AWS, Azure) and containerized workloads
  • Deep understanding of Linux systems, networking, distributed systems, and their associated security controls
  • Proficiency in automation, scripting, and security tooling integration to streamline operations and enforcement
  • Experience with security monitoring, alerting, SIEM platforms, and observability tools
  • Solid grasp of CI/CD practices with integrated security testing and compliance checks
  • Experience managing Kubernetes clusters and running containerized workloads in production
  • Experience with deploying and administrating any of the following: scalable cloud native secrets solutions such as AWS KMS, Azure KeyVault
  • PKI solutions such as EJBCA, Smallstep, Venafi
  • or vaulting solutions such as Hashicorp Vault
Job Responsibility
Job Responsibility
  • Implementing and improving automated security checks in CI/CD pipelines to prevent vulnerabilities from reaching production
  • Writing, reviewing, and maintaining security-focused infrastructure-as-code for scalable and compliant deployments
  • Investigating security incidents, performing root cause analysis, and implementing long-term mitigation strategies
  • Collaborating with developers to develop new features, services, and infrastructure requirements
  • Enhancing security observability through improved log collection, metrics, and alerting configurations
  • Maintaining and improving security runbooks, incident response playbooks, and internal security tooling for operational efficiency
  • Resolve security/infrastructure incidents by participating in high impact/high visibility incidents as a participant and ideally as an incident commander
  • Maintain and secure critical infrastructure components such as PKI (Public Key Infrastructure) and IAM ( Identity & Access Management) systems, ensuring reliability, scalability, and compliance with organizational and industry security standards
  • Build and maintain secure, reliable, and scalable infrastructure that protects core services and sensitive data
  • Troubleshoot and resolve complex operational and system-level issues across environments
What we offer
What we offer
  • Competitive salary and 401k with employer match
  • Discretionary paid time off
  • Paid parental leave for all
  • Medical, Dental, Vision plans
  • Fitness Programs
  • Emotional & Mental Wellness support
  • Learning & Development programs
  • Snacks in our offices
  • Fulltime
Read More
Arrow Right

Engineer Reliability Fixed Equipment

HF Sinclair in El Dorado, KS is seeking a Fixed Equipment Engineer. This positio...
Location
Location
United States , El Dorado
Salary
Salary:
Not provided
hfsinclair.com Logo
HF Sinclair
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A minimum of eight years of progressive work experience in a specific engineering discipline and project management experience is required
  • Emphasis on plant or refinery engineering, fixed equipment and/or mechanical integrity is required
  • A minimum of a Bachelor's Degree in engineering discipline is required
  • Technical expert in area of specialty
  • Advanced ability to stay abreast of new technology developments and processes and apply knowledge analytically
  • Strong knowledge of Microsoft products and commonly used engineering concepts and experience with engineering software
  • Familiarity with standards and practices of the specific discipline
  • Ability to effectively communicate with others, both written and verbal communication, advanced reading and writing skills, with the ability to perform advanced mathematical calculations
  • Ability to operate and drive all assigned company vehicles at company standard insurance rates is essential
  • Valid state driver's license and proof of insurance required
Job Responsibility
Job Responsibility
  • Defines engineering projects by determining objectives, evaluating technical strategies, and providing plant-engineering support to assigned business unit(s)
  • Plans and leads engineering work by writing specifications, developing schedules and budgets, and identifying improvements to existing equipment, inspection practices, and mechanical integrity programs
  • Implements engineering solutions by monitoring performance, coordinating with Operations/Inspection/Maintenance, taking corrective actions, updating procedures and reports, and securing materials, supplies, and services
  • Completes projects by delivering final outputs, closing administrative requirements, and evaluating overall project performance, including lessons learned for future inspection, maintenance, and reliability work
  • Analyzes the economics of each project where appropriate
  • calculates ROI for proposed projects
  • Provides engineering documentation, mechanical integrity analysis, operating analysis, and recommendations for management
  • Supports mechanical integrity initiatives, including fixed equipment evaluations, repair plans, inspection scope development, and root-cause investigations
  • Develops and improves procedures for inspection, maintenance, and engineering tasks to enhance consistency, effectiveness, and compliance with applicable standards
  • Collaborates with multi-disciplinary teams (Operations, Maintenance, Inspection, Process) to prioritize and execute reliability and mechanical integrity improvements
What we offer
What we offer
  • Medical Insurance
  • Vision Insurance
  • Dental Insurance
  • Paid Time-Off
  • 401(k) Retirement Plan with match
  • Educational Reimbursement
  • Parental Bonding Time
  • Employee Discounts
  • Fulltime
Read More
Arrow Right