Senior Site Reliability Engineer Job at Onebrief (Honolulu, Oahu)

Job Description

Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll work closely with fellow SREs, security, and customer success. You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments.

Job Responsibility

Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana)
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs), increasing trust internally and externally
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents who will lead blameless post-mortems / After Action Reviews (AARs)
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible)
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation

Requirements

Active Top Secret clearance
5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus
Proven partner to DevOps/Platform and application teams
collaborates well across functions and shares context openly
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement
Infrastructure as Code: Terraform (or CloudFormation), Ansible
Containers and orchestration: Kubernetes design, deployment, and operations
CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions)
Scripting: proficiency with at least one of Python, Go, or Bash
Cloud: Familiarity with AWS or AWS GovCloud
Observability: Grafana stack, ELK stack, or Datadog
Networking fundamentals: core protocols and secure configurations

Nice to have

Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503)
GitOps practices and toolchains
Security‑minded design for sensitive environments
Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems
Familiarity with on‑prem virtualization(VMware, Proxmox, Nutanix, Hyper-V, etc)
Service mesh exposure (Istio, Linkerd)
Relevant certifications (e.g., AWS DevOps Engineer, CKA/CKAD)
Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment

What we offer

Equity: Share in the company's success
Flexible Work Environment: Remote-first organization* with flexible work hours and unlimited PTO
Comprehensive Health Coverage: Health, dental, vision, and life insurance
Retirement Plan: 401(k) plan with company match to secure your future
Parental Leave: 8 weeks at 100% regardless of state
Company Retreats: Annual company summit trips
Home Office Budget: $1,000 per year for home office improvements
Relocation assistance

Onebrief - All Job Offers

Select Country

Senior Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Our AI answers in your language