This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Onebrief is collaboration and AI-powered workflow software designed specifically for military staffs. We are hiring a Site Reliability Engineer to join our Infrastructure & Security team. You’ll work closely with fellow SREs, security, and customer success. You will be the first line of support for our mission critical deployments, and responsible for ensuring best-in-class service quality and issue resolution. You will work in both on-premise DoD environments and AWS cloud environments.
Job Responsibility:
Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana)
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs), increasing trust internally and externally
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents who will lead blameless post-mortems / After Action Reviews (AARs)
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code (Terraform, Ansible)
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation
Requirements:
Active Top Secret clearance
5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus
Proven partner to DevOps/Platform and application teams
collaborates well across functions and shares context openly
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement
Infrastructure as Code: Terraform (or CloudFormation), Ansible
Containers and orchestration: Kubernetes design, deployment, and operations
CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions)
Scripting: proficiency with at least one of Python, Go, or Bash
Cloud: Familiarity with AWS or AWS GovCloud
Observability: Grafana stack, ELK stack, or Datadog
Networking fundamentals: core protocols and secure configurations
Nice to have:
Experience in DoD environments and compliance frameworks (RMF, STIGs, ICD 503)
GitOps practices and toolchains
Security‑minded design for sensitive environments
Experience designing and implementing meaningful SLIs/SLOs (including error budgets) for complex, distributed systems
Familiarity with on‑prem virtualization(VMware, Proxmox, Nutanix, Hyper-V, etc)