CrawlJobs Logo

Senior Reliability Engineer

United States, Washington · Job Posted June 29, 2026
Apply Position
Job Link Share

Job Description

Barbaricum is seeking an experienced Senior Site Reliability Engineer to support the reliability, availability, automation, and operational performance of IT and cloud systems under the Military Community and Family Policy (MC&FP) Outreach and Digital Enterprise Services (MODES) contract. You will help ensure MC&FP systems are reliable, scalable, resilient, and efficiently managed through proactive monitoring, automated incident response, performance optimization, and operational dashboards that support rapid decision-making

Job Responsibility

  • Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid IT environments supporting MC&FP mission requirements
  • Implement proactive performance monitoring, automated alerting, incident response workflows, and resilience engineering practices to reduce downtime and improve operational visibility
  • Develop, maintain, and improve scalable automated infrastructure solutions that support reliable system operations and repeatable service delivery
  • Implement rollback strategies, recovery approaches, and chaos engineering practices to validate resilience, reduce operational risk, and improve system stability
  • Analyze usage patterns, capacity trends, and performance indicators to support dynamic scaling, resource optimization, and system improvement decisions
  • Develop and maintain real-time operational dashboards, reports, and metrics that enable rapid decision-making, leadership awareness, and system optimization
  • Respond to and resolve system outages, impairments, and service disruptions while coordinating with technical teams to minimize mission impact
  • Conduct post-incident reviews to identify root causes, document lessons learned, and implement preventative measures that reduce recurrence
  • Collaborate with software developers, cloud engineers, cybersecurity personnel, and operations teams to improve services, reliability patterns, deployment practices, and operational standards
  • Create and maintain system documentation, configuration standards, operational runbooks, monitoring procedures, and service reliability guidance
  • Automate common operations tasks to reduce manual workloads, improve consistency, and increase system efficiency
  • Implement security best practices across operational activities, infrastructure automation, monitoring, incident response, and system administration functions

Requirements

  • Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience
  • Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices
  • Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies
  • Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks
  • Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments
  • Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification
  • Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making
  • Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments
  • Strong problem-solving skills and the ability to work under pressure during outages, impairments, and time-sensitive operational issues
  • Excellent written and verbal communication skills, with the ability to explain technical findings, incident impacts, and reliability recommendations to technical and non-technical stakeholders
  • Bachelor’s degree in Computer Science, Information Technology, Systems Engineering, Cybersecurity, or a related field
  • Master’s degree preferred
  • Certifications related to cloud computing, system administration, site reliability engineering, DevSecOps, or automation are beneficial
  • 10+ years of experience in site reliability engineering, systems administration, infrastructure operations, cloud operations, DevSecOps, or a similar technical role, particularly in a government, federal, defense, or secure IT setting
  • Demonstrated experience maintaining reliable, scalable, and efficiently managed IT systems across on-premises, cloud, or hybrid environments
  • Experience developing automated infrastructure, operational scripts, monitoring solutions, dashboards, runbooks, and configuration standards
  • Experience supporting incident response, system outage resolution, post-incident reviews, root cause analysis, and operational improvement initiatives
  • Experience collaborating with development, infrastructure, cloud, cybersecurity, and program teams to improve reliability, security, and service performance
  • DoD Secret Security Clearance

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Reliability Engineer

8 matching positions

Senior Reliability Engineer

Are you looking for a career move that will put you at the heart of a global fin...
Location
Location
Poland , Warsaw
Salary
Salary:
241750.00 - 411650.00 PLN / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Software senior developer, scripting in Python/Shell/Bash/Java/Go
  • Experience in CI/CD, in large enterprise tech-stack infra-architecture
  • Hands-on experience deploying and scaling the Model Context Protocol (MCP) in complex, enterprise environments
  • Understanding of SRE/DevOps/CI/CD
  • Strong analytical, algorithmic, and problem-solving skills
  • Excellent teamwork, proactive attitude, strong communication skills, both written and oral
Job Responsibility
Job Responsibility
  • You will work in an agile software development environment, developing quality and scalable software solutions using leading-edge technologies
  • You will work closely with developers, engineers and non-technology employees to help them be more productive with the use of the CI/CD tools
  • You will collaborate with Citi Developer Services engineers to automate manual and repetitive processes, integrate services with AI (by building and maintaining MCP - Model Context Protocol servers), enhance system resiliency, and coordinate service issue investigations by deploying best practices
  • Automate manual activities, repetitive processes, reporting, controls, etc., configure and tune them
  • Build and maintain the foundational MCP servers that allows AI models to securely interact with enterprise systems
  • Continuously improve systems resiliency, reliability, and business cost - through a design and development of software solutions and streamlined processes
  • Mitigate risk by analyzing the root cause of production issues, impacts to business, and required corrective actions.
What we offer
What we offer
  • Employer paid Defined Contribution Pension Plan contribution of 6% of employee’s pensionable earnings (PPE Program)
  • Employer paid Private Medical Care Package for employees and Private Medical Care Packages for certain family members available at preferential rates
  • Employer paid Life Insurance Program for employees and Life Insurance for certain family members available at preferential rates
  • Employee Assistance Program financed by Employer
  • Paid Parental Leave Program (maternity and paternity leave
  • statutory and 2 weeks additional paid paternity leave)
  • Sport Card for employees subsidised via Social Benefits Fund and Sport Cards for certain family members available at preferential rates
  • Additional benefits from Company’s Social Benefit Fund, in particular: Holidays Allowance, support for sport and cultural activities, team building events
  • Additional day off for volunteering
  • Cafeteria/ flex benefit – a company benefits system which enables employees to select and purchase benefits offered by a provider and available for employees on the platform
  • Fulltime
Read More
Arrow Right

Senior Reliability Engineer

Be part of Amgen's newest and most advanced drug substance manufacturing plant. ...
Location
Location
United States , Holly Springs
Salary
Salary:
123098.00 - 149145.00 USD / Year
amgen.com Logo
Amgen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Diploma / GED and 10 years of Engineering experience
  • Associate’s Degree and 8 years of Engineering experience
  • Bachelor’s Degree and 4 years of Engineering experience
  • Master’s Degree and 2 years of Engineering experience
  • Doctorate Degree
Job Responsibility
Job Responsibility
  • Lead all aspects of the delivery and continuous improvement of the Engineering Asset Management (AM), Reliability, Sustainability, and Continuous Improvement (CI) programs at Amgen North Carolina (ANC)
  • Serve as the primary system owner and subject matter expert for Engineering AM, CI, alarm management, data analytics, and audit readiness programs, acting as a key liaison between ANC Engineering, Global Asset Management, Sustainability, Quality, and Reliability organizations
  • Deploy and sustain a comprehensive, standardized Reliability Program aligned with corporate Reliability, Sustainability, and Industry 4.0 strategies
  • Establish and monitor standardized metrics across Manufacturing, Packaging, Laboratories, Maintenance, and Utilities to identify performance gaps, regulatory risks, and major reliability offenders, and to drive data-informed, risk-based improvement plans
  • Lead Asset Management, Reliability, CI, Sustainability, Alarm Management, and Audit Readiness programs
  • Establish data-driven reliability frameworks using analytics, dashboards, and KPIs
  • Lead MMP activities including PM optimization, job plans, and spare parts strategy
  • Develop risk-based action plans for reliable and sustainable utilities operations
  • Own alarm management and alarm review programs aligned with regulatory expectations
  • Drive sustainability initiatives (energy, waste, lifecycle optimization)
What we offer
What we offer
  • competitive and comprehensive Total Rewards Plans that are aligned with local industry standards
  • Fulltime
Read More
Arrow Right

Senior Reliability Engineer

The incumbent will be responsible to maintain site-wide equipment and facilities...
Location
Location
Singapore , Tuas
Salary
Salary:
Not provided
pfizer.de Logo
Pfizer
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum 6 years for Degree or 15 years for Diploma in Mechanical or Chemical Engineering with relevant experiences in pharmaceuticals, chemicals, or petrochemical industries
  • Experience with Root Cause Failure Analysis, Equipment Criticality Ranking, PM/PdM optimization, and/or Failure Modes and Effects Analysis
  • Strong knowledge and understanding of Current Good Manufacturing Practices (part of GxP)
  • Excellent oral and written communication skills
  • Working knowledge of MS Excel
  • Ability to manage complex issues and foster consensus among teams
  • Familiar with government code of practice, regulations, current Good Manufacturing Practice (cGMP), Good Documentation Practice (GDP) and Data Integrity (DI)
  • Good Mechanical Maintenance Troubleshooting, Repairs and Analysis Skills
  • Good Facilitation and Communication skills
  • Demonstrated problem-solving and relationship management skills
Job Responsibility
Job Responsibility
  • Maintain site-wide equipment and facilities, establishing optimization in initative and Right First Time Strategy / Technique to enhance high equipment and instrument reliability in compliance with cGMP, EHS, Data integrity and regulatory requirements in a cost effective manner
  • Execute the reliability programs of plant Mechanical equipment, instruments and systems by establishing proactive, predictive and preventive maintenance programs, conducting equipment inspections, analyzing data, provide recommendations and follow up monitoring/improvements
  • Accountable for: cGMP and EH&S compliance
  • Mechanical System / Equipment failure and Root Cause Analysis
  • Equipment, instrument and system reliability and performance
  • Team performance
  • Maintenance PM work planning and data tracking
  • Report generation
  • Implementation, Execution, Commissioning/testing of new and obsolete Mechanical systems and equipment
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Senior Reliability Engineer - AV Labs

We are looking for a hardware focused Senior Reliability Engineer to focus on se...
Location
Location
United States , Sunnyvale
Salary
Salary:
180000.00 - 200000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
  • Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments
  • coding skills in one or more of Go, Python, or C++, with experience building and operating production systems
  • Proficiency in Linux internals and shell scripting for triaging and debugging edge devices or hardware-adjacent systems
  • Ability to debug across services, containers (Docker), and networking stacks
  • Proven track record owning reliability, infrastructure, or platform systems for large-scale production workloads
  • Experience designing and operating observability systems (metrics, logging, alerting, and dashboards)
  • Experience defining and implementing SLIs and SLOs for system availability or data yield
  • Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT) and data handling in bandwidth-constrained environments
  • Experience driving complex technical projects and architectural reviews across multiple teams from design through production
Job Responsibility
Job Responsibility
  • Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes
  • Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling
  • Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield
  • Detect Complex Failure Modes: Design detection logic for 'silent' failures, such as sensor degradation, compute saturation, or recording pipeline stalls
  • Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows
  • Partner on Mitigation: Collaborate with Operations and Engineering to build safe, automated responses to recurring hardware and software failure scenarios
  • Drive Operational Efficiency: Build technical interfaces to help Operations surface issues and Engineering diagnose and deploy mitigations rapidly (TTD/TTM)
  • Lead Technical Strategy: Drive reliability-focused design reviews and translate operational pain points into concrete technical requirements and roadmaps
  • Uncover Proactive Insights: Apply advanced data analytics to identify latent patterns in fleet telemetry, enabling the proactive detection of systemic regressions and hardware degradation before they impact operations
What we offer
What we offer
  • Uber's bonus program
  • equity award & other types of comp
  • 401(k) plan
  • various benefits
  • Fulltime
Read More
Arrow Right

Senior Engineer, Reliability (Mechanical)

Location
Location
Malaysia , Manjung
Salary
Salary:
Not provided
airswift.com Logo
Airswift Sweden
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Engineering Background: Preferably Mechanical, Electrical, or related disciplines
  • Experience: Minimum 8 years (at least 5 years in a relevant reliability or maintenance role)
  • Industry Exposure: Candidates from O&L, heavy machinery, mining, or cement industries are highly preferred
  • Strong background in mechanical conveyors and bulk material handling systems
  • Ability to interpret technical drawings and support simple fabrication needs
  • Familiarity with steel fabrication, welding, and vibration analysis
  • Proficient in root cause analysis (RCA), FMEA, LDA, and other reliability tools
  • Strong data analysis capabilities, especially using CMMS (e.g., SAP)
  • Coordinate across multiple teams: execution, planning, process inspection
  • Lead and develop a multi-skilled team, including mechanical and electrical engineers
Job Responsibility
Job Responsibility
  • 80% strategic focus on equipment lifecycle cost optimization
  • 20% tactical involvement in daily maintenance and reliability operations
  • Fulltime
Read More
Arrow Right

Senior Reliability Engineer - PCBA, Harness & Connectors

We are looking for a Senior Reliability Engineer in charge of developing and exe...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 225000.00 USD / Year
figure.ai Logo
Figure
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in relevant reliability engineering areas
  • Bachelor's degree or higher in relevant science and engineering fields
  • Strong knowledge of environmental reliability test principles, models, and methodologies, such as high temperature high humidity, thermal cycle/shock, mechanical vibration/shock
  • Strong knowledge of industry test standards such as AECQ, JEDEC, IPC standards
  • Strong knowledge of electrical circuits, PCBA design and relevant SW tools (e.g. Altium)
  • Strong knowledge of PCBA, harness and connector failure modes, mechanisms, and FA techniques
  • Hands-on experience on field reliability risk analysis and failure prediction methods
  • Hands-on experience with Weibull++, JMP, or other reliability statistical analysis software
  • Hands-on experience on electronic circuit debug and relevant tools, e.g. source meter, oscilloscope
  • Hands-on experience with 3D CAD tool (e.g. CATIA)
Job Responsibility
Job Responsibility
  • Work with cross-functional teams, own hardware reliability requirements and validation strategy
  • Develop and execute accelerated life tests for PCBAs, electronic components, electrical harness and connectors
  • Lead DFMEA efforts with design engineers to assess design risks, impacts, controls, and corrective actions
  • Design reliability test flows and procedures, communicate with internal and external/CM teams to execute tests and report results
  • Work with test engineers to design setup and fixtures used in reliability testing
  • Guide and support PCBA, harness, connector failure analysis, design of experiments (DOEs) and corrective action processes with cross-functional teams
  • Analyze field data, assess field risks, and design tests that correlate to field usage conditions
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...
Location
Location
United Kingdom; United States; Canada
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
  • Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
  • Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
  • Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
  • Excellent async written communication skills
  • comfortable working with a geographically distributed team
  • Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
  • Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes
Job Responsibility
Job Responsibility
  • Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
  • Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
  • Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
  • Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
  • Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
  • Diagnose and debug production incidents
  • drive root-cause analysis and post-incident improvements to prevent recurring problems
  • Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
  • Contribute to runbooks, architecture documentation, and team processes
What we offer
What we offer
  • Fully remote work & schedule flexibility
  • Company-provided laptop
  • Annual bonus program
  • Monthly remote work stipend
  • Annual professional development stipend
  • Industry conferences
  • Company all-hands and team gatherings
  • 24 days PTO per year (prorated)
  • Birthday
  • Year-end company shutdown
  • Fulltime
Read More
Arrow Right