This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Barbaricum is seeking an experienced Senior Site Reliability Engineer to support the reliability, availability, automation, and operational performance of IT and cloud systems under the Military Community and Family Policy (MC&FP) Outreach and Digital Enterprise Services (MODES) contract. You will help ensure MC&FP systems are reliable, scalable, resilient, and efficiently managed through proactive monitoring, automated incident response, performance optimization, and operational dashboards that support rapid decision-making
Job Responsibility
Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid IT environments supporting MC&FP mission requirements
Implement proactive performance monitoring, automated alerting, incident response workflows, and resilience engineering practices to reduce downtime and improve operational visibility
Develop, maintain, and improve scalable automated infrastructure solutions that support reliable system operations and repeatable service delivery
Implement rollback strategies, recovery approaches, and chaos engineering practices to validate resilience, reduce operational risk, and improve system stability
Analyze usage patterns, capacity trends, and performance indicators to support dynamic scaling, resource optimization, and system improvement decisions
Develop and maintain real-time operational dashboards, reports, and metrics that enable rapid decision-making, leadership awareness, and system optimization
Respond to and resolve system outages, impairments, and service disruptions while coordinating with technical teams to minimize mission impact
Conduct post-incident reviews to identify root causes, document lessons learned, and implement preventative measures that reduce recurrence
Collaborate with software developers, cloud engineers, cybersecurity personnel, and operations teams to improve services, reliability patterns, deployment practices, and operational standards
Create and maintain system documentation, configuration standards, operational runbooks, monitoring procedures, and service reliability guidance
Automate common operations tasks to reduce manual workloads, improve consistency, and increase system efficiency
Implement security best practices across operational activities, infrastructure automation, monitoring, incident response, and system administration functions
Requirements
Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience
Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices
Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies
Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks
Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments
Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification
Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making
Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments
Strong problem-solving skills and the ability to work under pressure during outages, impairments, and time-sensitive operational issues
Excellent written and verbal communication skills, with the ability to explain technical findings, incident impacts, and reliability recommendations to technical and non-technical stakeholders
Bachelor’s degree in Computer Science, Information Technology, Systems Engineering, Cybersecurity, or a related field
Master’s degree preferred
Certifications related to cloud computing, system administration, site reliability engineering, DevSecOps, or automation are beneficial
10+ years of experience in site reliability engineering, systems administration, infrastructure operations, cloud operations, DevSecOps, or a similar technical role, particularly in a government, federal, defense, or secure IT setting
Demonstrated experience maintaining reliable, scalable, and efficiently managed IT systems across on-premises, cloud, or hybrid environments
Experience supporting incident response, system outage resolution, post-incident reviews, root cause analysis, and operational improvement initiatives
Experience collaborating with development, infrastructure, cloud, cybersecurity, and program teams to improve reliability, security, and service performance