CrawlJobs Logo

Site Reliability Engineer Platform Engineer

United States, Reston · Job Posted March 25, 2026
Apply Position
Job Link Share

Job Description

Join a mission-driven, national financial services organization at the heart of the U.S. housing finance ecosystem. This is a mid-sized, highly regulated enterprise operating at market scale—supporting platforms and analytics that enable trillions of dollars in annual economic activity. You’ll work in a modern tech environment with strong engineering partners, clear business impact, and a mandate for reliability, security, and continuous improvement. Our client is hiring a hands-on SRE / Platform Engineer to operate, tune, and scale our OpenShift/Kubernetes platforms while bridging on-prem to Azure to power our analytics ecosystem. You’ll own reliability, automation, and observability across a hybrid estate—partnering closely with developers, data engineers, infrastructure operations, and security to deliver secure, performant platform services using modern DevSecOps practices.

Job Responsibility

  • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
  • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
  • Map current hybrid topology and critical delivery pipelines
  • identify toil and prioritize automation (Terraform/Ansible)
  • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
  • Drive GitOps-first workflows
  • harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
  • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
  • Lead incident response and postmortems
  • institutionalize RCA, blameless learning, and continuous improvement
  • Advance the hybrid service model—migrations, integrations, reliability/latency tuning, cost and performance optimization
  • Operate and optimize OpenShift/Kubernetes clusters, ingress (e.g., Nginx), and container networking/service mesh
  • Manage Azure services (compute, VNet, storage, data services) supporting analytics workloads
  • Build and maintain automated infrastructure with Terraform, Ansible, and GitOps workflows
  • Implement and evolve observability (Datadog, Prometheus, Grafana): metrics, traces, logs, alerting, SLOs, runbooks
  • Design, harden, and support delivery pipelines with ArgoCD/Jenkins/GitHub Actions
  • Provide platform tooling and enablement for application developers, data engineers, and operations teams
  • Ensure security and access management (HashiCorp Vault, secrets management, least privilege)
  • Lead incident response, coordinate cross-functional resolution, and drive corrective actions and platform improvements
  • Script or develop tools in Bash, Python, or Go to eliminate toil and improve developer experience

Requirements

  • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters
  • Strong experience with Microsoft Azure (compute, networking, storage, and data services)
  • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
  • Proficiency with observability tooling (Datadog, Prometheus, Grafana)
  • Scripting/coding ability in Bash, Python, or Go

Nice to have

  • Experience bridging on-prem and cloud in a hybrid service model (migration, integration, optimization)
  • Expertise with Kafka/AMQ, HashiCorp Vault, and ArgoCD/Jenkins/GitHub Actions
  • Background leading incident response and postmortems with strong RCA and continuous improvement practices

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer Platform Engineer

8 matching positions

Development Platform Site Reliability Engineer

Join Barclays as a Development Platform Site Reliability Engineer role, where to...
Location
Location
India , Pune
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3–7 years in SRE/DevOps/Platform Engineering (or Software engineer with production ops ownership)
  • Python or Java + Bash
  • JSON/YAML
  • Terraform and/or CloudFormation
  • Jenkins and/or GitLab CI/CD
  • Elastic/Grafana/Prometheus (monitoring, dashboards, alerts)
  • Linux troubleshooting
  • fundamentals of networking/security/distributed systems
  • Incident response + automation + documentation/runbooks
Job Responsibility
Job Responsibility
  • Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
  • Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
  • Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
  • Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
  • Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
  • Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Big Data/Data Platform Site Reliability Engineer

About PulsePoint: PulsePoint is a fast-growing healthcare technology company (wi...
Location
Location
United Kingdom
Salary
Salary:
Not provided
pulsepoint.com Logo
PulsePoint
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on experience operating large-scale Linux infrastructure in production (Rocky Linux or equivalent)
  • Deep practical knowledge of Apache Hadoop-based data platforms, including: HDFS architecture and failure modes, Kerberos-based security models, Operational lifecycle (upgrades, scaling, recovery)
  • Experience running Apache Kafka clusters in production, including KRaft-based setups
  • Proven ability to debug complex distributed system issues across storage, compute, and networking layers
  • Experience designing or improving automation, deployment, or GitOps-style workflows
  • Proficiency in scripting or automation (Python, Shell, etc.)
  • Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, basic network security concepts)
  • Comfortable taking technical ownership, driving reliability improvements, and participating in on-call / incident processes
  • Willing and able to work East Coast U.S. hours (9am–6pm EST)
Job Responsibility
Job Responsibility
  • Deploying, configuring, monitoring and maintaining multiple big data stores across multiple datacenters, with a strong focus on reliability, scalability, and operational excellence
  • Perform planning, configuration, deployment, and lifecycle management of critical data infrastructure
  • Managing large-scale Linux infrastructure to ensure maximum uptime and predictable performance
  • Developing and documenting system configuration standards, operational procedures, and best practices
  • Performance and reliability testing, including reviewing configuration, software choices, versions, and hardware specifications
  • Participating in incident response, root cause analysis, and driving long-term reliability improvements
  • Advancing our technology stack with innovative ideas and pragmatic solutions
Read More
Arrow Right

Cloud Platform Engineer (Site Reliability)

We have an exciting opportunity for a Cloud Platform Engineer (Site Reliability)...
Location
Location
United States , Houston
Salary
Salary:
Not provided
amentum.com Logo
Amentum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor’s degree or equivalent certification in a related area and normally possess 10 years of experience in the field or in a related area
  • Strong experience with Kubernetes in production
  • Ability to manage and use GitLab (preferably very proficient)
  • Hands-on experience with CI/CD pipeline tools
  • Observability Monitoring tools such as Grafana and SuperSet
  • Proficiency with Infrastructure-as-Code utilizing Terraform for infrastructure automation and/or open source alternatives (OpenTofu)
  • Extensive Linux experience (familiarity with Windows also preferred, but not required)
  • Expert in at least one programming language (Go and Python is preferred)
  • Experience with Python, SQL (and R is preferable)
  • Working understanding of Machine Learning Model Lifecycle management (is preferred)
Job Responsibility
Job Responsibility
  • Developing new cloud-native platform services spanning all three major cloud environments
  • Developing best practices for cloud-native application development and promoting them within the organization
  • Administering NASA cloud networks and managing requests for deployment of COTS and Cloud Native applications into cloud environments
  • Writing quality code, providing quality and engaged code reviews for peers
  • Working with Managed Kubernetes offering across all three major cloud providers
  • Integrating cloud managed AI and data services with other bespoke and open-source Kubernetes applications
  • Developing best practices for cloud-native application development and promoting them within the organization
  • Identifying opportunities to abstract Prospective Project requirements and develop Enterprise-grade, multi-tenant Platform Services
  • Collaborate with NASA security and compliance teams to ensure teams are adhering to industry best practices and regulatory requirements
  • Working directly with NASA human spaceflight missions like Orion, Lunar Gateway, Artemis
What we offer
What we offer
  • Excellent personal and professional career growth
  • 9/80 work schedule (every other Friday off), when applicable
  • Onsite cafeteria (breakfast & lunch)
  • Health, dental, and vision insurance
  • Paid time off and holidays
  • Retirement benefits (including 401(k) matching)
  • Educational reimbursement
  • Parental leave
  • Employee stock purchase plan
  • Tax-saving options
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...
Location
Location
Salary
Salary:
175000.00 - 225000.00 USD / Year
zilliz.com Logo
Zilliz
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously
Job Responsibility
Job Responsibility
  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
  • Develop and implement strategies for monitoring, incident management, and disaster recovery
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
  • Collaborate with software engineers to enhance system reliability, scalability, and performance
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer - Container Platform

Join Barclays as a Site Reliability Engineer - Container Platform role, where yo...
Location
Location
India , Pune; Chennai
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum Qualification – bachelor’s degree
  • Experience configuring, using or maintaining Kubernetes (Openshift or EKS or AKS or Argocd)
  • Experience in developing and coding software using Python or Golang
  • Experience with Docker, Containers and Cloud-Native utilities and software
Job Responsibility
Job Responsibility
  • Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
  • Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
  • Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
  • Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
  • Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
  • Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer - Data Platform Operation

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform...
Location
Location
Brazil , Sao Paulo
Salary
Salary:
Not provided
amaris.com Logo
Amaris Consulting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
  • Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
  • English language: Professional working proficiency in English and the local language
  • Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
  • AI experience: Experience supporting enterprise Data & AI platforms
  • Soft skills: Analytical problem-solving
  • Effective communication and active listening
  • Team player with respect for others
  • Strong troubleshooting and platform monitoring skills
  • Automation (Python, PowerShell, CLI, KQL, Terraform)
Job Responsibility
Job Responsibility
  • Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
  • Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
  • Implement and manage RBAC, identity & access policies, and compliance controls
  • Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
  • Automate tasks using PowerShell, Azure CLI, Terraform, and Python
  • Utilize Git, GitHub Actions, and Airflow for workflow automation
  • Provide L2/L3 support for data pipelines, reporting, and cloud services
  • Conduct incident response, root cause analysis (RCA), and proactive issue resolution
  • Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
  • Follow ITSM processes: Incident, Change, and Problem Management
What we offer
What we offer
  • An international community bringing together 110+ different nationalities
  • An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
  • A robust training system with our internal Academy and 250+ available modules
  • A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
  • Strong commitments to CSR, notably through participation in our WeCare Together program
Read More
Arrow Right

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare!...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of site reliability engineering experience
  • Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
  • Proactive, curious, collaborative and eager to learn
  • Proven experience with cloud services such as AWS, Azure or Google Cloud
  • Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
  • Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles
Job Responsibility
Job Responsibility
  • Collaborating with Feature teams to ensure services align with developer needs
  • Driving improvements by evaluating new technologies and processes
  • Defining best practices (golden paths) for software development and deployment
  • Developing and maintaining tools and services that facilitate implementation of best practices
  • Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
  • Collaborating on roadmap delivery
What we offer
What we offer
  • Free Health Insurance for you
  • Up to 14 days of RTT
  • A flexible workplace policy offering both hybrid and office-based modes
  • Flexibility days allowing to work in EU countries and the UK 10 days per year
  • Wellbeing program with free mental health and coaching through moka.care
  • Special support package for caregivers and workers with disabilities
  • Lunch voucher with Swile card
  • Work Council subsidy for sport club membership or creative activities
  • Bicycle subsidy
  • Public transportation reimbursement
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare....
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of site reliability engineering experience
  • Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
  • Proactive, curious, collaborative and eager to learn
  • Proven experience with cloud services such as AWS, Azure or Google Cloud
  • Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
  • Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles
Job Responsibility
Job Responsibility
  • Collaborating with Feature teams to ensure services align with developer needs
  • Driving improvements by evaluating new technologies and processes
  • Defining best practices ("golden paths") for software development and deployment
  • Developing and maintaining tools and services that facilitate best practices
  • Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
  • Collaborating on roadmap delivery
What we offer
What we offer
  • Company health insurance through partner Allianz
  • Minimum 28 days of paid leave
  • Parent Care Program: one additional month of leave on top of legal parental leave
  • Free mental health and coaching services through partner Moka.care
  • For caregivers and workers with disabilities, a package including adaptation of remote policy, extra days off for medical reasons, and psychological support
  • Flexible workplace policy offering both hybrid and office-based mode
  • Work from EU countries and the UK for up to 10 days per year
  • Reimbursement of public transportation
  • Fulltime
Read More
Arrow Right