This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Senior Data Center Operations Engineer is responsible for the bedrock of our high-availability infrastructure. This role bridges the gap between physical hardware and the Red Hat OpenShift Container Platform (OCP). Your mission is to ensure 99.99% availability by architecting resilient physical layouts and automating the deployment, scaling, and self-healing capabilities of our production clusters.
Job Responsibility:
Design and development of a scalable distributed management plane infrastructure to manage Palo Alto Networks’ next-generation network security solutions
Ensure 99.99% availability by architecting resilient physical layouts and automating the deployment, scaling, and self-healing capabilities of our production clusters
Monitor and maintain data center systems with a focus on 'Zero Single Point of Failure' (ZSPoF) architecture for OpenShift control planes and worker nodes
Implement and manage OpenShift 4.x clusters across multiple power and cooling zones to ensure 99.99% uptime
Design, test, and execute automated failover strategies and backup/restore procedures using tools like OADP (Velero) and Red Hat ACM
Perform routine maintenance and upgrades using GitOps (ArgoCD) and the Machine Config Operator to ensure zero-downtime node evacuations and patching
Resolve deep-stack hardware and software issues, from faulty GPU firmware to OpenShift SDN (OVN-Kubernetes) network latencies
Coordinate with vendors for specialized hardware (e.g., NVIDIA, Dell, Cisco) while maintaining strict security and firmware compliance
Optimize rack density for high-performance GPU clusters while managing thermal loads and power distribution (PDU) to prevent circuit-trip outages
Maintain accurate documentation and integrate hardware health metrics (IPMI/SNMP) into Prometheus/Grafana for proactive alerting
Rack and stack high-density GPU servers, ensuring redundant power-pathing and high-speed (100G/200G) InfiniBand or Ethernet cabling
Perform precision physical installation and replacement of critical components (CPUs, GPUs, NVMe storage) in a live production environment without impacting cluster quorum
Requirements:
Bachelor's degree in Computer Science, IT, or equivalent experience
5+ years of experience specifically operating Red Hat OpenShift (OCP) in a production environment
Deep experience racking/stacking and cabling high-density GPU systems (e.g., NVIDIA DGX or similar) and specialized AI/ML hardware
Advanced proficiency in Ansible or Pulumi for automating bare-metal provisioning and cluster configuration
Strong Python and Bash skills for developing custom health-check scripts and API integrations
Expert-level CoreOS and RHEL administration, including kernel tuning and systemd management
Solid understanding of BGP, VLAN tagging, LACP, and Load Balancing (F5/NGINX) essential for cluster ingress
Experience with vSphere or KVM, and persistent storage solutions like OpenShift Data Foundation (ODF) or Ceph
Familiarity with DCIM tools (Netbox) and monitoring stacks ( ELK/Lok ..etci)
Ability to lift and move equipment up to 50 pounds (e.g., high-density 2U/4U servers)
Comfortable working in high-decibel, climate-controlled data center aisles
Capable of standing, walking, and performing precision cabling in tight rack spaces for extended periods
May require occasional travel to remote data center sites or edge locations