This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
HPE Operations is our innovative IT services organization. It provides the expertise to advise, integrate, and accelerate our customers’ outcomes from their digital transformation. Our teams collaborate to transform insight into innovation. In today’s fast paced, hybrid IT world, being at business speed means overcoming IT complexity to match the speed of actions to the speed of opportunities. Deploy the right technology to respond quickly to market possibilities. Join us and redefine what’s next for you.
Job Responsibility:
Review and Validate HPC solutions and Environment through POCs and Benchmarking
Architecting and designing HPC solutions tailored to the customer’s needs
Overseeing solution implementation, integration and testing
Diagnose and correct solution issues during the implementation
Providing training, documentation and ongoing support
Maintain the Life-cycle management of the HPC environment
Oversee the team operations and deliverables
Lead the team with technical expertise ensure regular technical session and case reviews
Demonstrate high level of technical & communication skills under critical situations
Takes responsibility for end-to-end problem ownership and its solutions
Should be a good team player
Requirements:
8 - 12 years of experience different flavours of Linux like SLES, RHEL and Ubuntu/Debian
5 - 8 years Experience in managing HPC/Linux clusters and should have good understanding of its architecture
Skilled in installation and configuration of various applications on Linux
Install, administer, and maintain hardware, system software, networking, accounts, and security measures on VMWare configuration
Diagnose and resolve system issues and performance issues
Should have experience in drafting technical SOPs, action plans and knowledge documents
Should have good understanding of different cloud platforms
Reinstate integrity of system as quickly as possible following an outage in order to minimize downtime
Triage and solve user-submitted tickets, especially when they relate to the infrastructure
Track resource usage using monitoring and queuing software
Peer assistance is an added trait
Demonstrated expertise with Linux system administration, including OS, networking, storage, Docker and security
Experience with high-speed networking such as InfiniBand and 10/40 Gigabit Ethernet
Familiarity with large storage systems (Scality, Weka, Lustre, GPFS, others)
Experience with HPC clusters manager ( HPCM, Bright Cluster Manager)
Experience in server hardware patching and troubleshooting
Experience managing HPC clusters and GPUs
Experience using and supporting job schedulers such as SLURM, PBS or other schedulers
Familiar with Shell/python scripting and Ansible
Familiar with monitoring tools like Grafana/Nagios/Opsramp
Familiar with virtualization technologies like KVM, VMWare, vCenter
Infrastructure Monitoring: Nagios, OpsRamp, HPE PCM, NVIDIA BCM, Solar Winds
Virtualization: Containers, Kubernetes, Vmware and OpenShift