The person should have experience from an standard infrastructure datacentre point of view but must also have the following experience and knowledge :
Extensive experience managing HPC networking at an expert or near expert level involving technologies and fabrics based on InfiniBand, Omni-
path, etc. This includes but not limited to the common tools, monitoring, sampling telemetry data and best practices around these type of fabrics.
Specific knowledge around technologies utilized in HPC clusters, for example RDMA and NVMe-oF.
Knowledge around HPC network topologies and why they are used for different types of clusters and sizes (fat tree, 3D Torus, Hypercubes etc.).
Knowledge and experience around working with clustered / distributed filesystems and scale-out storage technologies (for example Lustre, Ceph, Isilon etc.).
Previous experience in how to monitor HPC infrastructure ( the observer problem ). Must have experience to understand sample rates and how to tap telemetry data that supports our mission without affecting the client workloads negatively.
Extensive experience managing and troubleshooting HPC environments from an infrastructural perspective meaning everything from local node problems, blade chassis, intra-
and interconnect traffic in the topology, process / job management and the process of tracing backwards to understand what workload affects the cluster in an abnormal way.
Should have previous experience in job scheduling with for example IBM Spectrum LSF (used by client), Slurm, MOAB or equivalent.
Even though this is out of scope from solution perspective, it makes a great difference overall when it comes to manage the HPC infrastructure in a good manner.
For Capgemini HPC team to engage and work together with client HPC application team this experience and knowledge much preferred.
At least medium level of computer science and computer architecture knowledge.
Bonus if person has previous experience working directly with MPI, developer teams or did actual programming in HPC area.
Bonus if person has HPE cluster experience considering most of the client clusters are delivered by HPE.
Extra bonus if person has experience of hybrid HPC clusters or GPU-based / cuda clusters knowing the direction of client (autonomous cars, smart traffic systems etc.).