Move It but Don’t Lose It: Re-Creating the Barkla HPC Cluster in the Cloud during a Complete HPC Systems Relocation.
Event Type
HPC Impact Showcase
TimeThursday, 21 November 20194:30pm - 5pm
DescriptionWhat do you do when you have to move a cluster but cannot afford to lose the resource time? You could rely on a Disaster Recover (DR) plan, but having a complete mirror can be cost-prohibitive in both physical hardware and systems management - especially if your back-up site isn’t really geared-up for HPC. So why not try public cloud? That’s what the team at the University of Liverpool decided to do. They were searching for a means to keep resources available during the move of the Barkla HPC cluster, a 5,000-core Dell EMC cluster containing Xeon Phi and Nvidia Tesla V100 GPUs, into its updated facilities at the Department for Advanced Computing, an on-site data center located on the university campus. This move, and the prospective downtime period of up to two weeks that would come with it, sparked the need to completely overhaul how the team approached HPC resource management when a 'planned disaster' was on the way - and their secondary on-site facility not built to handle the HPC workload at the required level for that duration. The solution? Lean on the ephemeral strength of not one, but two public cloud platforms (AWS and Microsoft Azure) to architect a mission critical variant of Barkla that would not only cover off the cluster move, but also ensure users had (nearly) seamless access to cloud resources, lay the foundation for cloud bursting projects, and create a stronger, long-term DR process for their HPC facilities.