Workshop: Keynote 2: Toward Scaling Deep Learning to 100,000 Processors - The Fugaku Challenge
Abstract: Modern AI with deep learning poses significant overhead in training over very large data sets, whereby the use of HPC techniques to compute in parallel on a large machine is becoming increasingly popular. However, most of the efforts have been on GPUs at relatively low scale, in the order of a few hundreds, up to a thousand except on fairly limited sets of cases, due to inherent difficulties. On Fugaku we plan on extending the capabilities of deep learning by allowing training to be done on the full machine, or more than 100,000 nodes. This requires various technological underpinnings as well as new algorithms for scalable training, the ongoing effort whose curent state will be described.