Keynote 2: Toward Scaling Deep Learning to 100,000 Processors

SC19 Proceedings

Keynote 2: Toward Scaling Deep Learning to 100,000 Processors - The Fugaku Challenge

Workshop: Keynote 2: Toward Scaling Deep Learning to 100,000 Processors - The Fugaku Challenge

Abstract: Modern AI with deep learning poses significant overhead in training over very large data sets, whereby the use of HPC techniques to compute in parallel on a large machine is becoming increasingly popular. However, most of the efforts have been on GPUs at relatively low scale, in the order of a few hundreds, up to a thousand except on fairly limited sets of cases, due to inherent difficulties. On Fugaku we plan on extending the capabilities of deep learning by allowing training to be done on the full machine, or more than 100,000 nodes. This requires various technological underpinnings as well as new algorithms for scalable training, the ongoing effort whose curent state will be described.

Back to 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems Archive Listing

Back to Full Workshop Archive Listing