Tools and Best Practices for Distributed Deep Learning on Supercomputers
TimeSunday, 17 November 20191:30pm - 5pm
DescriptionThis tutorial is a practical guide on how to run distributed deep learning over multiple compute nodes effectively. Deep Learning (DL) has emerged as an effective analysis method and has adapted quickly across many scientific domains in recent years. Domain scientists are embracing DL as both a standalone data science method and an effective approach to reducing dimensionality in the traditional simulation. However, due to its inherent high computational requirement, application of DL is limited by the available computational resources. Recently, we have seen the fusion of DL and high-performance computing (HPC): supercomputers show an unparalleled capacity to reduce DL training time from days to minutes; HPC techniques have been used to speed up parallel DL training. Therefore, distributed deep learning has great potential to augment DL applications by leveraging existing high performance computing cluster.
This tutorial consists of three sessions. First, we will give an overview of the state-of-art approaches to enabling deep learning at scale. The second session is an interactive hands-on session to help attendees running distributed deep learning on Frontera at the Texas Advanced Computing Center. In the last session, we will focus on the best practices on how to scale, evaluate, and tune up performance.