Advisor: Dhabaleswar K. Panda (Ohio State University)
Abstract: Recent advances in Machine/Deep Learning techniques have triggered key success stories in many application domains like Computer Vision, Speech Comprehension and Recognition, and Natural Language Processing. Large-scale Deep Neural Networks (DNNs), that are at the core of state-of-the-art AI technologies, have been the primary drivers of this success. Training very complicated and large DNN architectures using a large number of training examples (data) is compute-intensive and can take from weeks to months to achieve state-of-the-art prediction capabilities. To achieve higher accuracy, making the DNN deeper is also a common strategy. These requirements have led to a simple but powerful approach called Data Parallelism to achieve shorter training times. This has resulted in various research studies and ML/DL software like TensorFlow and PyTorch as well as distributed-training middleware like Horovod. In addition, for DNNs that do not fit the GPU memory, a new DNN workloads are emerging that we call Out-of-Core DNNs and different strategies (out-of-core training and model-parallelism) are needed to train them. Clearly, large-scale DNN training brings forward new requirements for computation runtimes like CUDA and communication middleware like the MVAPICH2 MPI library. In this thesis, we broadly explore three different strategies to train DNNs on modern CPU and GPU architectures: 1) Data Parallelism, 2) Model Parallelism, and 3) Out-of-Core Training. We address the key challenge: How to co-design computation and communication in modern ML/DL frameworks with execution runtimes like CUDA and communication middleware like MVAPICH2 to enable scalable, high-performance, and efficient training of DNNs on large-scale HPC systems?
Thesis Canvas: pdf