SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Abstract: Frontera is the largest NSF-funded cluster in the US and comprises of 8,008 nodes equipped with the latest Intel Xeon processors (Cascade-Lake). In this paper, we explore the potential of Frontera for training state-of-the-art Deep Learning (DL) models at scale. Most DL studies present performance data from large-scale GPU clusters that are equipped with NVIDIA GPUs. However, our earlier performance characterization studies have helped us achieve comparable performance with CPU-only clusters as well. Based on this, we configure three important DL frameworks; 1) TensorFlow, 2) PyTorch, and 3) MXNet, using Horovod and two Message Passing Interface (MPI) libraries on Frontera: 1) MVAPICH2 and 2) Intel MPI. We provide a systematic performance comparison for TensorFlow using MVAPICH2 and Intel MPI on 2,048 Frontera nodes. Using a four process per-node configuration, we observe near-linear scaling for ResNet-50 training for TensorFlow up to 8,192 MPI processes (on 2,048 nodes) offering a sustained performance of 250,000 images/second. In addition, we provide insights into process per node and batch size configurations for TensorFlow as well as for PyTorch and MXNet. Based on single-node performance behavior, we scale all three DL frameworks up to 1,024 processes (256 nodes) for various models like ResNet-50/101/152 and Inception-v3/v4.






Back to Deep Learning on Supercomputers Archive Listing


Back to Full Workshop Archive Listing