SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

An Evaluation of the CORAL Interconnects

Authors: Christopher Zimmer (Oak Ridge National Laboratory), Scott Atchley (Oak Ridge National Laboratory), Ramesh Pankajakshan (Lawrence Livermore National Laboratory), Brian E. Smith (Oak Ridge National Laboratory), Ian Karlin (Lawrence Livermore National Laboratory), Matt Leininger (Lawrence Livermore National Laboratory), Adam Bertsch (Lawrence Livermore National Laboratory), Brian S. Ryujin (Lawrence Livermore National Laboratory), Jason Burmark (Lawrence Livermore National Laboratory), André Walker-Loud (Lawrence Berkeley National Laboratory), M. A. Clark (Nvidia Corporation), Olga Pearce (Lawrence Livermore National Laboratory)

Abstract: In 2019, the Department of Energy deployed the Summit and Sierra supercomputers, both employing the latest interconnect technology. In this paper, we provide an in-depth assessment of the systems' interconnects, that is based on Enhanced Data Rate (EDR) 100 Gb/s Mellanox Infiniband. Both systems use second-generation EDR Host Channel Adapters (HCAs) and switches adding several new features such as Adaptive Routing (AR), switch-based collectives, HCA-based tag matching, and NVMe-over-Fabrics offload. Although based on the same components, Summit's network is "non-blocking'' (i.e., fully provisioned) and Sierra's network has a 2:1 taper. We evaluate the two systems' interconnects using traditional communication benchmarks as well as real applications. We find that the new Adaptive Routing dramatically improves performance, but the other new features still need improvement.

Presentation: file

Back to Technical Papers Archive Listing