SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Understanding Congestion in High Performance Interconnection Networks Using Sampling

Authors: Philip A. Taffet (Rice University, Lawrence Livermore National Laboratory), John M. Mellor-Crummey (Rice University)

Abstract: Communication cost is an important factor for applications on clusters and supercomputers. To improve communication performance, developers need tools that enable them to understand how their application’s communication patterns interact with the network, especially when those interactions result in congestion. Since communication performance is difficult to reason about analytically and simulation is costly, measurement-based approaches are needed. This paper describes a new sampling-based technique to collect information about the path a packet takes and congestion it encounters. We describe a variant of this scheme that requires only 5–6 bits of information in a monitored packet, making it practical for use in next-generation networks. Simulations of synthetic benchmarks, miniGhost, and pF3D show that this strategy provides precise application-centric quantitative information about traffic and congestion that can be used to distinguish between problems with an application’s communication patterns, its mapping onto a parallel system, and outside interference.

Presentation: file

Back to Technical Papers Archive Listing