Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms Over PaRSEC
TimeMonday, 18 November 20193:50pm - 4:10pm
DescriptionThis paper introduces a generic and flexible matrix-matrix
multiplication algorithm $C = A \times B$ for state-of-the-art
computing platforms. Typically, these platforms are
distributed-memory machines whose nodes are equipped with several
accelerators (e.g., 6 GPUs per node for Summit. To
the best of our knowledge, SLATE is the only library
that provides a publicly available implementation on such platforms,
and it is currently limited to problem instances where the $C$
matrix can entirely fit in the memory of the GPU accelerators. Our
algorithm relies on the classical tile-based outer-product
algorithm, but enhances it with several control dependences to
increase data re-use and to optimize communication flow from/to the
accelerators within each node. The algorithm is written within
the Parsec runtime system, which allows for a fast and generic
implementation, while achieving close-to-peak performance for a
large variety of situations.
