Test of Time:
From Dense Linear Algebra to GPU Occupancy
Test of Time
TimeTuesday, 19 November 20193:30pm - 4:15pm
DescriptionThe CUDA programming model was introduced in 2007 and featured a number of new concepts, such as occupancy and shared memory. In this work, we considered performance implications of these concepts in application to dense matrix factorizations. Our findings were contrarian to the widely accepted recommendations of the day. (i) We found a performance optimization pattern that leads to lower occupancy, whereas it was recommended to maximize occupancy in order to hide memory latencies. (ii) We found that instruction-level parallelism contributes to latency hiding on GPUs, which was believed to be not the case. (iii) We found that performance can be improved by using massive register blocking, whereas it was recommended to minimize register use to maximize occupancy. (iv) We found that shared memory is slower than registers and the use of the latter should be favored over the former, if possible. These novel insights led us to a design of the matrix multiply routine that substantially outperformed the state-of-the-art vendor BLAS library. The optimization pattern we pointed out is found today in many high-performance GPU codes.