Presentation

· Contributors · Organizations · Search Program · Flagged · Happening Now · Maps

Awards Presentation

Test of Time

: Test of Time Award: From Dense Linear Algebra to GPU Occupancy

SessionTest of Time Award Presentation

Presenters

Vasily Volkov

James Demmel

Event Type

Awards Presentation

Test of Time

Registration Categories

Tags

TimeTuesday, 19 November 20193:30pm - 4:15pm

LocationMile High Ballroom

DescriptionThe CUDA programming model was introduced in 2007 and featured a number of new concepts, such as occupancy and shared memory. In this work, we considered performance implications of these concepts in application to dense matrix factorizations. Our findings were contrarian to the widely accepted recommendations of the day. (i) We found a performance optimization pattern that leads to lower occupancy, whereas it was recommended to maximize occupancy in order to hide memory latencies. (ii) We found that instruction-level parallelism contributes to latency hiding on GPUs, which was believed to be not the case. (iii) We found that performance can be improved by using massive register blocking, whereas it was recommended to minimize register use to maximize occupancy. (iv) We found that shared memory is slower than registers and the use of the latter should be favored over the former, if possible. These novel insights led us to a design of the matrix multiply routine that substantially outperformed the state-of-the-art vendor BLAS library. The optimization pattern we pointed out is found today in many high-performance GPU codes.

Presenters

Vasily Volkov

Nvidia Corporation

James Demmel

University of California, Berkeley