An Instruction Roofline Model for GPUs

SC19 Proceedings

An Instruction Roofline Model for GPUs

Workshop: An Instruction Roofline Model for GPUs

Abstract: The Roofline performance model provides an intuitive approach to identify performance bottlenecks and guide performance optimization. However, the classic FLOP-centric approach is inappropriate for emerging applications that perform more integer operations than floating-point operations. In this paper, we propose an Instruction Roofline Model on NVIDIA GPUs. The Instruction Roofline incorporates instructions and memory transactions across all memory hierarchies together and provides more performance insights than the FLOP-oriented Roofline Model, i.e., instruction throughput, stride memory access patterns, bank conflicts, and thread predication. We use our Instruction Roofline methodology to analyze five proxy applications: HPGMG from AMReX, BatchSW from merAligner, Matrix Transpose benchmarks, cudaTensorCoreGemm, and cuBLAS. We demonstrate the ability of our methodology to understand various aspects of performance and performance bottlenecks on NVIDIA GPUs and motivate code optimizations.

Back to The 10th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems (PMBS19) Archive Listing

Back to Full Workshop Archive Listing