Presentation
Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs
SessionGPU
Event Type
Paper
TP
Algorithms
Compiler Analysis and Optimization
Data Management
GPUs
Memory
Performance
TimeWednesday, 20 November 20194pm - 4:30pm
Location405-406-407
DescriptionStencil computations in real-world scientific applications may contain multiple interrelated stencils, have multiple input grids, and use higher order discretizations with high arithmetic intensity and complex expression structures. In combination, these properties place immense demands on the memory hierarchy that limit performance. Blocking techniques like tiling are used to exploit reuse in cache. Additional fine-grain data blocking can also reduce TLB, hardware prefetch, and cache pressure.
In this paper, we present a code generation approach designed to further improve tiled stencil performance by exploiting reuse within the block, increasing instruction-level parallelism, and exposing opportunities for the backend compiler to eliminate redundant computation. It also enables efficient vector code generation for CPUs and GPUs. For a wide range of complex stencil computations, we are able to achieve substantial speedups over tiled baselines for Intel KNL and Skylake-X and Nvidia P100 architectures.
In this paper, we present a code generation approach designed to further improve tiled stencil performance by exploiting reuse within the block, increasing instruction-level parallelism, and exposing opportunities for the backend compiler to eliminate redundant computation. It also enables efficient vector code generation for CPUs and GPUs. For a wide range of complex stencil computations, we are able to achieve substantial speedups over tiled baselines for Intel KNL and Skylake-X and Nvidia P100 architectures.
Download PDF
Archive