SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs

Authors: Tuowen Zhao (University of Utah), Protonu Basu (Facebook), Samuel Williams (Lawrence Berkeley National Laboratory), Mary Hall (University of Utah), Hans Johansen (Lawrence Berkeley National Laboratory)

Abstract: Stencil computations in real-world scientific applications may contain multiple interrelated stencils, have multiple input grids, and use higher order discretizations with high arithmetic intensity and complex expression structures. In combination, these properties place immense demands on the memory hierarchy that limit performance. Blocking techniques like tiling are used to exploit reuse in cache. Additional fine-grain data blocking can also reduce TLB, hardware prefetch, and cache pressure.

In this paper, we present a code generation approach designed to further improve tiled stencil performance by exploiting reuse within the block, increasing instruction-level parallelism, and exposing opportunities for the backend compiler to eliminate redundant computation. It also enables efficient vector code generation for CPUs and GPUs. For a wide range of complex stencil computations, we are able to achieve substantial speedups over tiled baselines for Intel KNL and Skylake-X and Nvidia P100 architectures.

Presentation: file

Back to Technical Papers Archive Listing