Workshop: A Hardware Prefetching Mechanism for Vector Gather Instructions
Abstract: Vector gather instructions are responsible for handling indirect memory accesses in vector processing. Since the indirect memory accesses usually express irregular access patterns, they have relatively low spatial and temporal locality compared with regular access patterns. As a result, an application with many vector gather instructions suffers from long latencies of the indirect memory accesses. Thus, the long latencies cause a significant performance degradation in vector processing.
This paper proposes a hardware prefetching mechanism to hide memory access latencies of indirect memory accesses. The mechanism prefetches cacheable index data before executing a vector gather instruction, and predicts the addresses of the memory requests issued by the vector gather instruction. The mechanism then tries to prefetch the data based on the predicted addresses. As a result, the mechanism can reduce the memory access latencies of vector gather instructions. Moreover, this paper discusses how many cache blocks should be loaded per prediction regarding a single vector gather instruction by varying the prefetching parameters of distance and degree. In the evaluation, the performance of a simple kernel is examined with two types of index data: sequential and random. The evaluation results show that the prefetching mechanism improves the performance of the sequential-indexed and random-indexed kernels by 2.2x and 1.2x, respectively.