Authors:
Abstract: In this poster, we present the strategy, progress, and performance while GPU porting one of the major modules, epsilon, of the electronic structure code BerkeleyGW. Epsilon represents the most time-consuming routines in the BerkeleyGW workflow for large-scale material science simulations. Some of the porting/optimization strategies include, changing our original data layout to efficiently use libraries such as cuBLAS and cuFFT, implementation of specific CUDA kernels to minimize data copies between host/device and keeping data on device, efficient use of data streams to leverage high concurrency on the device, asynchronous memory copies and overlapping (MPI) communication on the host and computation on the device. Preliminary results are presented in terms of the speedup compare to the CPU-only implementation, strong/weak scaling, and power efficiency. Excellent acceleration is demonstrated: up to 30x for specific kernels. Our port also exhibits good scalability and about 16x higher FLOPs/watt efficiency compared to the CPU-only implementation.
Best Poster Finalist (BP): no
Poster: PDF
Poster summary: PDF
Back to Poster Archive Listing