Authors:
Abstract: Hierarchical low-rank approximation can reduce both the storage and computation costs of dense matrices, but its implementation is challenging. In this research, we tackle one of the most difficult problems of GPU parallelization of the factorization of these hierarchical matrices. To this end, we are developing a new runtime system for GPUs that can schedule all tasks into one GPU kernel. Other existing runtime systems, like cuGraph and Standford Legion, can only manage streams and kernel-level parallelism. Even without too much tuning, we achieved 4x better performance in H-LU factorization with a single GPU when comparing with a well-tuned CPU-based hierarchical matrix library, HLIBpro, on moderately sized matrices. Additionally, we have significantly less runtime overheads exposed when processing smaller matrices.
Best Poster Finalist (BP): no
Poster: PDF
Poster summary: PDF
Back to Poster Archive Listing