Workshop: Toward Half-Precision Computation for Complex Matrices: A Case Study for Mixed Precision Solvers on GPUs
Abstract: Low-precision computations are popular in machine learning and artificial intelligence (AI) applications. Hardware architectures, such as high-end GPUs, now support native 16-bit floating point arithmetic (i.e. half-precision). While half-precision provides a natural 2x/4x speedups against the performance of single/double precisions, modern GPUs are equipped with hardware accelerators for even more FP16 performance. These accelerators, which are called tensor cores, have a theoretical peak performance that is 8x/16x faster than FP32/FP64 performance, respectively. Such a high level of performance has encouraged researchers to harness the compute power of the tensor cores outside AI applications.
This paper presents a mixed-precision dense linear solver (Ax = b) for complex matrices using the tensor core units of the GPU. Unlike similar efforts that have discussed accelerating Ax=b using real FP16 arithmetic, this paper focuses on complex precisions. The developed solution uses a ``half-complex'' precision to accelerate the solution of Ax=b while maintaining single-complex precision accuracy. The proposed solver requires a matrix multiplication kernel that can accept half-complex inputs. We discuss two possible designs for such a kernel, and integrate both of them into a mixed-precision LU factorization. The other component of our solution is an iterative refinement solver, which recovers the single-complex accuracy using a preconditioned GMRES solver. Our experiments, which are conducted on a V100 GPU, show that the mixed-precision solver can be up to 2.5x faster than a full single-complex precision solver.