Abstract: Using a CPU-GPU hybrid computing framework is becoming a common configuration for supercomputers. The wide deployment of GPUs (as well as other hardware accelerators) brings to the HPC community a big question: Are we using them effectively? Inappropriate use of GPUs can generate incorrect results in certain cases, but more often, will slow down the program instead of speeding it up. This paper describes a tool that satisfies the needs of programmers to analyze the runtime performance of kernels and obtain insights for better GPU utilization. Compared to existing GPU performance tools, ours provides some unique features: data-centric profiling and generating complete GPU call stacks. With the guidance of the tool, we were able to improve the kernel performance of three widely-studied GPU benchmarks by a factor of up to 46.6x with minor code modification.