Advisor: Dhabaleswar K. Panda (Ohio State University)
Abstract: Modern multi-petaflop HPC systems are powered by dense multi-/many-core architectures and this trend is expected to grow for the next-generation supercomputing systems. This rapid adoption of high core-density architectures by the current- and next-generation HPC systems is further fueled by the emerging application trends such as Deep Learning. This is putting more emphasis on middleware designers to optimize various communication protocols to meet the diverse needs of the applications. While the novelties in the processor architectures have led to the increased on-chip parallelism, they come at the cost of rendering traditional designs, employed by the communication runtimes such as MPI, to suffer from higher degree of intra-node communication latencies. Tackling the computation and communication challenges that accompany these dense multi-/many-cores garner special design considerations. The proposed work in this thesis tries to address the performance challenges posed by a diverse range of applications and the lacking support in state-of-the-art communication libraries such as MPI to exploit high-concurrency architectures. The author first proposes a "shared-address-spaces"-based communication substrate to derive intra-node communication in MPI. Atop this framework, the author has re-designed various MPI primitives such as point-to-point communication protocols (e.g., user-space zero-copy rendezvous transfer), collective communication (e.g., load/store based collectives, truly zero-copy and partitioning-based reduction algorithms), and efficient MPI derived datatypes processing (e.g., memoization-based "packing-free" communication) to exploit the potential of emerging multi-/many-core architectures and high throughput networks. The proposed designs have demonstrated significant improvement over state-of-the-art for various scientific and deep learning applications.
Thesis Canvas: pdf