Abstract: Recent advances in processor technologies have led to highly multi-threaded and dense multi- and many-core HPC systems. The adoption of such dense multi-core processors is widespread in the Top500 systems. Message Passing Interface (MPI) has been widely used to scale out scientific applications. The communication designs for intra-node communication in MPI are mainly based on shared memory communication. The increased core-density of modern processors warrants the use of efficient shared memory communication designs to achieve optimal performance. While there have been various algorithms and data-structures proposed for the producer-consumer like scenarios in the literature, there is a need to revisit them in the context of MPI communication on modern architectures to find the optimal solutions that work best for modern architectures.
In this paper, we first propose a set of low-level benchmarks to evaluate various data-structures such as Lamport queues, Fast-Forward queues, and Fastboxes (FB) for shared memory communication. Then, we bring these designs into the MVAPICH2 MPI library and measure their impact on the MPI intra-node communication for a wide variety of communication patterns. The benchmarking results are carried out on modern multi-/many-core architectures including Intel Xeon CascadeLake and Intel Knights Landing.