SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

SparCML: High-Performance Sparse Communication for Machine Learning


Authors: Cedric Renggli (ETH Zurich), Saleh Ashkboos (Institute of Science and Technology Austria), Mehdi Aghagolzadeh (Microsoft Corporation), Dan Alistarh (Institute of Science and Technology Austria, Neural Magic), Torsten Hoefler (ETH Zurich)

Abstract: Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed when distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communication. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse and quantized input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute sparse input data vectors. Our library extends MPI to support features such as non-blocking (asynchronous) operations and low-precision data representation. As such, SparCML provides the basis for future highly-scalable machine learning frameworks.




Back to Technical Papers Archive Listing