Abstract: Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed when distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communication. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse and quantized input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute sparse input data vectors. Our library extends MPI to support features such as non-blocking (asynchronous) operations and low-precision data representation. As such, SparCML provides the basis for future highly-scalable machine learning frameworks.
Back to Technical Papers Archive Listing