Spread-n-Share: Improving Application Performance and Cluster Throughput with Resource-Aware Job Placement
Clouds and Distributed Computing
TimeTuesday, 19 November 20191:30pm - 2pm
DescriptionTraditional batch job schedulers adopt the Compact-and-Exclusive (CE) strategy, packing processes of a parallel job into as few compute nodes as possible. While CE minimizes inter-node network communication, it often leads to self-contention among tasks of a resource-intensive application. Recent studies have used virtual containers to balance CPU utilization and memory capacity across physical nodes, but the unbalanced use of memory bandwidth and the shared last-level cache is still under-investigated.
In this work, we propose Spread-n-Share (SNS), a batch scheduling strategy that automatically scales resource-bound applications out onto more nodes to alleviate their performance bottleneck, and co-locates jobs in a resource compatible manner. We implement Uberun, a prototype scheduler to validate SNS, considering memory bandwidth and LLC capacity as two types of performance-critical shared resources. Experimental results show that SNS improves the overall system throughput by 19.8% on average over CE, while achieving an average individual job speedup of 1.8%.