Authors:
Abstract: Due to the rising of deep learning, clusters for deep learning training are widely deployed in production. However, static task configuration and resource fragmentation problems in existing clusters result in low efficiency and poor quality of service. We propose ETL, an elastic training layer for deep learning, to help address them once for all. ETL adopts many novel mechanisms, such as lightweight and configurable report primitive and asynchronous, parallel and IO-free state replication, to achieve both high elasticity and efficiency. The evaluation demonstrates the low overhead and high efficiency of these mechanisms and reveals the advantages of elastic deep learning supported by ETL.
Best Poster Finalist (BP): no
Poster: PDF
Poster summary: PDF
Back to Poster Archive Listing