SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Poster 144: Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning


Authors: Tonmoy Dey (Florida State University), Kento Sato (RIKEN Center for Computational Science (R-CCS)), Jian Guo (RIKEN Center for Computational Science (R-CCS)), Bogdan Nicolae (Argonne National Laboratory), Jens Domke (RIKEN Center for Computational Science (R-CCS)), Weikuan Yu (Florida State University), Franck Cappello (Argonne National Laboratory), Kathryn Mohror (Lawrence Livermore National Laboratory)

Abstract: With the emergence of fast local storage, multi-level checkpointing (MLC) has become a common approach for efficient checkpointing. To utilize MLC efficiently, it is important to determine the optimal configuration for the checkpoint/restart (CR). There are mainly two approaches for determining the optimal configuration for CR, namely modeling and simulation approach. However, with MLC, CR becomes more complicated making the modeling approach inaccurate and the simulation approach though accurate, very slow. In this poster, we focus on optimizing the performance of CR by predicting the optimized checkpoint count and interval. This was achieved by combining the simulation approach with machine learning and neural network to leverage its accuracy without spending time on simulating different CR parameters. We demonstrate that our models can predict the optimized parameter values with minimal error when compared to the simulation approach.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF


Back to Poster Archive Listing