Poster 144: Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning
TimeThursday, 21 November 20198:30am - 5pm
DescriptionWith the emergence of fast local storage, multi-level checkpointing (MLC) has become a common approach for efficient checkpointing. To utilize MLC efficiently, it is important to determine the optimal configuration for the checkpoint/restart (CR). There are mainly two approaches for determining the optimal configuration for CR, namely modeling and simulation approach. However, with MLC, CR becomes more complicated making the modeling approach inaccurate and the simulation approach though accurate, very slow. In this poster, we focus on optimizing the performance of CR by predicting the optimized checkpoint count and interval. This was achieved by combining the simulation approach with machine learning and neural network to leverage its accuracy without spending time on simulating different CR parameters. We demonstrate that our models can predict the optimized parameter values with minimal error when compared to the simulation approach.