SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Poster 104: An Adaptive Checkpoint Model For Large-Scale HPC Systems

Authors: Subhendu S. Behera (North Carolina State University), Lipeng Wan (Oak Ridge National Laboratory), Frank Mueller (North Carolina State University), Matthew Wolf (Oak Ridge National Laboratory), Scott Klasky (Oak Ridge National Laboratory)

Abstract: Checkpoint/Restart is a widely used Fault Tolerance technique for application resilience. However, failures and the overhead of saving application state for future recovery upon failure reduces the application efficiency significantly. This work contributes a failure analysis and prediction model making decisions for checkpoint data placement, recovery, and techniques for reducing checkpoint frequency. We also demonstrate a reduction in application overhead by taking proactive measures guided by failure prediction.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF

Back to Poster Archive Listing