Authors:
Abstract: Checkpoint/Restart is a widely used Fault Tolerance technique for application resilience. However, failures and the overhead of saving application state for future recovery upon failure reduces the application efficiency significantly. This work contributes a failure analysis and prediction model making decisions for checkpoint data placement, recovery, and techniques for reducing checkpoint frequency. We also demonstrate a reduction in application overhead by taking proactive measures guided by failure prediction.
Best Poster Finalist (BP): no
Poster: PDF
Poster summary: PDF
Back to Poster Archive Listing