Posters
Research Posters
:
Poster 104: An Adaptive Checkpoint Model For Large-Scale HPC Systems
Event Type
Posters
Research Posters
Registration Categories
TP
EX
EXH
TimeThursday, 21 November 20198:30am - 5pm
LocationE Concourse
DescriptionCheckpoint/Restart is a widely used Fault Tolerance technique for application resilience. However, failures and the overhead of saving application state for future recovery upon failure reduces the application efficiency significantly. This work contributes a failure analysis and prediction model making decisions for checkpoint data placement, recovery, and techniques for reducing checkpoint frequency. We also demonstrate a reduction in application overhead by taking proactive measures guided by failure prediction.
Archive
Back To Top Button