Paper
:
Replication Is More Efficient Than You Think
Event Type
Paper
Registration Categories
TP
Tags
Extreme Scale Computing
Reliability
Resiliency
Storage
Workflows
TimeThursday, 21 November 20194:30pm - 5pm
DescriptionThis paper revisits replication coupled with checkpointing for fail-stop errors. Previously published works use replication with the no-restart strategy: (i) compute the application Mean Time To Interruption M as a function of the number of processor pairs and the individual processor MTBF (Mean Time To Failures); (ii) use checkpointing period P= sqrt{2 M C}, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy, where failed processors are restarted after each checkpoint, which may introduce an additional overhead but prevents the application configuration from degrading throughout successive checkpointing periods. We show how to compute the optimal checkpointing period P' for this restart strategy, and prove that its length is an order of magnitude higher than P. Furthermore, we show through simulations that using P' and the restart strategy significantly decreases the overhead induced by replication.
Archive
Back To Top Button