SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Replication Is More Efficient Than You Think


Authors: Anne Benoit (ENS Lyon), Thomas Herault (University of Tennessee), Valentin Le Fèvre (ENS Lyon), Yves Robert (ENS Lyon, University of Tennessee)

Abstract: This paper revisits replication coupled with checkpointing for fail-stop errors. Previously published works use replication with the no-restart strategy: (i) compute the application Mean Time To Interruption M as a function of the number of processor pairs and the individual processor MTBF (Mean Time To Failures); (ii) use checkpointing period P= sqrt{2 M C}, where C is the checkpoint duration; and (iii) never restart failed processors until the application crashes. We introduce the restart strategy, where failed processors are restarted after each checkpoint, which may introduce an additional overhead but prevents the application configuration from degrading throughout successive checkpointing periods. We show how to compute the optimal checkpointing period P' for this restart strategy, and prove that its length is an order of magnitude higher than P. Furthermore, we show through simulations that using P' and the restart strategy significantly decreases the overhead induced by replication.


Presentation: file


Back to Technical Papers Archive Listing