SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Self-Stabilizing Connected Components


Workshop: Self-Stabilizing Connected Components

Abstract: For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of self-stabilization. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state.

We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs O(V log V) additional computation and requires O(V) additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional labelprop).

When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault-tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.






Back to Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) Archive Listing


Back to Full Workshop Archive Listing