SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

FT-iSort: Efficient Fault Tolerance for Introsort

Authors: Sihuan Li (University of California, Riverside), Hongbo Li (University of California, Riverside), Xin Liang (University of California, Riverside), Jieyang Chen (Oak Ridge National Laboratory), Elisabeth Giem (University of California, Riverside), Kaiming Ouyang (University of California, Riverside), Kai Zhao (University of California, Riverside), Sheng Di (Argonne National Laboratory), Franck Cappello (Argonne National Laboratory), Zizhong Chen (University of California, Riverside)

Abstract: Introspective sorting is a ubiquitous sorting algorithm which underlies many large scale distributed systems. Hardware-mediated soft errors can result in comparison and memory errors, and thus cause introsort to generate incorrect output, which in turn disrupts systems built upon introsort; hence, it is critical to incorporate fault tolerance capability within introsort. This paper proposes the first theoretically-sound, practical fault tolerant introsort with negligible overhead: FT-iSort. To tolerate comparison errors, we use minimal TMR protection via exploiting the properties of the effects of soft errors on introsort. This algorithm-based selective protection incurs far less overhead than naïve TMR protection designed to protect an entire application. To tolerate memory errors that escape DRAM error correcting code, we propose XOR-based re-execution. We incorporate our fault tolerance method into the well-known parallel sorting implementation HykSort, and we find that fault tolerant HykSort incurs negligible overhead and obtains nearly the same scalability as unprotected HykSort.

Presentation: file

Back to Technical Papers Archive Listing