Advisor: Dong Li (University of California, Merced)
Abstract: As high-performance computing systems scale in size and computational power, the occurrence of transient faults grows. Without protection by efficient and effective fault tolerance mechanisms, transient errors can cause incorrect execution outcome and even lead to a catastrophe in safety-critical HPC applications. Previous work attributes error resilience in HPC applications at a high-level to either the probabilistic or iterative nature of the application, whereas the community still lacks the fundamental understanding of the program constructs that result in natural error resilience. We design FlipTracker, a framework to analytically track error propagation and to provide a fine-grained understanding of the propagation and tolerance of errors. After running FlipTracker on representative HPC applications, we summarize six resilience computation patterns that lead to nature error resilience in HPC applications. With a better understanding of natural resilience in HPC applications, we aim to model application resilience on data objects to transient faults. Many common application-level fault tolerance mechanisms focus on data objects. Understanding application resilience on data objects can be helpful to direct those mechanisms. The common practice to understand application resilience (random fault injection) gives us little knowledge of how and where errors are tolerated. Understanding "how" and "where" is necessary to understand how to apply application-level fault tolerance mechanisms effectively and efficiently. We design a practical model (MOARD) to measure application resilience on data objects by analytically quantifying error masking events happening to the data object. Using our model, users can compare application resilience on different data objects with different data types.
Thesis Canvas: pdf