Advisor: Michela Taufer (University of Tennessee)
Abstract: As HPC applications migrate from the petascale systems of today to the exascale systems of tomorrow, the increasing need to embrace asynchronous, irregular, and dynamic communication patterns will lead to a corresponding decrease in application-level determinism. Two critical challenges emerge from this trend. First, unchecked non-determinism coupled with the non-associativity of floating-point arithmetic undermines numerical reproducibility of scientific applications. Second, the prevalence of non-determinism amplifies the cost of debugging, both in terms of computing resources and human effort. In this thesis, we present a modeling methodology to quantify and characterize communication non-determinism in parallel applications. Our methodology consists of three core components. First, we build graph-structured models of relevant communication events from execution traces. Second, we apply similarity metrics based on graph kernels to quantify run-to-run variability and thus identify the regions of executions where non-determinism manifests most prominently. Third, we leverage our notion of execution similarity to characterize applications via clustering, anomaly detection, and extraction of representative patterns of non-deterministic communication which we dub "non-determinism motifs". Our work will amplify the effectiveness of software tools that target mitigation or control of application-level non-determinism (e.g., record-and-replay tools) by providing them with a common metric for quantifying communication non-determinism in parallel applications and a common language for describing it.
Thesis Canvas: pdf