Presentation
Assessing the Impact of Timing Errors on HPC Applications
Event Type
Paper
TP
Algorithms
Fault Tolerance
Machine Learning
Reliability
Resiliency
TimeThursday, 21 November 201911am - 11:30am
Location405-406-407
DescriptionTiming errors are a growing concern for system resilience as technology continues to scale. It is problematic to use low-fidelity errors such as single-bit flips to model realistic timing errors. We address the lack of holistic methodology and tools for evaluating resilience of applications against timing errors. The proposed technique is able to rapidly inject high-fidelity and configurable timing errors to applications at the instruction level. Our implementation has no runtime dependencies on proprietary tools, enabling full parallelism of error injection campaign. Furthermore, because an injection point may not generate an actual error for a particular application run, we propose an acceleration technique to maximize the likelihood of generating errors that contribute to the overall campaign with speedup up to 7x. With our tool, we show that realistic timing errors lead to distinct error profiles from those of radiation-induced errors at both the instruction level and the application level.
Download PDF
Archive