Presentation
CARE: Compiler-Assisted Recovery from Soft Failures
Event Type
Paper
TP
Applications
Cancer
Compiler Analysis and Optimization
Compilers
Computational Biology
Exascale
Fault Tolerance
Heterogeneous Systems
Machine Learning
Parallel Application Frameworks
Portability
Reliability
Resiliency
Runtime Systems
Scalable Computing
Scientific Computing
Scientific Workflows
Simulation
Tools
Workflows
BSP Finalist
TimeWednesday, 20 November 20194pm - 4:30pm
Location401-402-403-404
DescriptionAs new architecture designs continue to boost the system performance with higher circuit density, shrinking process technology and near-threshold voltage operations, the hardware is projected to be more vulnerable to transient faults. Even though relatively infrequent, crashes due to transient faults are incredibly disruptive, and are unpredictable necessitating frequent check-pointing, which would incurs huge overhead.
In this paper, we present CARE, a light-weight and compiler-assisted technique to continue the execution of applications upon crash-causing errors. CARE repairs corrupted states by recomputing the data for the crashed architecture states on-the-fly. We evaluated CARE with 5 scientific workloads with up to 3072 cores. During the normal execution of applications, CARE incurs near-to-zero overheads, and can recover on an average 83.5% of crash-causing errors within ten of milliseconds. Moreover, due to such an effective error-recovery mechanism, frequent check-pointing can be relaxed into a relatively infrequent one, tremendously reducing the overheads.
In this paper, we present CARE, a light-weight and compiler-assisted technique to continue the execution of applications upon crash-causing errors. CARE repairs corrupted states by recomputing the data for the crashed architecture states on-the-fly. We evaluated CARE with 5 scientific workloads with up to 3072 cores. During the normal execution of applications, CARE incurs near-to-zero overheads, and can recover on an average 83.5% of crash-causing errors within ten of milliseconds. Moreover, due to such an effective error-recovery mechanism, frequent check-pointing can be relaxed into a relatively infrequent one, tremendously reducing the overheads.
Download PDF
Archive