Paper
:
Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice
Event Type
Paper
Registration Categories
TP
Tags
Data Analytics
MPI
Performance
Resiliency
Resource Management
State of the Practice
TimeTuesday, 19 November 20193:30pm - 4pm
Location205-207
DescriptionAs we near exascale, resilience remains a major technical hurdle. Any technique with the goal of achieving resilience suffers from having to be reactive, as failures can appear at any time. A wide body of research therefore aims at predicting failures, i.e., forecasting failures so that evasive actions can be taken while the system is still fully functional and enables a reasoning about its global state.

This research area has grown very diverse with a large number of approaches, yet is currently poorly classified, making it hard to understand impact of existing work. In this paper, we perform an extensive survey of existing literature in failure prediction by analyzing and comparing more than 30 different failure prediction approaches. We develop a taxonomy, which aids in categorizing the approaches, and show how this can help to understand the state-of-the-practice of this field and identify opportunities, gaps as well as future work.
Archive
Back To Top Button