Presentation

DescriptionPerformance anomalies are difficult to detect because often a “healthy system” is vaguely defined, and the ground truth for how a system should be operating is evasive. As we move to exascale, however, detection of performance anomalies will become increasingly important with the increase in size and complexity of systems. There are very few accepted ways of detecting anomalies in the literature, and there are no published and labeled sets of anomalous HPC behavior. In this research, we develop a suite of applications that represent HPC workloads and use data from a lightweight metric collection service to train machine learning models to predict the future behavior of metrics. In the future, this work will be used to predict anomalous runs in compute nodes and determine some root causes of performance issues to help improve the efficiency of HPC system administrators and users.