SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Poster 150: A Machine Learning Approach to Understanding HPC Application Performance Variation


Authors: Burak Aksar (Boston University, Sandia National Laboratories), Benjamin Schwaller (Sandia National Laboratories), Omar Aaziz (Sandia National Laboratories), Emre Ates (Boston University), Jim Brandt (Sandia National Laboratories), Ayse K. Coskun (Boston University), Manuel Egele (Boston University), Vitus Leung (Sandia National Laboratories)

Abstract: Performance anomalies are difficult to detect because often a “healthy system” is vaguely defined, and the ground truth for how a system should be operating is evasive. As we move to exascale, however, detection of performance anomalies will become increasingly important with the increase in size and complexity of systems. There are very few accepted ways of detecting anomalies in the literature, and there are no published and labeled sets of anomalous HPC behavior. In this research, we develop a suite of applications that represent HPC workloads and use data from a lightweight metric collection service to train machine learning models to predict the future behavior of metrics. In the future, this work will be used to predict anomalous runs in compute nodes and determine some root causes of performance issues to help improve the efficiency of HPC system administrators and users.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF


Back to Poster Archive Listing