SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Poster 105: Holistic Measurement Driven System Assessment


Authors: Saurabh Jha (University of Illinois), Mike Showerman (National Center for Supercomputing Applications (NCSA), University of Illinois), Aaron Saxton (National Center for Supercomputing Applications (NCSA), University of Illinois), Jeremy Enos (National Center for Supercomputing Applications (NCSA), University of Illinois), Greg Bauer (National Center for Supercomputing Applications (NCSA), University of Illinois), Zbigniew Kalbarczyk (University of Illinois), Ann Gentile (Sandia National Laboratories), Jim Brandt (Sandia National Laboratories), Ravi Iyer (University of Illinois), William T. Kramer (University of Illinois, National Center for Supercomputing Applications (NCSA))

Abstract: HPC users deploy a suite of monitors to observe patterns of failures and performance anomalies to improve operational efficiency, achieve higher application performance and inform the design of future systems. However, the promises and the potential of monitoring data have largely been not realized due to various challenges such as inadequacy in monitoring, limited availability of data, lack of methods for fusing monitoring data at time-scales necessary for enabling human-in-the-loop or machine-in-the-loop feedback. To address above challenges, in this work we developed a monitoring fabric Holistic Measurement Driven System Assessment (HMDSA) for large-scale HPC facilities, independent of major component vendor, and within budget constraints of money, space, and power. We accomplish this through development and deployment of scalable, platform-independent, open-source tools and techniques for monitoring, coupled with statistical and machine-learning based runtime analysis and feedback, which enables highly efficient HPC system operation and usage and also informs future system improvements.

Best Poster Finalist (BP): no

Poster: PDF
Poster summary: PDF


Back to Poster Archive Listing