Authors:
Abstract: HPC users deploy a suite of monitors to observe patterns of failures and performance anomalies to improve operational efficiency, achieve higher application performance and inform the design of future systems. However, the promises and the potential of monitoring data have largely been not realized due to various challenges such as inadequacy in monitoring, limited availability of data, lack of methods for fusing monitoring data at time-scales necessary for enabling human-in-the-loop or machine-in-the-loop feedback. To address above challenges, in this work we developed a monitoring fabric Holistic Measurement Driven System Assessment (HMDSA) for large-scale HPC facilities, independent of major component vendor, and within budget constraints of money, space, and power. We accomplish this through development and deployment of scalable, platform-independent, open-source tools and techniques for monitoring, coupled with statistical and machine-learning based runtime analysis and feedback, which enables highly efficient HPC system operation and usage and also informs future system improvements.
Best Poster Finalist (BP): no
Poster: PDF
Poster summary: PDF
Back to Poster Archive Listing