SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Benchmarking Machine Learning Ecosystem on HPC Systems


Authors: Murali Emani (Argonne National Laboratory), Abid Malik (Brookhaven National Laboratory), Jacob Balma (Cray Inc), Steven Farrell (Lawrence Berkeley National Laboratory)

Abstract: High-performance computing is seeing an upsurge in workloads that require data analysis. Machine learning and Deep learning models are used in several science domains such as cosmology, particle physics, biology with data in unprecedented scale from simulations. These applications include tasks such as image detection, segmentation, synthetic data generation and in-situ data analysis. Emerging HPC systems have diverse hardware including many-core, multi-core and heterogeneous accelerators. It is critical to understand the performance of Machine learning/deep learning models on HPC systems at scale. Benchmarking will help to better understand the model-system interactions and help co-design future HPC systems for ML workloads.

Long Description: With evolving system architectures, hardware and software stack, scientific workloads and data from simulations, it is important to understand how these interact with each other. Benchmarking would help evaluate and reason the performance gains with workload to system mapping. As machine learning (ML) is becoming a critical component to help run applications faster, improve throughput and understand the insights from the data generated from simulations, benchmarking ML methods with scientific workloads at scale will be important as we progress towards Exascale systems and beyond. The goal of this BoF session is to know more about benchmarking different machine learning methods, frameworks, metrics for HPC workloads and understand what various ongoing efforts could offer.

In particular, we plan to discuss the following questions that cater to scientific workloads. 1) Why are standard HPC benchmarks needed for ML? 1.1) What capabilities are missing in current benchmark suites to address ML and HPC workloads 1.2) How benchmarks could be used to characterize systems to project future system performance such that representative benchmarks would be critical in designing future HPC systems that run ML workloads.

2) What are the challenges in creating benchmarks that would be useful? 2.1) Fast-moving field where representative workloads change with state-of-the-art 2.2) On-node compute characteristics vs off-node communication characteristics for various training schemes 2.3) Big datasets, I/O bottlenecks, reliability, MPI vs alternative communication backends 2.4) Complex workloads where model training/inference might be coupled to simulations / high-dimensional data or Hyper-parameter optimization, Reinforcement learning frameworks 2.5) Availability and access to scientific datasets 2.6) What metrics would help in comparing different systems and workloads

3) How do we design benchmarks capable of characterizing HPC systems’ suitability for ML/DL workloads? 3.1) Probably need to enumerate the types of workloads that are emerging in practice 3.2) How do the needs of HPC facilities/labs differ from industry 3.3) How to integrate AI in HPC workflows

4) How well does the current landscape of emerging benchmarks such as MLPerf, Deep500, BigData bench, AI Matrix represent industry/science use-cases?

With contributions from diverse participants, the theme of the proposed session will help in channelizing efforts with the SC community and liaise with domain scientists, academia, HPC facilities and vendors. This is the first proposed session and we plan to organize a series of BoFs every year at Supercomputing to focus on community building for those interested in benchmarking. We anticipate a strong attendance of at least 50-75 attendees this year.

The outcome from these discussions would be summarized in a technical report that will be hosted online and made publicly available, thus laying the foundations for community building efforts towards benchmarking. Also, interested participants would be encouraged to contribute to organize and curate scientific datasets in a public repository.


URL: https://wordpress.cels.anl.gov/mlhpcbench/


Back to Birds of a Feather Archive Listing