HDF5 and Its role in Exascale, Cloud, and Object Stores

Authors: Elena Pourmal (HDF Group), Quincey Koziol (Lawrence Berkeley National Laboratory), Suren Byna (Lawrence Berkeley National Laboratory)

Abstract: We will provide a forum for the HDF5 user community to learn about HDF5's role in moving science applications to Exascale systems, the Cloud and to Object Stores, and to share initiatives in this area. Elena Pourmal will present HDF5 features that target Exascale systems, the Cloud and Object Stores, HDF5’s role in the DOE’s ECP and EOD projects, and the HDF5 roadmap. Quincey Koziol and Suren Byna will moderate a panel with representatives from research, commercial, and government organizations who will present case studies on how they leverage or plan to leverage HDF technologies for HPC and the Cloud.

Long Description: HDF5 is a unique, high-performance, open-source technology suite that consists of an abstract data model, a software library, and a file format used for storing and managing extremely large and/or complex data collections. The technology is used worldwide by government, industry, and academia in a wide range of science, engineering, and business disciplines.

Currently, there are more than 1300 projects on Github utilizing HDF5 due to its (1) versatile self-describing data model that can represent very complex data objects, relationships between objects and objects’ metadata; (2) completely portable binary file format with no limit on the number or size of data objects; (3) software library optimized for efficient I/O; and (4) tools for managing, manipulating, viewing, and analyzing HDF5 data.

In the recent few years, new features were added to the HDF5 library to access data in Object Stores and in the Cloud. Those features take full advantage of the new storage paradigms and require minimal changes to current HDF5 HPC applications.

The HDF5 suite is included by every major HPC system vendor as part of their core software due to its broad adoption in science applications and ability to improve I/O performance and data organization within HPC environments. In addition, for more than two decades the HDF Group has worked with researchers all over the globe, helping to capture, store and analyze experimental data in HDF5, for example, data collected at light sources and particle accelerators. In the past decade, the amount of modeling, experimental and observational data stored in HDF5 and the rate at which this data is collected have created new challenges for the scientists and triggered requests for using new storage paradigms such as Object Stores and the Cloud.

The HDF Group team, in collaboration with LBNL team, is excited to present new HDF5 features that will help applications move to Exascale systems, Object Stores and the Cloud. Those features include HDF5 VOL connectors that allow HDF5 applications to store data in Object Stores and the Cloud. We will present the latest HDF5 technology roadmap, and we will share how scientific, government, and industry users utilize HDF5 technologies to solve real-world problems. We will discuss technologies, such as async I/O, system topology-aware I/O, caching and prefetching, sparse data management, split VFD, remote VFD, etc., which are developed by our teams under the DOE Exascale Computing Project (ECP) and Experimental and Observational Data (EOD) projects.

The HDF Group will also demonstrate how current AWS Cloud technologies can be leveraged by HDF5 applications, including applications that use parallel HDF5 access.

The HDF5 BoF session format includes time for HDF5 community members to make 5-7 minutes presentations and discuss challenges when using HDF5. The discussion will help The HDF Group to prioritize items on the HDF5 roadmap. The HDF Group also encourages discussion on how the HDF5 community could contribute to the maintenance and future development of HDF5. Session agenda will be published on The HDF Group Website BoF presentations will be made available too.


