Knowledge Is Power: Unleashing the Potential of Your Archives Through Metadata

Authors: Jay Lofstead (Sandia National Laboratories), Margaret Lawson (University of Illinois, Sandia National Laboratories), Julian Kunkel (University of Reading)

Abstract: Currently, HPC archiving is a largely hopeful enterprise because the community lacks the infrastructure needed to help users identify what an archived dataset contains and who generated it. Although metadata management offers a potential solution, the community has yet to realize a system that fully addresses the archiving problem. The goal of this BoF is to bring together the groups that have been working individually on this problem to gather requirements for an effective, exascale solution and to try and generate a common framework for moving forward.

Long Description: Long term archives offer the false hope that saved files and data will once again be discoverable, accessible, and usable. With HPC applications generating increasing volumes of data, the community must develop an effective archiving solution. Most archives are never read because they lack the basic information needed to identify what the data contains and who generated it. This represents significant lost research potential since many irreplaceable datasets are lost in the ether and many replaceable datasets are re-generated unnecessarily. Until the HPC community develops the infrastructure to allow users to find the datasets they need, archiving is a largely wasted enterprise.

In recent years, metadata management has presented itself as a potential solution to the data discovery problem. Metadata captures what datasets contain so that users can query at various levels of granularity to find the data they need. At the dataset level, metadata management can allow users to know not just what information a dataset contains, but also who produced it and where the dataset is located so that, if the creator retires or changes jobs, the data can still be identified and used. This is critical to avoiding duplicating research and allowing data to be useful beyond a brief interval immediately after the initial creation or collection. Metadata management can also provide file-level indexing, allowing users to find all of the files associated with a particular project or generated from a particular simulation. At the finest level, metadata management can provide feature-level indexing that allows users to find the precise data they want from within a particular dataset or across collections of datasets. For example, users can search for all incidents of combustion from an ensemble run. This feature-level metadata can also be used by machine learning to discover long-term trends across collections of datasets. Thus, developing metadata management systems is vital to enabling effective archiving and facilitating scientific discovery.

Currently, many institutions are working on this problem individually rather than coming together to leverage each other’s efforts. For example, SNL has created EMPRESS, ORNL has created TagIt, LANL has created GUFI, and LLNL and NERSC both have other efforts underway. The goal of this BoF is to bring together this community to discuss what has been done and to gather requirements for an effective, exascale metadata management system that can address both short-term computational needs and long-term archive accessibility and usefulness. We will generate a unified effort moving forward to minimize redundant work and to ensure that the exascale archiving challenges are fully being addressed. By generating a common framework, we also plan to create a system that more readily facilitates collaboration and data sharing between institutions. Once scientists have collected their metadata, they should have an easy way to share this knowledge with others to avoid duplicated research and thereby facilitate the advancement of science.

URL: https://github.com/gflofst/knowledge-is-power

Back to Birds of a Feather Archive Listing