SC19 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Addressing Data Resiliency for Staging Based Scientific Workflows

Authors: Shaohua Duan (Rutgers University, Rutgers Discovery Informatics Institute), Pradeep Subedi (Rutgers University, Rutgers Discovery Informatics Institute), Philip E. Davis (Rutgers University, Rutgers Discovery Informatics Institute), Manish Parashar (Rutgers University, Rutgers Discovery Informatics Institute)

Abstract: As applications move toward extreme scales, data-related challenges are becoming significant concerns, and in-situ workflows based on data staging and in-situ/in-transit data processing have been proposed to address these challenges. Increasing scales are also expected to result in an increase in the rate of silent data corruption errors, which will impact both the correctness and performance of applications. Furthermore, this impact is amplified in the case of in-situ workflows due to the dataflow between the component applications of the workflow. While existing research has explored silent error detection at the application level, silent error detection for workflows remains an open challenge.

This paper addresses error detection for extreme scale in-situ workflows. The presented approach leverages the idle computation resource in data staging to enable timely detection and recovery from silent data corruption, effectively reducing the propagation of corrupted data and end-to-end workflow execution time in the presence of silent errors.

Presentation: file

Back to Technical Papers Archive Listing