Session
Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)
Session Chairs
Event TypeWorkshop
W
Extreme Scale Computing
Fault Tolerance
Reliability
Resiliency
TimeFriday, 22 November 20198:40am - 12pm
Location301-302-303
DescriptionAddressing failures in extreme-scale systems remains a significant challenge to reaching exascale. Current projections suggest that at the scale necessary to sustain exaflops of computation, systems could experience failures as frequently as once per hour. As a result, robust and efficient fault tolerance techniques are critical to obtaining reasonable application performance. Additionally, it is also imperative that we develop an understanding of trends in hardware devices may affect the reliability of future systems. The emergence of high-bandwidth memory devices, the continued deployment of burst buffers, and the development of near-threshold devices to address power concerns will all impact fault tolerance on new systems. These design trends coupled with increases in the number, variety, and complexity of components required to compose an extreme-scale system means that systems will experience significant increases in aggregate fault rates, fault diversity, and the complexity of isolating root causes.
Due to the continued need for research on fault tolerance in extreme-scale systems, the 9th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2019) will present an opportunity for innovative research ideas to be shared, discussed, and evaluated by researchers in fault-tolerance, resilience, dependability, and reliability from academic, government, and industrial institutions. Building on the success of the previous editions of the FTXS workshop, the organizers will assemble quality publications, invited talks, and panels to facilitate a lively and thought-provoking group discussion.
https://sites.google.com/site/ftxsworkshop/home/ftxs-2019
Due to the continued need for research on fault tolerance in extreme-scale systems, the 9th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2019) will present an opportunity for innovative research ideas to be shared, discussed, and evaluated by researchers in fault-tolerance, resilience, dependability, and reliability from academic, government, and industrial institutions. Building on the success of the previous editions of the FTXS workshop, the organizers will assemble quality publications, invited talks, and panels to facilitate a lively and thought-provoking group discussion.
https://sites.google.com/site/ftxsworkshop/home/ftxs-2019
Presentations
8:40am - 8:45am | Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) | |
8:45am - 9:10am | Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications | |
9:10am - 9:35am | Enforcing Crash Consistency of Scientific Applications in Non-Volatile Main Memory Systems | |
9:35am - 10:00am | FaultSight: A Fault Analysis Tool for HPC Researchers | |
10:00am - 10:30am | FTXS Morning Break | |
10:30am - 10:55am | Self-Stabilizing Connected Components | |
10:55am - 11:20am | Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors | |
11:20am - 11:45am | Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes | |
11:45am - 12:00pm | Closing remarks Presenter |