Abstract: Addressing failures in extreme-scale systems remains a significant challenge to reaching exascale. Current projections suggest that at the scale necessary to sustain exaflops of computation, systems could experience failures as frequently as once per hour. As a result, robust and efficient fault tolerance techniques are critical to obtaining reasonable application performance. Additionally, it is also imperative that we develop an understanding of trends in hardware devices may affect the reliability of future systems. The emergence of high-bandwidth memory devices, the continued deployment of burst buffers, and the development of near-threshold devices to address power concerns will all impact fault tolerance on new systems. These design trends coupled with increases in the number, variety, and complexity of components required to compose an extreme-scale system means that systems will experience significant increases in aggregate fault rates, fault diversity, and the complexity of isolating root causes.
Due to the continued need for research on fault tolerance in extreme-scale systems, the 9th Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2019) will present an opportunity for innovative research ideas to be shared, discussed, and evaluated by researchers in fault-tolerance, resilience, dependability, and reliability from academic, government, and industrial institutions. Building on the success of the previous editions of the FTXS workshop, the organizers will assemble quality publications, invited talks, and panels to facilitate a lively and thought-provoking group discussion.