Posters
Research Posters
:
Poster 133: Portable Resilience with Kokkos
Event Type
Posters
Research Posters
Registration Categories
TP
EX
EXH
TimeThursday, 21 November 20198:30am - 5pm
LocationE Concourse
DescriptionThe Kokkos ecosystem is a programming environment that provides performance and portability to many scientific applications that run on DOE supercomputers as well as other smaller scale systems. Leveraging software abstraction concepts within Kokkos, software resilience for end user code is made portable with abstractions and concepts while implementing the most efficient resilience algorithms internally. This addition enables an application to manage hardware failures reducing the cost of interruption without drastically increasing the software maintenance cost. Two main resilience methodologies have been added to the Kokkos ecosystem to validate the resilience abstractions: 1. Checkpointing includes an automatic mode supporting other checkpointing libraries and a manual mode which leverages the data abstraction and memory space concepts. 2. The redundant execution model anticipates failures by replicating data and execution paths. The design and implementation of these additions are illustrated, and appropriate examples are included to demonstrate the simplicity of use.
Archive
Back To Top Button