ACM Student Research Competition: Graduate Posters
ACM Student Research Competition: Undergraduate Posters
Posters
:
Poster 28: A Framework for Resilient and Energy-Efficient Computing in GPU-Accelerated Systems
Event Type
ACM Student Research Competition: Graduate Posters
ACM Student Research Competition: Undergraduate Posters
Posters
Registration Categories
TP
EX
EXH
Tags
Student Program
TimeWednesday, 20 November 20198:30am - 5pm
LocationE Concourse
DescriptionHigh-Performance Computing systems must simultaneously address both resilience and power. In heterogeneous systems, the trade-offs between resilience and energy-efficiency are more complex for applications using both CPUs and GPUs. A deep understanding of the interplay among energy efficiency, resilience, and performance is required for heterogeneous systems to address them simultaneously.

In this work, we present a new framework for resilient and energy-efficient computing in GPU-accelerated systems. This framework supports partial or full redundancy and checkpointing for resilience, and provides users with flexible hardware resource selection, adjustable precision and power management to improve performance and energy-efficiency. We further perform CUDA-aware MPI to reduce resilience overhead, mainly in message communication between GPUs. Using CG as an example, we show that our framework provides about 40% time and 45% energy savings, comparing to simple extension of RedMPI, a redundancy based resilience framework for homogeneous CPU systems.
Archive
Back To Top Button