Advisor: Luca Benini (University of Bologna, ETH Zurich)
Abstract: In the scope of technical and scientific computing, the rush toward larger simulations has been so far assisted by a steady downsizing of micro-processing units, which has allowed to increase the compute capacity of general-purpose architectures at constant power. As side effects of the end of Dennard's scaling, this process is now hitting its ultimate power limits and is just about to come to an end. This implies an increase in the energy cost for computation, performance loss due to a design based on worst-case power consumption, and performance loss due to overheating, and thermal gradients. As result, thermal and power bound supercomputing machines show performance degradation and heterogeneity which limit the peak performance of the system. This doctoral showcase presents software strategies to tackle the main bottlenecks induced by power and thermal issues that affect next-generation supercomputers. To respond to the above challenges, my work shows that propagating workload requirements from application to the runtime and operating system levels is the key to provide efficiency. This is possible only if the proposed software methodologies cause little or no overhead in term of application performance. With this in mind in my work, I have designed application-aware node-level optimal thermal management algorithms and runtimes, lazy node-level power capping, and energy reduction runtime. The experimental results show a significant step forward with respect to the current state-of-the-art solutions in power and thermal control of HPC systems.
Thesis Canvas: pdf