Abstract: This document presents a new containerized architecture to enable fine-grain control over the management of on-node resources for complex scientific high-performance workloads. Our approach is introducing a node-local, application-specific resource manager by extending a container runtime, which can coordinate with the global resource manager, i.e., the system-wide manager that assigns resources to jobs. The proposed work is based on the extension of a container runtime to interface running containers with global resource managers, as well as the implementation of advanced resource management capabilities to address all the running application's needs.
Based on this design, the various runtimes that are required for the execution of scientific applications can interact with the container runtime under which it is running. This interaction enables the scalable and dynamic allocation of resources based on runtime requirements, in opposition to job-level requirements that are traditionally handled by the global resource manager. It also enables fine-grain control over the placement of all processes and threads running in a container on specific hardware components, which is critical to achieve performance.
Our approach therefore enables an efficient, scalable, dynamic and trackable management of resources on behalf of scientific applications; bridging a gap observed with current solutions.