Software and hardware codesign of SmartNIC-based heterogeneous HPC clusters with machine learning case studies

Date
2024
DOI
Authors
Guo, Anqi
Version
OA Version
Citation
Abstract
Machine learning has evolved significantly recently and has penetrated every aspect of science, technology, and daily life. As application prediction demands higher accuracy and more complex tasks, larger models are proposed to meet these requirements. Deep learning applications like recommendation models and large language models have evolved with trillions of parameters and consume up to terabytes of memory. These models have outpaced the growth of GPU memories: GPU clusters, which aggregate GPU memory, have therefore grown exponentially to accommodate these large models. The Memory wall refers to the point at which the demand for memory exceeds the available capacity, creating a bottleneck for training ever-larger deep learning models. Heterogeneous deep learning training has become a key approach to addressing the limitations of GPU clusters, especially as models grow in size and complexity. By combining the strengths of CPUs, GPUs, and NVMe memory, heterogeneous systems aim to overcome the required scale of GPU clusters and mitigate the memory wall limitation by offloading model states and parameters and making it possible to train ever-growing large-size models on limited resources. However, such heterogeneous system performance is limited by the data exchange, computation, and control efficiency. Advanced network interface cards, known as SmartNICs, have emerged to mitigate network challenges in scale-out data centers. The placement of SmartNICs as a network-facing computational component within a node allows them to efficiently manage communication between different parts of the distributed system, offloading tasks from the central processors and reducing network traffic bottlenecks. As SmartNICs continue to evolve, they are expected to play a crucial role in enabling more scalable and efficient operations in large-scale data centers, addressing the growing demands of modern applications like machine learning and big data analytics. In this thesis, we propose heterogeneous smartNIC-based systems for coupling software and hardware for machine learning applications. We explore the heterogeneous system design space in four steps: examining the practical capabilities of emerging smartNIC, integrating host-detached smartNICs into CPU-centric systems, facilitating SmartNICs in GPU-centric systems, and exploring SmartNICs beyond computation offload with heterogeneous global control and disaggregated memory systems. Our proposal involves software-hardware codesign of SmartNIC-based systems, enhancing system performance through dynamic scheduling and control, enabling both GPU and CPU to focus on computation with reduced interruptions. The smartNIC serve as an intermediary layer, breaking barriers between heterogeneous system components and facilitating seamless connectivity between GPUs and CPU offload engines. Additionally, the introduction of a caching system reduces communication workload and memory bandwidth pressure. Furthermore, SmartNICs are attached to the switch level with disaggregated memory, forming a heterogeneous global control system. This system aims to minimize system barrier and synchronization overhead while maximizing communication-computation overlap and model FLOPs utilization for higher system performance.
Description
License