FPGA acceleration of high performance computing communication middleware
MetadataShow full item record
High-Performance Computing (HPC) necessarily requires computing with a large number of nodes. As computing technology progresses, internode communication becomes an ever more critical performance blocker. The execution time of software communication support is generally critical, often accounting for hundreds of times the latency of actual time-of-flight. This software support comes in two types. The first is support for core functions as defined in middleware such as the ubiquitous Message Passing Interface (MPI). Over the last decades this software overhead has been addressed through a number of advances such as eliminating data copies, improving drivers, and bypassing the operating system. However an essential core still remains, including message matching, data marshaling, and handling collective operations. The second type of communication support is for new services not inherently part of the middleware. The most prominent of these is compression; it brings huge savings in transmission time, but much of this benefit is offset by a new level of software overhead. In this dissertation, we address the software overhead in internode communication with elements of the emerging node architectures, which include FPGAs in multiple configurations, including closely coupled hardware support, programmable Network Interface Cards (NICs), and routers with programmable accelerators. While there has been substantial work in offloading communication software into hardware, we advance the state-of-the-art in three ways. The first is to use an emerging hardware model that is, for the first time, both realistic and supportive of very high performance gains. Previous studies (and some products) have relied on hardware models that are either of limited benefit (a NIC processor) or not sustainable (NIC augmented with ASICs). Our hardware model is based on the various emerging CPU-FPGA computing architectures. The second is to improve on previous work. We have found this to be possible through a number of means: taking advantage of configurable hardware, taking advantage of close coupling, and coming up with novel improvements. The third is looking at problems that have been, so far, nearly completely unexplored. One of these is hardware acceleration of application-aware, in-line, lossy compression. In this dissertation, we propose offload approaches and hardware designs for integrated FPGAs to bring down communication latency to ultra-low levels unachievable by today's software/hardware. We focus on improving performance from three aspects: 1) Accelerating middleware semantics within communication routines such as message matching and derived datatypes; 2) Optimizing complex communication routines, namely, collective operations; 3) Accelerating operations vital in new communication services independent of the middleware, such as data compression. % The last aspect is somewhat broader than the others. It is applicable both to HPC communication, but also is vital to broader system functions such as I/O.