High-performance communication infrastructure design on FPGA-centric clusters
MetadataShow full item record
FPGA-Centric Clusters (FCCs) with the FPGAs directly linked through their Multi-Gigabit Transceivers (MGTs) have a proven advantage over other commodity architectures for communication bound applications. To date, however, communication infrastructure for such clusters has generally only taken one of two simple approaches: nearest-neighbor-only, which is fast but of limited utility, and processor-based, which is general but slow. The overall problem addressed in this dissertation is the architecture, design, and implementation of communication networks for FCCs. These network designs should take advantage of the decades of design experience in networks for High-Performance Computing (HPC) clusters, but should also account for, and take advantage of, unique characteristics of FCCs, in particular, the configurability of the FPGAs themselves. This dissertation has seven parts. We begin with in-depth implementations of two model applications, Directional Dark Matter (DM) Detection, and Molecular Dynamics (MD). These implementations expose the necessary characteristics of FCC networks from physical through application layers. The second is the systematic exploration of communication microarchitecture for FCCs, as has been done previously for HPC clusters and for Networks on Chips (NoCs) on both FPGAs and ASICs. One outcome of this part is to find the properties of FCCs that substantially influence the router design space. Another outcome is to create a selection of candidate routers and generalize it so that it is parameterized by routing algorithm, arbitration policy, number of virtual channels (VCs), and other parameters. The third part is to use the proposed application-aware framework to evaluate the resulting design space with respect to a number of common communication patterns and packet sizes. The results from this part enable two sets of designs. One is the selection of an optimal router for a given resource budget that accounts for all the workloads. The other is to take advantage of FPGA reconfigurability to select the optimal router accounting for both resource budget and a particular workload. The fourth part is to evaluate the advantages of this approach of adapting the router design to the application. We find that the optimality of the router design varies significantly with workloads. We observe that compared with the router configuration with the best average performance, application-aware router selection can lead to substantial improvement in performance or reduction in resources required. The fifth part is application-specific optimizations in which we develop several modules and functional units that can provide specific optimizations for certain types of communication workloads depending on the application it going to serve. The sixth part explores topology emulation, e.g., when a three-dimensional network is used in the computation of an application that is logically two dimensional. We propose a generalized fold-and-cut mechanism that both preserves the locality in logical mapping, while also making use of the extra links provided by our 3D-torus fixture. The seventh part is a table-based static-scheduled router for applications with a static or persistent communication pattern. The router supports various cases, including unicast, multicast, and reduction. By making routing decisions a priori, we can bring better load-balance to network links and reduce congestion.