Improving processor efficiency through thermal modeling and runtime management of hybrid cooling strategies
MetadataShow full item record
One of the main challenges in building future high performance systems is the ability to maintain safe on-chip temperatures in presence of high power densities. Handling such high power densities necessitates novel cooling solutions that are significantly more efficient than their existing counterparts. A number of advanced cooling methods have been proposed to address the temperature problem in processors. However, tradeoffs exist between performance, cost, and efficiency of those cooling methods, and these tradeoffs depend on the target system properties. Hence, a single cooling solution satisfying optimum conditions for any arbitrary system does not exist. This thesis claims that in order to reach exascale computing, a dramatic improvement in energy efficiency is needed, and achieving this improvement requires a temperature-centric co-design of the cooling and computing subsystems. Such co-design requires detailed system-level thermal modeling, design-time optimization, and runtime management techniques that are aware of the underlying processor architecture and application requirements. To this end, this thesis first proposes compact thermal modeling methods to characterize the complex thermal behavior of cutting-edge cooling solutions, mainly Phase Change Material (PCM)-based cooling, liquid cooling, and thermoelectric cooling (TEC), as well as hybrid designs involving a combination of these. The proposed models are modular and they enable fast and accurate exploration of a large design space. Comparisons against multi-physics simulations and measurements on testbeds validate the accuracy of our models (resulting in less than 1C error on average) and demonstrate significant reductions in simulation time (up to four orders of magnitude shorter simulation times). This thesis then introduces temperature-aware optimization techniques to maximize energy efficiency of a given system as a whole (including computing and cooling energy). The proposed optimization techniques approach the temperature problem from various angles, tackling major sources of inefficiency. One important angle is to understand the application power and performance characteristics and to design management techniques to match them. For workloads that require short bursts of intense parallel computation, we propose using PCM-based cooling in cooperation with a novel Adaptive Sprinting technique. By tracking the PCM state and incorporating this information during runtime decisions, Adaptive Sprinting utilizes the PCM heat storage capability more efficiently, achieving 29\% performance improvement compared to existing sprinting policies. In addition to the application characteristics, high heterogeneity in on-chip heat distribution is an important factor affecting efficiency. Hot spots occur on different locations of the chip with varying intensities; thus, designing a uniform cooling solution to handle worst-case hot spots significantly reduces the cooling efficiency. The hybrid cooling techniques proposed as part of this thesis address this issue by combining the strengths of different cooling methods and localizing the cooling effort over hot spots. Specifically, the thesis introduces LoCool, a cooling system optimizer that minimizes cooling power under temperature constraints for hybrid-cooled systems using TECs and liquid cooling. Finally, the scope of this work is not limited to existing advanced cooling solutions, but it also extends to emerging technologies and their potential benefits and tradeoffs. One such technology is integrated flow cell array, where fuel cells are pumped through microchannels, providing both cooling and on-chip power generation. This thesis explores a broad range of design parameters including maximum chip temperature, leakage power, and generated power for flow cell arrays in order to maximize the benefits of integrating this technology with computing systems. Through thermal modeling and runtime management techniques, and by exploring the design space of emerging cooling solutions, this thesis provides significant improvements in processor energy efficiency.