Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing
Many-core systems, ranging from small-scale many-core processors to large-scale high performance computing (HPC) data centers, have become the main trend in computing system design owing to their potential to deliver higher throughput per watt. However, power densities and temperatures increase following the growth in the performance capacity, and bring major challenges in energy efficiency, cooling costs, and reliability. These challenges require a joint assessment of performance, power, and temperature tradeoffs as well as the design of runtime optimization techniques that monitor and manage the interplay among them. This thesis proposes novel modeling and runtime management techniques that evaluate and optimize the performance, energy, and reliability of many-core systems. We first address the energy and thermal challenges in 3D-stacked many-core processors. 3D processors with stacked DRAM have the potential to dramatically improve performance owing to lower memory access latency and higher bandwidth. However, the performance increase may cause 3D systems to exceed the power budgets or create thermal hot spots. In order to provide an accurate analysis and enable the design of efficient management policies, this thesis introduces a simulation framework to jointly analyze performance, power, and temperature for 3D systems. We then propose a runtime optimization policy that maximizes the system performance by characterizing the application behavior and predicting the operating points that satisfy the power and thermal constraints. Our policy reduces the energy-delay product (EDP) by up to 61.9% compared to existing strategies. Performance, cooling energy, and reliability are also critical aspects in HPC data centers. In addition to causing reliability degradation, high temperatures increase the required cooling energy. Communication cost, on the other hand, has a significant impact on system performance in HPC data centers. This thesis proposes a topology-aware technique that maximizes system reliability by selecting between workload clustering and balancing. Our policy improves the system reliability by up to 123.3% compared to existing temperature balancing approaches. We also introduce a job allocation methodology to simultaneously optimize the communication cost and the cooling energy in a data center. Our policy reduces the cooling cost by 40% compared to cooling-aware and performance-aware policies, while achieving comparable performance to performance-aware policy.
Thesis (Ph.D.)--Boston University