Exploiting multi-threaded application characteristics to optimize performance and power of chip-multiprocessors
Abstract (Summary)iii Chip multiprocessors (CMPs) are becoming a popular way of exploiting the ever-increasing number of on-chip transistors. Multi-threaded applications aim to more efficiently utilize the raw power that CMPs provide than is currently possible. However, current multi-threaded applications exhibit load imbalances at various levels. The increasing capacity for on-chip storage and increasing costs of wire delays make the location of data on the chip very vital. Thus, it is important to place the data in the right location at the right time in the on-chip cache hierarchy. For the purposes of this study, we characterize the load imbalance at the barrier, imbalance amongst cache requests from different cores, and the demands on different blocks of the cache. Using the insights obtained from our characterization study, we then propose techniques that exploit such load imbalances to improve power and performance. For the load imbalance problem at the barrier, we observe that the imbalances are quite predictable. Using an integrated hardware-software mechanism, we propose a novel technique for optimizing the power consumption of CMPs. By using a high-level synchronization construct called barrier, our technique tracks the idle times spent by a processor waiting for other processors to arrive at the same point in the program. Using this knowledge, the frequency of the processors can be modulated to reduce/eliminate idle time, thus providing power savings without compromising performance. For the load imbalance problems imposed on the L2 cache by the different cores, we notice that the possible imbalance between the L2 demands across the cores favors a shared L2 organization, while the interference due to these demands favors a private L2 organization. iv We propose a new architecture, called Shared Processor-Based Split L2. The new architecture captures the benefits of both types of organizations while avoiding many of their drawbacks. We also study the demands on different blocks of the L2 cache, namely actively shared blocks and mostly privately accessed blocks. We show that, while there are a considerable number of L2 accesses to shared data, the actual volume of data is relatively low. Consequently, it is important to keep the shared data fairly close to all processor cores for both performance and power reasons. Motivated by this observation, we propose a small center cell cache residing in the middle of the processor cores which provides fast access to the cores’ contents. We demonstrate that this cache organization can considerably lower the number of block migrations between the L2 portions that are closer to each core, thus providing better performance. Combined with sequential tag-data access, the power consumption of such a shared cache can be reduced further.
School Location:USA - Pennsylvania
Source Type:Master's Thesis
Date of Publication: