Critical Words Cache Memory
The major constraints on increasing computer performance are power dissipation and memory latency. These have led to increases in secondary cache memory (L2$) capacity to minimize the occurrence of power intensive and slow off-chip main memory accesses. However as they have grown, secondary cache memories have become a large part of the total processor power dissipation, and their access time has increased in terms of processor clock cycles. Most cache memory architecture research has focused on primary cache memory (L1$) or the overall cache hierarchy. In contrast, architectural improvements of the L2$ have usually been simple increases in capacity and associativity.
Our research concerns two previously unexamined attributes of L1$ misses and a novel
architectural means to reduce the average hit time and power dissipation of L2$ designs without negatively impacting their hit rates. We investigate both a form of sequence regularity in L1$ miss streams and the quantity of critical words within cache blocks as indicators of the potential for memory hierarchy speed and power improvements resulting from segregating the L2$ treatment of so-called critical and non-critical words. We call the form of sequence regularity "critical word regularity" (CWR), the amount of critical words within cache blocks "critical footprint size" (CFS), and cache memories with architectures that exploit CWR and CFS we call "critical words cache" (CW$) memories. We describe practical CW$ architectures, operating methods, and implementation approaches. We show that CW$ memories offer dramatically higher performance than standard cache architectures employing the well-known critical word first bus protocols.
Our investigation consisted of four major phases, each of which employed a trace-driven cache simulation experiment. The goal of the first phase was to determine whether significant CWR exists in the load miss stream of a primary data cache memory (L1D$). Having found this to be the case, initial estimates of potential CW$ performance were made. The second phase sought to quantify the CWR and CFS in the load miss streams of the SPEC CPU 2000 collection of benchmark applications across nine L1D$ configurations. The CWR results of the second experiment were then used to estimate both secondary CW$ coverage of L1D$ load misses and the overall performance of a computer system with a memory hierarchy that includes a CW$. The third phase of our investigation built on the second and more completely measured CWR and CFS. The range of benchmarks was expanded in the third phase experiment and the CWR of instruction fetch misses and data store misses were measured in addition to that of data load misses. The CFS distributions were also measured to better estimate the resource requirements for practical CW$ memories. The fourth and final phase of our investigation determined the workload performance improvements obtainable with practical CW$ memories of various capacities, configurations, operating methods, and implementations. We also further explored the cost and performance tradeoffs made possible by exploitation of CWR and CFS using a CW$ secondary cache architecture.
Our investigation shows that sufficient CWR exists in both data and instruction miss streams for the segregation of the critical words in L2$ blocks to be worthwhile. The average CWR for all miss types in both SPEC CPU 2000 and 2006 workloads was found to range from almost 40% up to 90%, across a wide range of L1$ configurations. CWR was found to depend primarily on the workload and secondarily on the cache configuration. We also found that on average, more than half of all cache blocks that are repeatedly missed in a L1$ have only one critical word - even in L1$ designs composed of large, 128 byte, blocks. With one exception, in all of the L1$ configurations we examined only one quarter of the words were ever critical words in more than 77% of the repeatedly missed cache blocks in the data load miss streams. We used our CWR and CFS results to estimate that exploitation of criticality in L1$ miss streams by using a secondary CW$ has the potential to cover more than 60% of L1D$ load misses more quickly and efficiently than standard architecture cache memories. Several practical CW$ configurations were found that achieve average L2$ hit coverage in excess of 70%. CW$ hit coverage was also found to scale well, generally increasing with overall cache capacity.
The CW$ architecture offers significant reductions in both L2$ average access time and dynamic power consumption. Our timing, power, and area estimates of practical CW$ memories indicate that the CW$ architecture is more power and area efficient for a small 128KB capacity L2$. However, a small capacity CW$ offers no system level execution time advantage over the standard serial L2$ architecture. In contrast, the CW$ advantages for a large capacity L2$ are profound. For example, compared to an 8MB capacity standard serial access cache memory, a comparable CW$ would be more than 25% faster, while using 76% less energy per hit access, and occupying 62% less area. Another 8MB CW$ configuration would use 40% less energy per hit access, occupy 63% less die area, and be 32% faster, while dissipating only 3% worse leakage power than the comparable standard serial cache memory. If these CW$ memories replaced a standard architecture critical word first 8MB L2$ in a high-performance computer system, the resulting system would respectively be 19% and 22% faster.
We also show that secondary CW$ memories provide significant architectural flexibility, enabling smaller, faster, and more power efficient cache memories to be used without degrading overall memory hierarchy performance or efficiency. In turn, smaller and thus lower latency cache memories may enable simpler (e.g., single-threaded, in-order) processor cores to be used. Taken together, all these architectural possibilities would yield significantly more power efficient and higher performance computers for the same cost and fabrication technology.
School:University of Cincinnati
School Location:USA - Ohio
Source Type:Master's Thesis
Keywords:computer architecture cache memory critical word criticality regularity footprint
Date of Publication:01/01/2008