You are on page 1of 11

Reducing Miss Rate And Penalty

Miss rate -- fraction of memory accesses that are not in the cache Cache: It mainly means the first level of the memory hierarchy encountered once the address leaves the CPU applied whenever buffering is employed to reuse commonly occurring items, i.e. file caches, name caches, and so on

3 Cs of Cache Miss CompulsoryThe first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) CapacityIf the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) ConflictIf block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative but hits in Fully Associative Size X Cache)

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2

Figure1. Total Miss Rate (top), Distribution of Miss Rates (bottom) as a function of the three Cs

1.Five Techniques to reduce miss rate:1. Larger block size 2. Larger caches 3. Higher associativity 4. Way prediction 5. Compiler optimizations

1. Larger Block Size:-

Figure 2 -Miss rate against block size for 5 different cache sizes Advantages Emphasize spatial locality of data/ instructions; Reduction in compulsory misses. Using the principle of locality: The larger the block, the greater the chance parts of it will be used again.

Disadvantages Increasingly sensitive to miss penalty as block size increases Conflict misses increase, Reduced block diversity, more MM blocks mapped to same cache set (Set associative)

2. Larger Caches:Motivation let more data/ instructions reside in cache. Increasing capacity of cache reduces capacity misses Disadvantage Longer hit time and higher cost
Table 1 lists average memory access time against block size for five cache sizes, with lowest time in bold.

Table 1 Average memory access time against block size for 5 cache sizes

3.High Associativity:With respect to Figure 1, 8-way set associative approximates fully associative; 2:1 cache rule of thumb (for cache size < 128 K bytes) miss rate of directly mapped cache of size N = miss rate of 2-way set associative cache of size N / 2.

Comment Access time increases with associativity, Figure 3. Current practice is to utilize 1- to 4-way set associative caches. Short CPU clock cycle rewards simple cache designs e.g. L1 cache High miss penalties rewards higher associativity e.g. L2 cache

Figure 3: Assess time as a function of associativity and cache size.

4.Way Prediction: As per dynamic branch prediction, each block in a set associative cache has a predictor.

o Predictor selects which block of a set to try at the next cache access.

Tag comparison on predicted block conducted in parallel with the read.

o If the prediction is incorrect, CPU flushes the fetched instruction and the predictor is updated;

Continue to make tag comparisons for remaining blocks.

5.Compiler Optimizations: Independent of specific hardware units. Concentrate on code re-organization for increasing the degree of temporal and spatial code locality.

Consider two specific examples o Loop interchange and Blocking.


(a) Loop Interchange Objective Improve spatial locality Methodology Cache does no have capacity for all matrix content. Order of loop nesting provides good spatial locality if data accessed in the order in which it originally appears. Cache thrashing results if cache capacity or conflicts stop all data appearing in cache as one. Consider,

(b) Blocking Objective Improve Temporal locality of data.

Problem o Interested in accessing same data in both row and column format. o Cache does not have the capacity for any more than a row or column. i.e. reading entire row or column into cache is optimally wrong for one set of operations. Methodology o Complete all row and column operations on block of data in one pass. Consider the following code, for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) { r = r + y[i][k] * z[k][j]; }

x[i][j] = r; } } One value for r requires all row of y and entire column from z. Process summarized by figure 4, Potential for domination by capacity misses.

o If the cache may hold all 3 N by N matrices, then no problem. o If cache may hold one N by N matrix (z) and one row (y) then flush each row of y N times. o If less than N + N N then likelihood of cache thrashing increases. o Worst case capacity misses of 2N3 + N2 accesses for N3 operations.

Figure 4: Nave matrix multiplication (outer product) for index i = 1. Solution o Divide the cache into sub-blocks, size B by B; o Value for result r built up incrementally over all blocks. o Inner loops compute over the sub-blocks, figure 5;

Figure 5: Block based outer product multiplication.

o B the blocking factor is selected to ensure that sub-blocks for all three variables (x, y, z) reside in cache (or registers). o Number of capacity misses is now 2N3 B + N2 accesses for N3 operations. I.e. speedup of B Following code illustrates a possible block based implementation.

for (jj = 0; jj < N; jj = jj + B) { for (kk = 0; kk < N; kk = kk + B) { for (i = 0; i < N; i++) { for (j = jj; j < min(jj + B, N); j++) { r = 0; for (k = kk; k < min(kk + B, N); k++) { r = r + y[i][k] * z[k][j]; } x[i][j] = x[i][k] + r; } } } }

2.Reduction of Miss Penalty:1.Critical word first 2.Merging write buffers

1.Early Restart And Critical Word First: Dont wait for full block before restarting CPU Early restartAs soon as requested word of block arrives, send to CPU and continue execution Spatial locality tend to want next sequential word, so may still pay to get that one Critical Word FirstRequest missed word from memory first, send it to CPU right away; let CPU continue while filling rest of block Long blocks more popular today Critical Word 1st widely used

2.Merging Write Buffer


Write buffer lets processor continue while waiting for write to complete

If buffer contains modified blocks, addresses can be checked to see if new data matches that of some write buffer entry If so, new data combined with that entry For sequential writes in write-through caches, increases block size of write (more efficient) Sun T1 (Niagara) and many others use write merging

You might also like