You are on page 1of 5

Hitting the Memory Wall: Implications of the Obvious

Win. A. W u l f Sally A. M c K e e Department of C o m p u t e r Science University of Virginia {wulf i mckee }@ virginia.edu D e c e m b e r 1994 This brief note points out something obvious m something the authors " k n e w " without really understanding. With apologies to those who did understand, we offer it to those others who, like us, missed the point. We all k n o w that the rate of i m p r o v e m e n t in microprocessor speed exceeds the rate o f i m p r o v e m e n t in D R A M m e m o r y speed, each is i m p r o v i n g exponentially, but the exponent for microprocessors is substantially larger than that for D R A M s . The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and m e m o r y speed is already an issue, d o w n s t r e a m someplace it will be a m u c h bigger one. H o w big and how soon? The answers to these questions are what the authors had failed to appreciate. To get a handle on the answers, consider an old friend the equation for the average time to access memory, where t c and t m a r e t h e cache and D R A M access times a n d p is the probability o f a cache hit:
t avg =pxt c +

(l-p)

xt

We want to look at how the average access time changes with technology, so w e ' l l m a k e some conservative assumptions; as y o u ' l l see, the specific values w o n ' t c h a n g e the basic c o n c l u s i o n o f this note, n a m e l y that we arc going to hit a wall in the i m p r o v e m e n t o f system perfo~uiance unless something b a s i c changes. First let's assume that the cache speed matches that o f the processor, and specifically that it scales with the processor speed. This is certainly true for on-chip cache, and allows us to easily normalize all our results in terms o f instruction cycle times (essentially saying t c = 1 cpu cycle). Second, assume that the cache is perfect. That is, the cache never has a conflict or capacity miss; the only misses are the c o m p u l s o r y ones. Thus ( 1 - p ) is just the probability o f accessing a location that has never been referenced before (one can quibble and adjust this for line size, but this w o n ' t affect the conclusion, so we w o n ' t m a k e the argument more complicated than necessary). Now, although ( 1 - p ) is small, it i s n ' t zero_ Therefore as t c and and system performance will degrade. In fact, it will hit a wall.
tm

diverge,

tavg

will grow

m20m

In most programs, 20-40% of the instructions reference memory [Hen90]. For the sake of argument let's take the lower number, 20%. That means that, on average, during execution every 5th instruction references memory. We will hit the wall when tavg exceeds 5 instruction times. At that point system performance is totally d e t c r m i n e d b y memory speed; making the processor faster w o n ' t affect the wall-clock time to complete an application. Alas, there is no easy way out of this. We have already assumed a perfect cache, so a bigger/ smarter one w o n ' t help.We're already using the full bandwidth of the memory, so prefetching or other related schemes won't help either. We can consider other things that might be done, but first let's speculate on when we might hit the wall. Assume the compulsory miss rate is 1% or less [Hen90] and that the next level of the memory hierarchy is currently four times slower than cache. If we assume that D R A M speeds increase by 7% per year [Hen90] and use Baskett's estimate that microprocessor performance is increasing at the rate of 80% per year [Bas91], the average number of cycles per memory access will be 1.52 in 2000, 8.25 in 2005, and 98.8 in 2010. Under these assumptions, the wall is less than a decade away. Figures 1-3 explore various possibilities, showing projected trends for a set of perfect or near-perfect caches. All our graphs assume that D R A M performance continues to increase at an annual rate of 7%. The horizontal axis is various cpu/DRAM performance ratios, and the lines at the top indicate the dates these ratios occur ff microprocessor performance (IJ-) increases at rates of 50% and 100% respectively. Figure 1 assumes that cache misses are currently 4 times slower than hits; Figure l(a) considers compulsory cache miss rates of less than 1% while Figure l(b) shows the same trends for caches with more realistic miss rates of 2-10%. Figures 2 is a counterpart of Figure 1, but assumes that the current cost of a cache miss is 16 times that of a hit. Figure 3 provides a closer look at the expected impact on average memory access time for one particular value of ~t, Baskett's estimated 80%. Even if we assume a cache hit rate of 99.8% and use the more conservative cache miss cost of 4 cycles as our starting point, performance hits the 5-cycles-per-access wall in 11-12 years. At a hit rate of 99% we hit the same wall within the decade, and at 90%, within 5 years. Note that changing the starting point the "currant" miss/hit cost ratio m and the cache miss rates don't change the trends: if the microprocessor/memory perfotL~,ance gap continues to grow at a similar rate, in 10-15 years each memory access will cost, on average, tens or even hundreds of processor cycles. Under each scenario, system speed is dominated by memory performance. Over the past thirty years there have been several predictions of the eminent cessation of the rate of improvement in computer performance. Every such prediction was wrong. They were wrong because they hinged on unstated assumptions that were overturned by subsequent events. So, for example, the failure to foresee the move from discrete components to integrated circuits led to a prediction that the speed of light would limit computer speeds to several orders of magnitude slower than they are now.

21

O u r p r e d i c t i o n o f the m e m o r y wall is p r o b a b l y w r o n g too but it suggests that we h a v e to start t h i n k i n g "out of the box". All the techniques that the authors are a w a r e of, i n c l u d i n g ones w e h a v e p r o p o s e d [ M c K 9 4 , M c K 9 4 a ] , p r o v i d e o n e - t i m e boosts to either b a n d w i d t h or latency. W h i l e these d e l a y the date o f impact, they d o n ' t c h a n g e the f u n d a m e n t a l s . T h e m o s t " c o n v e n i e n t " resolution to the p r o b l e m w o u l d be the d i s c o v e r y o f a cool, d e n s e m e m o r y t e c h n o l o g y w h o s e speed scales w i t h that o f processors. W e ax~ not a w a r e o f a n y such t e c h n o l o g y and c o u l d not affect its d e v e l o p m e n t in a n y case; the o n l y c o n t r i b u t i o n we can m a k e is to l o o k for architectural solutions. T h e s e are p r o b a b l y all b o g u s , but the d i s c u s s i o n m u s t start s o m e w h e r e : C a n we drive the n u m b e r o f c o m p u l s o r y m i s s e s to zero? I f w e c a n ' t fix t m , then the o n l y w a y to m a k e caches w o r k is to d r i v e p to 100% - - w h i c h m e a n s c l i r n i n a d n g the c o m p u l s o r y misses. I f all data w e r e i n i t i a l i z e d d y n a m i c a l l y , for e x a m p l e , p o s s i b l y the c o m p i l e r c o u l d generate s p e c i a l "first w r i t e " instructions. It is h a r d e r for us to i m a g i n e h o w to drive the c o m p u l s o r y m i s s e s for c o d e to zero. Is it time to forgo the m o d e l that access t i m e is u n i f o r m to all parts o f the address space? It is false for D S M and other scalable m u l t i p r o c e s s o r s c h e m e s , so w h y not for single processors as w e l l ? I f w c do this, c a n the c o m p i l e r e x p l i c i t l y manage a smaller amount of higher speed m e m o r y ? A r c there a n y n e w ideas for h o w to trade c o m p u t a t i o n for s t o r a g e ? Alternatively, can w c trade space for s p e e d ? D R A M k e e p s g i v i n g us p l e n t y o f the former. A n c i e n t m a c h i n e s like the I B M 650 a n d B u r r o u g h s 205 u s e d m a g n e t i c d r u m as p r i m a x y m e m o r y and h a d c l e v e r s c h e m e s for r e d u c i n g rotational l a t e n c y to e s s e n t i a l l y zero m can we b o r r o w a p a g e f r o m e i t h e r o f those books7 A s n o t e d above, the right solution to the p r o b l e m o f the m e m o r y w a l l is p r o b a b l y s o m e t h i n g that w e h a v e n ' t t h o u g h t o f m but w c w o u l d like to see the d i s c u s s i o n e n g a g e d . It w o u l d a p p e a r that w e do not h a v e a great deal o f dine.

m22.

s
;

::,C

tOO

.ll / / .. p = 99.8~
.

/" . " f ~ / / . . . ' - e l .......... p = 96.0~


~ tO0
/I
.- oo

// /

'

.//' ..'"
10
I/ .. .'" / t .. / / o"
" ~ .~s .o
_

ez

10

/1 ,t /I ;/ II ,"
// o."

."
"

_~_.'~.!'""~-"
I I I I I I I I I , I I I I I I I

cache miss/hit cost ratio

cache miss/hit cost ratio

(a)

Co)

Figure I Trends for a Current Cache Miss/Hit Cost Ratio of 4

IOOOO

i :
rw

~
/ "" o/'// o," j/ ." jI /I ." I I.. o.
//

I t - .50q5

1000(1

/'~/~'/~=,
;I ."
." ,'-~----p

5095

IOOOel

: p -- 99.0qG
p = 94.0~1, = 98.0~

b
L
.p = 99.8~

10
lO0/l
rw ,w

.~"
if" ." w

."

I00-

b
i I0-

,/! / !o./ "."


wf J r. ~

f"

10.f~/

/I //

,'

wl w"

'

cache miss/hit cost ratio

cache miss/hit cost ratio

(a)

(b)

Figure 2 Trends for a Current Cache Miss/Hit Cost Ratio of 16

m23--

!0: I

I
I

i !

................... J ............... ./-- p -- 9 0 . 0 % I I ~ ; p: 99.0%

6-

,,~4-- p = 9 9 . 9 %

I
I
/
el j "
i

/
I
/
i

,"
" ."
,"

"

S j"

."

It

year

Figure 3 Average Access Cost for 80% Annual Increase in Processor Performance

References
[Bas91] [Hen90] [MeK94] E Baskett, Keynote address. International Symposium on Shared. Memory Multiprocessing, April 1991. J.L. Hennessy and D.A. Patterson, Computer Architecture: a Quantitative Approach, Morgan-Kaufrnan, San Mateo, CA, 1990. S.A. McKc, et. al., "Experimental Implementation of Dynamic Access Ordering", Pro. 27th Hawaii International Conference on System Sciences, Maul, HI, January 1994. S.A. McKe, et. al., "Increasing Memory Bandwidth for Vector Computations", Pro. Conference on Programming Languages and System Architecture, Zurich, March 1994.

[McK94a]

24

You might also like