You are on page 1of 100

Speculate Ambitiously

To speculate ambitiously requires three capabilities:


• The ability of the complier to find instructions


that, with the possible use of register renaming,
can be speculatively moved and not affect the
program data flow.
• The ability to ignore exceptions in speculated
instructions, until we know that such exceptions
should really occur.
• The ability to speculatively interchange loads and
stores, or stores and stores, which may have
address conflicts.
• The first of these is a complier capability, while
the last two require hardware support .
Hardware Support for
Preserving Exception Behavior
• To speculate ambitiously, we must be able to
move any type of instruction and still preserve
its exception behavior.
• The key to being able to do this is to observe that
the results of a speculated sequence that is
mispredicted will not be used in the final
computation , and such a speculated
instruction should not cause an exception.
Investigated for supporting
more ambitious speculation
• There are four methods that have been
investigated for supporting more ambitious
speculation without introducing erroneous
behavior:
– The hardware and operating system
cooperatively ignore exceptions for
speculative instructions. This approach
preserves exception behavior for correct
programs, but not for incorrect ones. This
approach may viewed as unacceptable for
some programs, but it has been used,
under program control , as a “fast mode”
in several processors.
– Speculative instructions that never raise
exceptions are used, and checks are
Investigated for supporting
more ambitious speculation
– A set of status bits, called poison bits, are
attached to the result registers written by
speculated instructions when the
instructions cause exceptions. The poison
bits cause a fault when a normal
instructions attempts to use the register.
– A mechanism is provided to indicate that an
instruction is speculative , and the
hardware buffers the instruction result
until it is certain that the instruction is no
longer speculative.
Schemes
• To know the schemes, we need to distinguish
between exceptions that indicate a program
error and would normally cause termination,
such as a memory protection violation, and
those that are handled and normally resumed,
such as a page default.
• Exception that can be resumed can be accepted
and processed for speculative instructions just
as if they were normal instructions.
• If the speculative instruction should not have
been executed, handling the unneeded
exception may have some negative
performance effects, but it cannot cause
incorrect execution.
Schemes..
• The cost of these exceptions may be high, however, and
some processors use hardware support to avoid taking
such exceptions, just as processors with hardware
speculation may take some exceptions in speculative
mode, while avoiding others until an instruction is known
not to be speculative.
• Exceptions that indicate a program error should not occur in
correct programs, and the result of a program that gets
such a exception is not well defined, except perhaps
when the program is running in a debugging mode.
• If such exceptions arise in speculated instructions, we
cannot take the exception until we know that the
instruction is no longer speculative.
Schemes …
• In the simplest method for preserving exceptions,
the hardware and the operating system handle
all resumable exceptions when the exception
occurs and simply return an undefined value for
any exception that would cause termination.
• If the instruction generating the terminating
exception was not speculative , then the
program is in error.
• Note the instead of terminating the program, the
program is allowed to continue, although it will
almost certainly generate incorrect results.
• If the instruction generating the terminating
exception is speculative, then the program may
be correct and the speculative result will simply
be unused.
Schemes …
• Thus, returning an undefined value for the
instruction cannot be harmful.
• This scheme can never cause a correct program
to fail, no matter how much speculation is
done.
• An incorrect program, which formerly might have
received a terminating exception, will get an
incorrect result.
• This a acceptable for some programs, assuming
the complier can also generate a normal
version of the program, which does not
speculate and can receive a terminating
exception.
Schemes …
• In such a scheme, it is not necessary to know that
an instruction is speculative. Indeed, it is
helpful only when a program is in error and
receives a terminating exception on a normal
instruction; in such cases, if the instruction
were not marked as speculative, the program
could be terminated.
• A second approach to preserving exception
behavior when speculating introduces
speculative versions of instructions that do not
generate terminating exceptions and
instructions to check for such exceptions. This
combines preserves the exception behavior
exactly.
Schemes …
• A third approach for preserving exception tracks exceptions
as they occur but postpones of the exception, although
not in a completely precise fashion.
• The scheme is simple: a poison bit is added to every
register, and another bit is added to every instruction to
indicate whether a speculative instructions results in a
terminating exception; all other exceptions are handled
immediately.
• If a speculative instruction uses a register with a poison bit
turned on, the destination register of the instruction
simply has its poison bit turned on,
• If a normal instruction attempts to use a register source
with its poison bit turned on, the instruction causes a
fault .
• In this way, any program that would have generated an
exception still generates one, albeit at the first instance
where a result is used by an instruction that is not
speculative.
Schemes …
• Since poison bits exist only on register values and not
memory values, stores are never speculative and thus
trap if either operand is “poison”.
• One complication thus must be overcome is how the OS
saves the user registers on a context switch if the poison
bit is set.
• A special instruction is needed to save and reset the state
of the poison bits to the avoid this problem.
• The fourth and final approach listed in earlier relies on a
hardware mechanism that operates like a reorder buffer.
• In such an approach, instructions are marked by the
complier as speculative and include an indicator of how
many branches the instruction was speculatively moved
across and what branch action (taken/not taken) the
complier assumed.
• The last piece of information basically tells the hardware
the location of the code block where the speculated
instruction originally was .
Schemes …
• In practice, most of the benefit of speculative
instruction is marked by a sentinel, which tells
the hardware that the earlier speculative
instruction is no longer speculative and values
may be committed.
• All instructions are placed in a reorder buffer
when issued and are forced to commit in order,
as in hardware speculation approach.(Notice ,
though, that no actual speculative branch
prediction or dynamic scheduling occurs).
• The reorder buffer tracks when instructions are
ready to commit and delays the “write-back”
portion of any speculative instruction.
Schemes …
• Speculative instructions are not allowed to
commit until the branches that have been
speculatively moved over are also ready to
commit, or ,alternatively until the
corresponding sentinel is reached.
• At that point, we know whether the speculated
instructions should have been executed or not.
• If it should have been executed and it generated
a terminating exception, then we know that the
program should be terminated.
• If the instruction should not have been executed
then the can be ignored.

Example 1
• Consider the code fragment from if-the-else
statement of the form:
 If (A==0) A=B; else A=a+4;Where A is at 0(R3) and B
is at 0(R2). Assume the then clause is almost always
executed. Compile the code using complier-based
speculation. Assume R14 is unused and available.
 LD R1,0(R3) ; load A
 B NE Z R1,L1 ;tes t A
 LD R1,0(R2) ; then c laus e
J L2 ; S k ip els e
L1: DA DDI R1,R1, #4; els e c laus e
L2: SD R0,0(R3) ; s tore A
Example 2
• Show how the code using a speculative load (sLD)
and a speculation check instruction (SPECCK) to
completely preserve exception behavior.
Assume R14 is unused and available.
LD R1,0(R3) ; load A

B NE Z R1,L1 ;tes t A
LD R1,0(R2) ; then c laus e
J L2 ; S k ip els e
L1: DA DDI R1,R1, #4; els e c laus e
L2: SD R0,0(R3) ; s tore A
Example 3
• Consider the code fragment and show how it
would be complied with speculative instructions
and poison bits. Show where an exception for
the speculative memory reference would be
recognized. Assume R14 is unused and
available. LD R1,0(R3) ; load A
B NE Z R1,L1 ;tes t A
LD R1,0(R2) ; then c laus e
J L2 ; S k ip els e
L1: DA DDI R1,R1, #4; els e c laus e
L2: SD R0,0(R3) ; s tore A
memory Reference
Speculation
• Moving loads across stores is usually done when
the complier is certain the address do not
conflict.
• A special instruction to check for address conflicts
can be included in the architecture,
• The special instruction is left at the original
location of the load instruction (and acts like a
guardian), and the load is moved up across or
more stores.
• When a speculated load is executed, the
hardware saves the address of the accessed
memory location.
• If a subsequent store the address of the location
before the check instruction, then the
speculation has failed.
memory Reference
Speculation …
• If the location has not been touched, then the
speculation is successful.
• Speculation failure can be handled in two ways.
– If only the load instruction was speculated,
then it suffices to redo the load at the
point of the check instruction (which could
supply the target register in addition to
the memory address).
– If additional instructions that depend on the
load were also speculated , then a fix-up
sequence that reexecutes all the
speculation instructions starting with the
load is needed.
– In this case, the check instruction specifies
the address where the fix-up code is
Sections To be covered from
Chapter 4
• Section 4.5
• Software versus Hardware based
scheduling
Memory Hierarchy

Memory Hierarchy & Cache


Memory
Entry Quiz
1.Primary cache or level 1 cache is
implemented in a separate chip

a.True
b.False
Entry Quiz
2.SRAM is implemented using

a.Flip-Flop
b.Magnetic core
c.Capacitor
d.Non-volatile Technology
Entry Quiz
3.Main memory (200 ns) is slower
compared register (0.2 ns) by an
order of

a.3
b.4
c.1000
d.10000
Entry Quiz
4.Virtual Memory is
5.
a.Same as caching
b.Same as associative memory
c.Different from caching
d.Same as disk memory
e.
Entry Quiz
5.Cache Miss occurs when the
a.Required instruction is not found in
the cache
b.Required data is not found in the
cache
c.Required instruction or data is not
found in the cache
d.Required instruction or data is not
found in the main memory
e.For all of the above conditions
Module Objective
• To understand
1. Memory requirements of different computers
2. Memory hierarchy and the motivation behind it 
3. Moore’s Law 
4. Principles of Locality 
5. Cache Memory and its implementation
6. Cache Performance
7. Terms: Cache, Cache Miss, Cache Hit, Latency,
Bandwidth, SRAM, DRAM, by an order of,
Direct Mapping, Associative Mapping, Set
Mapping, Write Through, Write Allocated,
Write Back, Dirty Bit, and Valid Bit
Memory Requirements
• In general we would like to have
– Faster Memory (lower access
time or latency)
– Larger (capacity and bandwidth)
Memory
– Simpler Memory

Memory Requirements - Server, Desktop,
and Embedded devices
nDesktop • Server nEmbedded
nLower – Lower Access nLower
time
Access time Access
– Higher time
nLarger
Bandwidth*
Memory nSimpler
– Better
Protection* Memory*
– Larger Memory
Processor-Memory Gap
Moore’s Law
• Transistor density on a chip dye
doubles every couple (1.5) of years.

• Short reference:
http://en.wikipedia.org/wiki/Moore's_law

What is Memory Hierarchy
and Why?
250 ns

Main
Memor
CPU y
0.25 ns
(Registe
rs)
Bus Storage &
Adapte I/O devices
r
2,500,000 ns!
Memory Hierarchy &
Cache
Cache
• Cache is a smaller, faster, and expensive
memory.
• Improves the througput/latency of slower
memory next to it in the memory
hierarchy.
• Blocking reads and delaying the writes to
slower memory offers better
performance.
• There are two cache memories L1 and L2
between CPU and main memory.
• L1 is built into CPU.
• L2 is an SRAM.
Cache Operation
• Cache Hit: CPU finds the required data item (or
instruction) in the cache.
• Cache Miss: CPU does not find the required item
in the cache.
– CPU Stalled
– Hardware loads the entire block that contains the
required data item from the memory into cache.
– CPU continues to execute once the cache loaded.
• Spatial Locality
• Principle of Locality
Hit and Miss
• Hit
– Hit Rate
– Hit Time
• Miss
– Miss Rate Higher Level
– Miss Penalty
• Hit Time << Miss Penalty Lower Level
Cache Performance
Program Execution Time
• CPU Clock Cycle
• Cycle Time
• IC – Instruction Count
• Program Execution Time (Simple
model)

 = (Useful CPU Cycles + Stalled CPU


Cycles) x
 Cycle Time
Stalled CPU Cycles
 Stalled CPU Cycles

 = Number of Cache Misses X Miss


Penalty

 = IC X Memory Access X Miss Rate X Miss


Penalty Number of Cache reference
 Instruction

 Note: Unit of Miss Penalty is in CPU


cycles
Separating Read Miss from
Write Miss
Stalled CPU Cycles

= IC X Memory Access X Miss Rate X Miss Penalty


Instruction

= IC X Read Access X Read Miss Rate X Read Penalty +


Instruction
IC X Write Access X Write Miss Rate X Write Penalty

Instruction
Example 1
• Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when
all memory accesses are cache hits. The
only data accesses are loads and stores
and these total 50% of the instructions.
If the miss penalty is 25 clock cycles and
miss rate is 2%, how much faster would
the computer be if all instructions were
cache hits?
Example 1 …
1.Execution time for the program in a
computer with 100 % hits

 = (Useful CPU Cycles + Stalled CPU


Cycles) x Cycle Time
 = (IC X CPI + 0) X Cycle Time
 = IC X 1.0 X Cycle Time

Example 1 …
2.Stall Cycle when there is one or
more Cache Misses:
 = IC X Memory Access X Miss Rate X Miss Penalty
 Instruction
 = IC X (1 + 0.5) X 0.02 X 25 Cycle Time
 = IC X 0.75 Cycle Time

3.Total Execution Time:


 = (IC X 1 + IC X 0.75) Clock Cycles
 = 1.75 IC Clock Cycles
Example 1 …

4.Ratios
 = 1.75 IC Clock Cycles / IC X 1.0 X Cycle
Time

 = 1.75

Execution time is 1.75 faster in a computer


with no Misses.


Example 2
• Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when
all memory accesses are cache hits. The
only data accesses are loads and stores
and these total 50% of the instructions.
Miss penalty for read miss is 25 clock
cycles and miss penalty for write miss is
50 clock cycles. Miss rate is 2% and out
of this 80% is Read miss. How much
faster would the computer be if all
instructions were cache hits?
Cache Operation
0
1
Line 2
Number Tag Block 3 Block
( K Words )
0    
1    
2    

CPU    
..
   
.
..
    .
C= 16    

Block Length
(K Words) Block

Cache 2N - 1
Word Length
(K Words)

Main Memory
Elements of Cache Design
• Cache Size
• Block Size
• Mapping Function
• Replacement Algorithm
• Write Policy
• Write Miss
• Number of caches
• Split versus Unified/Mixed Cache

Mapping Functions
Address
From CPU Tag Line Word Tag
Block 0

Line 0 Block 1

1. Select
Line 1
2. Copy
3.Compare
5. Load Line 2
+
4. Hit
Block n-1
Miss Line 3 (m)

To main memory Block n

Cache
Main Memory
Mapping Function
• Direct
– Line value in address uniquely points to a line in cache. 1
tag Comparison
• Set Associative
– Line value in address points to a set of lines in cache
(typically 2/4/8, so 2/4/8 tag comparisons). This is
known as 2/4/8 way Set Associative.
• Associative
– Line value is always 0. This means Line points to all the
lines in cache (4 (m) tag Comparisons)
– Uses Content Addressable Memory (CAM) for
comparison.
– Needs non-trivial replacement algorithm

Address 0
From CPU Tag Line Word Tag 1
2 Block 0
3
5 2 2 3 Line 0
4
5 Block 1
6
Line 1 7
1. Select
SET 0
2. Copy
3.Compare Line 2
5. Load
+ 504
4. Hit 505
Line 3 Block 126
506
Miss
507
SET 1 508

To main memory 509


Block 127
510
Cache 511
Memory
Mapping Function
Comparison

Cache Type Hit Ratio Search Speed


Direct Mapping Good Be s t
Fully Associative Be s t M ode rate
Ve ry Good, Bette r Good, Wors e as N
N-Way Set Associative as N Incre as e s Incre as e s
Replacement Algorithm
• Least Recently Used
• First In First Out
• Least Frequently Used
• Random
Write Policy
• Write-through
– Information written to cache and the memory

• Write back
– Information written only to cache. Content of the
cache is written to the main memory only when
this cache block is replaced or the program
terminates.

• Dirty bit is used to indicate that a cache block


needs “Write back”
Write Miss
• Write-Allocate
• No Write-Allocate
Number of Caches
• Primary Cache (CPU)
• Secondary Cache (SRAM)
• L3 Cache (Cheaper SRAM)

Split versus Unified/Mixed
Cache
nS in g le o r tw o Le ve l
nU n ifie d o r S p lit

nM isse s p e r 1 0 0 0 in stru ctio n s w ith va rio u s C a ch e

size
n

Instruct
Size (KB) Cache
8 8.
Improving Cache
Performance
• Reducing Penalty
• Reducing Misses
– Compiler optimization attempts to reduce
the Cache Misses falls under this category

Stalled Cycles

= IC X Memory Access X Miss Rate X Miss Penalty


 Instruction
Improving Cache
Performance Using Compiler
• Compilers are built with the following
optimization:
 Instructions:
 Reordering instructions to avoid conflict misses
 Data :
 Merging Arrays
 Loop Interchange
 Loop Fusion
 Blocking

Merging Arrays
/* Conflict */ Key

 int key[size]; value


 int value [size];

/* Instead no conflict*/ Key, value pairs


struct merge {

 int key;
 int value;
} ;
Loop Interchange
 for (k = 0; k < 100; k= k+1) for (k = 0; k < 100; k= k+1)
 for (j = 0; j < 100; j= j+1) for (i = 0; i < 5000; i=
 for (i = 0; i < 5000; i= i+1)
i+1) for (j = 0; j < 100; j=
 x[i][j] = 2*x[i][j] j+1)
 /*instead */ Memory x[i][j] = 2*x[i][j]
X(0,0) Address
X(0,1)
X(0,2)
X(0,3)
X(0,4)
X(0,5)

X(0,4999)
X(1,0)
X(1,1)

Blocking
Loop Fusion
for (i = 0; i < 5000; i= i+1)
 a[i] = i;
…

for (i = 0; i < 5000; i= i+1)

 b[i] = b[i]+a[i];

/*instead */

for (i = 0; i < 5000; i= i+1)


 a[i]= i;
 b[i] = b[i]+a[i];

Example 2
• Assume a fully associative write-back cache with
many cache entries that starts empty. Below is
a sequence of five memory operations (the
address is in square brackets)
• What are the number of hits and misses using no-
write allocate versus write allocate?

 WriteMem[100];
 WriteMem[100];
 ReadMem[200];
 WriteMem[200];
 WriteMem[100];
Exit Quiz
1.In memory hierarchy top layer is
occupied by the

a.Fastest and the most expensive


memory
b.Slowest and the most expensive
memory
c.High band width and the fastest
memory
d.Slowest and the least expensive
memory
Exit Quiz
2.The memory technology suitable for
L2 cache is

a.EPROM
b.DRAM
c.SRAM
d.Flash
e.
Exit Quiz
3.Server’s memory requirements are
different from desktop’s memory
requirements
a.True
b.False

Exit Quiz
4.Hit Rate is (1 – Miss Rate)
a.True
b.False
c.
6.Gordon Moore’s law states that the
transistor density
a.Triples every year
b.Doubles every two years
c.Doubles every 1.5 years
d.Doubles every year
Improving Cache
Performance
• Two options
– By Reducing Miss Penalty
– By Reducing Miss Rate
Reducing Cache Miss
Penalty
1.Multilevel caches
2.Critical word first & early restart
3.Priority for Read misses over Write
4.Merging Write Buffers
5.Victim Cache

Reducing Miss Penalty -
Multilevel caches (1)
100 ns, 128M 100 ns, 128M

2 ns, 16K 10 ns, 512K


Cache Main
CPU 1 CPU Cache Main
Memory 1 Memory
0.04
0.02
Faster but smaller Slower but larger

100 ns, 128M


? ns, 528
K

CPU Cache Cache Main


1 2 Memory
0.04
0.02

Faster and Larger


Reducing Miss Rate &
Global Miss Rate (1)
2.6 ns, 528K 100 ns, 128M

10 ns, 512K
2 ns, 16K

CPU Cache 1 Cache 2 Main


Memory
0.05
0.02
Faster and Larger

Average memory access timeeff = Hit TimeL1 + Miss RateL1 x


Miss
PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

Global Miss Rate = Miss RateL1 x Miss RateL2


R e d u cin g M iss Pe n a lty – C ritica lW o rd
First a n d E a rly R e sta rt ( 2 )

• Critical word first


• Early restart
• Lets discuss the following:
– How is Miss Penalty improved?
– Under what conditions this is useful?
– Is it worth going through all the
trouble?

Reducing Miss Penalty –
Priority To Read Miss (3)
• Read Miss takes priority
• Write to Memory is put on hold
• Lets discuss the following:
– Is there a problem?
– How is Miss Penalty improved?
– How this is done in Write-through?
– How this is done in Write-back?
– Under what conditions this is useful?
– Is it worth going through all the
trouble?
Reducing Miss Penalty –
Merging Write Buffer (4)
• Write-through is sent to a write buffer
• Buffer & Write possible options:
1. Buffer is empty – write address and data to buffer
2. Buffer is not full – write or merge data
3. Buffer is full – Stall until buffer is written to memory.
– Note: Block write to memory is more efficient then
multiple writes.
• Lets Discuss the following:
– How many different ways do we optimize Miss
Penalty in this scheme?
– Explain the Fig 5.12 in the book
– When there is no block write (like I/O device), no
merging
Fully Associative Victim
Cache (5)
• Key word is Recycle
• Another implicit key word is complex
(hardware)
• Aimed at handling conflict misses
• 1 to 6 victim caches are ideal
Reducing Misses
• Misses due to 3Cs: Compulsory, Capacity,
and Conflict
• Reducing Misses Techniques
1.Larger Lines/Blocks
2.Larger Caches
3.Higher Associativity
4.Way Prediction Pseudoassociative caches
5.Compiler Techniques

 Notes: copy tables 5.14 and 5.17 to two slides


Reducing Cache Misses – Larger Block Size
(1)
• Larger block improves number of misses
up to a point!
• Points to discuss
– Why number of misses starts increasing for
larger blocks?
– Low latency encourages smaller blocks and
higher latency encourages larger blocks.
– Low bandwidth encourages smaller blocks
and higher bandwidth encourages larger
blocks.
– Effect on Miss Penalty with larger blocks
• What is the other name for 1-way set associative mapping?
Reducing Cache Misses – Larger Caches (2)

• Obviously!
Reducing Cache Misses – Higher
Associativity (3)

• Miss rate improves with associativity


• Points to discuss
– Complexity of set associative
mapping versus the improvement.
– Rule of thumb 1: 8-way associative is
as good as fully associative.
– Rule of thumb 2: 2-way associative is
as good as direct mapping.
– Greater associativity increases Hit
time
Pseudo-Associative Caches
(4)
Cache Memory – ith set
• Lets understand this using an tags
8-way set associative Block 0
Instruction Cache.
• Instruction exhibits better Block 1
locality
• Each access to instruction
cache normally needs 8
comparisons.
• Using locality predict the next
block to access in the set to Block 6
reduce the number of
comparisons.
Block 8
• Effect on Hit-time and Cache
Reducing Hit Time – Small and Simple
Cache (1)

• Tag comparison is complex,


specifically in associative mapping
• Tag checking can be overlapped with
data transfer to reduce the hit time.
• Tag checking in CPU chip, cache in a
chip by itself. Provides better hit
time and larger cache capacity
Reducing Hit Time – Virtual Cache (2)

• Addressing
– VA -> Physical Address -> cache
• Skip two levels, VA maps to cache
• Problems:
– No page boundary checks
– Building direct mapping between VA
and Cache for every process is not
easy.
Reducing Hit Time – Pipelined Cache (3)

• Increases the cache throughput not


access time

Reducing Hit Time – Trace Cache

• Increasing instruction level


parallelism
• Instead of 4 consecutive locations of
cache, load the next 4 instruction
required by the CPU using trace.
• Folding branch prediction into cache
Improving Parallelism with
Cache
• Cache Hit under Cache Miss
• Cache Miss under Cache Miss
• Both require non-blocking cache, out
of order execution CPU
• Other methods to improve
performance using parallelism:
– Hardware pre-fetching of instruction
– Software (compiler) pre-fetching of
data
Preparation
Write-back
 Write Through

• Complicates cache • Simplifies cache


coherency problem coherency problem
• Low overhead memory • High overhead
access overhead • If cache is blocking then
• Better cache access time higher access time
than write -through
• Requires Lower memory
• Requires higher memory bandwidth
bandwidth if blocking

• N o te : N o
n n e e d to d e scrib e th e se tw o p o licie s
nW rite - th ro u g h d o e s n o t b u y a n yth in g extra fo r a

sin g le p ro ce sso r syste m d u e to a b se n ce o f


ca ch e co h e re n cy
CPU Execution Time
CPU Execution Time
 = (CPU Cycle time + Stalled Cycle) X
Cycle Time

• Stalled Cycle = misses x penalty


• Misses given either as misses/1000
instruction or misses/memory-access AKA
miss rate.
• Instruction Count , Cycles per Instruction,
Miss are also required to compute CPU
execution time.

Average Access Time with
Cache
Average Access Time
= Hit Time + Miss Rate X Penalty

Multi-level Cache

 Avg Access Time = Hit Time


L1 + Miss RateL1 X
PenaltyL1
 PenaltyL1 = Hit TimeL2 + Miss RateL2 X
PenaltyL2
Addressing
Address 0
From CPU Tag Set Word Tag 1
2 Block 0
Set 0 3
4
Line 0 5 Block 1
6
7
1. Select
Line 1
2. Copy
3.Compare Set 1

5. Load Line 2
+
4. Hit
Block 126
4. Miss Line 3 (m)
508
509
To main memory 510 Block 127
Cache
511
Main Memory
Assignment I – Due same
day next week
• Mapping functions
• Replacement algorithms
• Write policies
• Write Miss policies
• Split Cache versus Unified Cache
• Primary Cache versus Secondary
Cache
• Compiler cache optimization
techniques with examples
Assignment II - Due same
day next week
• Multilevel Cache
• Cache Inclusion/Exclusion Property
• Thumb rules of cache
• Compiler pre-fetch
• Multi-level Caching + one another
Miss Penalty Optimization
technique
• Two miss Rate Optimization
Techniques
Assignment III - Due 2nd
class of next week
• All odd numbered problems from
cache module of your text book.
Assignment IV - Due 2nd
class of next week
• All even numbered problems from
cache module of your text book.
CPU Execution Time & Average Access Time
Memory

Cache

CPU
1 CC 100 CC
With Multi-level Cache
Memory

Cache

CPU
10 CC 100 CC
1 CC
1000
Memory Hierarchy

Main Memory
Main Memory
• Module Objective
– To understand Main memory latency
and bandwidth
– Techniques to improve latency and
bandwidth
Memory Hierarchy &
Cache
Main Memory – Cache – I/O
250 ns

Main
Memor
y
ca
CPU ch
0.25 ns e
Bus Storage &
Adapte I/O devices
r
2,500,000 ns!

§Cache prefers low latency main memory


§I/O & Multiprocessors prefer high bandwidth/throughput
main memory
Main Memory Access Time
4 cc
Main
Memory

ca 56 cc
CPU ch
e 4 cc
CC – Clock Cycle

Data
Bus Address
Bus
Access Time per word = 4+56+4 CC
One word is 8 Bytes
Latency is 1 bit/CC
Improving Memory
Performance
• Improving Latency ( time to access 1
memory unit - word)
• Improving bandwidth (bytes
accessed in unit time)


Improving Memory
Simple Design
Bandwidth
Wider Bus Interleaved

CPU CPU CPU


64 bits
64 bits 64 bits
Cache
Cache Cache

4x64 bits 64 bits


64 bits

Memory Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Cache Block 4 words
One word 8 bytes Bank 0 Bank 1 Bank 2 Bank 3

Bandwidth, Latency, Penalty


Interleaving Factor
Address allocation with Interleaving