Professional Documents
Culture Documents
• Interleaving factor
– What is the optimal interleaving factor if the memory
cycle is 8 CC (1+6+1)?
– Assume 4 banks
b1 b2 b3 b4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1st Read
Request
2nd Read
Request X
3rd Read
Request
3rd Read
Request
Problem 2 (page 452 in your
text book)
Simple
• Block size = 1 word Memory
• Memory bus size = 1 word Cache 64 CC
CPU
• Miss rate = 3%
• Memory access per instruction = 1.2
• Cache miss Penalty = 64 CC
• Avg Cycles per instruction = 2
• Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction -> 1200
accesses. If miss rate is 3% then number of misses for 1200
accesses is 36.
the execution time is
= 2000+36x64 = 4304 CC
Average cycles per instruction = 4304/1000 = 4.3
Problem 2 (wider bus – 2
words)
Interleaved
• Block size = 4 word Memory
• Memory bus size = 2 word Cache 64 CC
CPU
• Miss rate = 2.5%
• Memory access per instruction = 1.2
• Cache miss Penalty = 128 CC
• Avg Cycles per instruction = 2
• Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction -> 1200
accesses. If miss rate is 2.5% then number of misses for 1200
accesses is 30.
the execution time is
= 2000+30x128 = 5840 CC
Average cycles per instruction = 5840/1000 = 5.84
Problem 2 (2-way
interleaving)
Inter-
• Block size = 4 word leaved
• Memory bus size = 1 word Cache 64 CC Memory
CPU
• Miss rate = 2.5%
• Memory access per instruction = 1.2
• Cache miss Penalty = 68x2 CC
• Avg Cycles per instruction = 2
• Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction -> 1200
accesses. If miss rate is 2.5% then number of misses for 1200
accesses is 30.
the execution time is
= 2000+30x68x2 = 6000 CC
Average cycles per instruction = 6000/1000 = 6.0
DRAM (Dynamic Random
Access Memory)
• Address Multiplexing
• RAS and CAS
• Dynamic refreshing and implications
– Data to be written after a READ!
• Amdahl’s rule of thumb
– 1000 MIPS should have 1000 M
memory
• Read your text book for DRAM, SRAM,
RAMBUS, and Flash.
SRAM (Static Random
Access Memory)
• Uses six transistor per bit.
• Static
– NO NEED to write data after a READ
• Comparison versus DRAM
– More transistors
• Less memory – ¼ to 1/8
• Expensive 8 - 10 times
– No Refresh
• Faster 8-16 times
– Used in Cache instead of main memory
•
Flash
• Similar to EEPROM
• Ideal choice for embedded systems
– Low power requirement, no moving parts
– Faster READ time, slower WRITE time
• Comparison versus EEPROM
– Finer and simpler control
– Better capacity
• WRITE operation – entire block to be
erased and replaced.
– Writes are 50 to 100 times slower than
READ
• NOR type flash and NAND type flash
Virtual Memory
Physical
Virtual Address
Address
0
0 A 4K C
4K B 8K
8K C 12K
12K D 16K A
20K
Virtual Memory 24K B
28K
Physical
Disk Main Memory
D
Virtual Versus First Level
Cache
B lock /P age S iz e 16-128 B 4K -64K B
Hit Tim e 1-3 CC 50-150 CC
M iss P enalty 8-150 CC 1-10 M CC
A cc es s Tim e 6-130 CC 0.8-8 M CC
TransferTim e 2-20 CC 0.2-2 M CC
M is s Rate 0.1-10% .00001 to 0.001%
25-45 P hys ic al 32-64 V irtual A ddress
A ddres s to 14-20 to 25-45 P hy sic al
A ddress M apping Cache A ddres s A ddres s
Memory Hierarchy
Questions
• Where can a block be placed?
• How is a block found if it is in?
• Which block should be replaced?
• What happens on a write?
Segmentation versus Paging
Code Data
Paging
Segmentation
Efficient disk traffic Yes (easy tune the page size) Not always (small segments)
CPU Execution Time
CPU Execution Time
= (CPU Cycle time + Stalled Cycle) X
Cycle Time
Multi-level Cache
Avg Access Time = Hit TimeL1 +
Miss
RateL1 X PenaltyL1
performance
Disk System
• Disk pack & Cache
• Disk Embedded Controller
• SCSI Host Adapter
• OS (kernel/driver)
• User Program disk disk
•
SCSI …
• Host Controll Controll
er er T
Adapter
SCSI Bus
Host I/O Bus
Individual Disks
Reference:
http://www.stjulians.com/cs/diskstoragenotes.html
Components of Disk Access Time
• Seek Time
• Rotational Latency
• Internal Transfer Time
• Other Delays
• That is,
– Avg Access Time =
Avg Seek Time + Avg Rotational Delay +
Transfer Time + Other overhead
Problem (page 684)
• Seek Time = 5 ms/100 tracks
• RPM = 10000 RPM
• Transfer rate 40 MB/sec
• Other Delays = 0.1 ms
• Sector Size = 2 KB
Average Seek Time (5ms) +
Average Rotational Delay (time for 1/2 revolution) +
Transfer Time (512/(40x106)) +
Other overhead (0.1 ms)
= 5 + 103x60/(2x10000) + 512x103/(40x106) + 0.1 ms
= 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 =
RAID
• RAID stands for Redundant Array of
Inexpensive/Individual Disks
• Uses a number of little disks instead of one large
disk.
• Six+ Types of RAID (0-5).
•
–
•
•
Terminologies
• Mean Time Between Failure (MTBF)
• Mean Time To Data Loss (MTTDL)
• Mean Time To Data Inaccessability
(MTTDI)
RAID-0 Striping
chunk Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12
01 02 03 04
05 06 07 08
Physical
09 10 11 12
Storage
RAID 0 RAID 0
01 02 01 02
03 04 03 04
05 06 05 06
n disks n disks
Performance RAID-0+1
• Let the Reliability of a RAID 0 sub-tree be R’:
– Then the reliability of RAID 1 tree = 1 – (1-R’)
(1-R’)
• Reliability R’ is:
– R’ = r2 (reliability of a single disk is r):
• Throughput is same as RAID-0, however with 2 x n
disks
• Utilization is lower than RAID-0 due to mirroring
• “Write” is marginally slower due to atomicity
• When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96
RAID 1+0 - Mirroring &
Stripping RAID 1+0
RAID 0
RAID 1 RAID 1
01 01 02 02
03 03 04 04
05 05 06 06
Performance RAID-1+0
• Let the Reliability of a RAID 1 sub-tree be R’:
– Then the reliability of RAID 0 tree = (R’)2
• Reliability R’ is:
– R’ = 1-(1-r)2 (reliability of a single disk is r):
• Throughput is same as RAID-0, however with 2 x n
disks
• Utilization is lower than RAID-0 due to mirroring
• “Write” is marginally slower due to its atomicity
• When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98
RAID-2 Hamming Code
Arrays
• Low commercial interest due to
complex nature of Hamming code
computation.
RAID-3 Stripping with Parity
Single Bits
Or Words Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12
01 02 03 04 P0
Physical
Storage
05 06 07 08 P1
09 10 11 12 P2
Parity Disk
RAID-3 Operation
• Based on the principle of reversible form of
parity computation.
• Where Parity P = C0 C1 … Cn-1 Cn
• Missing Stripe Cm = P C0 C1 … Cm-1 Cm+1 … Cn-1 Cn
RAID-3 Performance
• RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n
Parity disk. Less expensive than RAID-1
• Rest of the performance is similar to RAID-0.
• This can withstand the failure of one of its disks.
• Reliability = all the disks are working + exactly one failed
= rn + nc1rn-1 .(1-r)
• When r = 0.9 and n = 5
= 0.95 + 5 x 0.94 x (1- 0.9)
= 0.6 + 5x0.66 x 0.1
= 0.6 + 0.33 = .93
RAID-4 Performance
• Similar to RAID-3, but supports larger
chunks.
• Performance measures are similar to
RAID-3.
RAID-5 (Distributed Parity)
Chunk
Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12
Physical 01 02 03 04 P0
P1 05 06 07 08
Storage
09 P2 10 11 12
Parity Disk
RAID-5 Performance
• In RAID-3 and RAID-4, Parity Disk is a
bottleneck for Write operations.
• This issue is addressed in RAID-5.
Reading Assignment
• Optical Disk
• RAID study material
• Worked out problems from the text
book: p 452, p 537, p 539, p561, p
684, p 691, p 704
Amdahl’s Speedup
Multiprocessors
Speedup = 1/(Fractione/Speedupe+(1–Fractione ))
Speedup
e - number of processors
Fractione - Fraction of the program that runs
parallely on Speedupe
Assumption: Either the program runs in fully parallel
Cache
Cache Cache
Cache Cache
Cache Cache
Cache
Interconnection Network
Symmetric versus
Distributed Memory MP
• SMP: Uses shared memory for Inter-process Communication
– Advantage:
• Close coupling due to shared memory
• Sharing of data is faster between processors
– Disadvantage
• Scaling: Memory is a bottleneck
• High and unpredictable Memory Latency
Cache
Cache Cache
Cache Cache
Cache Cache
Cache
cc cc cc cc
Bus
Main Memory I/O Systems
cc Cache Controller
Cache Coherent (CC)
Protocols
• CC protocol implement cache coherency
• Two types:
– Snooping (Replicated): There are multiple copies
of the sharing status. Every cache that has a
copy of the data from a block of physical
memory also has a copy of the sharing status of
the block, and no centralized state is kept.
– Directory based (logically centralized): There is
only one copy of the sharing status of a block of
physical memory. All the processors use this
one copy. This copy could be in any of the
participating processors.
Snooping Protocol
Centralized Shared -Memory Architecture (SMP)
Bus
Main Memory I/O Systems
cc Cache Controller
Invalidation of Snooping
Protocol
Cache Cache Memory
Processor Contents for Contents for contents for
Activity Bus Activity CPU A CPU B Location X
0
CPU A reads X
(block) Cache Miss for X 0 0
CPU B reads X
(block) Cache Miss for X 0 0 0
A Writes 1 to X Invalidation for X 1 0
B reads X Cache Miss for X 1 1 1
Write Broadcast of Snooping
Protocol
Cache Cache Memory
Processor Contents for Contents for contents for
Activity Bus Activity CPU A CPU B Location X
0
CPU A reads X
(block) Cache Miss for X 0 0
CPU B reads X
(block) Cache Miss for X 0 0 0
A Writes 1 to X Write Broadcast for X 1 1 1
B reads X 1 1 1
Reading Assignment
• Section 7.4 (Reliability, Availability, &
Dependability) in your text book
• Page 554 and 555
• Section 7.11 I/O Design – attempt all
the 5 problems in this section.
•
•
Invalidation versus Write
Distribute
• Multiple writes to the same data item with no intervening reads require multiple write
broadcast in an update protocol, but only one initial invalidation in a write
invalidate protocol.
• With multiword cache blocks, each word written in a cache block requires a write
broadcast in an update protocol, although only the first write to any word in the
block needs to generate an invalidate in an invalidation protocol.
• An invalidation protocol works on cache blocks, while an update protocol must work
on individual words.
• The delay between writing a word in one processor and reading the written value in
another processor is usually less in a write update scheme, since written data are
immediately updated in the reader’s cache. By comparison, in an invalidation
protocol, the reader is invalidated first, then later reads the data and is stalled until
a copy can be read and returned to the processor.
Coherent Misses
• Write Invalidate causes Read cache
misses
– True Miss: If an item X in a block is made
exclusive by write invalidate and the read
of the same item in another processor
causes read miss then it is a true miss.
– False Miss: If non-exclusive items in the
same block cause the miss then it is a
false miss.
Use of Valid, Shared, and
Dirty bits
• Valid bit: Every time a block is loaded into a cache from memory, the tag for the block
is saved in cache and the valid bit is set to TRUE. A write update to the same
block in a different processor may reset this valid bit due to write invalidate. Thus,
when a cache block is accessed for READ or WRITE, tag should match AND the
value of valid bit should be TRUE. If the tag matches but the valid bit is reset,
then its a cache miss.
• Shared bit: When a memory block is loaded into a cache block for the first time the
shared bit is set to FALSE. When some other cache loads the same block, it is
turned into TRUE. When this block is updated, write invalidate uses the value of
shared bit to decide whether to send write-invalidate message or not. If the
shared bit is set then an invalidate message is sent, otherwise not.
• Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is
set to TRUE when the block is updated the first time. When another processor
wants to load this block, this block is migrated to the processor instead of loading
from the memory.
Summary of Snooping
Mechanism
S ta te o f a d d re sse d
Re q u e st S o u rce ca ch e b lo ck F u n ctio n a n d Ex p la n a tio n
Read hit proc es s or s hared or ex c lus iveRead data in c ac he
Read m is s proc es s or invalid P lac e read m is s on bus
Read m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
Read m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
W rite hit proc es s or ex c lus ive W rite data in c ac he
W rite hit proc es s or s hared P lac e write m is s on bus
W rite m is s proc es s or invalid P lac e write m is s on bus
W rite m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
W rite m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
Read m is s bus s hared No ac tion; allow m em ory to s ervic e read m is s
A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to
Read m is s bus ex c lus ive s hared
W rite m is s bus s hared A ttem pt to write s hared bloc k ; invalidate the bloc k
A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he
W rite m is s bus ex c lus ive bloc k and m ak e its s tate invalid
State Transition CPU read
hit
o ck
CPU Write bl CPU read
ck miss
Place write miss
a e
b rit
e- s w
ri
t bu P U Place read miss on bus
ss
W on b u sC
m i s s n
mi o
ad
on bus
e s
U
r
e ad m is
CP r
c e i te
la w r
Exclusive P e Cache State Transitions
c Based on requests from CPU
(read/write) P l a
CPU Write hit
CPU Read hit
r t
Abort memor y acces s
; miss
o ck
bl s
a ck ces
e -b ac
r it ory
W em
M
Write miss for
this block Exclusive Cache State Transitions
(read/write) Based on requests from the bus
Read miss for
this block
Some Terminologies
• Polling: A process periodically checks if there is a message that it
needs to handle. This method of awaiting a message is called
polling. Polling reduces processor utilization.
• Interrupt: A process is notified when a message arrives using built-in
interrupt register. Interrupt increases processor utilization in
comparison to polling.
• Synchronous: A process sends a message and waits for the
response to come before sending another message or carrying
out other tasks. This is way of waiting is referred to as
synchronous communication.
• Asynchronous: A process sends a message and continues to carry
out other tasks while the requested message is processed. This
is referred as asynchronous communication.
Communication
Infrastructure
• Multiprocessor Systems with shared
memory can have two types of
communication infrastructure:
– Shared Bus
– Interconnect
Interconnect
Shared Bus
Directory Based Protocol
1 e n-1 e
Nod Nod
Directory … Directory
Interconnect
Directory Directory
…
2 e n e
Nod Nod
State of the Block
• Shared: One or more processors have the block cached,
and the value in memory is up to date.
• Uncached: No processor has a copy of the block.
• Exclusive: Exactly one processor has a copy of the
cache block, and it has written the block, so the
memory copy is out of date. The processor is called
the owner of the block.
Local, Remote, Home Node
• Local Node: This is the node where
request originates
• Home Node: This is the node where
the memory location and the
directory entry of an an address
reside.
• Remote Node: A remote node is the
node that has copy of a cache block
Shared State Operation
• Read-miss: The requesting processor is
sent the requested data from memory
and the requestor is added to the
sharing set.
• Write-miss: The requesting processor is
sent the value. All the processors in the
set Sharer are sent invalidate
messages, and the Sharer set is to
contain the identity of the requesting
processor. The state of the block is
made exclusive.
Uncached State Operation
• Read-miss: The requesting processor is sent
the requested data from memory and the
requestor is made the only sharing node. The
state of the block is made shared.
• Write-miss:The requesting processor is sent the
requested data and becomes the sharing
node. The block is made exclusive to indicate
that the only valid copy is cached. Sharer
indicates the identity of the owner.
Exclusive State Operation
• Read-miss: The owner processor is sent a data fetch message, which causes the
state of the block in the owner’s cache to transition to shared and causes the
owner to send the data to the directory, which it is written to memory and sent
back to the requesting processor. The identity of the requesting processor is
added to the set Sharer, which still contains the identity of the processor that was
the owner (since it still has a readable copy).
• Data write back: The owner processor is replacing the block and therefore must write
it back. This write back makes the memory copy up to date (the home directory
essentially becomes the owner), the block is now uncached and the Sharer is
empty.
• Write-miss: The block has a new owner. A message is sent to the old owner causing
the cache to invalidate the block and send the value to the directory, from which it
is sent to the requesting processor, which becomes the new owner. Sharer is set
to the identity of the new owner, and the state of the block remains exclusive.
Cache State Transition
Read Miss Shared
U n ca ch e d
Data Value Reply y ; (read only)
Shares = { P } e pl
r
Fetch , Shar es = {}
l ue
Write Miss va Read miss
Data Value Reply
ta {P
} ss
d a Mi
Shares = { P }
h ; +=
s i te
e tc res bu Wr Data value Reply
F ha n P }; Shares += { P }
S s o {
=
ss m is es
m i r
Data Write- ad e ad S ha ly
Back re r p
c e t e; re
Pl
a
i da lue
v al va
Exclusive In ata
(read/write) d
Memory
P1 P2
TOP READ X
time BNEZ TOP
P2 is scheduled
TOP READ X
BNEZ TOP
WRITE 1
P1 is scheduled
WRITE 1
With Interrupt Mask
Process 1 X=0 Process 2
Memory
P1 P2
TOP READ X
BNEZ TOP
MASK
WRITE 1
time UNMASK
MASK
TOP READ X
BNEZ TOP
WRITE 1
UNMASK
Multiprocessor System
• Masking the interrupt will not work
Processor1
X=0 Processor2
Memory
P1 P2
•
Predictor for SPEC92
Benchmarks
300
250
Instructions
between 200
mispredictio
ns 150 P re d ic te d T a k e n
P r o fi le b a s e d
100
50
0
li
ear
gcc
uc
ijdp
cor
tot
d
so
dod
ro2
ss
eqn
su2
md
res
pre
hyd
esp
com
B e n ch m a rk