You are on page 1of 89

Problem 1

• Interleaving factor
– What is the optimal interleaving factor if the memory
cycle is 8 CC (1+6+1)?
– Assume 4 banks
b1 b2 b3 b4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1st Read
Request
2nd Read
Request X
3rd Read
Request

3rd Read
Request
Problem 2 (page 452 in your
text book)
Simple
• Block size = 1 word Memory
• Memory bus size = 1 word Cache 64 CC
CPU
• Miss rate = 3%
• Memory access per instruction = 1.2
• Cache miss Penalty = 64 CC
• Avg Cycles per instruction = 2
• Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction -> 1200
accesses. If miss rate is 3% then number of misses for 1200
accesses is 36.
the execution time is
= 2000+36x64 = 4304 CC
Average cycles per instruction = 4304/1000 = 4.3
Problem 2 (wider bus – 2
words)
Interleaved
• Block size = 4 word Memory
• Memory bus size = 2 word Cache 64 CC
CPU
• Miss rate = 2.5%
• Memory access per instruction = 1.2
• Cache miss Penalty = 128 CC
• Avg Cycles per instruction = 2
• Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction -> 1200
accesses. If miss rate is 2.5% then number of misses for 1200
accesses is 30.
the execution time is
= 2000+30x128 = 5840 CC
Average cycles per instruction = 5840/1000 = 5.84
Problem 2 (2-way
interleaving)
Inter-
• Block size = 4 word leaved
• Memory bus size = 1 word Cache 64 CC Memory
CPU
• Miss rate = 2.5%
• Memory access per instruction = 1.2
• Cache miss Penalty = 68x2 CC
• Avg Cycles per instruction = 2
• Assume 1000 instructions in your Program
If no miss then execution time is 2000 CC
One instruction needs 1.2 memory accesses, 1000 instruction -> 1200
accesses. If miss rate is 2.5% then number of misses for 1200
accesses is 30.
the execution time is
= 2000+30x68x2 = 6000 CC
Average cycles per instruction = 6000/1000 = 6.0
DRAM (Dynamic Random
Access Memory)
• Address Multiplexing
• RAS and CAS
• Dynamic refreshing and implications
– Data to be written after a READ!
• Amdahl’s rule of thumb
– 1000 MIPS should have 1000 M
memory
• Read your text book for DRAM, SRAM,
RAMBUS, and Flash.
SRAM (Static Random
Access Memory)
• Uses six transistor per bit.
• Static
– NO NEED to write data after a READ
• Comparison versus DRAM
– More transistors
• Less memory – ¼ to 1/8
• Expensive 8 - 10 times
– No Refresh
• Faster 8-16 times
– Used in Cache instead of main memory

Flash
• Similar to EEPROM
• Ideal choice for embedded systems
– Low power requirement, no moving parts
– Faster READ time, slower WRITE time
• Comparison versus EEPROM
– Finer and simpler control
– Better capacity
• WRITE operation – entire block to be
erased and replaced.
– Writes are 50 to 100 times slower than
READ
• NOR type flash and NAND type flash

Virtual Memory
Physical
Virtual Address
Address
0
0 A 4K C
4K B 8K
8K C 12K
12K D 16K A
20K
Virtual Memory 24K B
28K

Physical
Disk Main Memory

D
Virtual Versus First Level
Cache
B lock /P age S iz e 16-128 B 4K -64K B
Hit Tim e 1-3 CC 50-150 CC
M iss P enalty 8-150 CC 1-10 M CC
A cc es s Tim e 6-130 CC 0.8-8 M CC
TransferTim e 2-20 CC 0.2-2 M CC
M is s Rate 0.1-10% .00001 to 0.001%
25-45 P hys ic al 32-64 V irtual A ddress
A ddres s to 14-20 to 25-45 P hy sic al
A ddress M apping Cache A ddres s A ddres s
Memory Hierarchy
Questions
• Where can a block be placed?
• How is a block found if it is in?
• Which block should be replaced?
• What happens on a write?
Segmentation versus Paging
Code Data

Paging

Segmentation

Remark Page Segment


Words per address One Two (Segment and offset)
Programmer visible Invisible Visible to programmer
Replacing a block Trivial Hard
Memory use inefficiency Internal fragmentation External fragmentation

Efficient disk traffic Yes (easy tune the page size) Not always (small segments)
CPU Execution Time
CPU Execution Time
 = (CPU Cycle time + Stalled Cycle) X
Cycle Time

• Stalled Cycle = misses x penalty


• Misses given either as misses/1000
instruction or misses/memory-access AKA
miss rate.
• Instruction Count , Cycles per Instruction,
Miss are also required to compute CPU
execution time.

Average Access Time with
Cache
Average Access Time
= Hit Time + Miss Rate X Penalty

Multi-level Cache
 Avg Access Time = Hit TimeL1 +
Miss
 RateL1 X PenaltyL1

 PenaltyL1 = Hit TimeL2 + Miss


 RateL2 X PenaltyL2
I/O Systems
• Objective
– To understand mainly disk storage
technologies.
Storage Systems
• Individual Disks
• SCSI Subsystem
• RAID
• Storage Area Networks (NAS & SAN)

Disk Storage

• Disk storage slower


than Memory.
• Disk offers better
volume at a lower
unit price.
• Nicely fit into the
Virtual Memory
concept.
Courtesy:

• Critical piece of the www.shortcourses.com/ choosing/storage/06.htm

performance
Disk System
• Disk pack & Cache
• Disk Embedded Controller
• SCSI Host Adapter
• OS (kernel/driver)
• User Program disk disk


SCSI …
• Host Controll Controll
er er T
Adapter
SCSI Bus
Host I/O Bus
Individual Disks


Reference:
http://www.stjulians.com/cs/diskstoragenotes.html
Components of Disk Access Time

• Seek Time
• Rotational Latency
• Internal Transfer Time
• Other Delays
• That is,
– Avg Access Time =

Avg Seek Time + Avg Rotational Delay +

Transfer Time + Other overhead
Problem (page 684)
• Seek Time = 5 ms/100 tracks
• RPM = 10000 RPM
• Transfer rate 40 MB/sec
• Other Delays = 0.1 ms
• Sector Size = 2 KB

Average Access Time =



Average Seek Time (5ms) +

Average Rotational Delay (time for 1/2 revolution) +

Transfer Time (512/(40x106)) +

Other overhead (0.1 ms)

= 5 + 103x60/(2x10000) + 512x103/(40x106) + 0.1 ms

= 5 + 3 + 128/104 + 0.1 ms = 5 + 3 + 0.0128 + 0.1 =
RAID
• RAID stands for Redundant Array of
Inexpensive/Individual Disks
• Uses a number of little disks instead of one large
disk.
• Six+ Types of RAID (0-5).





Terminologies
• Mean Time Between Failure (MTBF)
• Mean Time To Data Loss (MTTDL)
• Mean Time To Data Inaccessability
(MTTDI)

RAID-0 Striping
chunk Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12

01 02 03 04
05 06 07 08
Physical
09 10 11 12
Storage

n number of independent disks


RAID-0 Performance
• Throughput: best case - nearly n x single
disk value
• Utilization: worst case – nearly (1/ n) x
single disk value
• Data Reliability: (r)n , where r is the
reliability of a disk, (r  1).
• Sequential Access: Fast
• Random Access: Multithreaded Random
Access offers better performance.
• When r = 0.8 and n = 2, reliability is 0.64

RAID-1 Mirroring
• Issue Addressed:
RAID-0 Reliability
Problem. 01
• Shadowing or 02
mirroring is used. 03

• Data is not lost when


a disk goes down.
• One or more disks
can be used to 01
mirror primary 02
disk. 03

• Writes are posted to


primary and
shadowing disks.
• Read from any of
them.
RAID-1 Performance
• Reliability is improved with mirroring:
– (1- (1-r)(1-r))
• Example: when r is 0.8, the reliability of
RAID-1 is .96.
• Writes are more complex – must be
committed to primary and all shadowing
disks.
• Writes much slower due to atomicity
requirement.
• Expensive – due to 1-to-1 redundancy.
RAID 0+1 -Stripping &
Mirroring
RAID 0 + 1
RAID 1

RAID 0 RAID 0

01 02 01 02
03 04 03 04
05 06 05 06

n disks n disks
Performance RAID-0+1
• Let the Reliability of a RAID 0 sub-tree be R’:
– Then the reliability of RAID 1 tree = 1 – (1-R’)
(1-R’)
• Reliability R’ is:
– R’ = r2 (reliability of a single disk is r):
• Throughput is same as RAID-0, however with 2 x n
disks
• Utilization is lower than RAID-0 due to mirroring
• “Write” is marginally slower due to atomicity
• When r = 0.9, R’ = 0.81, and R = 1 – (0.19)2 = .96
RAID 1+0 - Mirroring &
Stripping RAID 1+0

RAID 0

RAID 1 RAID 1

01 01 02 02
03 03 04 04
05 05 06 06
Performance RAID-1+0
• Let the Reliability of a RAID 1 sub-tree be R’:
– Then the reliability of RAID 0 tree = (R’)2
• Reliability R’ is:
– R’ = 1-(1-r)2 (reliability of a single disk is r):
• Throughput is same as RAID-0, however with 2 x n
disks
• Utilization is lower than RAID-0 due to mirroring
• “Write” is marginally slower due to its atomicity
• When r = 0.9, R’ = 0.99, and R = (0.99)2 = .98

RAID-2 Hamming Code
Arrays
• Low commercial interest due to
complex nature of Hamming code
computation.
RAID-3 Stripping with Parity

Single Bits
Or Words Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12

01 02 03 04 P0
Physical
Storage
05 06 07 08 P1
09 10 11 12 P2

Parity Disk
RAID-3 Operation
• Based on the principle of reversible form of
parity computation.
• Where Parity P = C0 C1 … Cn-1  Cn
• Missing Stripe Cm = P  C0 C1 … Cm-1  Cm+1 … Cn-1  Cn

RAID-3 Performance
• RAID-1’s 1-to-1 redundancy issue is addressed by 1-for-n
Parity disk. Less expensive than RAID-1
• Rest of the performance is similar to RAID-0.
• This can withstand the failure of one of its disks.
• Reliability = all the disks are working + exactly one failed

 = rn + nc1rn-1 .(1-r)
• When r = 0.9 and n = 5
 = 0.95 + 5 x 0.94 x (1- 0.9)
 = 0.6 + 5x0.66 x 0.1
 = 0.6 + 0.33 = .93
RAID-4 Performance
• Similar to RAID-3, but supports larger
chunks.
• Performance measures are similar to
RAID-3.
RAID-5 (Distributed Parity)
Chunk
Stripe width
Logical
Storage 01 02 03 04 05 06 07 08 09 10 11 12

Physical 01 02 03 04 P0
P1 05 06 07 08
Storage
09 P2 10 11 12

Parity Disk
RAID-5 Performance
• In RAID-3 and RAID-4, Parity Disk is a
bottleneck for Write operations.
• This issue is addressed in RAID-5.
Reading Assignment
• Optical Disk
• RAID study material
• Worked out problems from the text
book: p 452, p 537, p 539, p561, p
684, p 691, p 704
Amdahl’s Speedup
Multiprocessors
 Speedup = 1/(Fractione/Speedupe+(1–Fractione ))
 
Speedup
e - number of processors
 Fractione - Fraction of the program that runs
 parallely on Speedupe
  
Assumption: Either the program runs in fully parallel

(enhanced) mode making use of all the processors or


non-enhanced mode.

Multiprocessor Architectures
• Single Instruction Stream, Single Data Stream (SISD):
• Single Instruction Stream, Multiple Data Stream (SIMD) 
• Multiple Instruction Streams, Single Data Stream (MISD)  
• Multiple Instruction Streams, Multiple Data Streams (MIMD)
Classification of MIMD
Architectures
• Shared Memory:
– Centralized shared memory architecture OR
Symmetric (shared-memory) Multiprocessors
(SMP) OR Uniform Memory Access (UMA).
– Distributed shared memory architecture OR Non-
Uniform Memory Access (NUMA) architecture.
• Message Passing:
– Multiprocessor Systems based on
messaging
Shared Memory versus
Message Passing
No Shared Memory Message Passing
1 Compatibility with well-understood Simpler hardware compared to
shared memory mechanism used in scalable shared memory
centralized multi-processor systems.
2 Simplify compiler design Communication is explicit. This
means it is easier to understand.
3 No need to learn a messaging protocol Improved modularity
4 Low communication overhead No need for expensive and complex
synchronization mechanisms similar
to the ones used in shared memory.

5 Caching to improve latency


Two types of Shared Memory architectures
Centralized Shared -Memory Architecture

Processor Processor Processor Processor

Cache
Cache Cache
Cache Cache
Cache Cache
Cache

Main Memory I/O Systems

Distributed -Shared -Memory Architecture

Processor Processor Processor


+ + +
Cache Cache Cache

Memory I/O Memory I/O Memory I/O

Interconnection Network
Symmetric versus
Distributed Memory MP
• SMP: Uses shared memory for Inter-process Communication
– Advantage:
• Close coupling due to shared memory
• Sharing of data is faster between processors
– Disadvantage
• Scaling: Memory is a bottleneck
• High and unpredictable Memory Latency 

• Distributed : Uses message passing for Inter-process Communication


– Advantage:
• Low Memory Latency
• Scaling: Scales better than SMP
– Disadvantage
• Control and management are more complex due to distributed
memory
Performance Metrics
• Communication bandwidth  
• Communication Latency: Ideally latency is as low as
possible. Communication latency is
 = Sender Overhead + Time of flight +
 Transmission Time + Receiver Overhead
• Communication Latency Hiding
Cache Coherence Problem
Cache Cache Memory
Contents for Contents for contents for
Time Event CPU A CPU B Location X
0 1
1 CPU A reads X 1 1
2 CPU B reads X 1 1 1
3 CPU A stores 0 in X 0 1 0
Cache Coherence
A Memory System is coherent if

1. A read by a processor to location X that follows a write by another processor


to X returns the written value if the read and write are sufficiently
separatedin time and no other writes to X occur between the two
accesses. Defines coherent view of memory.
2.
3. Write to the same location are serialized; that is two writes to the same
location by any two processors are seen in the same order by all
processors. For example, if the values 1 and then 2 are written to a
location, processors can never read the value of the location as 2 and
then later read it as 1.

Coherency and Consistency
• Coherency and consistency are complementary.

• Coherency defines the behavior of reads and writes to


the same memory location, while consistency defines
the behavior of reads and writes with respect to other
memory locations.
Features supported by Coherent
Multiprocessors
• Migration: Shared data in a cache can be moved to
another cache directly (without going through the main
memory). This is referred to as migration. It is used in
a transparent fashion. It reduces both latency (going
to another cache every time the data item is
accessed) and precious memory bandwidth.

• Replication: Cache also provides replication for shared


data items that are being simultaneously read, since
the caches make a copy of the data item in the local
cache. Replication reduces both latency of access and
contention for a read shared data item.
Migration and Replication
Centralized Shared -Memory Architecture (SMP)

Processor Processor Processor Processor

Cache
Cache Cache
Cache Cache
Cache Cache
Cache

cc cc cc cc

Bus
Main Memory I/O Systems

cc Cache Controller
Cache Coherent (CC)
Protocols
• CC protocol implement cache coherency
• Two types:
– Snooping (Replicated): There are multiple copies
of the sharing status. Every cache that has a
copy of the data from a block of physical
memory also has a copy of the sharing status of
the block, and no centralized state is kept.
– Directory based (logically centralized): There is
only one copy of the sharing status of a block of
physical memory. All the processors use this
one copy. This copy could be in any of the
participating processors.

Snooping Protocol
Centralized Shared -Memory Architecture (SMP)

Processor Processor Processor Processor


Tw o w a ys:
1. W rite
Cache
Cache Cache
Cache Cache
Cache Cache
Cache
In va lid a tio n
2. W rite B ro a d ca st
cc cc cc cc

Bus
Main Memory I/O Systems

cc Cache Controller
Invalidation of Snooping
Protocol
Cache Cache Memory
Processor Contents for Contents for contents for
Activity Bus Activity CPU A CPU B Location X
0
CPU A reads X
(block) Cache Miss for X 0 0
CPU B reads X
(block) Cache Miss for X 0 0 0
A Writes 1 to X Invalidation for X 1 0
B reads X Cache Miss for X 1 1 1
Write Broadcast of Snooping
Protocol
Cache Cache Memory
Processor Contents for Contents for contents for
Activity Bus Activity CPU A CPU B Location X
0
CPU A reads X
(block) Cache Miss for X 0 0
CPU B reads X
(block) Cache Miss for X 0 0 0
A Writes 1 to X Write Broadcast for X 1 1 1
B reads X 1 1 1
Reading Assignment
• Section 7.4 (Reliability, Availability, &
Dependability) in your text book
• Page 554 and 555
• Section 7.11 I/O Design – attempt all
the 5 problems in this section.


Invalidation versus Write
Distribute
• Multiple writes to the same data item with no intervening reads require multiple write
broadcast in an update protocol, but only one initial invalidation in a write
invalidate protocol.
 

• With multiword cache blocks, each word written in a cache block requires a write
broadcast in an update protocol, although only the first write to any word in the
block needs to generate an invalidate in an invalidation protocol.
 

• An invalidation protocol works on cache blocks, while an update protocol must work
on individual words.
 

• The delay between writing a word in one processor and reading the written value in
another processor is usually less in a write update scheme, since written data are
immediately updated in the reader’s cache. By comparison, in an invalidation
protocol, the reader is invalidated first, then later reads the data and is stalled until
a copy can be read and returned to the processor.
Coherent Misses
• Write Invalidate causes Read cache
misses
– True Miss: If an item X in a block is made
exclusive by write invalidate and the read
of the same item in another processor
causes read miss then it is a true miss.
– False Miss: If non-exclusive items in the
same block cause the miss then it is a
false miss.
Use of Valid, Shared, and
Dirty bits
• Valid bit: Every time a block is loaded into a cache from memory, the tag for the block
is saved in cache and the valid bit is set to TRUE. A write update to the same
block in a different processor may reset this valid bit due to write invalidate. Thus,
when a cache block is accessed for READ or WRITE, tag should match AND the
value of valid bit should be TRUE. If the tag matches but the valid bit is reset,
then its a cache miss.
 

• Shared bit: When a memory block is loaded into a cache block for the first time the
shared bit is set to FALSE. When some other cache loads the same block, it is
turned into TRUE. When this block is updated, write invalidate uses the value of
shared bit to decide whether to send write-invalidate message or not. If the
shared bit is set then an invalidate message is sent, otherwise not.

• Dirty bit: Dirty bit is set to FALSE when a block is loaded into a cache memory. It is
set to TRUE when the block is updated the first time. When another processor
wants to load this block, this block is migrated to the processor instead of loading
from the memory.
Summary of Snooping
Mechanism
S ta te o f a d d re sse d
Re q u e st S o u rce ca ch e b lo ck F u n ctio n a n d Ex p la n a tio n
Read hit proc es s or s hared or ex c lus iveRead data in c ac he
Read m is s proc es s or invalid P lac e read m is s on bus
Read m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
Read m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
W rite hit proc es s or ex c lus ive W rite data in c ac he
W rite hit proc es s or s hared P lac e write m is s on bus
W rite m is s proc es s or invalid P lac e write m is s on bus
W rite m is s proc es s or s hared A ddres s c onflic t m is s : plac e read m is s on bus
W rite m is s proc es s or ex c lus ive A ddres s c onflic t m is s : write bac k bloc k , then plac e read m is s on bus
Read m is s bus s hared No ac tion; allow m em ory to s ervic e read m is s
A ttem pt to s hare data: plac e c ac he bloc k on bus and c hange s tate to
Read m is s bus ex c lus ive s hared
W rite m is s bus s hared A ttem pt to write s hared bloc k ; invalidate the bloc k
A ttem pt to write bloc k that is ex c lus ive els ewhere: write bac k the c ac he
W rite m is s bus ex c lus ive bloc k and m ak e its s tate invalid
State Transition CPU read
hit

CPU Read Shared


In va lid Place read miss on bus
(read only)

o ck
CPU Write bl CPU read
ck miss
Place write miss

a e
b rit
e- s w
ri
t bu P U Place read miss on bus
ss
W on b u sC
m i s s n
mi o
ad
on bus

e s
U
r
e ad m is
CP r
c e i te
la w r
Exclusive P e Cache State Transitions
c Based on requests from CPU
(read/write) P l a
CPU Write hit
CPU Read hit

CPU Write miss


Write - back cache block
Place write miss on bus
State Transition
Write miss for this Shared
block (read only)
In va lid

r t
Abort memor y acces s

CPU Write a bo CPU read


Write - back block ;

; miss
o ck
bl s
a ck ces
e -b ac
r it ory
W em
M
Write miss for
this block Exclusive Cache State Transitions
(read/write) Based on requests from the bus
Read miss for
this block
Some Terminologies
• Polling: A process periodically checks if there is a message that it
needs to handle. This method of awaiting a message is called
polling. Polling reduces processor utilization.
• Interrupt: A process is notified when a message arrives using built-in
interrupt register. Interrupt increases processor utilization in
comparison to polling. 
• Synchronous: A process sends a message and waits for the
response to come before sending another message or carrying
out other tasks. This is way of waiting is referred to as
synchronous communication.
• Asynchronous: A process sends a message and continues to carry
out other tasks while the requested message is processed. This
is referred as asynchronous communication.
Communication
Infrastructure
• Multiprocessor Systems with shared
memory can have two types of
communication infrastructure:
– Shared Bus
– Interconnect

Interconnect
Shared Bus
Directory Based Protocol

1 e n-1 e
Nod Nod
Directory … Directory

Interconnect

Directory Directory

2 e n e
Nod Nod
State of the Block
• Shared: One or more processors have the block cached,
and the value in memory is up to date.
• Uncached: No processor has a copy of the block.
• Exclusive: Exactly one processor has a copy of the
cache block, and it has written the block, so the
memory copy is out of date. The processor is called
the owner of the block.
Local, Remote, Home Node
• Local Node: This is the node where
request originates
• Home Node: This is the node where
the memory location and the
directory entry of an an address
reside.
• Remote Node: A remote node is the
node that has copy of a cache block
Shared State Operation
• Read-miss: The requesting processor is
sent the requested data from memory
and the requestor is added to the
sharing set.
• Write-miss: The requesting processor is
sent the value. All the processors in the
set Sharer are sent invalidate
messages, and the Sharer set is to
contain the identity of the requesting
processor. The state of the block is
made exclusive.
Uncached State Operation
• Read-miss: The requesting processor is sent
the requested data from memory and the
requestor is made the only sharing node. The
state of the block is made shared.
• Write-miss:The requesting processor is sent the
requested data and becomes the sharing
node. The block is made exclusive to indicate
that the only valid copy is cached. Sharer
indicates the identity of the owner.
Exclusive State Operation
• Read-miss: The owner processor is sent a data fetch message, which causes the
state of the block in the owner’s cache to transition to shared and causes the
owner to send the data to the directory, which it is written to memory and sent
back to the requesting processor. The identity of the requesting processor is
added to the set Sharer, which still contains the identity of the processor that was
the owner (since it still has a readable copy).

• Data write back: The owner processor is replacing the block and therefore must write
it back. This write back makes the memory copy up to date (the home directory
essentially becomes the owner), the block is now uncached and the Sharer is
empty.

• Write-miss: The block has a new owner. A message is sent to the old owner causing
the cache to invalidate the block and send the value to the directory, from which it
is sent to the requesting processor, which becomes the new owner. Sharer is set
to the identity of the new owner, and the state of the block remains exclusive.
Cache State Transition
Read Miss Shared
U n ca ch e d
Data Value Reply y ; (read only)
Shares = { P } e pl
r
Fetch , Shar es = {}

l ue
Write Miss va Read miss
Data Value Reply

ta {P
} ss
d a Mi
Shares = { P }

h ; +=
s i te
e tc res bu Wr Data value Reply
F ha n P }; Shares += { P }
S s o {
=
ss m is es
m i r
Data Write- ad e ad S ha ly
Back re r p
c e t e; re
Pl
a
i da lue
v al va
Exclusive In ata
(read/write) d

Write miss Fetch / invalidate


Data value Reply
Shares = { P }
True Sharing Miss & False
Sharing Miss
Time Processor P1 Processor P2
1 Write X1
2 Read X2
3 Write X1
4 Write X2
5 Read X2
Synchronization
• We need a set of hardware primitives
with the ability to atomically read
and modify a memory location.
Example 1: Atomic
Exchange
• Interchanges a value in a register for
a value in memory
• You could build a lock with this. If the
memory location contains 1 then
the lock is on. Register
A 0 1

Physical Main Memory


Other Operations
• Test-and-Set
• Fetch-and-Increment
Load Linked & Set
Conditional
• Load Linked: Load linked is a primitive
operation that loads the content of a
specified location into cache.
• Set Conditional: Set conditional operation
is related to Load Linked operation by
operating on the same memory location.
This operation sets the location to a new
value only if the location contains the
same value read by Load Linked,
otherwise it will not.
Operation of LL & SC
try: LL R2, 0(R1) ; load linked
DADDUI R3, R2, #1 ; increment
SC R3, 0(R1) ; Store conditional
BEQZ R3, try ; branch store fails

1.Address of the memory location loaded by LL is kept in a register


2.If there is an interrupt or invalidate then this register is cleared
3.SC checks if this register is zero.
a.If it is then it fails
b.Otherwise it simply Stores the content of R3 in that memory address
Single Processor System
• Concurrent Execution is the source of problem
Process 1 X=0 Process 2

Memory

P1 P2
TOP READ X
time BNEZ TOP
P2 is scheduled
TOP READ X
BNEZ TOP
WRITE 1
P1 is scheduled
WRITE 1
With Interrupt Mask
Process 1 X=0 Process 2
Memory

P1 P2

TOP READ X
BNEZ TOP
MASK
WRITE 1
time UNMASK
MASK
TOP READ X
BNEZ TOP
WRITE 1
UNMASK
Multiprocessor System
• Masking the interrupt will not work

Processor1
X=0 Processor2
Memory

P1 P2

TOP READ X TOP READ X


BNEZ TOP BNEZ TOP
MASK MASK
WRITE 1
UNMASK WRITE 1
UNMASK
Multiprocessor System
• Uses Cache Coherence
– Bus Based Systems or Interconnect
Based System
• Coherence system arbitrates writes
(Invalidation/write distribute)
• Thus, serializes writes!
Spin Locks
• Locks that a processor continuously
tries to acquire, spinning around a
loop until it succeeds.
• It is used when a lock is to be for a
short amount of time.
Spin Lock implementation

lockit: LD R2, 0(R1) ;load of lock


BNEZ R2, lockit ;not available-spin
DADDUI R2, R0,#1 ;load locked value
EXCH R2, 0(R1) ;swap
BNEZ R2, lockit ;branch if lock wasn't 0
Cache Coherence steps
Coherence
step 1 P0 P1 P2 state of lock Bus/Directory activity
Spins, testing if Spins, testing if
1 Has lock lock=0 lock=0 Shared None
Invalidate Invalidate Write invalidate of lock
2 Set lock to 0 received received Exclusive (P0) variable from P0
Bus/Directory services P2
Cache miss; write back from
3 Cache Miss Cache Miss Shared P0
4 Waits fo bus Lock = 0 Shared Cache miss for P2 satisfied
Executes swap,
5 Lock = 0 gets cache miss Shared Cache miss for P1 satisfied
Completes
swap: returns 0 Bus/Directory services P2
Executes swap, and sets Lock = Cache miss; generates
6 gets cache miss 1 Exclusive (P2) invalidate
Swap completes
and returns 1, Bus/Directory services P1
and sets Lock = Enter critical Cache miss; generates write
7 1 section Exclusive (P1) back
Spins, testing if
8 lock=0 None
Multi-threading
• Multi-threading – why and how?
• Fine-grain
– Almost round robin
– High overhead (compared to coarse grain)
– Better throughput then coarse grain
• Coarse-grain
– Threads switched only on long stalls (L2
cache miss)
– Low overhead
– Individual threads get better performance
Reading Assignment
• Memory Consistency
– Notes (Answers)
• Problem on page 596
• An example that uses LL (Load
Linked) and Set Conditional (SC).
End!
IETF – Internet Engineering Task
Force
• Loosely self-organized groups (working groups)
of …
• Two types of documents: I-D and RFC
• I-D’s life is shorter compared to RFCs life
• RFCs include proposed standards and standards
• RFC Example: RFC 1213, RFC 2543
• I-D -> Proposed Standard -> Standard
SIP Evolution
• SIP I-D v1 ’97, v2 ’98
• SIP RFC 2543 ’99
• SIP RFC 3261 ’00, obsoletes RFC 2543
• SIP working groups – SIPPING, SIMPLE, PINT,
SPIRITS, …
• SIPit – SIP interoperability test


Predictor for SPEC92
Benchmarks
300

250
Instructions
between 200
mispredictio
ns 150 P re d ic te d T a k e n
P r o fi le b a s e d
100

50

0
li

ear
gcc

uc

ijdp
cor
tot

d
so

dod

ro2
ss
eqn

su2
md
res
pre

hyd
esp
com

B e n ch m a rk

You might also like