Professional Documents
Culture Documents
Q1
Time: 9.00AM-12.00Nn
Full Marks: 60
Speeddifference:cacheandmainmemory:oneorderofmagnitude,mainmemoryandHDD:several
orders of magnitude Response to a miss: cache : must be handled by h/w, cant afford to switch
context, VM : cant keep CPU waiting, must switch context, can be conveniently handled by s/wMiss
rate:VMcanonlyaffordextremelylowmissratesPolicies:Pagesmuchlargerthanblocks(~4KB16
KB):capturelargespatiallocality,amortizetransfertime,Fullyflexiblemapping(notwithassocmem),
Alwayswriteback,GoodapproximationtoLRU
Virtual Address 32bit, Page size =4KB=4x210B=212B
Number of page table entry= 232/212=220
Each entry require 8B, So Total Page Table size=220x8 B=8 MB
Q2
[8 Mark
ks] [Topic: I//O Organization]
A [1+2] What
A.
W
is the generalized
g
m
mechanism
t connect multiple
to
m
I/O
O devices to a system? How system
m
handle when
w
IO deviice send interrupt to CPU
U?
Sol: Inteerrupt and priority ressolver is geeneralized mechanism
m
to connect multiple I//O. Interruppt
controlleer have priorrity resolver and it can prioritized
p
reequests.
When a device
d
send interrupt
i
to CPU, CPU stop
s
his norm
mal executioon and transsfer the proggram controol
to specifi
fic location to
o service thee interrupt (ISR:
(I
Interrup
upt Service Routine),
R
perf
rform the tassk using ISR
R,
at end of ISR it ena
able interrup
upt (EI) andd after returrn form ISR
R it continuee to executee the normaal
executionn .
B [1+4] What
B.
W
is the difference between innterrupt andd polling? Calculate
C
m
maximum
baandwidth of
USB/SATA hard dissk connect too a high perfformance PC
C system, where
w
USB controller
c
speed is 2MhZ
Z
S) uses asynchronous moode of seriall communicaation with frraming of 1sstart bit, 2 stop bit and 8
(T=0.5S
data bit.
So: Polliing is checkiing status off device in a regular inteerval by CPU but interrrupt is device send readyy
signal to CPU. Polin
ng is CPU initiated but innterrupt is device
d
initiated.
Min timee to send 8 bit
b data takes = 11 Cyclle that is 1 (start
(s
bit) +88 (data) +2((stop bits) = 1 byte takess
11 cycless
= 11 cycle
c
Per Byyte
Bandwiddth= Numberr of byte sendd per Sec= Number
N
of cyycle Per Secc/ Number off Cycle per Byte
B
= Clockk Speed 2*10
06 / (11 Cyycle Per Bytte) = 2/11 * 106 byte /SS = 2/11 MB//S = 0.1818 MB/S
M
Q3
[10 Mark
ks] [Topic: Processor Pipeline
P
and
d Data Forw
warding]
A [3] Explaain different type of dataa forwardingg path (P1, P2,
A.
P P3 and P44) and its peerformance improvemen
i
nt
in terms of delay cyccles assuminng a 5 stage pipelined processor. Exxample Fig. shows one data forwardd
path P1 (A-A),
(
other paths are M-A,
M
A-M annd M-M.
B [7=2+3+2] Given bellow assem
B.
mbly prograam (A for loop with one branch instruction),
i
identify alll
dependennt pair of in
nstruction annd required forward
f
pathhs to improvve these datta hazards conditions
c
(iif
possible optionally draw
d
dependency graph). Assumingg no branch performance
p
e improvemeent, calculatee
performaance of the given
g
prograam in terms of number of cycles. Think: how you can rem
move Path 2
delay by interchangin
ng of instrucction sequence?
L:
move $t0$ze
m
ero
ad
ddi $t2,$zzero,100
w
$t20($
$7)
lw
P2
ad
dd $t1$t2
2$s1
P1
ad
dd $a$t1$s5
sw
w $a32(($s3)
P3
ad
dd $6$3$
$a
ad
ddi $t0$t0
01
lw
w $70($8
8)
P4
sw
w $78($0
0)
ad
dd$s9$s9
91
eq $t0$t2
2L
be
hlt
P2ha
ave1cycleexxtradelaywitthforwardoth
herwisedela
ayare0withfforward.
(b)Allavailablepa
athP1,P2,P3,,P4inprogra
amsareIdenttifiedintheprogramdiagrram
Asth
heprogrameexecuteforon
nceandeachinstructionta
ake1cycleexxcept1stinstru
uction(4extrra)andoneP2
2
depee3ncdency,theperformancceofprogram
mis13+4+1
1=18Cycle.
Instru
uctioninterch
hangetoimprroveperformance:anyoneoftheindep
pendentinstrructionfromadd$9$9a
andaddit0
t01canbeputin
nbetweenP2
2dependentin
nstructionlwandaddinsttructions.
Q4
[12 Mark
ks] [Topic: Branch Perforrmance Optiimization]
A [6=1.5x4]] What are th
A.
he four basic techniques off branch perfformance impprovement; write
w
conciselyy with at leasst
one exam
mple for each case.
c
Sol: Brannch eliminatio
ons/predicate instruction (Ex
( slt instrucction), Brancch speed up (filling
(f
gap beetween by CC
C
and Brannch instructio
on by early condition
c
cheecking or dellayed branchh by insertingg independennt instructionss
between CC
C and B insttruction), Braanch Predictioon (any staticc, dynamic, hiistory based using
u
two bit, bimodal) andd
4th one is Branch
B
targeet capture usinng BTC. (3rd & 4th methods
ds are alreadyy given in 2nd part
p of questiion.)
B [6=4+2] Suppose
B.
S
a prrocessing sysstem with a dynamic braanch predictor using Brannch Target Capture
C
(BTC
C:
BTAC/BT
TIC). BTC hiit is used as prediction
p
means if a hit inn BTC then it predict to go for target else predict goes
to inline.
A program
m has 10,000
0 instructionss to run, brannch probability is 0.2 (only
ly 2,000 are branch
b
instruuctions) and iit
uses BTAC scheme forr branch preddiction. Usedd BTAC hit raatio =0.8, proogram probaability of goinng to target =
0.7, delayy of GIGT=5, delay GIGI=
=0, delay GTG
GI=4, delay GTGT=3
G
thenn calculate ovverall branchh performancee
and progrram performa
ance in term of
o number of cycles.
c
What willl be performance if BTIC iss used insteadd of BTAC whhere we save two cycles forr going to tarrget case.
Sol:Avera
ageBranchdeelay=
0..2miss(0.3*0+0.7*5)+0.8hit(0.3
3*4+0.7*3
3)=0.7+0.8(1
1.2+2.1)=0.7++0.8*3.3=0.7++2.74=3.44
0.2proba
ablity=2000brranch.Totalextradelayo
ofBranch=3..44*2000=6,6
680
TotalProg
gramExecutio
ontime=10,000+6,880=1
16,680cycles
IfweuseB
BTICsavetwo
ocycleinBTIC
Chitcasewitthgoingtota
arget
SoAverag
geBranchDellay=0.2(0.3**0+0.7*5)+0.8(0.3*4+0.7**1)=0.7+0.8*1
1.9=2.22
TotalProg
gramExecutio
ontime=10,0
000+2000*2.2
22=14,440cyycles
Q5
B. [5=2+3] Given configuration data for L1 (#set=512, Assoc=4, Line size=32B, Hit time=1, WT), L2 (#set=1024,
Assoc=2, line size=1024byte, hit time = 10cycle, WB) and DRAM memory (hit time=1000cycle).
i. Calculate size of L1 and L2.
ii. Calculate AMAT where L2 probability of write (probability of dirty) is 0.2 for L2 cache line and
replacement from L2 cache-memory or vice versa takes 10,000 cycles per 128byte.
Sol:L1size=512*4*32B=1024*64B=64KB//assumeBisByte
L2Size=1024*2*1024Byte=2MByte
AMAT=1+0.2(10+0.2(1000+1024/128*(0.2+1))//.2+1isforWB,toolddirtyblockCM&newblock
MC
Q6
[12 Marks] [Topic: Cache Performance Optimization]
A. [2] Explain technique available to reduce hit time of set associative cache.
Sol: Pseudo Associativity and Way Prediction
B. [2] Explain pre-fetch buffer and victim buffer in cache system.
Sol: Pre-fetch buffer: At the time of accessing current cache block, we can prefetch a future required block (next)
before hand and store in a buffer. So possibly there will not be a miss.
Victim Buffer: Evicted blocks are recycled, Much faster than getting a block from the next level and a significant
fraction of misses may be found in victim cache
C. [3+2+3] Suppose we want to run the following Matrix multiplication program on a computer system with cache
system where cache line (or block) can hold 4 matrix elements. Matrices are stored row wise. Size of L, M and N
are so large w.r.t. the cache size that cache cant accommodate a full row/column implies that after iteration along
any of the three indices, when an element is accessed again, it results in a miss.
Ignore misses due to conflict between matrices. As if there was a separate cache for each matrix.
(a)Calculte number of miss a given matrix multiplication code
(b) Recalculate number of cache miss if we introduce three pre-fetch buffer where it can pre-fetch two new block
into cache pre fetch buffer when there is a miss. (Demand pre-fetch: every miss it fetches two new blocks out of
which one is current demanded and other is next pre-fetch one).
So: LN(2M+1)/8, It will reduce one miss per one miss
(c) Recalculate number of cache miss if the introduced pre-fetch-buffer which can pre-fetch in the background
before the processor needed the next data line. Suppose processor is accessing 10th block it will start fetching 11th
one in background. (There will be always a miss at the beginning of matrix access, if matrix is access sequentially
in a row then there will not be any miss)
Sol:
Misses C
: LN
Misses A
:1
Array A Access completely row wise one after another
Misses B
:L
Total: L(N+1)+1