Indian Institute of Technology Guwahati: Department of Computer Science and Engineering

INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI
Department of Computer Science and Engineering

CS222 (Comp. Organization & Architecture): End Semester Examination
Date: 27th Apr 2011
Q1
Time: 9.00AM-12.00Nn
Full Marks: 60
Answer all 6 questions
[10 Marks] [Topic: Virtual Memory and Superscalar]

A. [3+1] What are basic differences between a superscalar and a VLIW processor? How serializibility
maintained in superscalar processor?
Sol: Both processor can execute instructions in parallel (both are powerful, ILP) where
1. Superscalar is dynamically scheduled but VLIW is statically scheduled: Compiler has to do lot
of work in VLIW to scheduled or arranged independent instruction to be run in parallel but in
superscalar hardware dependency analysis units are available to do this job.
2. Power consumption in VLIW is less as compiler do a lot work, Code density is higher in
superscalar as not need to arrange independent instruction before hand, but in VLIW we have to
make group of bigger multi operation instruction.
3. Code compatibility of super scalar is good and hence writing assembly code for superscalar.
ROB: Super scalar processor uses Reorder buffer to maintain serializibility.
B. [4+2] What are the similarity and difference between virtual memory and cache memory hierarchy?
Given a system with 32 bit virtual address and 4KB page size, how much space require to store the page
table (Assume 8B for each page table entry)?
Sol: Virtual memory is similar to cache with similar kind of behavior. Cache block, cache miss and
cache directory of cache system can be mapped to page, page falut, page table of virtual memory
system respectively. Differencesarewithspeeddifference,responsetoamiss,policyandmissrate.
Speeddifference:cacheandmainmemory:oneorderofmagnitude,mainmemoryandHDD:several
orders of magnitude Response to a miss: cache : must be handled by h/w, cant afford to switch
context, VM : cant keep CPU waiting, must switch context, can be conveniently handled by s/wMiss
rate:VMcanonlyaffordextremelylowmissratesPolicies:Pagesmuchlargerthanblocks(~4KB16
KB):capturelargespatiallocality,amortizetransfertime,Fullyflexiblemapping(notwithassocmem),
Alwayswriteback,GoodapproximationtoLRU
Virtual Address 32bit, Page size =4KB=4x210B=212B
Number of page table entry= 232/212=220
Each entry require 8B, So Total Page Table size=220x8 B=8 MB
Q2
[8 Mark
ks] [Topic: I//O Organization]
A [1+2] What
A.
W
is the generalized
g
m
mechanism
t connect multiple
to
m
I/O
O devices to a system? How system
m
handle when
w
IO deviice send interrupt to CPU
U?
Sol: Inteerrupt and priority ressolver is geeneralized mechanism
m
to connect multiple I//O. Interruppt
controlleer have priorrity resolver and it can prioritized
p
reequests.
When a device
d
send interrupt
i
to CPU, CPU stop
s
his norm
mal executioon and transsfer the proggram controol
to specifi
fic location to
o service thee interrupt (ISR:
(I
Interrup
upt Service Routine),
R
perf
rform the tassk using ISR
R,
at end of ISR it ena
able interrup
upt (EI) andd after returrn form ISR
R it continuee to executee the normaal
executionn .
B [1+4] What
B.
W
is the difference between innterrupt andd polling? Calculate
C
m
maximum
baandwidth of
USB/SATA hard dissk connect too a high perfformance PC
C system, where
w
USB controller
c
speed is 2MhZ
Z
S) uses asynchronous moode of seriall communicaation with frraming of 1sstart bit, 2 stop bit and 8
(T=0.5S
data bit.
So: Polliing is checkiing status off device in a regular inteerval by CPU but interrrupt is device send readyy
signal to CPU. Polin
ng is CPU initiated but innterrupt is device
d
initiated.
Min timee to send 8 bit
b data takes = 11 Cyclle that is 1 (start
(s
bit) +88 (data) +2((stop bits) = 1 byte takess
11 cycless
= 11 cycle
c
Per Byyte
Bandwiddth= Numberr of byte sendd per Sec= Number
N
of cyycle Per Secc/ Number off Cycle per Byte
B
= Clockk Speed 2*10
06 / (11 Cyycle Per Bytte) = 2/11 * 106 byte /SS = 2/11 MB//S = 0.1818 MB/S
M
Q3
[10 Mark
ks] [Topic: Processor Pipeline
P
and
d Data Forw
warding]
A [3] Explaain different type of dataa forwardingg path (P1, P2,
A.
P P3 and P44) and its peerformance improvemen
i
nt
in terms of delay cyccles assuminng a 5 stage pipelined processor. Exxample Fig. shows one data forwardd
path P1 (A-A),
(
other paths are M-A,
M
A-M annd M-M.
B [7=2+3+2] Given bellow assem
B.
mbly prograam (A for loop with one branch instruction),
i
identify alll
dependennt pair of in
nstruction annd required forward
f
pathhs to improvve these datta hazards conditions
c
(iif
possible optionally draw
d
dependency graph). Assumingg no branch performance
p
e improvemeent, calculatee
performaance of the given
g
prograam in terms of number of cycles. Think: how you can rem
move Path 2
delay by interchangin
ng of instrucction sequence?
L:
move $t0$ze
m
ero
ad
ddi $t2,$zzero,100
w
$t20($
$7)
lw
P2
ad
dd $t1$t2
2$s1
P1
ad
dd $a$t1$s5
sw
w $a32(($s3)
P3
ad
dd $6$3$
$a
ad
ddi $t0$t0
01
lw
w $70($8
8)
P4
sw
w $78($0
0)
ad
dd$s9$s9
91
eq $t0$t2
2L
be
hlt
Sol: P1 iss given in quesstion: No credit for P1
P2ha
ave1cycleexxtradelaywitthforwardoth
herwisedela
ayare0withfforward.
(b)Allavailablepa
athP1,P2,P3,,P4inprogra
amsareIdenttifiedintheprogramdiagrram
Asth
heprogrameexecuteforon
nceandeachinstructionta
ake1cycleexxcept1stinstru
uction(4extrra)andoneP2
2
depee3ncdency,theperformancceofprogram
mis13+4+1
1=18Cycle.
Instru
uctioninterch
hangetoimprroveperformance:anyoneoftheindep
pendentinstrructionfromadd$9$9a
andaddit0
t01canbeputin
nbetweenP2
2dependentin
nstructionlwandaddinsttructions.
Q4
[12 Mark
ks] [Topic: Branch Perforrmance Optiimization]
A [6=1.5x4]] What are th
A.
he four basic techniques off branch perfformance impprovement; write
w
conciselyy with at leasst
one exam
mple for each case.
c
Sol: Brannch eliminatio
ons/predicate instruction (Ex
( slt instrucction), Brancch speed up (filling
(f
gap beetween by CC
C
and Brannch instructio
on by early condition
c
cheecking or dellayed branchh by insertingg independennt instructionss
between CC
C and B insttruction), Braanch Predictioon (any staticc, dynamic, hiistory based using
u
two bit, bimodal) andd
4th one is Branch
B
targeet capture usinng BTC. (3rd & 4th methods
ds are alreadyy given in 2nd part
p of questiion.)
B [6=4+2] Suppose
B.
S
a prrocessing sysstem with a dynamic braanch predictor using Brannch Target Capture
C
(BTC
C:
BTAC/BT
TIC). BTC hiit is used as prediction
p
means if a hit inn BTC then it predict to go for target else predict goes
to inline.
A program
m has 10,000
0 instructionss to run, brannch probability is 0.2 (only
ly 2,000 are branch
b
instruuctions) and iit
uses BTAC scheme forr branch preddiction. Usedd BTAC hit raatio =0.8, proogram probaability of goinng to target =
0.7, delayy of GIGT=5, delay GIGI=
=0, delay GTG
GI=4, delay GTGT=3
G
thenn calculate ovverall branchh performancee
and progrram performa
ance in term of
o number of cycles.
c
What willl be performance if BTIC iss used insteadd of BTAC whhere we save two cycles forr going to tarrget case.
Sol:Avera
ageBranchdeelay=
0..2miss(0.3*0+0.7*5)+0.8hit(0.3
3*4+0.7*3
3)=0.7+0.8(1
1.2+2.1)=0.7++0.8*3.3=0.7++2.74=3.44
0.2proba
ablity=2000brranch.Totalextradelayo
ofBranch=3..44*2000=6,6
680
TotalProg
gramExecutio
ontime=10,000+6,880=1
16,680cycles
IfweuseB
BTICsavetwo
ocycleinBTIC
Chitcasewitthgoingtota
arget
SoAverag
geBranchDellay=0.2(0.3**0+0.7*5)+0.8(0.3*4+0.7**1)=0.7+0.8*1
1.9=2.22
TotalProg
gramExecutio
ontime=10,0
000+2000*2.2
22=14,440cyycles
Q5
[10 Marks] [Topic: Memory Hierarchy]

A. [5] Explain different cache policy related to read, fetch, load, replacement and write.
Sol:read:sequentialsimple,concurrentsimple,sequentialforward,concurrentforward
Write:WritethroughandWriteBack
Fetch:Demandfetch,SoftwarePrefetch,HardwarePrefetch
Load:Blockload,WrapAroundload,loadforward
Replacement:LRU,FIFO,Random
B. [5=2+3] Given configuration data for L1 (#set=512, Assoc=4, Line size=32B, Hit time=1, WT), L2 (#set=1024,
Assoc=2, line size=1024byte, hit time = 10cycle, WB) and DRAM memory (hit time=1000cycle).
i. Calculate size of L1 and L2.
ii. Calculate AMAT where L2 probability of write (probability of dirty) is 0.2 for L2 cache line and
replacement from L2 cache-memory or vice versa takes 10,000 cycles per 128byte.
Sol:L1size=512*4*32B=1024*64B=64KB//assumeBisByte
L2Size=1024*2*1024Byte=2MByte
AMAT=1+0.2(10+0.2(1000+1024/128*(0.2+1))//.2+1isforWB,toolddirtyblockCM&newblock
MC
Q6
[12 Marks] [Topic: Cache Performance Optimization]
A. [2] Explain technique available to reduce hit time of set associative cache.
Sol: Pseudo Associativity and Way Prediction
B. [2] Explain pre-fetch buffer and victim buffer in cache system.
Sol: Pre-fetch buffer: At the time of accessing current cache block, we can prefetch a future required block (next)
before hand and store in a buffer. So possibly there will not be a miss.
Victim Buffer: Evicted blocks are recycled, Much faster than getting a block from the next level and a significant
fraction of misses may be found in victim cache
C. [3+2+3] Suppose we want to run the following Matrix multiplication program on a computer system with cache
system where cache line (or block) can hold 4 matrix elements. Matrices are stored row wise. Size of L, M and N
are so large w.r.t. the cache size that cache cant accommodate a full row/column implies that after iteration along
any of the three indices, when an element is accessed again, it results in a miss.
Ignore misses due to conflict between matrices. As if there was a separate cache for each matrix.
(a)Calculte number of miss a given matrix multiplication code
int A[L][N], B[N][M], C[L][M];

//Improved code for C=AxB
for (i = 0; i < L; i++)
for (k = 0; k < N; k++)
for (j = 0; j < M; j++)
C[i][j] += A[i][k] * B[k][j];
Sol: All the Array access
Access A: LN Misses
Access B: LMN Misses
Access C: LMN Misses
Total : LN(2M+1)/4
row wise, so miss will be access/4

: LN/4
: LMN/4
: LMN/4
(b) Recalculate number of cache miss if we introduce three pre-fetch buffer where it can pre-fetch two new block
into cache pre fetch buffer when there is a miss. (Demand pre-fetch: every miss it fetches two new blocks out of
which one is current demanded and other is next pre-fetch one).
So: LN(2M+1)/8, It will reduce one miss per one miss
(c) Recalculate number of cache miss if the introduced pre-fetch-buffer which can pre-fetch in the background
before the processor needed the next data line. Suppose processor is accessing 10th block it will start fetching 11th
one in background. (There will be always a miss at the beginning of matrix access, if matrix is access sequentially
in a row then there will not be any miss)
Sol:
Misses C
: LN
Misses A
:1
Array A Access completely row wise one after another
Misses B
:L
Total: L(N+1)+1

Indian Institute of Technology Guwahati: Department of Computer Science and Engineering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Indian Institute of Technology Guwahati: Department of Computer Science and Engineering

Uploaded by

Copyright:

Available Formats

INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI

Department of Computer Science and Engineering

Answer all 6 questions

[10 Marks] [Topic: Virtual Memory and Superscalar]

Sol: P1 iss given in quesstion: No credit for P1

[10 Marks] [Topic: Memory Hierarchy]

int A[L][N], B[N][M], C[L][M];

row wise, so miss will be access/4

You might also like