You are on page 1of 19

Sub-code CS1601

Subject Name Computer Architecture


Semester I
Degree ME
Branch Computer Science
Staff Name Dr. Hari T. S. Narayanan
Date of Update 1.10.08
Part Unit Group No
Describe the three types of instruction
A 2&3 1 1
1 dependences with appropriate examples

Give examples using assembly level code


A 2&3 1 2
segments for each of the three instruction
2 dependencies. No definitions required

A 1 1 3 If a comput er support s an address space of 4G,


3 how many bits are required for address register?

Draw t he memory hierarchy t hat is used in a


A 4 1 4
t ypical deskt op comput er. List t ypical size and
4 performance values for each of these levels.

A 4 1 5 Compare t hree cache mapping funct ions in t erms


5 of access speed and cache miss.
Compare coarse grain and f ine grain
A 5 2 1 mult it hreading wit h t heir performance t rade-
6 offs.
Why Invalidation is preferred over Write-
A 5 2 2
7 distribution? State exactly two reasons.

A 4 2 3 What is t he average rot at ional delay for a disk


8 system with 10000 RPM?

Describe the relationship between response


A 1 2 4
time, user time, elapsed time, and system time
9 in the context of a process.
What are Big Endian and Small Endian
A 1 2 5
10 representations?
Draw a 6-stage pipeline with multiple Functional
A 2&3 3 1
11 Units
Why Read After Write (RAW) is not a problem
A 2&3 3 2
12 with extended ID (Issue & RO)?
Describe the three types of instruction
A 2&3 3 3
13 dependences with appropriate examples.
Illustrate Write After Read and Write After
A 2&3 3 4
14 Write hazards with appropriate examples.
Compare static scheduling of instructions with
A 2&3 3 5
15 dynamic scheduling.
Describe the working of branch prediction
A 2&3 4 1
16 algorithm using prediction buffer of size 1 bit.
Provide an example to show that data
dependence by itself is not sufficient and control
A 2&3 4 2 dependence needs to be considered as well. Why
Data Flow and Exceptional Behavior combination
is preferred over Control and Data Dependence
17 combination in scheduling instructions.
Draw a 5-stage pipeline with multiple Functional
A 2&3 4 3
18 Units.
Describe Write After Read Hazard and Write
A 2&3 4 4
19 After Write Hazard.
Explain Read After Write (RAW) with an
A 2&3 4 5
20 appropriate example

What is t he st eady st at e best -case t hroughput


A 2&3 5 1 from a pipelined archit ect ure where each st age
t akes 1 clock cycle of a 4 GHz clock? Give your
21 answer in number of instructions per second.

Draw the memory hierarchy that is used in a


A 4 5 2
t ypical desktop computer. List typical size and
22 access speed for each of these levels
Describe cache direct mapping using an example.
A 4 5 3 For inst ance, a memory wit h 512 blocks and a
23 cache with 32 blocks.

A program accesses cache 4 million t imes during


A 4 5 4
it s execut ion. How many of t hese accesses are
24 hits if the miss rate is 0.01%?
A program includes 5 million instructions. On the
average, each instruction takes 1.5 cycles if the
A 4 5 5 entire program were to be loaded into cache of a
computer 1 GHz clock. If the program takes 8
million CPU cycles to execute; how much time
25 wasted spent in stalling
A program accesses cache 4 million times during
its execution. The miss rate is 0.01%. If the CPU
A 4 6 1
stalls 2000 cycles for each cache miss, compute
26 the number of misses
What are t wo different shared memory
A 5 6 2
27 arrangement used in MIMD architecture?

Why shared-bus MIMD system is referred to as


A 5 6 3
symmetric multi-processor system? What is the
28 other common name for this arrangement?
29 A 1 6 4 Compare RISC and CISC architectures
What is the basic principle behind Tomasulo’s
A 2&3 6 5
30 algorithm? Illustrate that with an example.

List the essential memory requirements.


A 4 7 1
Compare the memory requirements of Desktop,
31 Server, and Embedded Systems.
Draw the memory hierarchy that is used in a
A 4 7 2
typical desktop computer. List size and
32 performance values for each of these levels.

A program accesses cache 2 million times during


A 4 7 3
its execution. How many of these accesses are
33 hits if the miss rate is 0.025 %?
A program includes 4 million instructions. On the
average, each instruction takes 1.5 cycles if the
A 4 7 4 entire program were to be loaded into cache. If
the program takes 7 million CPU cycles to
execute; how many CPU cycles are spent in
34 stalling?
Do we need cache replacement scheme for
A 4 7 5
35 direct mapping? Justify your choice.
36 A 4 8 1 Compare two Write Miss schemes
Compare Write Back and Write Through
A 4 8 2
37 objectively.
38 A 1 8 3 What are bench-marks?
In the 6-stage pipeline if there are three
consecutive arithmetic instructions and if each
A 2&3 8 4 arithmetic instruction take 3 cycles then the
third instruction has to be stalled due to limited
number (2) of ALUs. This is referred to as
39 Structural dependence
40 A 4 8 5 Compare Split cache and Unified cache

42 Part B

i. Write program segments to multiply two


integer numbers that are in memory (A and B)
using different internal storage types (stack,
register-register, register-memory, and
accumulator). The result of this operation is to
B 1 1 1 be stored in memory. (4)
ii. Draw and describe 32 bit floating point
representation (4)
iii. List the steps in converting a given decimal
number to a binary 32-bit floating-point
representation. (4)
iv. Convert the following number to 32-bit
43 floating-point representation: 255.625. (4)
i. Do we need both auto-decrement and auto-
increment instructions; Can we implement one
using the other? (2)
ii. Compare register-register and register-
memory internal storages? (2)
B 1 1 2 iii. Describe Floating-point arithmetic operations
(add, subtract, multiply, and divide) using
appropriate examples (4).
iv. Convert the following decimal number to 32-
bit floating-point representation: 20482.875 (6)
v. Describe briefly the two locality properties.
44 (2)

i. Describe Amdahl’s law on speedup. (4)


ii. Deduce the limit of speedup suggested by
Amdahl’s law. (4)
iii. Suppose we have made the following
measurements:
Frequency of FP operations (other than FPSQR) =
25%
B 1 1 3 Average CPI of FP operations = 4.0
Average CPI of other instructions 1.33
Frequency of FPSQR = 2%
CPI of FPSQR = 20
Assume that the two design alternatives are to
decrease the CPI of FPSQR to 2 or to decrease
the average CPI of all operations to 2.5.
Compare these two design alternatives using the
45 CPU performance equation. (8)
iv. Consider the following program, which
includes two parts – A & B. Part A, must be
executed serially, Part B can be broken down
into components that can be executed in
parallel. If Part A and Part B were to be
executed serially in a computer they take 20 and
70 million CPU cycles respectively.
a. If the above computer is using 2 GHz clock,
what is the total CPU time required to complete
B 1 1 4 the program? (3)
b. If the total instructions executed were to be
30 million. What is the average Clock cycles Per
Instruction (CPI)? (3)
c. What is the average time taken to complete
an instruction? (2)
d. What is the maximum speedup that is possible
for this program? (4)
v. Is it possible to have an average CPI of less
46 than 1? Justify your answer. (4)
i. Describe how the following terms are related –
Dependence, Hazard, and Stall in the context of
Instruction Level Parallelism (ILP). (4)
ii. Describe the four types of data dependence
hazards. Provide examples for RAW, WAW, and
WAR using simple assembly code (4)
iii. Consider the un-pipelined processor section.
B 2&3 1 5 Assume that it has a 1 ns clock cycle and that it
uses 4 cycles for ALU operations and branches
and 5 cycles for memory operations. Assume that
the relative frequencies of these operations are
30%, 30% and 40% respectively. Suppose that
due to clock skew and setup, pipelining the
processor adds 0.25 ns of overhead to the clock.
Ignoring any latency impact, how many speedups
in the instruction execution rate will we gain
47 from a pipeline? (8)

i. Describe how the following terms are related –


stall, bypass, dynamic scheduling. (4)
ii. Describe Scoreboard algorithm. (4)
iii. Describe how this algorithm solves WAR,
B 2&3 2 1 WAW, and RAW hazards (4)
iv. Identify all the dependences in the following
code (4)
DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0(R1)
SUB.D F8, F10, F14
48 MUL.D F6, F10, F8
i. Describe control dependence (4)
ii. Write a code segment that illustrates control,
data, and name dependence (4)
iii. Consider a pipelined processor with 5 stages.
There is only one Functional Unit to execute all
the arithmetic and logical operations. The clock
that drives this process runs at 4 GHz. Each stage
can be completed in single clock cycle.
a. What will be the throughput in the best case?
(2)
b. Express your answer for (a) in instructions per
B 2&3 2 2
second (2)
c. What is the best-case speed up compared to a
non-pipelined computer?
d. If the execution stage were to take 2 clock
cycles to complete then what will be the best-
case throughput? Express you answer in
instructions per second. Assume static scheduling
(4)
e. If we add one more Arithmetic & Logic unit
and continue to use static scheduling; will there
49 be any difference in the average throughput?
i. Describe Scoreboard algorithm and describe
how it solves WAR, RAW, and WAW hazards (6)
ii. Describe Tomasulo’s algorithm and describe
how it solves WAR, RAW, and WAW hazards. (6)
iii. Eliminate the name dependence in the
B 2&3 2 3
following code with minimal change (1)
DIV.D F0, F2, F4
ADD.D F6, F0, F8
S.D F6, 0(R1)
SUB.D F8, F10, F14
MUL.D F6, F10, F8
iv. Is it possible for a compiler to eliminate the
name dependence found in the above code
50 segment? Is there any limitation in doing so? (3)

i. Why MIPS cannot be used to compare two


computers? (2)
ii. Discuss the role of compilers in achieving
instruction level parallelism (4)
iii. Describe the role of benchmarking in
evaluating computers. (4)
iv. Choose one of the following computers (C1,
C2, and C3) for a set of applications with the
B 2&3 2 4 following profile; justify your answer.
The applications with in the set are classified
into 3 types P1, P2, and P3. The integer value in
ith row and jth column indicates the average
time (in micro second) taken to complete Pi type
programs in Cj computer. 80% of the programs in
the set are of type P1, 10% are of type P2, and
the rest are of type P3. (6)
Profile/Computer C1 C2 C3
P1 5 10 20
P2 100 50 20
51 P3 10 10 20
i. Describe Scoreboard algorithm and compare it
with Tomasulo’s algorithm (6)
ii. Why dynamic scheduling does not make sense
with single functional unit? (2)
iii. There is a simple pipeline with n stages.
Assume all stages occupy equal number of clock
cycles and there is only one ALU.
2 5
a. What is the maximum possible speed-up? (2)
b. The average stall per instruction is 0.2 CPU
cycle, what is the maximum possible speed-up
(2)
c. The effect of having additional Functional
Units is to decrease the average stall by 20%.
52 What is the speed possible? (4)
i. Compare register-register and register-memory
internal storages? (2)
ii. Write program segments to add two integer
numbers (in memory locations A and B) using
different internal storage types (stack, register-
register, register-memory, and accumulator).
The result should be stored in A. (4)
iii. What are the conditions under which two
B 1&3 3 1 consecutive multiplication commands could be
scheduled without stalling? (2)
iv. Describe 32 bit floating point representation
(3)
v. Describe the steps in converting a given
decimal number to a binary 32-bit floating-point
representation. (2)
vi. Convert the following number to 32-bit
53 floating-point representation: 155. 5 (3)
i. Describe Amdahl’s law. (2)
ii. Deduce the limit of speedup suggested by
Amdahl’s law. (2)
iii. Suppose we have made the following
measurements:
Frequency of FP operations (other than FPSQR) =
20%
Average CPI of FP operations = 4.0
Average CPI of other instructions 1.33
Frequency of FPSQR = 5%
CPI of FPSQR = 20
Assume that the two design alternatives are to
decrease the CPI of FPSQR to 2 or to decrease
the average CPI of all operations to 2.5.
Compare these two design alternatives using the
CPU performance equation. (4)
i. Describe how the following terms are related –
B 2&3 3 2 Dependence, Hazard, and Stall in the context of
Instruction Level Parallelism (ILP). (2)
ii. Describe the four types of Hazards that arise
when you try to exploit ILP. Provide examples
for RAW, WAW, and WAR using simple assembly
code (2)
iii. Consider the un-pipelined processor section.
Assume that it has a 1 ns clock cycle and that it
uses 4 cycles for ALU operations and branches
and 5 cycles for memory operations. Assume that
the relative frequencies of these operations are
40%, 40% and 20% respectively.
Suppose that due to clock skew and setup,
pipelining the processor adds 0.25 ns of overhead
to the clock.
Ignoring any latency impact, how many speedups
in the instruction execution rate will we gain
54 from a pipeline? (4)
i. What is the use of cache coherence protocols?
What are the two classes of cache coherence
protocols? (4) ii.
What is snooping protocol? Justify the name
snooping protocol
B 1&3 3 3
(iii) What is Directory-based coherence protocol?
(4)
iv. What are the inherent features of bus-based
systems that are missing in interconnect based
55 systems (4)
56 B 2&3 3 4

i. Explain the program execution time in terms of


Miss Rate, Miss Penalty, Memory Access Per
Instruction, CPU Cycles Per Instruction, and
Number of instructions. (4)
ii. Compute the program execution time where
the number of instructions is 12500, Average CPU
B 1&3 3 5 Cycles per instruction is 2, Average memory
access per instruction is 1.5, Miss Rate 2%, and
Miss Penalty is 200 CC (CPU Cycles). Compare the
execution time for a Miss Penalty of 150 CC. (4)
iii. Describe Least Recently Used (LRU) and Least
Frequently Used (LFU) cache replacement
algorithms. (4)
57 iv. Describe MTBF, MTTR, & MTTDL. (4)
i. Describe the differences between SRAM and
DRAM in the context of main memory and cache
memory. (4)
ii. Describe and compare unified and split cache
schemes. What is the appropriate cache type
(split or unified) to use for Level1 and Level2
cache implementations? Justify your choice. (4)
iii. Assume we have a computer where Clock
B 4 4 1 cycles Per Instruction (CPI) is 1.0 when all
memory accesses are cache hits. The only data
accesses are loads and stores and these total 50%
of the instructions. If the miss penalty is 45 clock
cycles and miss rate is 2%, how much faster
would the computer be if all instructions were
cache hits? (4)
iv. What is processor-memory gap? Explain
58 Gordon Moore’s law in this context? (4)
i. Describe the technical and logistical problems
of VLIW model. (4)
ii. Describe Loop-carried dependence with
dependence distance of n with an example (4)

iii. Consider the following loop. What are the


dependence between S1 and S2? Is this loop
parallel, if not make it parallel? (4)
for (i= 1; i <= 100; i = i + 1)
B 2&3 4 2 {
A [ i ] = A [ i ] + B [ i ] ; /* S1 */
B [ i + 1] = C [ i ] + D [ i ] ;/* S2 */
}
iv. The following loop has multiple types of
dependences. Find all the true dependences, and
anti-dependences, and eliminate the output
dependences and anti-dependences by renaming.
(4)
for ( i = 1 , i <= 100 ; i = i + 1)
{
y [ i ] = x [ i ] / c ; /* S1 */
x [ i ] = x [ i ] + c ; /* S2 */
z [ i ] = y [ i ] + c ; /* S3 */
y [ i ] = c - y [ i ] ; /* S4 */
59 }
i. Describe how having multiple level cache
reduces the miss penalty (4)
ii. Describe Reduce Critical Word First and Early
Restart. Explain how these two techniques
reduces miss penalty. (4)
iii. Describe the effect of having larger line size
on cache misses. (4)
iv. Assume a fully associative write-back cache
with many cache entries that starts empty.
Below is a sequence of five memory operations
B 4 4 3 (the address is in square brackets). What are the
number of hits and misses using no-write allocate
versus write allocate? (4)
WriteMem[100];
WriteMem[100];
ReadMem[200];
WriteMem[200];
WriteMem[100];
WriteMem[100];
ReadMem[300];
60 WriteMem[200];
i. Describe principle of localities; discuss how
spatial and temporal localities are made use of
in caching. (4)
B 4 4 4 ii. Describe Loop interchange with an example.
(4)
iii. Describe array merge with an example. (4)
61 iv. Describe Loop fusion with an example. (4)

i. Write program segments to multiply two


integer numbers that are in memory (A and B)
using different internal storage types (stack,
register-register, register-memory, and
accumulator). The result of this operation is to
B 1 4 5 be stored in memory (A). (4)
ii. Draw and describe 32 bit floating point
representation (4)
iii. List the steps in converting a given decimal
number to a binary 32-bit floating-point
representation. (4)
iv. Convert the following number to 32-bit
62 floating-point representation: 125. 5. (4)
i. Given the data below, what is the impact of
second-level cache associativity on its miss
penalty? (4)\
Hit time L2 for direct mapped = 10 clock cycles
Two-way set associativity increases hit time by
0.1 clock cycles to 10.1 clock cycles.
Local miss rate L2 for direct mapped = 30%
Local miss rateL2 for two-way set associative =
20%
Miss penaltyL2 = 200 clock cycles
ii. Describe the effect of larger cache size on
B 4 5 1 cache misses. (4)
iii. Describe Direct-mapping, Associative
mapping, and fully associative mapping. Use one
diagram illustrates all the three (4)
iv. A program includes 4 million instructions. On
the average, each instruction takes 2.0 cycles
(1GHz) if the entire program were to be loaded
into cache. If the program takes 9 million CPU
cycles to execute, how much CPU time is spent
in stalling. If the CPU stalls 4000 nanosecond for
each cache miss, compute the number of misses.
63 (4)
ii. Unroll and schedule the following piece of
code. This code adds a constant value to each
element of a floating-point vector. The required
latency information is listed in the following
table. Assume there are 4 FP adders in your CPU
(10)
L: L.D F0, 0(R1) 1
stall 2
ADD.D F4, F0, F2 3
stall 4
B 2&3 5 2 stall 5
S.D F4, 0(R1) 6
ADDI R1, R1, #-8 7
stall 8; note latency of 1
for ALU, BNE
BNE R1, R2, L 9
stall

ii. Calculate the average instructions to process


one addition in your code (2)
iii. Why dynamic scheduling does not make sense
with single functional unit? (2)
iv. There is a simple pipeline with n stages.
Assume all stages occupy equal number of clock
cycles and there is only one ALU. What is the
64 maximum possible speed-up? (2)
i. Represent the following cache system
specification in a diagram: (5)
Word size: 4 bytes
Number of words per memory block is 8
Number of blocks is 512
Size of a Line is 8 words
Number of Lines 16
Number of Sets 4
ii. How many bits are required to address a word
in memory?
iii. How many of these bits are used to address a
block in memory?
B 4 5 3 iv. How many bits are required to address a Line.
v. Draw a diagram that illustrates your address
word (size & bit allocation)
vi. If you are using Direct Mapping, where in
cache, the last memory block mapped?
vii. Where in cache, the first memory block
mapped?
viii. Where in cache, the 24th memory block
mapped?
ix. How many bits are required to address the
sets in cache?
x. How many comparisons are made in direct
mapping?
xi. How many comparisons are made in fully
associative mapping?
xii. How many comparisons are made in set
65 associative mapping?

i. Compare Write-back and Write-through


algorithms in terms of their complexity, and
average cache access time. Will you recommend
write-through for a single processor system? Give
reason(s) for your answer. (4)
ii. Describe Direct Mapping and Fully Associative
Mapping using Set Associative mapping. (4)
iii. Compute the number of bits required for
coding tag, line, and word offset for direct
B 4 5 4
mapping for the following hypothetical cache-
memory system. (4)
The size of the main memory 512 words
Size of each block is 4 words
Size of the cache is 4 Lines
Each line size is 4 words
In set-associative mapping each set contains 2
blocks
There are 2 sets in cache
iv. Where are the memory blocks 0, 8, 23, 52
66 mapped (in cache) in the above mapping? (4)
i.What is Directory-based coherence protocol?(8)
B 5 5 5 ii.What constraints restrict coherence protocols
67 in interconnect based system? (8)

i. Describe Software pipelining (4)


ii. Suppose we have a VLIW that could issue two
memory references, two FP operations, and one
integer operation or branch in every clock cycle.
Show an unrolled version of the loop x[i]=x[i]+s
for such a processor. Unroll as many times as
necessary to eliminate any stalls. Ignore the
branch delay slot.(4)
iii. Describe software pipelining with an
B 2&3 6 1
illustration. (4)
iv. Show a software-pipelined version of this
loop, which increments all the elements of an
array whose starting address is in R1 by the
contents of F2: (4)
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
68 You can omit the start-up and clean-up code.
i.What is memory consistency? How is memory
consistency guaranteed in a multiprocessor
B 5 6 2 system?(8)
ii.What is multithreading, describe two types of
69 multithreading?(8)
i.What is the mechanism used in multiprocessor
B 5 6 3 system to implement atomic operations? (8)
ii.What are spin-locks? Give an example(8)
70
i. Explain the program execution time in terms of
Miss Rate, Miss Penalty, Memory Access Per
Instruction, CPU Cycles Per Instruction, and
Number of instructions. (4)
ii. Compute the program execution time where
the number of instructions is 12000, Average CPU
Cycles per instruction is 2, Average memory
B 4 6 4 access per instruction is 1.5, Miss Rate 2%, and
Miss Penalty is 150 CC (CPU Cycles). Compare the
execution time for a Miss Penalty of 100 CC. (4)
iii. Describe Least Recently Used (LRU) and Least
Frequently Used (LFU) cache replacement
algorithms. (4)
iv. Describe MTBF, MTTR, MTTDL, and MTTDI.
71 (4)
i. What is multithreading, describe the two types
of multithreading? (4)
ii. What are spin-locks? Give an example (4)
iii. Describe the use of LL and SC in
implementing atomic operations with an
example. How is SC implemented – specifically
B 5 6 5
describe how SC decides to change the memory
location only when the combination of LL and SC
are atomic. (4)
iv. What mechanism is used in a single processor
system to implement atomic operation? Can we
72 use this in a multiprocessor system? (4)
i. Describe Write Invalidate with an example (4)
ii. Describe Write Distribute with an example (4)
iii. Compare the above two snooping
B 5 7 1 implementations. (4)
iv. Describe how valid, shared, & dirty bits are
used to improve the performance of cache
73 coherence system. (4)

i.How is interrupt used to create software


synchronization primitives? Why can’t this be
B 5 7 2 used in multiprocessor environment(8)
ii.What mechanism is used in a single processor
system to implement atomic operation? Can we
74 use this in a multiprocessor system?(8)

i. Describe the two empirical rules that are


followed in caching. (4)
ii. Describe Direct Mapping and Fully Associative
Mapping using Set Associative mapping. (4)
iii. Compute the number of bits required for
coding tag, line, and word offset for direct
mapping for the following hypothetical cache-
B 4 7 3 memory system. (4)
The size of the main memory 512 words
Size of each block is 4 words
Size of the cache is 4 Lines
Each line size is 4 words
In set-associative mapping each set contains 2
blocks
There are 2 sets in cache
iv. Where are the memory blocks 0, 8, 23, 52
75 mapped (in cache) in the above mapping? (4)
i. Describe and compare unified and split cache
schemes. . Which one of this cache is used for
Level 1 and Level 2 caches. Justify your choice.
(4)
ii. Assume we have a computer where Clock
cycles Per Instruction (CPI) is 1.0 when all
memory accesses are cache hits. The only data
accesses are loads and stores and these total 50%
of the instructions. If the miss penalty is 25 clock
cycles and miss rate is 2%, how much faster
would the computer be if all instructions were
cache hits? (4)
iii. What is processor-memory gap? Explain
Gordon Moore’s law in this context? (4)
B 4 7 4
iv. Calculate the average cycles per instructions
for the following two scenarios of memory
systems:
Scenario 1 Scenario 2
Block size = 1 word Block size = 4 word
Memory bus size = 1 word Memory bus size = 2
word
Miss rate = 3% Miss rate = 2.5%
Memory access per instruction = 1.2 Memory
access per instruction = 1.2
Cache miss Penalty = 64 CC Cache miss Penalty =
128 CC
Avg Cycles per instruction = 2 Avg Cycles per
76 instruction = 2
i. Why MIPS cannot be used to compare two
computers? (2)
ii. Discuss the role of compilers in achieving
instruction level parallelism (4)
B 2&3 7 5 iii. Describe branch prediction using branch
correlation. Use an appropriate example. (4)
iv. Choose one of the following computers (C1,
C2, and C3) for a set of applications with the
77 following profile; justify your answer.
i. Unroll and schedule the following piece of
code. This code adds a constant value to each
element of a floating-point vector. The required
latency information is listed in the following
table. Assume there are 4 FP adders in your CPU
(10)
ii. Calculate the average instructions to process
B 2&3 8 1
one addition in your code (2)
iii. Why dynamic scheduling does not make sense
with single functional unit? (2)
iv. There is a simple pipeline with n stages.
Assume all stages occupy equal number of clock
cycles and there is only one ALU. What is the
78 maximum possible speed-up? (2)
i.What are the essential memory requirements?
B 4 8 2 Compare the memory requirements of Desktop,
Server, and Embedded Systems.(8)
ii.What is processor-memory gap? Explain Gordon
79 Moore’s law in this context (8)
i.What is memory hierarchy and why do we need
memory hierarchy? (8)
B 4 8 3
Ii.What is cache memory, how does cache
80 memory work?(8)

i.In general, how does cache memory operate,


explain the terms cache hit, cache miss, spatial
locality, and temporal locality.(8)
B 4 8 4 ii.Explain the program execution time in terms of
Miss Rate, Miss Penalty, Memory Access Per
Instruction, CPU Cycles Per Instruction, and
81 Number of instructions.(8)

i.Compute the program execution time where


the number of instructions is 12000, Average CPU
Cycles per instruction is 2, Average memory
B 4 8 5 access per instruction is 1.5, Miss Rate 2%, and
Miss Penalty is 150 CC (CPU Cycles). Compare the
execution time for a Miss Penalty of 100 CC.(8)
ii.Why do we need cache-mapping functions?
Describe the working of Direct-mapping
82 function.(8)

i.Compare Write-back and Write-through


algorithms in terms of their complexity, and
average cache access time. Will you recommend
B 4 9 1 write-through for a single processor system? Give
reason(s) for your answer. (8)
ii.Describe Direct Mapping and Fully Associative
83 Mapping using Set Associative mapping.(8)

i.Compare Mapping functions in terms of their hit


B 4 9 2 ratio and search speed.(8)
ii.Describe loop-merging arrays with appropriate
84 examples. (8)
i.Compute tag values, line values, and word
offsets for direct- mapping for the following
B 4 9 3 hypothetical cache-memory system. (8)
ii.Why do we need cache replacement
85 algorithms? (8)
i.Describe Least Recently Used (LRU) cache
replacement algorithm.(8)
B 4 9 4 ii.Describe Least Frequently Used (LFU) cache
replacement algorithm.i.Describe MTBF, MTTR,
86 MTTDL, and MTTDI.(8)
i.Describe and compare unified and split cache
schemes.(8)
B 4 9 5 ii.Describe the differences between SRAM and
DRAM in the context of main memory and cache
87 memory.(8)
i.Why do we need multiprocessor architectures?
Describe Flynn’s classification of multiprocessor
B 5 10 1 architectures.(8)
ii.What are the two classes of MIMD
88 architectures? Describe them briefly.(8)
i.Explain the terms Polling, Interrupt,
Synchronous, and Asynchronous in the context of
B 5 10 2 message passing.(8)
ii.Describe the 3 important metrics of
89 Communication Mechanisms.(8)
i. Describe the advantages of different
communication Mechanisms – Shared Memory and
Message Passing.(8)
B 5 10 3 ii.To achieve a speedup of 80 with 100
processors what should be the Fraction of the
program that can be executed in parallel or
90 enhanced mode?(8)
i.Suppose we have an application running on 32-
processor multiprocessor, which has 400 ns time
to handle references to a remote memory. For
this application, assume that all the references
except those involving communication hit in the
local memory hierarchy, which is slightly
optimistic. Processors are stalled on a remote
request, and the processor clock rate is 1 GHz. If
B 5 10 4
the base Instructions Per Cycle (IPC) (assuming
that all references hit in the cache) is 2, how
much faster is the multiprocessor if there is no
communication versus if 0.2% of the instructions
involve a remote communication references?(8)
ii.Explain cache coherence problem using the
following diagram where there are 2 CPUs. What
91 kind of cache-write policy is used?(8)
i. What are different properties or conditions or
requirements cache coherency must satisfy?
B 5 10 5 Compare coherency and consistency.(8)
ii. What are the two features offered by cache
memory in SMP systems? Why do we need these
92 features? (8)
i.What are the two different ways snooping
protocols maintain the cache coherence
B 5 11 1 properties? (8)
ii.Compare Write Invalidate and Write Broadcast
93 protocols(8)
i.Describe how valid, shared, & dirty bits are
used in cache coherence (8)
B 5 11 2 ii.What is coherence miss? What are the two
types of coherence misses? What is the effect of
94 block size on false miss?(8)
This document was created with Win2PDF available at http://www.win2pdf.com.
The unregistered version of Win2PDF is for evaluation or non-commercial use only.
This page will not be added after purchasing Win2PDF.