You are on page 1of 5

2

Answers for Advanced Computer Architecture 1st Quiz 2006/2007 Answer

1. The following code segment, consisting of six instructions, needs to be executed 128 a. The total number of CPU cycles if executed on an SISD machine is 3.382
times for evaluation of vector arithmatic expression : D ( I ) = A ( I ) + B ( I ) xC ( I ) for cycles.

0 ≤ I ≤ 127
Instruction Clock cycle Times Loop Total Sum
Load 4 3 128 1,536
Load R1, B(I) /R1 ← Memory( α +I)/ Multiply 8 1 128 1,024
Load R2, C(I) /R2 ← Memory( β +I)/ Add 2 1 128 256
Store 4 1 128 512
Multiply R1, R2 /R1 ← (R1) x (R2)/ Total clock cycles 3,328
Load R3, A(I) /R3 ← Memory( γ +I)/
Add R3, R1 /R3 ← (R3) + (R1)/ b. The total number of CPU cycles if executed on an SIMD machine with 32 PEs
is 3,328/32 = 104 clock cycles.
Store D(I), R3 /Memory( θ +I) ← (R3)/

c. The speedup gain is total number of CPU cycles on an SIMD machine/total


where R1 , R 2 and R 3 are CPU registers, ( R1) is the content of R1 . α , β , γ and θ
number of CPU cycles on an SISD machine = 3,328/104 = 32 times.
are the starting memory addresses of arrays A ( I ) , B ( I ) , C ( I ) and D ( I ) , respectively.

Assume :
a. 4 clock cycles for each Load or Store.
b. 2 clock cycles for Add.
c. 8 clock cycles for Multiply.
On either a uniprocessor or a single PE in an SIMD machine.
a. Calculate the total number of CPU cycles needed to execute the above code
segment repeatedly 128 times on an SISD uniprocessor computer sequentially,
ignoring all other time delays.
b. Consider the use of an SIMD computer with 32 PEs execute the above vector
operations in six synchronized vector instructions over 32-component vector data and
both driven by the same speed clock. Calculate the total execution time on the SIMD
machine, ignoring instruction broadcast and other delays.
c. What is the speedup gain of the SIMD computer over the SISD computer ?

Arwin@23206008 Arwin@23206008
3 4

Secara kolektif proses yang dapat dieksekusi secara paralel adalah : S1||S3||S5
2. Consider the execution of the following code segment consisting of seven statements. dan S2||S7, sedangkan S4 dan S6 tidak dapat di-paralel dengan proses-proses lainnya
Use Brenstein’s conditions to detect the maximum parallelism embedded in this code. sehingga harus dieksekusi secara sequential.
Justify the portions that can be executed in parallel and the remaining portions that must be
executed in sequentially.

S1: A = B + C
S2: C = D + E
S3: F = G + E Cycle 1
S4: C = A + F
S5: M = G + C
S6: A = L + C
S7: A = E + A

Answer
Cycle 2

o S1 P S2 karena I1 ∩ O2 ≠ ∅ o S4 P S3 karena I 4 ∩ O3 ≠ ∅

o S1 P S4 karena I1 ∩ O4 ≠ ∅ Cycle 3

o S1 P S6 karena O1 ∩ O6 ≠ ∅ o S4 P S5 karena I 5 ∩ O4 ≠ ∅

o S1 P S7 karena O1 ∩ O7 ≠ ∅ o S4 P S6 karena I 4 ∩ O6 ≠ ∅
Cycle 4
o S4 P S7 karena O4 ∩ O7 ≠ ∅

o S2 P S4 karena O2 ∩ O4 ≠ ∅

o S2 P S5 karena I 5 ∩ O2 ≠ ∅ o S6 P S7 karena O6 ∩ O7 ≠ ∅ dan


This parallelism scheme can save the clock cycles up to 3 cycles (from 7 cycles to 4
o S2 P S6 karena I 6 ∩ O2 ≠ ∅ I 7 ∩ O6 ≠ ∅
cycles).

Maka dengan demikian proses yang dapat di paralel adalah :

o S1||S3 karena I1 ∩ O3 = ∅; I 3 ∩ O1 = ∅; O1 ∩ O3 = ∅ ; demikian pula


halnya dengan S1||S5
o S2||S3 karena I 2 ∩ O4 = ∅; I 4 ∩ O2 = ∅; O2 ∩ O4 = ∅ ; demikian pula
halnya dengan S2||S7
o S3||S5, S3||S6 dan S3||S7
o S5||S6 dan S5||S7

Arwin@23206008 Arwin@23206008
5 6

2) Starting from dimension i = 1 and v = s


3. Consider a 32-node hypercube network. Based on E-cube routing algorithm, 3) The steps : (bila ri = 0 , lewati)
determine the optimal routing paths to multicast a message from node (10110) to : (all
intermediate nodes must be identified on the routing paths).
r1 = s0 ⊕ 20 → r1 = 1 → 10111
r2 = s1 ⊕ 21 → r2 = 1 → 10101
Answer
r3 = s2 ⊕ 22 → r3 = 0

To solve this problem, one can refer to Chapter 2 page 79 which is done via bit- r4 = s3 ⊕ 23 → r4 = 1 → 11101

position routing and Chapter 7 page 384-385 which is done via an algorithmic r5 = s4 ⊕ 24 → r5 = 1 → 01101
approach. The two methods give the same answer.
The routing is 10110 → 10111 → 10101 → 11101 → 01101
The general formula is N = 2n where N is number of processor and n is the
dimension of the cube. In this case N = 32 and n = 5 . We can represent each node
via a 5-bit sequence which can be written as C4C 3C 2C1C 0 . b. Node (01110).

First → routing via middle bit C 3 → 11110


a. Node (01101).
1 0 1 1 0 Second → followed by most significant bit C 4 → 01110
C4 C3 C2 C1 C 0 The routing is 10110 → 11110 → 01110
First → routing via least significant bit C 0 → 10111

Second → followed by middle bit C1 → 10101 c. Node (10001).

Third → followed by middle bit C 3 → 11101


First → routing via least significant bit C 0 → 10111
Fourth → followed by most significant bit C 4 → 01101
Second → followed by middle bit C1 → 10101
The routing is 10110 → 10111 → 10101 → 11101 → 01101
Third → followed by middle bit C 2 → 10001

The routing is 10110 → 10111 → 10101 → 10001


Using algorithmic approach

Source node ( si ) is 10110 and destination node ( d i ) is 01101 d. Node (11110). Routing via middle bit C 4 → 11110 , so that The routing is

10110 → 1 1110
1) Calculate ri = si −1 ⊕ d i −1 . r = r5 r4 r3 r2 r1 = 11011

1 0 1 1 0
0 1 1 0 1 ⊕ perhatikan bit-bit yang berbeda
1 1 0 1 1

Arwin@23206008 Arwin@23206008
7 8

4. Consider a vector computer which can operate in one of two execution modes at a Note :
time : one is the vector mode with an execution rate of Rv = 10 Mflops , and the other is I do not really agree with it … it is not consistent … how can P become Ra

the scalar mode with an execution rate of Rs = 2 Mflops . Let α be the percentage of code ?

that is vectorizable in a typical program mix for this computer.


b. Plot Ra and α is as followed :

a. Derive an expression for the average execution rate Ra for this computer.
b. Plot Ra as a function of α in the range (0,1). α Ra (Mflops)
Ra vs alpha
0 1
c. Determine the vectorization ratio α needed in order to achieve an average 0.1 1.09
6
execution rate of Ra = 7.5 Mflops . 0.2 1.19
0.3 1.32 5
d. Suppose Rs = 1 Mflops and α = 0.7 , what value of Rv is needed to achieve 0.4 1.47
0.5 1.67 4
Ra = 2 Mflops . 0.6 1.92

Ra
3
0.7 2.27
0.8 2.78 2
Answer 0.9 3.57
1
1 5
a. Deriving Ra 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha
Refer to equation (8.11) in Hwang’s book at Chapter 8 page 413. This
equation is supposed to calculate the relative performance – the speedup
performance of vector processing over scalar processing. It is not intended to
5 5
calculate the average execution rate Ra (?). By using this equation, Ra can c. Ra = → 7.5 = , so that,
5 − 4α 5 − 4α
be derived as followed :

37.5 − 30α = 5
r Rv
Ra = P = where r = , hence 32.5 = 30α
(1 − α ) r + α Rs

Rv 32.5
Rs 5 α=
Ra = = 30
Rv
(1 − α ) + α ( 1 − α )5 +α = 1.083
Rs
5
=
5 − 4α d. Refer to the equation (8.11) above.

Arwin@23206008 Arwin@23206008
9

Rv
Rs Rv
Ra = → Ra =
Rv
(1 − α ) + α ( 1 − α ) Rv + α
Rs
Rv
2=
(1 − 0.7 ) Rv + 0.7
Rv
2=
0.3 Rv + 0.7

0.6 Rv + 1.4 = Rv
1.4 = 0.4 Rv
1.4
Rv =
0.4
= 3.5 Mflops

Arwin@23206008

You might also like