Professional Documents
Culture Documents
1. The following code segment, consisting of six instructions, needs to be executed 128 a. The total number of CPU cycles if executed on an SISD machine is 3.382
times for evaluation of vector arithmatic expression : D ( I ) = A ( I ) + B ( I ) xC ( I ) for cycles.
0 ≤ I ≤ 127
Instruction Clock cycle Times Loop Total Sum
Load 4 3 128 1,536
Load R1, B(I) /R1 ← Memory( α +I)/ Multiply 8 1 128 1,024
Load R2, C(I) /R2 ← Memory( β +I)/ Add 2 1 128 256
Store 4 1 128 512
Multiply R1, R2 /R1 ← (R1) x (R2)/ Total clock cycles 3,328
Load R3, A(I) /R3 ← Memory( γ +I)/
Add R3, R1 /R3 ← (R3) + (R1)/ b. The total number of CPU cycles if executed on an SIMD machine with 32 PEs
is 3,328/32 = 104 clock cycles.
Store D(I), R3 /Memory( θ +I) ← (R3)/
Assume :
a. 4 clock cycles for each Load or Store.
b. 2 clock cycles for Add.
c. 8 clock cycles for Multiply.
On either a uniprocessor or a single PE in an SIMD machine.
a. Calculate the total number of CPU cycles needed to execute the above code
segment repeatedly 128 times on an SISD uniprocessor computer sequentially,
ignoring all other time delays.
b. Consider the use of an SIMD computer with 32 PEs execute the above vector
operations in six synchronized vector instructions over 32-component vector data and
both driven by the same speed clock. Calculate the total execution time on the SIMD
machine, ignoring instruction broadcast and other delays.
c. What is the speedup gain of the SIMD computer over the SISD computer ?
Arwin@23206008 Arwin@23206008
3 4
Secara kolektif proses yang dapat dieksekusi secara paralel adalah : S1||S3||S5
2. Consider the execution of the following code segment consisting of seven statements. dan S2||S7, sedangkan S4 dan S6 tidak dapat di-paralel dengan proses-proses lainnya
Use Brenstein’s conditions to detect the maximum parallelism embedded in this code. sehingga harus dieksekusi secara sequential.
Justify the portions that can be executed in parallel and the remaining portions that must be
executed in sequentially.
S1: A = B + C
S2: C = D + E
S3: F = G + E Cycle 1
S4: C = A + F
S5: M = G + C
S6: A = L + C
S7: A = E + A
Answer
Cycle 2
o S1 P S2 karena I1 ∩ O2 ≠ ∅ o S4 P S3 karena I 4 ∩ O3 ≠ ∅
o S1 P S4 karena I1 ∩ O4 ≠ ∅ Cycle 3
o S1 P S6 karena O1 ∩ O6 ≠ ∅ o S4 P S5 karena I 5 ∩ O4 ≠ ∅
o S1 P S7 karena O1 ∩ O7 ≠ ∅ o S4 P S6 karena I 4 ∩ O6 ≠ ∅
Cycle 4
o S4 P S7 karena O4 ∩ O7 ≠ ∅
o S2 P S4 karena O2 ∩ O4 ≠ ∅
Arwin@23206008 Arwin@23206008
5 6
To solve this problem, one can refer to Chapter 2 page 79 which is done via bit- r4 = s3 ⊕ 23 → r4 = 1 → 11101
position routing and Chapter 7 page 384-385 which is done via an algorithmic r5 = s4 ⊕ 24 → r5 = 1 → 01101
approach. The two methods give the same answer.
The routing is 10110 → 10111 → 10101 → 11101 → 01101
The general formula is N = 2n where N is number of processor and n is the
dimension of the cube. In this case N = 32 and n = 5 . We can represent each node
via a 5-bit sequence which can be written as C4C 3C 2C1C 0 . b. Node (01110).
Source node ( si ) is 10110 and destination node ( d i ) is 01101 d. Node (11110). Routing via middle bit C 4 → 11110 , so that The routing is
10110 → 1 1110
1) Calculate ri = si −1 ⊕ d i −1 . r = r5 r4 r3 r2 r1 = 11011
1 0 1 1 0
0 1 1 0 1 ⊕ perhatikan bit-bit yang berbeda
1 1 0 1 1
Arwin@23206008 Arwin@23206008
7 8
4. Consider a vector computer which can operate in one of two execution modes at a Note :
time : one is the vector mode with an execution rate of Rv = 10 Mflops , and the other is I do not really agree with it … it is not consistent … how can P become Ra
the scalar mode with an execution rate of Rs = 2 Mflops . Let α be the percentage of code ?
a. Derive an expression for the average execution rate Ra for this computer.
b. Plot Ra as a function of α in the range (0,1). α Ra (Mflops)
Ra vs alpha
0 1
c. Determine the vectorization ratio α needed in order to achieve an average 0.1 1.09
6
execution rate of Ra = 7.5 Mflops . 0.2 1.19
0.3 1.32 5
d. Suppose Rs = 1 Mflops and α = 0.7 , what value of Rv is needed to achieve 0.4 1.47
0.5 1.67 4
Ra = 2 Mflops . 0.6 1.92
Ra
3
0.7 2.27
0.8 2.78 2
Answer 0.9 3.57
1
1 5
a. Deriving Ra 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
alpha
Refer to equation (8.11) in Hwang’s book at Chapter 8 page 413. This
equation is supposed to calculate the relative performance – the speedup
performance of vector processing over scalar processing. It is not intended to
5 5
calculate the average execution rate Ra (?). By using this equation, Ra can c. Ra = → 7.5 = , so that,
5 − 4α 5 − 4α
be derived as followed :
37.5 − 30α = 5
r Rv
Ra = P = where r = , hence 32.5 = 30α
(1 − α ) r + α Rs
Rv 32.5
Rs 5 α=
Ra = = 30
Rv
(1 − α ) + α ( 1 − α )5 +α = 1.083
Rs
5
=
5 − 4α d. Refer to the equation (8.11) above.
Arwin@23206008 Arwin@23206008
9
Rv
Rs Rv
Ra = → Ra =
Rv
(1 − α ) + α ( 1 − α ) Rv + α
Rs
Rv
2=
(1 − 0.7 ) Rv + 0.7
Rv
2=
0.3 Rv + 0.7
0.6 Rv + 1.4 = Rv
1.4 = 0.4 Rv
1.4
Rv =
0.4
= 3.5 Mflops
Arwin@23206008