You are on page 1of 26

Two Equations to Evaluate Alternatives

Amdahls Law
The performance gain that can be obtained by improving some portion
of a computer can be calculated using Amdahls Law.
Amdahls Law defines the speedup that can be gained by using a
particular feature.
is used to find the maximum expected improvement to an overall system
when only part of the system is improved.

Improvement rate N = (original execution time) / (new execution time)


New execution time = time unaffected + time affected / (Improvement
Factor)

The CPU Performance Equation


Essentially all computers are constructed using a clock running at a
constant rate.
CPU time then can be expressed by the amount of clock cycles.

Amdahl's Law
Speedup is the ratio

Alternatively,

Two major reasons of Speedup enhancement:


Fractionenhanced: the fraction of the computation time in the original
computer that can be converted to take advantage of the enhancement (1).

If 20 seconds of the execution time of a program that takes 60 seconds in


total can use an enhancement, the fraction is 20/60.
Speedupenhanced: the improvement gained by the enhanced execution mode,
that is, how much faster the task would run if the enhanced mode were used
for the entire program. (1).
This value is the time of the original mode over the time of the enhanced
mode.
If the enhanced mode takes 2 seconds for a portion of the program,
5 seconds in the original mode.
The improvement is 5/2.

Thus, Execution Timeoverall = the time of the unenhanced portion of the


machine + the time spent using the enhancement, i.e. that is,

Fraction enhanced

Execution time new Execution timeold 1 Fraction enhanced


Speedup enhanced

Speedup overall

S(p) =

Execution timeold
1

Fraction enhanced
Execution time new 1 Fraction

enhanced
Speedup enhanced

Execution time using one processor (best sequential algorithm)


Execution time using a multiprocessor with p processors

ts
= t
p

Even with infinite number of processors, maximum speedup is limited to 1/f.

Suppose that we want to enhance the processor used for Web serving. The
new processor is 10 times faster on computation in the Web serving
application than the original processor. Assuming that the original processor
is busy with computation 40% of the time and is waiting for I/O 60% of the
time, what is the overall speedup gained by incorporating the enhancement?
Answer
Fractionenhanced = 0.4, Speedupenhanced = 10
Speedup overall

1
(1 - 0.4)

0.4
10

1
1

1.56
0.6 0.04 0.64

Amdahls Law can serve as a guide to how much an enhancement will


improve performance and how to distribute resources to improve costperformance.

A common transformation required in graphics processors is square root.


Implementations of floating-point (FP) square root vary significantly in
performance, especially among processors designed for graphics.
Suppose FP square root (FPSOR) is responsible for 20% of the execution time
of a critical graphics benchmark. One proposal is to enhance the FPSQR
hardware and speed up this operation by a factor of 10.
The other alternative is just to try to make all FP instructions in the graphics
processor run faster by a factor of 1.6; FP instructions are responsible for half
of the execution time for the application. The design team believes that they
can make all FP instructions run 1.6 times faster with the same effort as
required for the fast square root.
Compare these two design alternatives.
Answer

We can compare these two alternatives by comparing the speedups:

1.22

Speedup FP

1
(1 - 0.5)

0.5
1.6

1
1.23
0.8125

Improving the performance of the FP operations overall is slightly better


because of the higher frequency.

Performance Enhancement: Amdahls Law


50
f =0

f = fraction
unaffected
p = speedup
of the rest

f = 0.01

30

f = 0.02

20
f = 0.05

10
f = 0.1

0
0

10

20
30
Enhancement factor (p )

40

50

s=

1
f + (1 f)/p

min(p, 1/f)
Amdahls law: speedup achieved if a fraction f of a task is unaffected
and the remaining 1 f part runs p times as fast.

Computer Architecture,
Background and Motivation

Speedup (s )

40

A processor spends 30% of its time on flp addition, 25% on flp mult,
and 10% on flp division. Evaluate the following enhancements, each
costing the same to implement:

Solution
a.Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18
b.Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
c.Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10

Computer Architecture,
Background and Motivation

a.Redesign of the flp adder to make it twice as fast.


b.Redesign of the flp multiplier to make it three times as fast.
c.Redesign the flp divider to make it 10 times as fast.

Generalized Amdahls Law


Original running time of a program = 1 = f1 + f2 + . . . + fk

New running time after the fraction fi is speeded up by a factor pi


f2
+
p1

fk
+

... +

p2

pk

Speedup formula
1
S=

f1

f2
+

p1

fk
+

p2

... +
pk

If a particular fraction
is slowed down rather
than speeded up,
use sj fj instead of fj / pj ,
where sj > 1 is the
slowdown factor

Computer Architecture,
Background and Motivation

f1

The CPU Performance Equation


Essentially all computers are constructed using clock (all called ticks, clock
ticks, clock periods, clocks, cycles, or clock cycles) running at a constant rate.
Clock rate: today in GHz
Clock cycle time: clock cycle time = 1/clock rate
Ex. 1 GHz clock rate = 1 ns cycle time
Thus, the CPU time for a program can be expressed two ways:
CPU Time CPU clock cycles for a program Clock cycle time

Number of clock cycles needed to execute a program.


OR
CPU Time

CPU clock cycles for a program


Clock rate

We can also count the number of instructions executed the instruction path
length or instruction count (IC).
If we know the number of clock cycles and IC, then the average number of
clock cycles per instruction (CPI).
CPI is computed as
CPU clock cycles for a program
CPI
IC

CPU clock cycles = IC CPI


This figure provides insight into different styles of instruction sets and
implementations.
IC CPI
CPU time IC CPI Clock cycle time
Clock rate

Processor performance is dependent upon three characteristics:


instruction count,
clock cycles per instruction
and clock cycle (or rate).
A % improvement in any one of three pieces leads to a % improvement in
CPU time.
Unfortunately, it is difficult to change one parameter in complete isolation
form others, because the technologies of them are interdependent:
Clock cycle time:
Hardware technology and organization;
CPI:
Organization and instruction set architecture;
Instruction count:
Instruction set architecture and compiler technology.

Computer architecture is focus on CPI and IC parameters.


To calculate the number of total processor clock cycles as
n

CPU clock cycles ICi CPIi


i 1

ICi: the number of times instruction i is executed in a program.


CPIi: the average number of clocks per instruction for instruction i.
It is useful in designing the processor.
To express CPU time again:
n

CPU time ICi CPIi Clock cycle time


i 1

And overall CPI as


n

ICi CPIi
n
ICi
i 1

CPI

CPIi
Instruction count i 1 Instruction count

Hint: CPIi should be measured because pipeline effects, cache misses, and
any other memory system inefficiencies.

ICi/IC presents the fraction of occurrences of that instruction in a program.

Suppose we have made the following measurements:


Frequency of FP operations = 25%,
Average CPI of FP operations =4.0,
Average CPI of other instructions = 1.33,
Assume that the design alternatives are to decrease the average CPI
of all FP operations to 2.5. Compare the processor performance equation.

Answer
First, observe that only the CPI changes; the clock rate and
instruction count remain identical. We start by finding the original CPI with
neither enhancement;
ICi

CPIoriginal CPIi
4 25% 1.33 75% 2.0
Instructio
n
count

i 1
n

CPInew FP 75% 1.33 25% 2.5 1.62

Improvement rate N = (original execution time) / (new execution time)

Speedup new FP

CPU timeoriginal
CPU time new FP

IC Clock cycle CPIoriginal


IC Clock cycle CPInew FP

CPIoriginal
CPInew FP

2.00
1.23
1.625

Amdahl's Law vs. CPU Performance


CPU performance equation is better than Amdahls Law
To measure the fraction of execution time for which a set of instructions
is responsible;
For an existing processor, to measure execution time and clock speed is
easy;
The challenge lies in discovering the instruction count or the CPI.
Most new processors include counter for both instructions executed
and for clock cycles.

Dynamic Instruction Count


How many instructions
are executed in this
program fragment?

Each for consists of two instructions:


increment index, check exit condition

250 instructions
for i = 1, 100 do
20 instructions
for j = 1, 100 do
40 instructions
for k = 1, 100 do
10 instructions
endfor
endfor
endfor
Static count = 326

2 + 20 + 124,200 instructions
100 iterations
12,422,200 instructions in all
2 + 40 + 1200 instructions
100 iterations
124,200 instructions in all
2 + 10 instructions
100 iterations
1200 instructions in all

for i = 1, n
while x > 0

Computer Architecture,
Background and Motivation

12,422,450 Instructions

Faster Clock Shorter Running Time

1 GHz

Suppose addition takes 1 ns


Clock period = 1 ns; 1 cycle
Clock period = ns; 2 cycles

Solution

20 steps
2 GHz

In this example, addition time


does not improve in going from
1 GHz to 2 GHz clock

Faster steps do not necessarily mean shorter


travel time.

Computer Architecture,
Background and Motivation

4 steps

Peak performance is often expressed in units of instructions per second or IPS,


with MIPS and GIPS.
For floating point calculation, floating point operations per second (FLOPS) is
used as the unit. MFLOPS, GFLOPS.
MIPS is a method of measuring the raw speed of a computer's processor and is
defined as the number of machine instructions (in millions) that a processor
can execute in one second.
MIPS = Clock rate/(CPI * 106)

This number gives you an idea of the speed of a CPU, as faster processors
have a higher MIPS than slower computers

CPI and IPS Calculations


Consider two implementations M1 (600 MHz) and M2 (500 MHz) of
an instruction set containing three classes of instructions:
CPI for M1
5.0
2.0
2.4

CPI for M2
4.0
3.8
2.0

Comments
Floating-point
Integer arithmetic
Nonarithmetic

a. What are the peak performances of M1 and M2 in MIPS?


Solution
a. Peak MIPS for M1 = 600 / 2.0 = 300 (assume all class I)
for M2 = 500 / 2.0 = 250 (assume all class N)

Computer Architecture,
Background and Motivation

Class
F
I
N

b.

If 50% of instructions executed are class-N, with the rest divided equally among
F and I, which machine is faster? By what factor?

Solution
b. Average CPI for M1 = (5.0 x 0.25) + (2.0 x 0.25) + (2.4 x 0.5) = 2.95
for M2 = (4.0 x 0.25) + (3.8 x 0.25) + (2.0 x0.5) = 2.95
Average CPIs are the same
M1 is 1.2 times as fast as M2 factor (based on ratio of clock rates)
c. Designers of M1 plan to redesign the machine for better performance. With the
assumption of part b, which of the following redesign options has the greatest
performance impact and why?
1. Using a faster floating point unit with double the speed
(class F CPI = 2.5)
2. Adding a second integer ALU to reduce the integer CPI to 1.20
3. Using faster logic that allows a clock rate of 750MHz with the same
CPIs

Solution:
Option 1:
Average CPI for M1 = (2.5 x 0.25) + (2.0 x 0.25) + (2.4 x 0.5) = 2.325

MIPS = 600 / 2.325 = 258


Option 2:
Average CPI for M1 = (5 x 0.25) + (1.2 x 0.25) + (2.4 x 0.5) = 2.75
MIPS = 600 / 2.75 = 218
Option 3:
CPI = 2.95
MIPS = 750 / 2.95 = 254
Option 1 has greatest impact.

Prefixes for large units:


Kilo = 103, Mega = 106, Giga = 109, Tera = 1012, Peta = 1015
For memory:
K = 210 = 1024, M = 220, G = 230, T = 240, P = 250
Prefixes for small units:
micro = 10-6, nano = 10-9, pico = 10-12, femto = 10-15

You might also like