You are on page 1of 10

UEEA4653: Computer Architecture

Assessing &
Understanding Performance

UUEA4653 May 2015

Major Goals

To tell which computer is faster

From computer architects perspective:


Designing high-performance computers is one of the major goals
In this chapter, well
Learn the components and factors influencing performance
Introduce methodologies to evaluate performance
Metric for evaluation performance measure
Basis for comparison benchmark
Understand cost vs. performance implications of different architectures

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Questions We Usually Ask

What factors of system performance are hardware related?

e.g., do we need a new machine, or a new operating system?

Why is some hardware better than others for different programs?


How does the machines instruction set affect performance?

Answers to these questions hinge on our ability to evaluate


performance!

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Measuring Performance

CPU time:

Time spent running the program only


Does not count I/O and time spent on running other programs
Our focus in this course
System CPU time: CPU time spent in operating system (OS)

Response time (execution time, elapsed time or wall-clock time):

Time between the start and completion of a task


i.e. CPU time + waiting time
Waiting time disk access, memory access, I/O, OS overhead, etc
Computer users care about it

Throughput:
Total number of jobs completed per unit time
Computer administrators care about it

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Response Time vs. Throughput

Question: Throughput = 1/ Response time?


Only if each component in the system does not overlap
Example
No overlap
Job A

Job B

5s

5s

Throughput = 2/10 = 0.2 = 1/5

Overlap

Job A
3s

Job B

Throughput = 2/8 = 0.25 1/5

2s
3s
(overlap)

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Response Time vs. Throughput (2)

Assume each instruction takes 5 stages to execute


No Pipelining, (takes 5*4 = 20 stages to finish 4 instructions)

With Pipelining

(takes 5+1+1+1 = 8 stages)


Copyright Dr. Lai An Chow

Two Pipelines

(takes 5+1 = 6 stages)


Assessing & Understanding Performance

Response Time vs. Throughput (3)

Example:
Do the following changes to a computer system increase throughput,
decrease response time, or both?
Replacing the processor in a computer with a faster version
Both
Adding additional processors to a system
Throughput increases
Individual response time does not decrease
How about average response time when heavily loaded?
Decreasing response time almost always improves throughput
In real system, execution time and throughput often affect each other
We will focus primarily on execution time for a single job

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

1. Methodologies

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Execution Time & Performance

For some program running on machine X:

performance X =

1
execution time X

Execution time: smaller the value is better


Performance: bigger the value is better

If the performance of machine X is greater than that of Y, i.e.

performance X > performanc e Y


then we can conclude that
execution time X < execution time Y

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Performance Comparison

10

Machine X is n times faster than machine Y means:

n=

performance X execution time Y


=
performanc e Y execution time X

Machine X is m% faster than Y means:

n=

performance X execution time Y


=
= 1 + m / 100
performanc e Y execution time X

Example:

Machine X runs a program in 10 sec


Machine Y runs the same program in 15 sec

15 / 10 = 1.5 = 1 + 50/100 X is 50% faster than Y

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Clock Cycles

11

Digital computers have a digital clock that runs at a constant rate


Clock to synchronize occurrence of different events in the system
Clock cycles refer to clock ticks
time
Clock cycle time (or clock period)

time between ticks (in second)

Clock rate

number of ticks per second (in Hz)

Example
A 200 MHz clock has a cycle time of 1 / (200x106) = 5x10-9 = 5 ns

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Cycles Per Instruction

12

Wrong assumption
# of CPU clock cycles in a program = # of instructions in the program

Reasons
Execution times for different type of instructions may vary
Multiplication takes more time than addition
Floating-point operations take more time than integer ones
Accessing memory takes more time than accessing registers

Problems of using # of instructions as the metric
When comparing two machines with same ISA,
Same # of instructions do not imply same performance!
What should we use then? Answer: Cycles per instructions (CPI)

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Cycles Per Instruction (2)

13

Meaning
Average # of clock cycles each instruction takes to execute in a program
Definition

CPI =

total number of CPU clock cycles


total number of instruction executed

Smaller CPI better performance

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Cycles Per Instruction (3)

14

Let there be n different instruction classes (with different CPIs)


For a given program, suppose we know:

CPIi = CPI for instruction class i


Ci = number of instructions of class i
The formula

CPU clock cycles = CPI instruction count


can be generalized to
n
CPU clock cycles = (CPIi C i )
i=1

Thus, CPI for the entire program


n

CPI = (CPIi C i )
i=1

Copyright Dr. Lai An Chow

Ci

i=1

Assessing & Understanding Performance

Example Comparing Code Segments

15

Description

A particular machine has the following hardware facts:


Instruction class
A
B
C

CPI for this instruction class


1
2
3

For a given C++ statement, a compiler designer considers two


code sequences with the following instruction counts:
Code sequence
1
2

Instruction counts for instruction classes

A
2
4

B
1
1

C
2
1

Problem to solve

Which code sequence executes the most instructions? Which is


faster? What is the CPI for each sequence?

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Example - Answer

16

Instruction count: Sequence 1 : 2 + 1 + 2 = 5 instructions

Sequence 2 : 4 + 1 + 1 = 6 instructions (more)


CPU clock cycles:

CPU clock cycles = (CPIi C i )


i=1

CPU clock cycles1 = 1 2 + 2 1 + 3 2 = 10


CPU clock cycles 2 = 1 4 + 2 1 + 3 1 = 9 (faster)

CPI1 =

CPI:

10
=2
5

CPI2 =

9
= 1.5
6

Remarks:

Sequence 2 has more instructions, but it is actually more efficient


Instruction count alone is not a reliable measure, should use CPI
If these 2 codes are for different machines with different clock rates,
Is CPI still a good enough metric to tell which code is better?

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

IRON LAW !

17

The only complete and reliable measure is


CPU execution time
Other measures are unreliable because theyre not consistent
e.g., changing the instruction set to CISC to lower instruction count
Leading to a larger CPI
Or, complex hardware results in slower clock rate
Either case can offset the improvement in instruction count
Other inconsistent measurements
# of clock cycles to execute program
# of instructions in program
# of cycles per second
# of cycles per instruction, or # of instructions per second
Copyright Dr. Lai An Chow

Assessing & Understanding Performance

IRON LAW: Performance Equation

18

A program is broken down into instructions


Each instruction takes multiple clock cycles

Time =

seconds
program

# of instructions
a program

# of clocks
# of instructions

seconds
clocks

= instruction count * CPI * clock cycle time


= instruction count * CPI / clock rate
Instruction count (instruction executed, not static code)

Mostly determined by compiler and ISA

CPI

Mostly determined by ISA and CPU organization

Clock rate

Mostly determined by technology and CPU organization

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Example: Using the Performance Equation

19

Description
Two implementations of the same ISA, for a program
In computer A with clock cycle time 250ps, CPI is 2.0
In computer B with clock cycle time 500ps, CPI of 1.2
Question
Which computer is faster for this program, and by how much?
Answer (assume the program has I instructions)

CPU time A = I CPIA clock cycle time A


= I 2.0 250ps = 500 I ps

CPU timeB = I 1.2 500ps = 600 I ps


CPU Time B 600 I ps
=
= 1.2
CPU Time A 500 I ps
Copyright Dr. Lai An Chow

A is faster than B, by 20%


Assessing & Understanding Performance

Common Ways to Improve Performance

20

Referring to the three components in the iron law, we can either


Shorten clock cycle time (i.e. increase clock rate)
e.g. run the processor at a higher clock frequency
Reduce the CPI
e.g. pipelining
Reduce the number of instructions of a program
e.g. use a compiler that can optimize the code generation better
Remarks
Fine to use CPI when two conditions below are true in the comparison
# of instructions in the program remain unchanged, and
Clock rate remain unchanged
These two conditions are often not true when comparing two ISAs

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

21

Some performance measures commonly used by


computer manufacturers and sellers are
misleading!

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

A Misleading Measure MIPS

22

MIPS (Million Instructions Per Second)

As a joke, meaningless indicator of processor speed


For a given program:

MIPS =

instruction count

execution time 10 6
MIPS can vary inversely with performance

Example: (assume clock rate = R)


3 types of instructions; A,B,C; take 1,2,3 cycles respectively
contradicting
Before optimization, instruction count: A=10, B=1, C=1
After optimization, instruction count: A=5, B=1, C=1
Clocks (before) = 15 > Clocks (after) = 10 optimization is better
MIPS (before) = 0.8R > MIPS (after) = 0.7R optimization is worse

A machine cant have a single MIPS rating


Because different benchmarks have different instruction counts
Cannot compare two different ISAs
A program have different instruction counts on different ISAs
Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Another Misleading Measure - MFLOPS

23

MFLOPS (Million Floating-point Operations Per Second)

For a given program:

MFLOPS =

number of floating point operations


execution time 10 6

MFLOPS considers only floating-point operations


Not applicable to integer applications as MFLOPS = 0
Number of floating-point operations depends on:
Compiler; different compilers give different numbers of FP instr.
No support of FP division in some ISAs
Different FP operations have different execution times
FP multiplication takes longer time than FP add
Different programs have different mixtures of FP operations
Because of the above reasons, MFLOPS doesnt give a fair comparison

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

24

2. Benchmarks

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Evaluating Performance of Two Computers

25

What should we use to test their performance?


Ideally, we should use real applications that we use everyday
Programs typical of expected class of applications
e.g., software development tools, scientific applications, graphics, etc.
In reality, benchmarks
e.g. SPEC (Standard Performance Evaluation Cooperation) suite
Save money and effort
Smaller than real programs, easier to standardize
Not representative of real workload
To improve the quality of evaluation
Run a set of benchmarks

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Comparing and Summarizing Performance


1 n
Timei
n i =1

Arithmetic mean (AM):

26

n: number of benchmarks in the workload


Timei: execution time for program i

Example:
Computer

Benchmark

1 sec

10 sec

20 sec

1000 sec

100 sec

20 sec

AM

500.5 sec

55 sec

20 sec

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Comparing and Summarizing Performance


Weighted arithmetic mean:

Weight
i =1

Weight
i =1

27

Timei

=1

Timei: execution time for program i

Example:

Benchmark

Computer

Weight

W1

W2

10

20

0.91

0.999

1000

100

20

0.09

0.001

1001

110

40

AM

500.5

55

20

Weighted AM using W1

90.91

18.1

20

Weighted AM using W2

1.999

10.09

20

Total time (sum)

Use weighted AM when benchmarks are not equally important


Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Comparing and Summarizing Performance


Geometric Mean (GM):

28

ratio

i =1

n: number of benchmarks in the workload


Ratioi: ratio for program i

Example:

Benchmark

Speedup (ratios)

Computer

A/B

1 sec

10 sec

0.1

B/A
10

1000 sec

100 sec

10

0.1

AM

5.05

5.05

GM

1.0

1.0

A/B = 5.05 while B/A also = 5.05: Cannot be true!


Should use GM for ratios; only use AM for raw numbers
GM: Independent of reference machine (math property)

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Key Concepts to Remember

29

Iron Law: execution time is a consistent summary of performance

Time = (# of instructions in a program) * CPI * (clock cycle time)

Performance is always relative to a specific program


For a given ISA, performance increases come from:

Increases in clock rate


Improvements in processor organization that lower CPI
Compiler enhancements that give lower instruction count
or generate instructions with lower CPI

The best programs to use for benchmarks are real applications

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

10

You might also like