Performance 2015

UEEA4653: Computer Architecture
Assessing &
Understanding Performance
UUEA4653 May 2015
Major Goals
To tell which computer is faster
From computer architects perspective:

Designing high-performance computers is one of the major goals
In this chapter, well
Learn the components and factors influencing performance
Introduce methodologies to evaluate performance
Metric for evaluation performance measure
Basis for comparison benchmark
Understand cost vs. performance implications of different architectures
Copyright Dr. Lai An Chow
Assessing & Understanding Performance
Questions We Usually Ask
What factors of system performance are hardware related?
e.g., do we need a new machine, or a new operating system?
Why is some hardware better than others for different programs?

How does the machines instruction set affect performance?
Answers to these questions hinge on our ability to evaluate

performance!
Measuring Performance
CPU time:
Time spent running the program only

Does not count I/O and time spent on running other programs
Our focus in this course
System CPU time: CPU time spent in operating system (OS)
Response time (execution time, elapsed time or wall-clock time):
Time between the start and completion of a task

i.e. CPU time + waiting time
Waiting time disk access, memory access, I/O, OS overhead, etc
Computer users care about it
Throughput:
Total number of jobs completed per unit time
Computer administrators care about it
Response Time vs. Throughput
Question: Throughput = 1/ Response time?

Only if each component in the system does not overlap
Example
No overlap
Job A
Job B
5s
5s
Throughput = 2/10 = 0.2 = 1/5
Overlap
Job A
3s
Job B
Throughput = 2/8 = 0.25 1/5
2s
3s
(overlap)
Response Time vs. Throughput (2)
Assume each instruction takes 5 stages to execute

No Pipelining, (takes 5*4 = 20 stages to finish 4 instructions)
With Pipelining
(takes 5+1+1+1 = 8 stages)

Two Pipelines
(takes 5+1 = 6 stages)

Response Time vs. Throughput (3)
Example:
Do the following changes to a computer system increase throughput,
decrease response time, or both?
Replacing the processor in a computer with a faster version
Both
Adding additional processors to a system
Throughput increases
Individual response time does not decrease
How about average response time when heavily loaded?
Decreasing response time almost always improves throughput
In real system, execution time and throughput often affect each other
We will focus primarily on execution time for a single job
1. Methodologies
Execution Time & Performance
For some program running on machine X:
performance X =
1
execution time X
Execution time: smaller the value is better

Performance: bigger the value is better
If the performance of machine X is greater than that of Y, i.e.
performance X > performanc e Y

then we can conclude that
execution time X < execution time Y
Performance Comparison
10
Machine X is n times faster than machine Y means:
n=
performance X execution time Y

=
performanc e Y execution time X
Machine X is m% faster than Y means:
n=
performance X execution time Y

=
= 1 + m / 100
performanc e Y execution time X
Example:
Machine X runs a program in 10 sec

Machine Y runs the same program in 15 sec
15 / 10 = 1.5 = 1 + 50/100 X is 50% faster than Y
Clock Cycles
11
Digital computers have a digital clock that runs at a constant rate

Clock to synchronize occurrence of different events in the system
Clock cycles refer to clock ticks
time
Clock cycle time (or clock period)
time between ticks (in second)
Clock rate
number of ticks per second (in Hz)
Example
A 200 MHz clock has a cycle time of 1 / (200x106) = 5x10-9 = 5 ns
Cycles Per Instruction
12
Wrong assumption
# of CPU clock cycles in a program = # of instructions in the program
Reasons
Execution times for different type of instructions may vary
Multiplication takes more time than addition
Floating-point operations take more time than integer ones
Accessing memory takes more time than accessing registers

Problems of using # of instructions as the metric
When comparing two machines with same ISA,
Same # of instructions do not imply same performance!
What should we use then? Answer: Cycles per instructions (CPI)
Cycles Per Instruction (2)
13
Meaning
Average # of clock cycles each instruction takes to execute in a program
Definition
CPI =
total number of CPU clock cycles

total number of instruction executed
Smaller CPI better performance
Cycles Per Instruction (3)
14
Let there be n different instruction classes (with different CPIs)

For a given program, suppose we know:
CPIi = CPI for instruction class i

Ci = number of instructions of class i
The formula
CPU clock cycles = CPI instruction count

can be generalized to
n
CPU clock cycles = (CPIi C i )
i=1
Thus, CPI for the entire program

n
CPI = (CPIi C i )
i=1
Ci
i=1
Example Comparing Code Segments
15
Description
A particular machine has the following hardware facts:

Instruction class
A
B
C
CPI for this instruction class

1
2
3
For a given C++ statement, a compiler designer considers two

code sequences with the following instruction counts:
Code sequence
1
2
Instruction counts for instruction classes
A
2
4
B
1
1
C
2
1
Problem to solve
Which code sequence executes the most instructions? Which is

faster? What is the CPI for each sequence?
Example - Answer
16
Instruction count: Sequence 1 : 2 + 1 + 2 = 5 instructions
Sequence 2 : 4 + 1 + 1 = 6 instructions (more)

CPU clock cycles:
CPU clock cycles = (CPIi C i )

i=1
CPU clock cycles1 = 1 2 + 2 1 + 3 2 = 10

CPU clock cycles 2 = 1 4 + 2 1 + 3 1 = 9 (faster)
CPI1 =
CPI:
10
=2
5
CPI2 =
9
= 1.5
6
Remarks:
Sequence 2 has more instructions, but it is actually more efficient

Instruction count alone is not a reliable measure, should use CPI
If these 2 codes are for different machines with different clock rates,
Is CPI still a good enough metric to tell which code is better?
IRON LAW !
17
The only complete and reliable measure is

CPU execution time
Other measures are unreliable because theyre not consistent
e.g., changing the instruction set to CISC to lower instruction count
Leading to a larger CPI
Or, complex hardware results in slower clock rate
Either case can offset the improvement in instruction count
Other inconsistent measurements
# of clock cycles to execute program
# of instructions in program
# of cycles per second
# of cycles per instruction, or # of instructions per second
IRON LAW: Performance Equation
18
A program is broken down into instructions

Each instruction takes multiple clock cycles
Time =
seconds
program
# of instructions
a program
# of clocks
# of instructions
seconds
clocks
= instruction count * CPI * clock cycle time

= instruction count * CPI / clock rate
Instruction count (instruction executed, not static code)
Mostly determined by compiler and ISA
CPI
Mostly determined by ISA and CPU organization
Clock rate
Mostly determined by technology and CPU organization
Example: Using the Performance Equation
19
Description
Two implementations of the same ISA, for a program
In computer A with clock cycle time 250ps, CPI is 2.0
In computer B with clock cycle time 500ps, CPI of 1.2
Question
Which computer is faster for this program, and by how much?
Answer (assume the program has I instructions)
CPU time A = I CPIA clock cycle time A

= I 2.0 250ps = 500 I ps
CPU timeB = I 1.2 500ps = 600 I ps

CPU Time B 600 I ps
=
= 1.2
CPU Time A 500 I ps
A is faster than B, by 20%

Common Ways to Improve Performance
20
Referring to the three components in the iron law, we can either

Shorten clock cycle time (i.e. increase clock rate)
e.g. run the processor at a higher clock frequency
Reduce the CPI
e.g. pipelining
Reduce the number of instructions of a program
e.g. use a compiler that can optimize the code generation better
Remarks
Fine to use CPI when two conditions below are true in the comparison
# of instructions in the program remain unchanged, and
Clock rate remain unchanged
These two conditions are often not true when comparing two ISAs
21
Some performance measures commonly used by

computer manufacturers and sellers are
misleading!
A Misleading Measure MIPS
22
MIPS (Million Instructions Per Second)
As a joke, meaningless indicator of processor speed

For a given program:
MIPS =
instruction count
execution time 10 6
MIPS can vary inversely with performance
Example: (assume clock rate = R)

3 types of instructions; A,B,C; take 1,2,3 cycles respectively
contradicting
Before optimization, instruction count: A=10, B=1, C=1
After optimization, instruction count: A=5, B=1, C=1
Clocks (before) = 15 > Clocks (after) = 10 optimization is better
MIPS (before) = 0.8R > MIPS (after) = 0.7R optimization is worse
A machine cant have a single MIPS rating

Because different benchmarks have different instruction counts
Cannot compare two different ISAs
A program have different instruction counts on different ISAs
Another Misleading Measure - MFLOPS
23
MFLOPS (Million Floating-point Operations Per Second)
For a given program:
MFLOPS =
number of floating point operations

execution time 10 6
MFLOPS considers only floating-point operations

Not applicable to integer applications as MFLOPS = 0
Number of floating-point operations depends on:
Compiler; different compilers give different numbers of FP instr.
No support of FP division in some ISAs
Different FP operations have different execution times
FP multiplication takes longer time than FP add
Different programs have different mixtures of FP operations
Because of the above reasons, MFLOPS doesnt give a fair comparison
24
2. Benchmarks
Evaluating Performance of Two Computers
25
What should we use to test their performance?

Ideally, we should use real applications that we use everyday
Programs typical of expected class of applications
e.g., software development tools, scientific applications, graphics, etc.
In reality, benchmarks
e.g. SPEC (Standard Performance Evaluation Cooperation) suite
Save money and effort
Smaller than real programs, easier to standardize
Not representative of real workload
To improve the quality of evaluation
Run a set of benchmarks
Comparing and Summarizing Performance

1 n
Timei
n i =1
Arithmetic mean (AM):
26
n: number of benchmarks in the workload

Timei: execution time for program i
Example:
Computer
Benchmark
1 sec
10 sec
20 sec
1000 sec
100 sec
20 sec
AM
500.5 sec
55 sec
20 sec

Weighted arithmetic mean:
Weight
i =1
Weight
i =1
27
Timei
=1
Timei: execution time for program i
Example:
Benchmark
Computer
Weight
W1
W2
10
20
0.91
0.999
1000
100
20
0.09
0.001
1001
110
40
AM
500.5
55
20
Weighted AM using W1
90.91
18.1
20
Weighted AM using W2
1.999
10.09
20
Total time (sum)
Use weighted AM when benchmarks are not equally important


Geometric Mean (GM):
28
ratio
i =1
n: number of benchmarks in the workload

Ratioi: ratio for program i
Example:
Benchmark
Speedup (ratios)
Computer
A/B
1 sec
10 sec
0.1
B/A
10
1000 sec
100 sec
10
0.1
AM
5.05
5.05
GM
1.0
1.0
A/B = 5.05 while B/A also = 5.05: Cannot be true!

Should use GM for ratios; only use AM for raw numbers
GM: Independent of reference machine (math property)
Key Concepts to Remember
29
Iron Law: execution time is a consistent summary of performance
Time = (# of instructions in a program) * CPI * (clock cycle time)
Performance is always relative to a specific program

For a given ISA, performance increases come from:
Increases in clock rate

Improvements in processor organization that lower CPI
Compiler enhancements that give lower instruction count
or generate instructions with lower CPI
The best programs to use for benchmarks are real applications
10

Performance 2015

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance 2015

Uploaded by

Copyright:

Available Formats

UEEA4653: Computer Architecture

UUEA4653 May 2015

To tell which computer is faster

From computer architects perspective:

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Questions We Usually Ask

What factors of system performance are hardware related?

e.g., do we need a new machine, or a new operating system?

Why is some hardware better than others for different programs?

Answers to these questions hinge on our ability to evaluate

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Time spent running the program only

Response time (execution time, elapsed time or wall-clock time):

Time between the start and completion of a task

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Response Time vs. Throughput

Question: Throughput = 1/ Response time?

Throughput = 2/10 = 0.2 = 1/5

Throughput = 2/8 = 0.25 1/5

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Response Time vs. Throughput (2)

Assume each instruction takes 5 stages to execute

(takes 5+1+1+1 = 8 stages)

(takes 5+1 = 6 stages)

Response Time vs. Throughput (3)

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Execution Time & Performance

For some program running on machine X:

Execution time: smaller the value is better

If the performance of machine X is greater than that of Y, i.e.

performance X > performanc e Y

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Machine X is n times faster than machine Y means:

performance X execution time Y

Machine X is m% faster than Y means:

performance X execution time Y

Machine X runs a program in 10 sec

15 / 10 = 1.5 = 1 + 50/100 X is 50% faster than Y

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Digital computers have a digital clock that runs at a constant rate

time between ticks (in second)

number of ticks per second (in Hz)

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Cycles Per Instruction

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Cycles Per Instruction (2)

total number of CPU clock cycles

Smaller CPI better performance

Copyright Dr. Lai An Chow

Assessing & Understanding Performance

Cycles Per Instruction (3)

Let there be n different instruction classes (with different CPIs)