Lec28 Pipelineprocessing PDF

Basic Computer Organization
CPU
ALU
Registers
Control
MMU
Memory
Cache
Bus
I/O
I/O
I/O
13
Performance of Processor
Which is more important?
execution time of a single instruction

throughput of instruction execution
i.e., number of instructions executed per unit time
Cycles Per Instruction (CPI)

Current ideas: CPI between 3 and 5
Pipelining
Why keep Fetch hardware idle while instruction is

being decoded
Inspired by petroleum pipelines?
14
Pipelines
Used for transportation of liquids or gases

over long distances
1000s of kms
Built with periodic pump/compressor stations to
keep the fluid flowing
1000 kms
refinery
city
15
Inside the Processor

Zero?
NPC
A
IR
PC
Mem
Reg
File
sign
extend
Inst Fetch
IF
IF
Cond
ID
ALU
ALU
out
Mem
LMD
Imm
Inst Decode
ID
Execution
EX
Memory
MEM
WB
EX MEM WB
16
Processor Pipelining
i1
i2
i3
i4
IF
ID
IF
clock
cycles
time
EX MEM WB
ID EX MEM WB
IF
ID EX MEM WB
IF
ID EX MEM WB
Execution time of each instruction is still 5 cycles, but

the throughput is now 1 instruction per cycle
Initial pipeline fill time (4 cycles), after which 1
instruction completes every cycle
17
MIPS 1 Instructions: 3, 4 or 5 cycles

time
LW R1, 0(R2)
ADD R3, R1, R2
IF
ID
IF
EX MEM WB
ID EX MEM WB
MIPS 1 Instructions: 3, 4 or 5 cycles

time
LW R1, 0(R2)
JR R6
IF
ID
IF
EX MEM WB
ID EX MEM WB
Pipelined Processor Datapath
Zero?
+
PC
Mem
Reg
File
ALU
Mem
Sign
extend
IF
ID
EX
MEM
WB
7
Some Terminology
IF
EX MEM WB
Pipeline stages: IF, ID, EX, MEM, WB

We describe this as a 5 stage pipeline
ID
or a pipeline of depth 5
Assume that the time delay through each

stage is the same (say 1 clock cycle)
Pipeline Speedup =
timenon pipelined
time pipelined
8
Pipeline Speedup
IF
ID
EX MEM WB
For 5 stage pipeline taking 1 cycle per stage
Let us compute the speedup over a non-pipelined

processor that takes 5 cycles for every instruction
Calculate how much time each of these
processors takes to run a program involving the
execution of n instructions
Non-pipelined processor: 5n cycles

Pipelined processor: 4 + n cycles
5n
5 as n
Speedup =
4n
9
Pipeline Speedup
A pipeline with p stages could give a speedup

of p (compared to a non-pipelined processor
that takes p cycles for each instruction)
i.e., A program would run p times faster on
the pipelined processor (than on the nonpipelined processor)
if on every clock cycle, an instruction completes

execution
10
Problem: Pipeline Hazards

A situation where an instruction cannot
proceed through the pipeline as it should
Hazard: a dangerous (hazardous) situation
From the perspective of correct program
execution
11
1.
2.
3.

Structural hazard: When 2 or more
instructions in the pipeline need to use the
same resource at the same time
Data hazard: When an instruction depends
on the data result of a prior instruction that
is still in the pipeline
Control hazard: A hazard that arises due to
control transfer instructions
12
1.
2.
3.

13
Structural Hazard
i
IF
i+1
ID
EX MEM WB
IF
ID
EX MEM WB
i+2
ID
i+3
IF
IF
MEM and IF use
memory at same time
LW R3, 8(R2)
EX MEM WB
14
Petroleum pipeline analogy?
refinery
city
Diesel
Kerosene
15
Petroleum pipeline analogy?
refinery
city
Diesel
Kerosene
Air
16
Structural Hazard
i
IF
i+1
ID
EX MEM WB
IF
ID
LW R3, 8(R2)
EX MEM WB
i+2
ID
EX MEM WB
i+3
B
IF
IF
ID
EX MEM WB
IF
MEM and IF use
memory at same time
i+3
17
Solving Structural Hazards
This hazard can be overcome by designing

main memory so that it can handle 2 memory
requests at the same time
Or, since we are assuming that memory

delays are hidden by cache memories
Double ported memory
include a separate instruction cache (for use by

the IF pipeline stage) and data cache (for use by
the MEM stage)
Identify the possible structural hazards and

design so as to eliminate them
18
1.
2.
3.

2
Data Hazard
R3 read by instruction i+1
R3 updated by instruction i
i
i+1
IF
ID
EX MEM WB
IF
B
ID
add R3 , R1, R2
B MEM
B
B
EX
WB
ID
B
time
sub R4 , R3, R8
EX
B MEM
B WB
B
ID
EX MEM WB
Idea: Delay (or stall) the progress of instruction i+1

through the pipeline until the data is available in register
R3
3
Solving Data Hazards

1.
Interlock: Hardware that is included in the

processor to detect such a data dependency
and stall the dependent instruction
inst
Add
Sub
Or
time
0
1
2
3
4
5
6
IF
ID
EX MEM WB
IF
stall stall
ID EX MEM
stall stall
IF
ID EX

1.
Interlocks & stalling dependent instructions

The result is available at the output of the ALU
now (in the special purpose register ALUout)
add R3, R1, R2

IF
ID
EX
MEM
WB
sub R5, R3, R4

IF
ID
EX
MEM
or R7, R3, R6
IF
ID
EX
5

1.
2.

Forwarding or Bypassing: forward the result
to EX as soon as is is available anywhere in
the pipeline
add R3, R1, R2

IF
ID
EX
MEM
WB
sub R5, R3, R4

IF
ID
EX
MEM
or R7, R3, R6
IF
ID
EX
6
Modified Processor Datapath
NPC
A
ALU
Mem
B
Imm
EXE
MEM
But Forwarding is Not Always Possible

LW R3, -4(R1)
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
EX
SUB R5, R3, R4

OR R7, R3, R6

Forwarding or Bypassing
Load delay slot
1.
2.
3.
Build the hardware to assume that an instruction

that uses a load value is separated from the load
instruction
Recall: Notes from ISA Manual.
For load instructions: the loaded value might

not be available in the destination register for
use by the instruction immediately following
the load
LOAD DELAY SLOT
For control transfer instructions: the transfer

of control takes place only following the
instruction immediately after the control
transfer instruction
BRANCH DELAY SLOT
10

Load delay slot
Instruction Scheduling
1.
2.
3.
4.
Reorder the instructions of the program so that

dependent instructions are far enough apart
This could be done either
by the compiler, before the program runs: Static

by the hardware, when the program is running:
Dynamic Instruction Scheduling
11
Static Instruction Scheduling
Reorder the instructions of the program to

eliminate data hazards
or in general to reduce the execution time of the

program
Reordering must be safe

ADD
SUB
R1, R2, R3
R2, R4, R5
/* R1 = R2 + R3 */
/* R2 = R4 R5 */
Static Instruction Scheduling
Reorder the instructions of the program to

eliminate data hazards
Reordering must be safe
or in general to reduce the execution time of the

program
should not change the meaning of the program
Two instructions can be exchanged if they

are independent of each other
Example: Static Instruction Scheduling

Program fragment:
LW
R3, 0(R1)
ADDI R5, R3, 1
Scheduling:
LW R3, 0(R1)
1 stall
ADD R2, R2, R3
ADDI R5, R3, 1

ADD R2, R2, R3
LW R13, 0(R11)
1 stall
ADD R12, R13, R3
LW R13, 0(R11)
ADD R12, R13, R3
2 stalls
0 stalls
5

Load delay slot
1.
2.
3.
4.
Reorder the instructions of the program so that

dependent instructions are far enough apart
This could be done either
by the compiler, before the program runs: Static

by the hardware, when the program is running:
Kinds of Data Dependence
True dependence
ADD
SUB
R2,
R1,
R3
R5
Anti-dependence
ADD
SUB
R1,
R4,
R1,
R2,
R2,
R4,
R3
R5
Output dependence
ADD
SUB
R1,
R1,
R2,
R4,
R3
R5

IF
ID
EX
MEM
WB
With dynamic instruction scheduling

IF
ID
EX
Functional Units
WB
Floating point Adder

Floating point Multiplier
Instruction Window
Instruction Queue
Integer ALU
Integer Multiplier
Memory Unit
8
The hardware dynamically schedules

instructions from the Instruction Window for
execution on the functional units
The instructions could execute in an order
that is different from that specified by the
program
with the same result
Such processors are called out of order

processors
as opposed to in order processors

9
1.
2.
3.

10
Recall: Execution of Branch Instruction
Zero?
+
PC
Mem
Reg
File
ALU
Mem
Sign
extend
IF
ID
EX
MEM
WB
11
Control Hazards
Condition and target
are resolved by now
BEQZ R3, out
IF
ID
EX
MEM
WB
Fetch inst (i +1) or

from target?
B
IF
B
ID
B
EX
B
MEM
Fetch inst (i +1) or

from target?
B
IF
B
ID
B
EX
IF
ID
Branch resolved; Fetch

appropriate instruction
12
Control Hazards
Observation: Since the branch is resolved

only in the EX stage, there must be 2 stall
cycles after every conditional branch
instruction
13
Reducing Impact of Branch Stall

The execution of a conditional branch
instruction involves 2 activities
1.
2.
To reduce branch stall effect we could
evaluating the branch condition (determine

whether it is to be taken or not-taken)
computing the branch target address
evaluate the condition earlier (in ID stage)
compute the target address earlier (in ID stage)
The number of stall cycles would then be

reduced to 1 cycle
14
Control Hazard Solutions

1.
Static Branch Prediction

Prediction?
reasoning about the future
guessing what is going to happen
Static
The behaviour of a branch instruction is predicted
once before the program starts executing
15
Prediction and Correctness
Prediction: guessing what is going to happen

What if the guess is incorrect?
The pipelined processor hardware must be built to

detect the misprediction and take appropriate
corrective action
16

1.
Example: Static Not-Taken policy
The hardware is built to fetch next from PC + 4
After ID stage, if it is found that the branch

condition is false (i.e., not taken), continue with
the fetched instruction (from PC + 4)
Else, squash the fetched instruction and re-fetch

from the branch target address
squash: cancel, annul the processing of that instruction
17
Static Not-Taken Branch Prediction

1
BEQZ R3, out

IF
Fetch inst i +1
Fetch inst i +2
4
3
Suppose that the
condition evaluates
to FALSE
ID
EX
MEM
WB
IF
ID
EX
MEM
IF
ID
EX
IF
ID
etc
i.e., NO BRANCH STALL CYCLES
18
Static Not-Taken Branch Prediction

1
BEQZ R3, out

IF
4
3
Suppose that the
condition evaluates
to TRUE
ID
EX
MEM
WB
IF
ID
EX
MEM
Fetch inst from

branch target address
IF
ID
EX
IF
ID
Fetch inst i +1
SQUASH inst i+1
etc
i.e., ONE BRANCH STALL CYCLE
19

1.
Example: Static Not-Taken policy
The hardware is built to fetch next from PC + 4
After ID stage, if it is found that the branch

condition is false (i.e., not taken), continue with
the fetched instruction (from PC + 4) 0 stall cycles
Else, squash the fetched instruction and re-fetch

from the branch target address
1 stall cycle
Thus, average branch penalty < 1 cycle
20

1.
2.

Delayed Branching
Design hardware so that control transfer
takes place after a few of the following
instructions
BEQ R1, R2, target
ADD R3, R2, R3
21
Recall: Interesting ISA Notes
For load instructions: the loaded value might

not be available in the destination register for
use by the instruction immediately following
the load
LOAD DELAY SLOT
For control transfer instructions: the transfer

of control takes place only following the
instruction immediately after the control
transfer instruction
BRANCH DELAY SLOT
22

1.
2.

Delayed Branching
Design hardware so that control transfer
takes place after a few of the following
instructions
BEQ R1, R2, target
ADD R3, R2, R3
Delay slots: following instructions that are

executed whether or not the branch is taken
Stall cycles are avoided if the delay slots
are filled with useful instructions
23
Delayed Branching: Filling Delay Slots
Instructions that do not affect the branching

condition can be put in the delay slot
by the compiler
Where to get instructions to fill delay slots?
From the branch target address
From the fall through (branch not taken path)
only valuable when branch is taken

only valuable when branch is not taken
From before the branch
useful whether branch is taken or not
Delayed BranchingCompilers Role
When filled from branch target or fall-through,

patch-up code may be needed
BEQZ R1, target
/ Branch delay slot
fall through:
target: ADDI R7, R7, 1

LW R8, -8(R29)

BEQZ R1, target

ADDI R7, R7, 1
fall through: SUBI R7, R7, 1
/ Branch delay slot
target: LW R8, -8(R29)

It may still be beneficial, depending on branching

frequency
The more the number of delay slots, the harder it

is to fill them usefully
If no instruction can be found
The compiler must insert an instruction that

does nothing
other than occupying the delay slot, being fetched

and decoded
Example: ADD R0, R0, R0
If an instruction that does nothing was included in
the instruction set, it would be called a NoOperation instruction, or NOP for short
NOP might be included in the assembly language
It has practically the same effect as a STALL
cycle

Lec28 Pipelineprocessing PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec28 Pipelineprocessing PDF

Uploaded by

Copyright:

Available Formats

Basic Computer Organization

Which is more important?

execution time of a single instruction

Cycles Per Instruction (CPI)

Why keep Fetch hardware idle while instruction is

Used for transportation of liquids or gases

Inside the Processor

Execution time of each instruction is still 5 cycles, but

MIPS 1 Instructions: 3, 4 or 5 cycles

MIPS 1 Instructions: 3, 4 or 5 cycles

Pipelined Processor Datapath

Pipeline stages: IF, ID, EX, MEM, WB

Assume that the time delay through each

For 5 stage pipeline taking 1 cycle per stage

Let us compute the speedup over a non-pipelined

Non-pipelined processor: 5n cycles

A pipeline with p stages could give a speedup

if on every clock cycle, an instruction completes

Problem: Pipeline Hazards

Problem: Pipeline Hazards

A situation where an instruction cannot

Problem: Pipeline Hazards

A situation where an instruction cannot

Petroleum pipeline analogy?

Petroleum pipeline analogy?

Solving Structural Hazards

This hazard can be overcome by designing

Or, since we are assuming that memory

Double ported memory

include a separate instruction cache (for use by

Identify the possible structural hazards and

Problem: Pipeline Hazards

A situation where an instruction cannot

Idea: Delay (or stall) the progress of instruction i+1

Solving Data Hazards

Interlock: Hardware that is included in the

Solving Data Hazards

Interlocks & stalling dependent instructions

add R3, R1, R2

sub R5, R3, R4

Solving Data Hazards

Interlocks & stalling dependent instructions

add R3, R1, R2

sub R5, R3, R4

Modified Processor Datapath

But Forwarding is Not Always Possible

SUB R5, R3, R4

Solving Data Hazards

Build the hardware to assume that an instruction

Recall: Notes from ISA Manual.

For load instructions: the loaded value might

LOAD DELAY SLOT

For control transfer instructions: the transfer

BRANCH DELAY SLOT

Solving Data Hazards

Reorder the instructions of the program so that

by the compiler, before the program runs: Static

Static Instruction Scheduling

Reorder the instructions of the program to

or in general to reduce the execution time of the

Reordering must be safe

Static Instruction Scheduling