You are on page 1of 93

Appendix A

Pipelining: Basic and Intermediate Concepts

Overview
Introduction
Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline

Pipeline Hazards
Stalls, structural hazards, data hazards Branch hazards

Pipeline Implementation
Simple MIPS pipeline

Implementation Difficulties for Pipelines


Exceptions, instruction set complications

Extending MIPS Pipeline to Multicycle operations Example: MIPS R4000 Pipeline

Pipelining
Similar to an assembly line
Widget Definition
Add partA, then B, then C, then D

Widget Definition

Add partA

Add partB

Add partC

Add partD

CPU Pipelining
multiple clock cycles, or one LONG clock cycle

Instruction 2

Fetch, then decode, then execute then access memory if needed then write results if needed

Instruction 1

Instruction 6

Instruction 5
Fetch

Instruction 4

Instruction 3

Instruction 2

Decode

Execute

Memory

Write Results

Instruction 1

1 cycle

1 cycle

1 cycle

1 cycle

1 cycle

Performance and Pipelining


For the following assumptions:
N stages in the pipeline Unpipelined execution time for 1 instruction is T Pipeline stages are equal and perfectly balanced

Then
Execution time for pipelined version = T N Throughput increase is N

Advantages of Pipelining
Significant speedup without much additional hardware.
Invisible to the programmer

RISC (MIPS) Pipeline


All ALU operations operate on registers Only load and store affect memory Load and store of 8,16,32-bit items available Few instruction formats All instructions the same size

Non-pipelined Implementation
Multi-cycle implementation Simplified to better understand transition to pipelined version Not the most efficient implementation

Datapath

Control

Simplified Datapath
Program Counter Instruction Register Branch Target

Memory (instructions and data)

Register File

ALU

Multiplexors not shown Control signals not shown Sign extend and shift modules not shown

Simplified Control State Diagram


IFetch

lw/sw
AddrCal

IDecode Rtype Rexec

Immed

branch
ImmExec

Brcomplete

LWmem

SWmem

Rfinish

ImmFinish

LWwrite

Multi-cycle Implementation
At most 5 cycles to implement an instruction
Branch 3 cycles Load 5 cycles Others 4 cycles

Assume the following instruction frequencies:


Branch 12% Load 10% Others 78%

cycles CPI = (.12*3)+(.1*5)+(.78*4) = 3.98 instruction

Pipelined Version
Each of the 5 clock cycles becomes a pipe stage
IF, ID, EX, MEM, WB

Use separate data and instruction memories


implemented with two caches eliminates conflicts between instruction fetch and memory access

Stages
IF use PC to address current instruction from memory; update PC ID decode instruction and read registers from register file; do equality test on register; sign extend offset field; compute possible branch target EX ALU operates on operands (memory address calculation, register-register operation, registerimmediate operation MEM if a load, read memory, if a store, write memory WB for register-register or load, write register result back to register file.

Simplified Pipelined Datapath


Instruction Memory Data Memory

Pipeline Registers

Register file (just one) Read register after fetch Write register after data memory access

Pipeline Execution

Pipeline Registers
Stages
IF ID EX Mem WB

IF/ID

ID/EX

EX/Mem

Mem/WB

Pipeline register names

Some Issues
Register file used in two stages,
two register reads (two operands) and one register write during a single clock cycle

PC needed in IF stage and must be updated on every clock cycle Adder needed in ID to compute branch target in cases of branch/jump instructions Branch does not change PC until ID stage, next instruction already fetched at that point

Instruction Timing
Throughput is increased approximately by 5 Execution time of individual instruction INCREASES due to pipelining overhead
Pipeline register delay Clock skew (T = TCL + Tsu + Treg + Tskew )

Important to balance pipeline stages, since clock is matched to slowest stage (TCL)

Example
Unpipelined: 1GHz clock (T = 1ns) ALU 4 cycles 40% Branches 4 cycles 20% Memory 5 cycles 40% If pipelined, increase T by: Tskew + TSU + Treg = .2ns How much speedup from a 5-stage pipeline? Unpipelined execution time: E. Timeu = T * CPI CPI = (.4*4) + (.2*4) + (.4*5) = 4.4

E. Time = 1ns * 4.4 = 4.4ns


Pipelined execution time: E. Timep = T T = 1.2ns

Speedup =

E. Timeu
E. Timep

= 4.4/1.2 = 3.7

Pipeline Hazards
Structural Hazards resource conflicts when more than one instruction needs a resource Data Hazards an instruction depends on a result from a previous instruction that is not yet available Control Hazards conflicts from branches and jumps that change the PC

Pipeline Stall

stall

One solution to some hazards

Performance with Stalls


CPI Pipelined = Ideal CPI + stall cycles per instruction = 1 + stall cycles per instruction Pipelining Speedup (ideal) = CPI unpipelined 1 + stall cycles per instruction

Simple case: CPI unpipelined = # of pipeline stages


# of pipeline stages Pipelining Speedup (ideal) = 1 + stall cycles per instruction For no stalls, pipeline speedup = # stages Tunpipelined Pipelining Speedup (actual) = x Tpipelined 1 + stall cycles per instruction
# of pipeline stages

Structural Hazards (Resource Conflicts)

If Data Memory and Instruction Memory are implemented with a single memory, then this can cause a structural hazard

Data Hazards (Instruction Dependencies)

DADD R1, R2, R3

Hazards

DSUB R4, R1, R5

AND

R6, R1, R7

OR

R8, R1, R9

No hazards
XOR R10, R1, R11

Forwarding
Solution for hazards Also called bypassing or short-circuiting
Create potential datapath from where result is calculated to where it is needed by another instruction Detect hazard to route the result

Example:
DADD R1, R2, R3

Forwarding path
DSUB R4, R1, R5

DADD R1, R2, R3

DSUB R4, R1, R5

AND

R6, R1, R7

OR

R8, R1, R9

XOR

R10, R1, R11 read register file write register file

More Forwarding

Remaining Stalls
Some data hazards cannot be resolved by forwarding:
LD DSUB AND OR R1,0(R2) R4, R1, R5 R6, R1, R7 R8, R1, R9

Hazard detection causes stall until hazard is cleared. Hardware interlock

Stall to Solve Data Hazard


LD DSUB R1, 0(R2) R4, R1, R5 IF ID IF EX ID MEM stall WB EX MEM WB

AND
OR

R6, R1, R7
R8, R1, R9

IF

stall
stall

ID
IF

EX
ID

MEM
EX

WB
MEM WB

Branch Hazards
Branch not taken
BEQZ R1, Name Instr. 1 Instr. 2 Instr. 3 Branch taken Instr. 4

Name:

Pipeline hazard solution using a fetch redo after branch:


BEQZ R1, Name IF ID IF Instr. 1 EX MEM IF ID Instr. 1 or Instr. 4 WB EX

MEM

WB

Branch redo penalties


If branch is not taken, second fetch is redundant Always stalling after branch results in 10% to 30% performance loss

Reducing Branch Penalty


Compile time solutions
Decide on a hardware action Compiler tries to use this knowledge

1. 2. 3. 4.

Pipeline freeze (or flush) Predicted-not-taken Predicted-taken Delayed branch

Pipeline Freeze
Hold or delete all instructions after a branch until the target address is known.
Simple to implement Results in 1 cycle stall for MIPS Longer stalls for other pipeline architectures

Predicted-not-taken
Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Must be careful not to alter state of registers until actual branch target is known Only slightly more complicated than pipeline freeze to implement Compiler can modify loops to favor branches not taken

Predicted-taken
Treat every branch as taken As soon as branch is decoded and target address is computed, begin fetching at the target
No advantage for MIPS because target address is not known any earlier than branch outcome Only makes sense for machines that compute target address before determining branch outcome

Delayed Branch
Execute instruction after branch no matter what Fetch subsequent instruction depending on branch outcome
branch sequential successor instruction branch target if taken

Compilers job is to put a useful instruction as the sequential successor instruction Otherwise a NOP is used

Performance with Branches

Pipeline depth Pipeline speedup = 1 + Branch frequency x Branch penalty

For the schemes just mentioned, penalty is at most 1 cycle. Penalty is more for deeper pipelines

Pipeline Implementation
Details of pipeline implementation
So that other issues can be explored

Look at non-pipelined implementation first


Focus on integer subset of MIPS
Load-store word Branch equal zero Integer ALU operations

Basic principles can be extended to all instructions

Multi-cycle Implementation
1.

Instruction Fetch cycle (IF)


IR Mem[PC] NPC PC+4

2.

Instruction decode/register fetch cycle (ID)


Decode opcode A Regs[rs] B Regs[rt] Imm sign-extend immediate field of IR

3.

Execution/effective address cycle (EX)


ALUOutput A + Imm (memory reference) OR ALUOutput A func B (register-register ALU op) OR ALUOutput A op Imm (register-immediate ALU op) OR ALUOutput NPC + (Imm<<2); Cond (A==0) (Branch)

Multi-cycle Implementation
4. Memory access/branch completion cycle (MEM)
LMD Mem[ALUOutput] (load) OR Mem[ALUOutput] B (store)

5. Write-back cycle (WB)


Regs[rd] ALUOutput (register-register ALU) OR Regs[rt] ALUOutput (register-immediate) OR Regs[rt] LMD (load)

Multicycle Datapath

rs = R2 Regs[R2]+55 rd = R1 Address for LD instruction Regs[R2] Mem(Regs[R2]+55)

CYCLE 1

CYCLE 3
Imm = 55

CYCLE 4

CYCLE 2 CYCLE 5

Example: Datapath for LD R1, 55(R2)

Add Pipeline Registers

PC can also be considered a pipeline register

Pipeline registers take place of IR, A, B, Imm, ALUoutput, LMD

Stage by Stage Operation


Figure A.19 pp A-32
Each pipeline register has fields Example: IF/ID.IR IR field of the IF/ID pipeline register

Pipeline Control

Control signals needed for MUXs Register (write) ALU (function) Data memory (read/write)

Overview
Introduction
Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline

Pipeline Hazards
Stalls, structural hazards, data hazards Branch hazards

Pipeline Implementation
Simple MIPS pipeline

Implementation Difficulties for Pipelines


Exceptions, instruction set complications

Extending MIPS Pipeline to Multicycle operations Example: MIPS R4000 Pipeline

Control Complications
Instruction issue when instruction transfers from ID stage into EX stage Data hazards checked in ID stage
If stall is required, instruction is stalled before it is issued If forwarding is needed, controls are set

Hazard Detection Situations


LD R1, 45(R2) DADD R5, R1, R7 (Requires stall)
Comparators detect the use of R1 in the DADD and stall the DADD (and future instructions) before the DADD begins EX.

LD R1, 45(R2) DADD R5, R6, R7 DSUB R8, R1, R7 (Requires forwarding)

Comparators detect the use of R1 in DSUB and forward result of load to ALU in time for DSUB to begin EX.

Load Interlocks
Recall that the following code requires a stall or load interlock to prevent Read After Write (RAW) hazards
LD DSUB AND OR R1,0(R2) R4, R1, R5 R6, R1, R7 R8, R1, R9

Hazard can be detected in the ID stage by comparing rt and rs registers

Load Interlock Detection Logic


Opcode field of ID/EX (ID/EX0..5) Opcode field of IF/ID (IF/ID.IR0..5)

Matching Operand Fields

Load Load

Register-register ID/EX.IR[rt]==IF/ID.IR[rs] ALU Register-register ID/EX.IR[rt]==IF/ID.IR[rt] ALU


ID/EX.IR[rt]==IF/ID.IR[rs] Load, store, ALU immediate, or branch

Load

Implementing a Stall after Detection


Change opcode in ID/EX pipeline register to 00000 (NOP) Recirculate contents of IF/ID register to hold stalled instruction

Forwarding Logic
Detection is similar to detecting RAW, but more cases All forwarding values originate at ALU or data memory output Terminate at ALU input, data memory input or zero detection unit

Forwarding Logic

Additional MUX inputs and paths

Branches in Pipeline
Consider only BEQZ and BNEZ (branch if equal to zero or not equal to zero) For these it is possible to move the test to the ID stage To take advantage of early decision, target address must also be computed early Must add another adder for computing target address in ID Result is 1-cycle stall on branches. Branches on result of register from previous ALU operation will result in a data hazard stall.

Logic for Early Test and Target Address


Additional adder

Zero test

Compare to Previous Version

Exceptions and Pipelines


Exceptions can come from several sources and can be classified several ways Sources
I/O Device Interrupt Invoking OS from user program Tracing program execution Breakpoint Integer arithmetic overflow or underflow, FP trap Page fault Misaligned memory accesses Memory protection violation Undefined instruction Hardware malfunction Power failure

Exceptions and Pipelines


Exception characteristics
Synchronous (from within cpu) vs asynchronous
Asynchronous caused by devices external to cpu and memory

User requested vs coerced


User requests from a program are predictable Coerced requests are from some hardware event outside control of user program

User maskable vs user non-maskable


Masks control whether hardware responds to exception or not

Within vs between instructions


Within are usually synchronous instruction triggers exception Asynchronous within are catastrophic and cause program termination

Resume vs terminate
Terminating programs execution always stops after interrupt Resuming program execution continues after interrupt is handled. Resuming exceptions harder to handle.

Classifications of Exception Types


Figure A.27 in text Most difficult:
Synchronous Coerced Within instructions that can be resumed

Restartable pipelines or processors can handle these

Example: Virtual Memory Page Fault


Occur here: Synchronous Coerced Within instruction Resume

Saving Pipeline State


Force a trap instruction into the pipeline on the next IF Until trap is taken, turn off all writes for faulting instruction and all that follow After the trap receives control, immediately save PC of faulting instruction to return from trap later.
For delayed branch pipelines, need to save and restore as many PCs as the length of the branch delay plus one. (Compilers put instructions out of order)

After exception is handled, return from exception by reloading PCs and restart instruction stream.

Precise exceptions if pipeline can always be stopped to that the instruction just before the faulting instruction are completed and those after it can be restarted. Floating point instructions tend to take many cycles,
difficult to have precise exceptions

Some CPUs have 2 modes of operation


Precise exception mode allows less overlap in floating point instruction - slower Fast performance mode

Almost all integer pipelines support precise exceptions

Precise Exceptions in MIPS


Possible exceptions in MIPS stages
IF page fault on instruction fetch; misaligned memory access; memory protection violation ID undefined or illegal opcode EX Arithmetic exception MEM page fault on data fetch; misaligned memory access; memory protection violation WB None

Multiple exceptions can occur on the same clock cycle

Example
LD IF DADD ID IF EX ID MEM EX WB MEM

WB

Data page fault

Arithmetic exception

1. Deal with the page fault, redo the DADD 2. Deal with the DADD arithmetic exception that will occur again
But: Exceptions can occur out of order Alternate solution: Hardware posts all exceptions in a status vector Control signals that writes data is turned off When instruction enters WB, exception status vector is checked Exceptions of earliest instructions handled first.

Helpful MIPS Pipeline Features


No instruction updates the state of the processor (registers or memory) before the MEM stag

ISA Complications for Pipelining


Instructions that change processor state at multiple stages in pipeline.
Ex: Autoincrement, autodecrement Must provide a way to back out of instruction even after it is partially completed Usually requires the storage of extra state

The use of condition codes


Restricts the reordering of instructions that is often useful for delay slots after branches Complications of deciding when condition codes are fixed affects exception handling as well as hazard detection

ISA Complications for Pipelining


Multi-cycle operations (operations that take a variable number of cycles based on operands)
Example: Move Character String where instruction specifies address and length of string

Some ISAs are just too complex to be pipelined efficiently

MIPS Multicycle Operations


Floating Point operations
Load/Store Add Multiply These are multicycle for integers too Divide

Multiple Cycles in Execution Stage

IF

ID

EX

MEM

WB

One Approach
4 separate Function Units for EX Stage Integer takes 1 clock cycle FP units take multiple cycles Instruction issue: Allowing an instruction to move from ID to EX phase

Could Pipeline the Function Units


Allows some overlap of instructions Difficult to pipeline divider Pipelined Units

Divide unit takes 24 clock cycles, but is NOT pipelined.

Definitions
Latency: the number of cycles between when an instruction produces a result and when the next instruction can use the result. Integer ALU: latency = 0 Loads: latency = 1

FP Mult: latency = 6

FP Add: latency = 3 Results consumed at beginning of EX stage FP Div: latency = 24


(generally 1 cycle less than stages in function unit pipeline)

Definitions
Initiation Interval: the number of cycles that must elapse between issuing two operations of a given type Integer ALU, Loads, FP Add, FP Mult: Initiation Interval = 1

Divide: Initiation Interval = 25

Example Pipeline Timing


MUL.D ADD.D
L.D

IF

ID IF

M1 M2 ID
IF

M3 A2
EX

M4 A3
MEM

M5 A4
WB

M6 ME M

M7 WB

MEM WB

A1
ID

S.D

IF

ID

EX

MEM WB

Stages where data is needed

Stages where results are available

Hazards and Forwarding


Multi-cycle, non-pipelined divide unit can cause structural hazards must be detected. Varying run times means that there can be multiple register writes in a clock cycle. Instructions dont reach WB in order, so Write After Write (WAW) hazards are possible. Exceptions complicated by out of order completion of instructions. Longer latency results in more frequent stalls for Read After Write (RAW) hazards

Multiple Register Writes


Clock Cycle Number

Instruction
MUL.D ... ... ADD.D ... ...

1
IF

2
ID IF

3
M1 ID IF

4
M2 EX ID IF

5
M3 MEM EX ID IF

6
M4 WB MEM A1 ID IF

7
M5

8
M6

9
M7

10
MEM

11
WB

WB A2 EX ID A3 MEM EX A4 WB MEM WB MEM WB

L.D

IF

ID

EX

MEM

WB

Possible Solutions
Add write ports probably not a good idea because it is not a common scenario. Detect structural hazard and implement interlock.
Track scheduled write ports in ID and stall there, OR Stall conflicting instruction in the MEM or WB stage

Pros and Cons of each approach stalls in ID phase will be assumed

WAW Hazards
Clock Cycle Number

Instruction
MUL.D ... ... ADD.D F2, F4, F6 ... L.D F2, 0(R2) ...

1
IF

2
ID IF

3
M1 ID IF

4
M2 EX ID IF

5
M3 MEM EX ID IF

6
M4 WB MEM A1 ID IF

7
M5

8
M6

9
M7

10
MEM

11
WB

WB A2 EX ID IF A3 MEM EX ID A4 WB MEM EX WB MEM WB MEM WB

Overview
Introduction
Pipeline concepts Basics of RISC instruction set Classic 5-stage pipeline

Pipeline Hazards
Stalls, structural hazards, data hazards Branch hazards

Pipeline Implementation
Simple MIPS pipeline

Implementation Difficulties for Pipelines


Exceptions, instruction set complications

Extending MIPS Pipeline to Multicycle operations Example: MIPS R4000 Pipeline

Review of Hazards
Caused by different lengths of execution unit pipelines.
Structural hazards multiple instructions need the same function unit at the same time RAW data hazards Instruction needs to read a value that has not been written yet WAW data hazards Writes occur out of order

Handling Hazards
Structural Hazards
Wait to issue instructions if divider is busy, or if the register write port will not be available.

RAW Hazards
Check source registers against pending destinations, stall issue if necessary.

WAW Hazards
Determine if any instruction in MULT, ADD, or DIV pipeline has same destination of instruction being issued, stall issue if necessary.

Precise Exceptions
Out of order completion makes precise exceptions difficult
Completion Time (starting at 0,1,2)
DIV.D F0, F3, F5 ADD.D F9, F9, F7 SUB.D F10, F10, F14 cycle 28 cycle 9 cycle 10

No data hazards, so no stalls IF SUB causes an exception, ADD is already done, but DIV is NOT complete. Saving PC and starting over at SUB.D will not work.

Solution Options
Buffer results until all previous instructions are complete.
OK as long as the difference in completion times is reasonable. (Lots of storage otherwise)

Allow exceptions to be imprecise, have trap routines create precise sequence.


Requires some buffering also. Trap routine finishes instructions preceding the latest instruction completed before returning.

MIPS FP Pipeline
Stalls to avoid structural and RAW hazards
Stalls per FP operation # stalls depends on latency # stalls also depends on how many cycles before results are used Divide frequency is low, but number of stalls needed is high due to latency Average for add/sub/conv = 1.7 (56% of latency) Average for mult = 2.8 (46%) Average for div = 14.2 (59%)

MIPS R4000 Pipeline


Implements MIPS-64 Deeper pipeline (8 stages) Superpipeline Higher clock rate (smaller logic in each stage) Additional stages from decomposing memory accesses

MIPS R4000 8-Stage Pipeline

Stages: IF First half of instruction fetch IS Second half of instruction fetch RF Instruction decode and register fetch, hazard checking, instruction cache hit detection EX execution (address calc., ALU operation, condition evaluation DF Data fetch, first half of data cache access DS Second half of data fetch, completion of cache access TC Tag check, determine whether the data cache access hit WB Write back for loads and register-register operations

Effects of Deeper Pipeline


More forwarding required Load and branch delays increased
2-cycle stall for RAW hazards with load Branch delay is 3 cycles
Single-cycle branch delay used along with forwarding Predicted-not-taken approach used

MIPS R4000 Floating-point Pipeline


3 functional units
FP adder FP multiplier FP divider

8 stages, used 0 or many times, in different orders, by different instructions Large range of completion times (2-112 cycles)

FP Pipeline Stages
Stage A D E M N R S U Functional Unit FP Adder FP divider FP multiplier FP multiplier FP multiplier FP adder FP adder Description Mantissa ADD stage Divide pipeline stage Exception test stage First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers

Instruction Latencies and Initiation Intervals


FP Instruction Add, subtract Multiply Divide Square root Negate Absolute Value FP Compare Latency 4 8 36 112 2 2 3 Initiation Interval 3 4 35 111 1 1 2

MIPS R4000 Pipeline Performance


Four major causes of pipeline stalls
Load stalls Branch stalls FP results stalls (RAW hazards) FP structural stalls

CPI for 10 SPEC92 benchmarks

Branch stalls from longer pipeline substantial FP structural stalls sometimes masked by result stalls

AppendixA Summary
For ideal N-stage pipeline, throughput increase is N over a non-pipelined architecture Ideal pipelined cpu has CPI=1 Pipelining has advantages of
significant speedup with moderate hardware costs invisible to programmer

Pipeline challenges include


Structural hazards Data hazards Control hazards Exceptions Floating point operations

AppendixA Summary
Solutions include
Stalls Forwarding Buffering state (for exceptions) Branch delay slots Branch prediction Several multi-cycle execution units for FP

You might also like