You are on page 1of 124

Embedded Processor Architecture

RISC
Instruction Set Implementation Alternatives
== using MIPS as example ==

TU/e 5kk73
Henk Corporaal Bart Mesman

Topics

MIPS ISA: Instruction Set Architecture MIPS single cycle implementation MIPS multi-cycle implementation MIPS pipelined implementation Pipeline hazards Recap of RISC principles Other architectures Based on the book: ch2-4 (4th ed) Many slides; I'll go quick and skip some

H.Corporaal EmbProcArch 5kk73

Main Types of Instructions

Arithmetic

Integer Floating Point

Memory access instructions

Load & Store

Control flow

Jump Conditional Branch Call & Return

H.Corporaal EmbProcArch 5kk73

MIPS arithmetic

Most instructions have 3 operands Operand order is fixed (destination first) Example: C code: A = B + C

MIPS code: add $s0, $s1, $s2 ($s0, $s1 and $s2 are associated with variables by compiler)

H.Corporaal EmbProcArch 5kk73

MIPS arithmetic
C code: A = B + C + D; E = F - A;

MIPS code: add $t0, $s1, $s2 add $s0, $t0, $s3 sub $s4, $s5, $s0

Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why?

H.Corporaal EmbProcArch 5kk73

Registers vs. Memory

Arithmetic instruction operands must be registers, only 32 registers provided Compiler associates variables with registers What about programs with lots of variables ?

CPU
register file

Memory

IO
H.Corporaal EmbProcArch 5kk73 6

Register allocation

Compiler tries to keep as many variables in registers as possible Some variables can not be allocated

large arrays (too few registers) aliased variables (variables accessible through pointers in C) dynamic allocated variables heap stack

Compiler may run out of registers => spilling

H.Corporaal EmbProcArch 5kk73

Memory Organization

Viewed as a large, single-dimension array, with an address A memory address is an index into the array "Byte addressing" means that successive addresses are one byte apart
0 1
8 bits of data 8 bits of data 8 bits of data

2
3 4 5 6 ...
H.Corporaal EmbProcArch 5kk73

8 bits of data
8 bits of data 8 bits of data 8 bits of data

Memory Organization

Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes.
0 4
32 bits of data 32 bits of data 32 bits of data 32 bits of data

Registers hold 32 bits of data

8
... 12

232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4

H.Corporaal EmbProcArch 5kk73

Memory layout: Alignment


31 23 15 7 0

0 address

this word is aligned; the others are not!

4 8 12 16 20 24

Words are aligned What are the least 2 significant bits of a word address?
H.Corporaal EmbProcArch 5kk73 10

Instructions: load and store


Example:

C code:

A[8] = h + A[8];

MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 32($s3)

Store word operation has no destination (reg) operand Remember arithmetic operands are registers, not memory!

H.Corporaal EmbProcArch 5kk73

11

Let's translate some C-code

Can we figure out the code?

swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli add lw lw sw sw jr Explanation: index k : $5 base address of v: $4 address of v[k] is $4 + 4.$5
H.Corporaal EmbProcArch 5kk73 12

$2 , $2 , $15, $16, $16, $15, $31

$5, 4 $4, $2 0($2) 4($2) 0($2) 4($2)

Machine Language

Instructions, like registers and words of data, are also 32 bits long

Example: add $t0, $s1, $s2 Registers have numbers: $t0=9, $s1=17, $s2=18

Instruction Format: op 000000


6 bits

rs 10001
5 bits

rt 10010
5 bits

rd 01000
5 bits

shamt 00000
5 bits

funct 100000
6 bits

Can you guess what the field names stand for?

H.Corporaal EmbProcArch 5kk73

13

Machine Language

Consider the load-word and store-word instructions,


What would the regularity principle have us do? New principle: Good design demands a compromise I-type for data transfer instructions other format was R-type for register

Introduce a new type of instruction format


Example: lw $t0, 32($s2)

35
op

18
rs

9
rt

32
16 bit number

H.Corporaal EmbProcArch 5kk73

14

Stored Program Concept


memory OS Program 1 CPU
unused

code global data stack heap

Program 2

unused

H.Corporaal EmbProcArch 5kk73

15

Control

Decision making instructions


alter the control flow, i.e., change the "next" instruction to be executed

MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label

Example:

if (i==j) h = i + j;

bne $s0, $s1, Label add $s3, $s0, $s1 Label: ....

H.Corporaal EmbProcArch 5kk73

16

Control

MIPS unconditional branch instructions: j label

Example:
if (i!=j) h=i+j; else h=i-j; beq $s4, $s5, Lab1 add $s3, $s4, $s5 j Lab2 Lab1:sub $s3, $s4, $s5 Lab2:...

Can you build a simple for loop?


17

H.Corporaal EmbProcArch 5kk73

So far:

Instruction
add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label

Meaning
$s1 = $s2 + $s3 $s1 = $s2 $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label

Formats:
R I J op op op rs rs rt rt rd shamt funct 16 bit address

26 bit address
18

H.Corporaal EmbProcArch 5kk73

Control Flow

We have: beq, bne, what about Branch-if-less-than? New instruction:


meaning: if slt $t0, $s1, $s2

$s1 < $s2 then $t0 = 1 else $t0 = 0

Can use this instruction to build "blt $s1, $s2, Label" can now build general control structures

Note that the assembler needs a register to do this, use conventions for registers

H.Corporaal EmbProcArch 5kk73

19

MIPS compiler/assembler Conventions


Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address
H.Corporaal EmbProcArch 5kk73 20

Constants

Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not?

put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one or .

MIPS Instructions: addi slti andi ori $29, $8, $29, $29, $29, $18, $29, $29, 4 10 6 4

H.Corporaal EmbProcArch 5kk73

21

How about larger constants?


We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate" instruction lui $t0, 1010101010101010 filled with zeros
1010101010101010 0000000000000000

Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010
1010101010101010 0000000000000000 1010101010101010

ori

0000000000000000

1010101010101010
H.Corporaal EmbProcArch 5kk73

1010101010101010
22

Assembly Language vs. Machine Language

Assembly provides convenient symbolic representation

much easier than writing down numbers

e.g., destination first


e.g., destination is no longer first e.g., move $t0, $t1 exists only in Assembly would be implemented using add $t0,$t1,$zero

Machine language is the underlying reality

Assembly can provide 'pseudoinstructions'


When considering performance you should count real instructions

H.Corporaal EmbProcArch 5kk73

23

Addresses in Branches and Jumps

Instructions:
bne $t4,$t5,Label beq $t4,$t5,Label j Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5 Next instruction is at Label

Formats:
I J op op rs rt 16 bit address

26 bit address

Addresses are not 32 bits How do we handle this with load and store instructions?

H.Corporaal EmbProcArch 5kk73

24

What's the next address?

Instructions:
bne $t4,$t5,Label beq $t4,$t5,Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5

Formats:
I op rs rt 16 bit address

Could specify a register (like lw and sw) and add it to address


use Instruction Address Register (PC = program counter) most branches are local (principle of locality)

Jump instructions just use high order bits of PC

address boundaries of 256 MB

H.Corporaal EmbProcArch 5kk73

25

To summarize:
Category
add

Instruction

MIPS assembly language Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 sub $s1, $s2, $s3 $s1 = $s2 - $s3 $s1 = $s2 + 100 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = 100 * 2
16

Comments
Three operands; data in registers

Arithmetic

subtract

Three operands; data in registers

addi $s1, $s2, 100 lw $s1, 100($s2) load word sw $s1, 100($s2) store word lb $s1, 100($s2) Data transfer load byte sb $s1, 100($s2) store byte load upper immediate lui $s1, 100
add immediate branch on equal

Used to add constants Word from memory to register Word from register to memory Byte from memory to register Byte from register to memory Loads constant in upper 16 bits

beq bne slt slti j jr jal

$s1, $s2, 25 $s1, $s2, 25 $s1, $s2, $s3

if ($s1 == $s2) go to PC + 4 + 100 if ($s1 != $s2) go to PC + 4 + 100 if ($s2 < $s3) $s1 = 1; else $s1 = 0 else $s1 = 0

Equal test; PC-relative branch

branch on not equal

Not equal test; PC-relative

Conditional branch

set on less than

Compare less than; for beq, bne

set less than immediate jump

$s1, $s2, 100 if ($s2 < 100) $s1 = 1; 2500 $ra 2500

Compare less than constant

Unconditional jump

jump register jump and link

Jump to target address go to 10000 For switch, procedure return go to $ra $ra = PC + 4; go to 10000 For procedure call
26

H.Corporaal EmbProcArch 5kk73

MIPS (3+2) addressing modes overview


1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register

3. Base addressing op rs rt Address Memory

Register

Byte

Halfword

Word

4. PC-relative addressing op rs rt Address Memory

PC

Word

5. Pseudodirect addressing op Address Memory

PC

Word

H.Corporaal EmbProcArch 5kk73

27

MIPS Datapath

Building a datapath

support a subset of the MIPS-I instruction-set

A single cycle processor datapath

all instruction actions in one (long) cycle

A multi-cycle processor datapath

each instructions takes multiple (shorter) cycles

For details see book (ch 5):

H.Corporaal EmbProcArch 5kk73

28

Datapath and Control


Registers & Memories
Multiplexors Buses ALUs Control Datapath

FSM or Microprogramming

H.Corporaal EmbProcArch 5kk73

29

The Processor: Datapath & Control

Simplified MIPS implementation to contain only:


memory-reference instructions: arithmetic-logical instructions: control flow instructions:

lw, sw add, sub, and, or, slt beq, j

Generic Implementation:

use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do

All instructions use the ALU after reading the registers Why?

memory-reference? arithmetic? control flow? H.Corporaal EmbProcArch 5kk73

30

More Implementation Details

Abstract / Simplified View:


Data Register # Registers Register #

PC

Address Instruction memory

Instruction

ALU

Address Data memory Data

Register #

Two types of functional units:


elements that operate on data values (combinational) elements that contain state (sequential)
31

H.Corporaal EmbProcArch 5kk73

State Elements

Unclocked vs. Clocked Clocks used in synchronous logic

when should an element that contains state be updated?


falling edge

cycle time rising edge

H.Corporaal EmbProcArch 5kk73

32

An unclocked state element

The set-reset (SR) latch

output depends on present inputs and also on past inputs R Q

S R 0 0 1 1 S 0 1 0 1 Q Q 1 0 ?

Truth table:

state change

H.Corporaal EmbProcArch 5kk73

33

Latches and Flip-flops

Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock

Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology)

A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written

H.Corporaal EmbProcArch 5kk73

34

D-latch

Two inputs:

the data value to be stored (D) the clock signal (C) indicating when to read & store D the value of the internal state (Q) and it's complement

Two outputs:

C Q

C
_ Q D

H.Corporaal EmbProcArch 5kk73

35

D flip-flop

Output changes only on the clock edge


D D C C D latch Q D Q D latch _ C Q Q _ Q

H.Corporaal EmbProcArch 5kk73

36

Our Implementation

An edge triggered methodology Typical execution:

read contents of some state elements, send values through some combinational logic, write results to one or more state elements

State element 1

Combinational logic

State element 2

Clock cycle
H.Corporaal EmbProcArch 5kk73 37

Register File

3-ported: one write, two read ports

Read reg. #1

Read data 1 Read data 2

Read reg.#2

Write reg.#

Write data Write

H.Corporaal EmbProcArch 5kk73

38

Register file: read ports


Register file built using D flip-flops
R e a d r e g i st e r nu m b er 1 R e g i s te r 0 R e g i s te r 1 M u x R e ad d at a 1

R e g i s te r n 1 R e g is t e r n R e a d r e g i st e r nu m b er 2

M u x

R e ad d at a 2

Implementation of the read ports


H.Corporaal EmbProcArch 5kk73 39

Register file: write port

Note: we still use the real clock to determine when to write


W r it e C R e g i s te r 0 D C R e g i s te r 1 D

0 1 R e g is t e r n u m b e r n -to - 1 d e co d e r n 1

C R e g is te r n 1 D C R e g i s te r n R e g is t e r d a t a
H.Corporaal EmbProcArch 5kk73

D
40

Building the Datapath

Use multiplexors to stitch them together


PCSrc Add M u x Add ALU result Shift left 2 Registers PC Read address Instruction Instruction memory Read register 1 Read Read data 1 register 2 Write register Write data RegWrite 16 Read data 2 ALUSrc 3 ALU operation Zero ALU ALU result MemWrite MemtoReg Address

M u x

Read data

Data Write memory data

M u x

Sign extend

32

MemRead

H.Corporaal EmbProcArch 5kk73

41

Our Simple Control Structure


All of the logic is combinational We wait for everything to settle down, and the right thing to be done

ALU might not produce right answer right away we use write signals along with clock to determine when to write

Cycle time determined by length of the longest path


S ta t e e le m e n t 1 S ta te e lem e n t 2

C o m b i n a t io n a l lo g ic

C l o c k c y c le

We are ignoring some details like setup and hold times !


H.Corporaal EmbProcArch 5kk73 42

Control

Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction

Example: add $8, $17, $18


000000 op

Instruction Format:
10001 rs 10010 rt 01000 rd 00000 shamt 100000 funct

ALU's operation based on instruction type and function code


43

H.Corporaal EmbProcArch 5kk73

Control: 2 level implementation


bit Opcode 31 6

Control 2
26

instruction register

2 ALUop 00: lw, sw 01: beq 10: add, sub, and, or, slt

Control 1

3
ALUcontrol 000: and 001: or 010: add 110: sub 111: set on less than

Funct.

5 0

ALU

H.Corporaal EmbProcArch 5kk73

44

Datapath with Control


0 M u x Add ALU result Add 4 Instruction [3126] RegDst Branch Mem ead R MemoReg t Control ALUOp Mem rite W ALUSrc RegWite r Read register 1 Shift left 2 1

Instruction [2521] PC Read address Instruction [310] Instruction m m ry e o Instruction [1511] Instruction [2016] 0 M u x 1

Read register 2 R gisters Read e W ite r data 2 register W ite r data

Read data 1

0 M u x 1

Zero ALU ALU result

Address

Read data D ta a memory

Wite r data
Instruction [150] 16 Sign extend 32 ALU control

1 M u x 0

Instruction [5 0]

H.Corporaal EmbProcArch 5kk73

45

ALU Control1

What should the ALU do with this instruction example: lw $1, 100($2) 35 op 2 rs 1 rt 100 16 bit offset

ALU control input

000 001 010 110 111

AND OR add subtract set-on-less-than

Why is the code for subtract 110 and not 011?

H.Corporaal EmbProcArch 5kk73

46

ALU Control1

Must describe hardware to compute 3-bit ALU control input

given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic function code for arithmetic inputs

ALU Operation class, computed from instruction type

Describe it using a truth table (can turn into gates):


outputs
Operation F0 X X 0 0 0 1 0 010 110 010 110 000 001 111 Funct field F4 F3 F2 F1 X X X X X X X X X 0 0 0 X 0 0 1 X 0 1 0 X 0 1 0 X 1 0 1

ALUOp ALUOp1 ALUOp0 0 0 X 1 1 X 1 X 1 X 1 X 1 X H.Corporaal EmbProcArch 5kk73

F5 X X X X X X X

47

ALU Control1

Simple combinational logic (truth tables)


ALUOp ALU control block ALUOp0 ALUOp1

F3 F2 F (5 0) F1

Operation2 Operation1 Operation0 Operation

F0

H.Corporaal EmbProcArch 5kk73

48

Deriving Control2 signals


Input
6-bits

9 control (output) signals

Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1

Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4
H.Corporaal EmbProcArch 5kk73 49

Control 2

Inputs Op5 Op4 Op3

PLA example implementation

Op2 Op1 Op0

Outputs R-format Iw sw beq RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO
H.Corporaal EmbProcArch 5kk73 50

Single Cycle Implementation

Calculate cycle time assuming negligible delays except:

memory (2ns), ALU and adders (2ns), register file access (1ns)
PCSrc 1 M u x 0

Add 4 ALU Add result

RegWrite
Instruction [25 21] PC Read address Instruction [31 0] Instruction memory Instruction [20 16] 1 M u Instruction [15 11] x 0 RegDst Instruction [15 0] Read register 1 Read register 2

Shift left 2

Read data 1

MemWrite ALUSrc 1 M u x 0 Zero ALU ALU result MemtoReg

Read Write data 2 register Write Registers data 16

Address

Read data

Sign 32 extend

Write data ALU control

Data memory

1 M u x 0

MemRead

Instruction [5 0] ALUOp
H.Corporaal EmbProcArch 5kk73 51

Single Cycle Implementation

Memory (2ns), ALU & adders (2ns), reg. file access (1ns) Fixed length clock: longest instruction is the lw which requires 8 ns Variable clock length (not realistic, just as exercise):

R-instr: Load: Store: Branch: Jump:

6 ns 8 ns 7 ns 5 ns 2 ns

Average depends on instruction mix

H.Corporaal EmbProcArch 5kk73

52

Where we are headed

Single Cycle Problems:


what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources

One Solution:

use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath:

Instruction register PC Address Instruction Memory or data Memory data register

Data A Register # Registers Register # B Register # ALU ALUOut

IR

Data

MDR

H.Corporaal EmbProcArch 5kk73

53

Multicycle Approach

We will be reusing functional units


ALU used to compute address and to increment PC Memory used for instruction and data

Add registers after every major functional unit Our control signals will not be determined solely by instruction

e.g., what should the ALU do for a subtract instruction?

Well use a finite state machine (FSM) or microcode for control

H.Corporaal EmbProcArch 5kk73

54

Review: finite state machines

Finite state machines:


a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input)

Current state

Next-state function

Next state

Clock Inputs

Output function

Outputs

Well use a Moore machine (output based only on current state)


55

H.Corporaal EmbProcArch 5kk73

Multicycle Approach

Break up the instructions into steps, each step takes a cycle


balance the amount of work to be done restrict each cycle to use only one major functional unit store values for use in later cycles (easiest thing to do) introduce additional internal registers

At the end of a cycle

Notice: we distinguish

processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout)

H.Corporaal EmbProcArch 5kk73

56

Multicycle Approach
PC 0 M u x 1 Instruction [25 21] Instruction [20 16] Instruction [15 0] Instruction [15 11] Instruction register Instruction [15 0] 0 M u x 1 0 M u x 1 16 Read register 1 Read Read data 1 register 2 Registers W ite r Read register data 2 W ite r data A 0 M u x 1

Address
M mory e Mem ata D Wite r data

Zero ALU ALU result

ALUOut

B 4

0 1M u 2x 3

Mem ry o data register

Sign extend

32

Shift left 2

H.Corporaal EmbProcArch 5kk73

57

Multicycle Approach

Note that previous picture does not include:

branch support jump support Control lines and logic

Tclock > max (ALU delay, Memory access, Regfile access) See book for complete picture

H.Corporaal EmbProcArch 5kk73

58

Five Execution Steps

Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!


H.Corporaal EmbProcArch 5kk73 59

Step 1: Instruction Fetch


Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register-Transfer Language" IR = Memory[PC]; PC = PC + 4;

Can we figure out the values of the control signals? What is the advantage of updating the PC now?

H.Corporaal EmbProcArch 5kk73

60

Step 2: Instruction Decode and Register Fetch


Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch Previous two actions are done optimistically!! RTL:

A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC+(sign-extend(IR[15-0])<< 2);

We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
61

H.Corporaal EmbProcArch 5kk73

Step 3 (instruction dependent)

ALU is performing one of four functions, based on instruction type Memory Reference:

ALUOut = A + sign-extend(IR[15-0]);

R-type: ALUOut = A op B;

Branch: if (A==B) PC = ALUOut;

Jump: PC = PC[31-28] || (IR[25-0]<<2)

H.Corporaal EmbProcArch 5kk73

62

Step 4 (R-type or Memory-access)

Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B;

R-type instructions finish Reg[IR[15-11]] = ALUOut;

The write actually takes place at the end of the cycle on the edge

H.Corporaal EmbProcArch 5kk73

63

Write-back step

Memory read completion step

Reg[IR[20-16]]= MDR;
What about all the other instructions?

H.Corporaal EmbProcArch 5kk73

64

Summary execution steps


Steps taken to execute any instruction class
Action for R-type instructions Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) ALUOut = A + sign-extend (IR[15-0]) Load: MDR = Memory[ALUOut] or Store: Memory [ALUOut] = B Load: Reg[IR[20-16]] = MDR if (A ==B) then PC = ALUOut Action for jumps

Step name Instruction fetch Instruction decode/register fetch Execution, address computation, branch/ jump completion Memory access or R-type completion Memory read completion

ALUOut = A op B

PC = PC [31-28] II (IR[25-0]<<2)

Reg [IR[15-11]] = ALUOut

H.Corporaal EmbProcArch 5kk73

65

Simple Questions

How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, L1 add $t5, $t2, $t3 sw $t5, 8($t3) L1: ...

#assume not taken

What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place?

H.Corporaal EmbProcArch 5kk73

66

Implementing the Control

Value of control signals is dependent upon:


what instruction is being executed which step is being performed

Use the information we have accumulated to specify a finite state machine (FSM)

specify the finite state machine graphically, or use microprogramming

Implementation can be derived from specification

H.Corporaal EmbProcArch 5kk73

67

Graphical Specification of FSM


S t a rt

In s tr u c ti o n fe tc h 0 M em R e ad A L U S rc A = 0 Io rD = 0 IR W r i te A L U S rc B = 0 1 ALUOp = 00 P C W r i te P C S o u rc e = 0 0

In s t r u c ti o n d e c o d e / re g i s te r fe t ch 1 A L U S rc A = 0 A L U S rc B = 1 1 A L U O p = 00

How many state bits will we need?


2

M e m o ry a d d r e s s c o m p u t a ti o n 6 A L U S rc A = 1 A L U S rc B = 10 ALUO p = 00

E x e c u ti o n 8 A L U S rc A = 1 A L U S rc B = 00 A L U O p = 10

B ra nc h co m p l e ti o n 9 A L U S rc A = 1 A L U S rc B = 0 0 AL U Op = 0 1 P C W rit eC o nd P C S ou rc e = 0 1

(O p = 'J')

J ump c o m p l e t io n

P C W r i te P C S ou rc e = 1 0

(Op = 'L W')

M e m o ry a c ce s s 5

M em o ry ac c es s 7 M e m W r ite Io r D = 1

R - t y p e c o m p l e t io n

3 M e m R ea d Io r D = 1

R e gD s t = 1 R e g W ri te M e m to R e g = 0

W rite - b a c k s te p 4 R eg D st = 0 R e g W r i te M e m to R e g = 1

Finite State Machine for Control


PCWrite

Implementation:
Control logic

PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0

Inputs

Op5

Op4

Op3

Op2

Op1

Op0

S3

S2

S1

Instruction register opcode field

State register

H.Corporaal EmbProcArch 5kk73

S0

69

PLA Implementation
(see book)

Op5 Op4

opcode

Op3 Op2 Op1 Op0 S3

current state

S2 S1 S0

If I picked a horizontal or vertical line could you explain it ? What type of FSM is used? Mealy or Moore?

PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0

next state
70

datapath control

H.Corporaal EmbProcArch 5kk73

Pipelined implementation

Pipelining Pipelined datapath Pipelined control Hazards:


Structural Data Control Exceptions

Scheduling For details see the book (chapter 6):

H.Corporaal EmbProcArch 5kk73

71

Pipelining
Improve performance by increasing instruction throughput
P rog ra m e x e c u t io n T im e o rd er ( i n in s t r u c t i o n s ) lw $ 1 , 1 0 0 ( $ 0 ) 2 4 6 8 10 12 14 16 18

I n s t ru c t i o n R eg fe tc h

A LU

D a ta a c c e ss

R eg I n s t ru c t i o n R eg fe tc h D a ta a c c ess

lw $ 2 , 2 0 0 ( $ 0 )

8 ns

A LU

R eg I n s t ru c t i o n fe tc h

lw $ 3 , 3 0 0 ( $ 0 )

8 ns

...
8 ns P ro g ra m e x e c u t io n T im e o rd e r ( i n i n s t r u c t io n s ) lw $ 1 , 1 0 0 ( $ 0 )

10

12

14

I n s t r u c t io n fe tc h

Reg I n s t r u c t io n fe tc h

ALU

D a ta acce ss ALU

R eg D a ta a cc e s s ALU

lw $ 2 , 2 0 0 ( $ 0 )

2 ns

R eg I n s t r u c t io n fe tc h

R eg D a ta acce ss

lw $ 3 , 3 0 0 ( $ 0 )

2 ns

Reg

R eg

2 ns H.Corporaal EmbProcArch 5kk73

2 ns

2 ns

2 ns

2 ns 72

Pipelining

Ideal speedup = number of stages Do we achieve this?

H.Corporaal EmbProcArch 5kk73

73

Pipelining

What makes it easy


all instructions are the same length just a few instruction formats memory operands appear only in loads and stores

What makes it hard?


structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

Well build a simple pipeline and look at these issues Well talk about modern processors and what really makes it hard:

exception handling trying to improve performance with out-of-order execution, etc.

H.Corporaal EmbProcArch 5kk73

74

Basic idea: start from single cycle impl.


What do we need to add to actually split the datapath into stages?
IF: Instruction fetch
0 M u x 1

ID: Instruction decode/ register file read

EX: Execute/ address calculation

MEM: Memory access

WB: Write back

Add 4 Shift left 2 Read register 1 d Add reAuld s t

PC

Address

Instruction Instruction memory

Read data 1 Read register 2 Registers Read Write data 2 register Write data

0 M u x 1

Zero ALU ALU result

Address Data memory Write data

Read data

1 M u x 0

16

Sign extend

32

H.Corporaal EmbProcArch 5kk73

75

Pipelined Datapath
0 M u x 1

Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem?

IF/ID

ID/EX

EX/MEM

MEM/WB

Add 4 Shift left 2 Ins tructio n Read register 1 Add Add result

PC

Address Instruction memory

Read register 2 Registers Read Write data 2 register Write data

Read data 1 0 M u x 1 Zero ALU ALU result Read data

Address Data memory Write data

1 M u x 0

16

Sign extend

32

H.Corporaal EmbProcArch 5kk73

76

Corrected Datapath
0 M u x 1

IF/ID

ID/EX

EX/MEM

MEM/WB

Add 4 Shift left 2 I nst r uci o n t Read register 1 Add Add result

PC

Address Instruction memory

Read register 2 Registers Read Write data 2 register Write data

Read data 1 0 M u x 1 Zero ALU ALU result Read data

Address Data memory Write data

1 M u x 0

16

Sign extend

32

H.Corporaal EmbProcArch 5kk73

77

Graphically Representing Pipelines


Time (in clock cycles)

Program execution order (in instructions)


lw $10, 20($1)

CC 1

CC 2

CC 3

CC 4

CC 5

CC 6

IM

Reg

ALU

DM

Reg

sub $11, $2, $3

IM

Reg

ALU

DM

Reg

Can help with answering questions like:


how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
78

H.Corporaal EmbProcArch 5kk73

Pipeline Control
PCSrc 0 M u x 1 IF/ID Add 4
RegWrite Shift left 2 Add Add result

ID/EX

EX/MEM

MEM/WB

Branch

PC

Address
Instruction memory

Instruction

Read register 1

MemWrite Read data 1


ALUSrc

Read register 2 Registers Read Write data 2 register


Write data

0 M u x 1

Zo Zerero ALU ALU result

MemtoReg Address Data memory Write data

Read
data

1 M u x 0

Instruction 16 [15 0]

Sign extend

32

ALU control

MemRead

Instruction [20 16] Instruction [15 11]

0 M u x 1
RegDst

ALUOp

H.Corporaal EmbProcArch 5kk73

79

Pipeline control

We have 5 stages. What needs to be controlled in each stage?


Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back

How would control be handled in an automobile plant?


a fancy control center telling everyone what to do? should we use a finite state machine?

H.Corporaal EmbProcArch 5kk73

80

Pipeline Control
Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 Memory access stage control lines Branc Mem Mem h Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X

Pass control signals along just like the data:


WB Instruction M EX Control

(compare single cycle control!)

WB M WB

H.Corporaal EmbProcArch 5kk73

IF/ID

ID/EX

EX/MEM

MEM/WB

81

Datapath with Control


PCSrc 0 M u x 1 Control ID/EX WB M EX EX/MEM WB M MEM/WB WB

IF/ID Add 4

RegWrite

d Add reAuld s t
MemWrite Shift left 2 ALUSrc

Branch

PC

Address Instruction memory

Instruction

Read register 1

Read data 1 Read register 2 Registers Read Write data 2 register Write data

0 M u x 1

Zero ALU ALU result

Address Data memory Write data

Read data

Instruction 16 [15 0]

Sign extend

32

ALU control

MemRead

Instruction [20 16]


Instruction [15 11]

0 M u x 1 RegDst

ALUOp

H.Corporaal EmbProcArch 5kk73

MemtoReg

1 M u x 0

82

H.Corporaal EmbProcArch 5kk73

83

Hazards: problems due to pipelining


Hazard types: Structural

same resource is needed multiple times in the same cycle data dependencies limit pipelining next executed instruction may not be the next specified instruction

Data

Control

H.Corporaal EmbProcArch 5kk73

84

Structural hazards
Examples: Two accesses to a single ported memory Two operations need the same function unit at the same time Two operations need the same function unit in successive cycles, but the unit is not pipelined Solutions: stalling add more hardware

H.Corporaal EmbProcArch 5kk73

85

Structural hazards on MIPS


Q: Do we have structural hazards on our simple MIPS pipeline?
time

IF

ID

EX

MEM WB

IF

ID
IF

EX
ID IF

MEM WB
EX ID IF MEM WB EX ID MEM WB EX MEM WB

H.Corporaal EmbProcArch 5kk73

86

Data hazards

Data dependencies:

RaW WaW WaR

(read-after-write) (write-after-write) (write-after-read)

Hardware solution:

Forwarding / Bypassing Detection logic Stalling

Software solution: Scheduling

H.Corporaal EmbProcArch 5kk73

87

Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5 sub r4, r1, r3 add r1, r2, 5 sub r2, r4, 1 add r1, r2, 5 sub r1, r1, 1 st ld r1, 5(r2) r5, 0(r4) ; r1 := r2+5 ; RaW of r1 ; WaR of r2

; WaW of r1
; M[r2+5] := r1 ; RaW if 5+r2 = 0+r4

WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium! use register renaming to solve this!
H.Corporaal EmbProcArch 5kk73 88

RaW on MIPS pipeline


T i m e ( in c lo c k c y c le s ) V a lu e o f r e g is te r $ 2 : CC 1 10 CC 2 10 CC 3 10 CC 4 10 CC 5 1 0 / 2 0 CC 6 20 CC 7 20 CC 8 20 CC 9 20

P ro g ra m e x e c u ti o n orde r ( in in s tru c t io n s )
su b $ 2 , $ 1 , $ 3 IM Reg DM Reg

and $1 2, $2 , $ 5

IM

R eg

DM

R eg

or $ 1 3 , $ 6 , $ 2

IM

R eg

DM

R eg

a dd $ 1 4 , $ 2 , $ 2

IM

Reg

DM

R eg

sw $ 1 5 , 1 0 0 ( $ 2 )

IM

R eg

DM

Reg

H.Corporaal EmbProcArch 5kk73

89

Forwarding
Use temporary results, dont wait for them to be written

register file forwarding to handle read/write to same register ALU forwarding


T im e ( i n c lo ck cy c le s) CC 1 CC 2 10 X X CC 3 10 X X CC 4 10 20 X CC 5 1 0 / 20 X 20 CC 6 20 X X CC 7 20 X X CC 8 20 X X CC 9 20 X X

V a l ue o f re giste r $ 2 : 1 0 V a lu e of E X /M E M : X V a lu e o f M E M /W B : X

P r o g ra m e xe c u ti on o rde r ( in ins tru c tio ns ) sub $ 2 , $ 1 , $ 3 IM Reg DM Reg

What if this $2 was $13?

a nd $ 1 2 , $ 2 , $ 5

IM

R eg

DM

R eg

or $ 1 3 , $ 6, $ 2

IM

R eg

DM

Reg

a dd $ 1 4 , $ 2 , $ 2

IM

Reg

DM

Reg

sw $ 1 5 , 1 0 0 ($ 2 ) H.Corporaal EmbProcArch 5kk73

IM

Reg

DM

Reg 90

Forwarding hardware
ALU forwarding circuitry principle:

from register file ALU from register file to register file

Note: there are two options buf - ALU bypass mux - buf buf - bypass mux ALU - buf H.Corporaal EmbProcArch 5kk73

91

Forwarding
Control IF/ID

ID/EX WB

EX/MEM WB

MEM/WB WB

EX

In str uc tion

M u x Registers

PC

Instruction memory

ForwardA ALU
M u x

Data memory

M u x

IF/ID.RegisterRs IF/ID.RegisterRt IF/ID.RegisterRt IF/ID.RegisterRd

Rs Rt Rt Rd M u x

ForwardB
EX/MEM.RegisterRd

Forwarding unit

MEM/WB.RegisterRd

H.Corporaal EmbProcArch 5kk73

92

Forwarding check

Check for matching register-ids: For each source-id of operation in the EX-stage check if there is a matching pending dest-id

Example:
if (EX/MEM.RegWrite) (EX/MEM.RegisterRd 0) (EX/MEM.RegisterRd = ID/EX.RegisterRs) then ForwardA = 10

Q. How many comparators do we need?


H.Corporaal EmbProcArch 5kk73 93

Can't always forward

Load word can still cause a hazard:


an instruction tries to read register r following a load to the same r Need a hazard detection unit to stall the load instruction

T im e ( in c lo c k c y c le s ) P r o gr a m CC 1 e x e c u t io n ord er ( in in s t r u c t i o n s ) CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

lw $ 2 , 2 0 ( $ 1 )

IM

R eg

DM

R eg

an d $4 , $ 2, $5

IM

R eg

DM

Re g

or $8 , $ 2, $6

IM

R eg

DM

Reg

ad d $9 , $ 4, $2

IM

R eg

DM

Reg

slt $ 1, $6 , $ 7 H.Corporaal EmbProcArch 5kk73

IM

Reg

DM

Reg 94

Stalling
We can stall the pipeline by keeping an instruction in the same stage
Program Tim (in clock cycles) e execution CC1 CC2 order (in instructions) CC3 CC4 CC5 CC6 CC 7 CC8 CC9 CC 10

lw$2, 20($1)

IM

Reg

DM

Reg

and $4, $2, $5

IM

Reg

Reg

DM

Reg

or $8, $2, $6

IM

IM

Reg

DM

Reg

bubble
add $9, $4, $2 IM Reg DM Reg

In$1, $6, $7 the ALU is not used, CC4 slt Reg, and IM are redone
H.Corporaal EmbProcArch 5kk73

IM

Reg

DM

Reg
95

Hazard Detection Unit


Hazard detection unit IF/IDW r ite ID/EX.MemRead ID/EX WB M u x EX/MEM WB

Control 0

MEM/WB WB

IF/ID

EX

P CW r ite

In str uction

M u x Registers ALU M u x Data memory

PC

Instruction memory

M u x

IF/ID.RegisterRs IF/ID.RegisterRt

IF/ID.RegisterRt
IF/ID.RegisterRd ID/EX.RegisterRt

Rt Rd Rs Rt

M u x Forwarding unit

EX/MEM.RegisterRd

MEM/WB.RegisterRd

H.Corporaal EmbProcArch 5kk73

96

Software only solution?


Have compiler guarantee that no hazards occur Example: where do we insert the NOPs ?

sub nop nop and or Problem: this really slows us down! add nop sw

sub and or add sw

$2, $12, $13, $14, $13,

$1, $3 $2, $5 $6, $2 $2, $2 100($2)

$2,

$1, $3

$12, $2, $5 $13, $6, $2 $14, $2, $2 $13, 100($2)


97

H.Corporaal EmbProcArch 5kk73

Control hazards

Control operations may change the sequential flow of instructions


branch jump call (jump and link) return (exception/interrupt and rti / return from interrupt)

H.Corporaal EmbProcArch 5kk73

98

Control hazard: Branch


Branch actions: Compute new address Determine condition Perform the actual branch (if taken): PC := new address

H.Corporaal EmbProcArch 5kk73

99

Branch example
P ro g ra m e x e c u ti o n o rd e r ( in i n s t r u c t i o n s ) T i m e ( i n c l o c k c y c le s ) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

40 be q $1 , $ 3, 7

IM

Reg

DM

R eg

44 an d $1 2, $2 , $ 5

IM

Reg

DM

R eg

48 or $1 3, $6 , $ 2

IM

R eg

DM

R eg

52 ad d $1 4 , $2 , $ 2

IM

R eg

DM

Reg

7 2 lw $ 4 , 5 0 ($ 7 )

IM

Reg

DM

R eg

H.Corporaal EmbProcArch 5kk73

100

Branching
Squash pipeline: When we decide to branch, other instructions are in the pipeline! We are predicting branch not taken

need to add hardware for flushing instructions if we are wrong

H.Corporaal EmbProcArch 5kk73

101

Branch with predict not taken

Clock cycles

Branch L

IF

ID IF

EX ID IF

MEM WB EX ID IF MEM WB EX ID IF MEM WB EX ID MEM WB EX MEM WB

Predict not taken

L:

H.Corporaal EmbProcArch 5kk73

102

Branch speedup

Earlier address computation Earlier condition calculation Put both in the ID pipeline stage

adder comparator

Clock cycles

Branch L Predict not taken L:


H.Corporaal EmbProcArch 5kk73

IF

ID IF

EX ID IF

MEM WB EX ID MEM WB EX MEM WB


103

Improved branching / flushing IF/ID


IF.Flush Hazard detection unit M u x M u x ID/EX WB EX/MEM WB

Control 0 IF/ID

MEM/WB WB

EX

Shift left 2 M u x ALU M u x Data memory

Registers PC

Instruction memory

M u x

Sign extend

M u x Forwarding unit

H.Corporaal EmbProcArch 5kk73

104

Exception support
Types of exceptions: Overflow I/O device request Operating system call Undefined instruction Hardware malfunction Page fault

Precise exception:

finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after handling the exception(s)

H.Corporaal EmbProcArch 5kk73

105

Exceptions
Changes needed for handling overflow exception of an operation in EX stage (see book for details) :

Extend PC input mux with extra entry with fixed address Add EPC register recording the ID/EX stage PC

this is the address of the next instruction !

Cause register recording exception type

E.g., in case of overflow exception insert 3 bubbles; flush the following stages: IF/ID stage ID/EX stage EX/MEM stage
H.Corporaal EmbProcArch 5kk73 106

Scheduling, why?
Lets look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution

Reduce CPI (cycles per instruction) early scheduling of long latency operations avoid pipeline stalls due to structural, data and control hazards allow Nissue > 1 and therefore CPI < 1 Reduce Ninstructions compact many operations into each instruction (VLIW)

H.Corporaal EmbProcArch 5kk73

107

Scheduling data hazards: example 1


Try and avoid RaW stalls (in this case load interlocks)! E.g., reorder these instructions:

lw lw sw sw

$t0, $t2, $t2, $t0,

0($t1) 4($t1) 0($t1) 4($t1)

lw lw sw sw

$t0, $t2, $t0, $t2,

0($t1) 4($t1) 4($t1) 0($t1)

H.Corporaal EmbProcArch 5kk73

108

Scheduling data hazards example 2


Avoiding RaW stalls:
Reordering instructions for following program
(by you or the compiler)

Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4

Code: a = b + c d = e - f
Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4
109

H.Corporaal EmbProcArch 5kk73

Scheduling control hazards


Texecution = Ninstructions x CPI x Tcycle CPI = CPIideal + fbranch x Pbranch Pbranch = Ndelayslots x miss_rate

Modern processors tend to have large branch penalty, Pbranch, due to:

many pipeline stages multi-issue

Note that penalties have larger effect when CPIideal is low

H.Corporaal EmbProcArch 5kk73

110

Scheduling control hazards


What can we do about control hazards and CPI penalty? Keep penalty Pbranch low:

Early computation of new PC Early determination of condition Visible branch delay slots filled by compiler (MIPS)

Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro95] Remove branches: if-conversion

Conditional instructions: CMOVE, cond skip next Guarding all instructions: TriMedia
111

H.Corporaal EmbProcArch 5kk73

Branch delay slot

Add a branch delay slot:


the next instruction after a branch is always executed rely on compiler to fill the slot with something useful

Is this a good idea?

let's look how it works

H.Corporaal EmbProcArch 5kk73

112

Branch delay slot scheduling


Q. What to put in the delay slot? op 1 beq r1,r2, L ............. 'fall-through' op 2 .............

branch target
H.Corporaal EmbProcArch 5kk73

L: op 3 .............
113

Summary

Modern processors are (deeply) pipelined, to reduce Tcycle and aim at CPI = 1 Hazards increase CPI Several software and hardware measure to avoid or reduce hazards are taken

Not discussed, but important developments: Multi-issue further reduces CPI Branch prediction to avoid high branch penalties Dynamic scheduling In all cases: a scheduling compiler needed
H.Corporaal EmbProcArch 5kk73 114

Recap of MIPS

RISC architecture Register space Addressing Instruction format Pipelining

H.Corporaal EmbProcArch 5kk73

115

Why RISC? Keep it simple


RISC characteristics: Reduced number of instructions Limited addressing modes

load-store architecture enables pipelining uniform (no distinction between e.g. address and data registers) know directly where the following instruction starts

Large register set

Limited number of instruction sizes (preferably one)

Limited number of instruction formats Memory alignment restrictions ...... Based on quantitative analysis

" the famous MIPS one percent rule": don't even think about it when its not used more than one percent
116

H.Corporaal EmbProcArch 5kk73

Register space
32 integer (and 32 floating point) registers of 32-bit
Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address
H.Corporaal EmbProcArch 5kk73 117

1. Immediate addressing op rs rt Immediate

Addressing
funct Registers Register

2. Register addressing op rs rt rd ...

3. Base addressing op rs rt Address Memory

Register

Byte

Halfword

Word

4. PC-relative addressing op rs rt Address Memory

PC

Word

5. Pseudodirect addressing op Address Memory

PC

Word

H.Corporaal EmbProcArch 5kk73

118

Instruction format
R I J op op op rs rs rt rt rd shamt funct 16 bit address

26 bit address

Example instructions Instruction


add $s1,$s2,$s3 addi $s2,$s3,4 lw $s1,100($s2) bne $s4,$s5,L j Label
H.Corporaal EmbProcArch 5kk73

Meaning
$s1 = $s2 + $s3 $s2 = $s3 + 4 $s1 = Memory[$s2+100] if $s4<>$s5 goto L goto Label
119

Pipelining
All integer instructions fit into the following pipeline
time

IF

ID IF

EX ID IF

MEM EX ID IF

WB MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB

H.Corporaal EmbProcArch 5kk73

120

Other architecture styles

Accumulator architecture

one operand (in register or memory), accumulator almost always implicitly used zero operand: all operands implicit (on TOS)

Stack

Register (load store)

three operands, all in registers loads and stores are the only instructions accessing memory (i.e. with a memory (indirect) addressing mode two operands, one in memory
three operands, may be all in memory

Register-Memory

Memory-Memory

(there are more varieties / combinations)


H.Corporaal EmbProcArch 5kk73 121

Accumulator architecture
latch Accumulator

ALU registers latch

address

Memory

Example code: a = b+c;


load b; add c; store a;
H.Corporaal EmbProcArch 5kk73

// accumulator is implicit operand

122

Stack architecture
latch latch top of stack ALU latch stack pt Memory

Example code: a = b+c; push b; push b push c; b add; stack: pop a;


H.Corporaal EmbProcArch 5kk73

push c

add

pop a

c b

b+c

123

Other architecture styles


Let's look at the code for C = A + B
Stack Architecture
Push A Push B Add Pop C

Accumulator Architecture
Load A Add B

RegisterMemory
Load r1,A Add r1,B

MemoryMemory
Add C,B,A

Register (load-store)
Load r1,A Load r2,B Add r3,r1,r2

Store C

Store C,r1

Store C,r3

Q: What are the advantages / disadvantages of load-store (RISC) architecture?


H.Corporaal EmbProcArch 5kk73 124