COA Module 2 Notes

08.
503 Computer Organization and Architecture Design of Data path and Control (based on MIPS instruction set) Basic MIPS Implementation: Consider the subset of the core MIPS instruction set:
Module 2
The key principles used to create data path and design the control for other instructions are similar. The implementation ideas are common for general purpose microprocessors, processors in high performance servers, embedded processors etc. For execution of every instruction, the first two steps are identical: 1. Send the PC to the memory that contains the code and fetches the instruction from that memory. 2. Read one or two registers, using the fields of the instruction to select the registers. After these steps, the actions to complete the instruction depend on the instruction class. But for memory-reference, arithmetic-logical and branches class instructions the actions are largely same. All instruction class needs ALU, except jump instruction. A high-level view of a MIPS implementation focusing on various functional units and the interconnection is shown below:
It shows that the value written to PC can come from one of two adders. The data written into register file can come from either ALU or the data memory. A line from multiple lines are selected by multiplexer, also called data selector. A line is selected from several lines using control lines.
Department of ECE, VKCET
Page 1
08.503 Computer Organization and Architecture
Module 2
The data path with required multiplexers of MIPS implementation of the basic type is shown below:
Three multiplexers and few control lines are required. A control unit is there, that has the instruction as input used to determine how to set the control lines for the functional units and the two multiplexers. The third multiplexer determines whether PC+4 or branch destination address is written into PC. It is based on the zero output of ALU which is used to perform the comparison of instruction beq This type design approach is easier to understand, but not a practical one, because it is slower than the implementation that allows different instruction classes to take different numbers of clock cycles. There are two types of design concept: single cycle datapath concept and multicycle datapath concept. In single cycle concept, separate instruction and data memories are required, because:
Logic design conventions: The functional units of MIPS implementation consists of two types of logical elements: 1. Combinational elements It operates on the data values and their outputs depend only on the current inputs. It has no storage elements. The adders, ALU and multiplexers are examples. 2. State elements: It contains state and has some internal storage. It preserves the values we stored in the previous state. The instruction, data memories and registers are the examples. State element has at least two inputs and one output. The required inputs are data value to be written into it and the clock, which determine when the data value is written. A simplest state element is a D flip flop. The clock is also used to read the state element at any time.
Page 2
Module 2
State element is also called sequential element, because its output depends on both the input and its internal state. Clocking Methodology: It defines when the signal can read and when they can be written. A simple methodology is edge-triggering method: Any values stored in a sequential logic element are updated only on a clock edge. Consider two state elements surrounding a block of combinational logic which operates in a single clock cycle:
All signals must propagate from state element 1 to state element 2 through the combinational logic in the time of one clock cycle. This can be done by using edge-triggering method. During the edge of clock a read operation can be performed to state element 1 and write operation can be performed to state element 2 during edge of the next clock. Edge triggering may be +ve edged or ve edged. Building data path: Start with major components to execute: Two state elements (Instruction memory and PC) and an adder as shown below:
All elements are combined by data path to fetch the instruction and increment PC to point next instruction. Consider R-Type instruction, it requires processors 32-bit register structure called register file. R-Type instructions have 3 operands in registers. For Read operation: An input to the register file that specifies the register number to be read and output from the register file that carry the value that has been read from the registers. So two inputs and two outputs are required. To write data: One input to specify the register number to be written and one to supply the data to be written into the register. The register number inputs are 5 bits wide to specify one of 32 registers (32 = 25). We need total four inputs (3 for register number and one for data) and two outputs for data. Page 3
08.503 Computer Organization and Architecture The elements required for R-Type instruction is:
Module 2
ALU takes two 32-bit input, 4 control lines, 32-bit result output and 1-bit zero signal for zero output. Consider memory-reference instructions: lw $S1, offset_value ($S2) and sw $S1, offset_value ($S2) Both require a sign extension unit for 16-bit offset_vaue to 32-bit offset_value, ALU operation and data memory elements. Then the additional two elements are shown below:
Consider branching instruction beq: It has three operands, two registers are compared for equality and a 16-bit offset value to calculate the target address. To implement this instruction, the branch address is computed by adding signed extended 32-bit offset field to PC. Before adding the offset field is shifted to left by 2 bits for word offset value. For the instruction, if the condition is true branch is taken otherwise no branch is taken.
Page 4
08.503 Computer Organization and Architecture The structure of data path handle the branch instruction is:
Module 2
To perform branch target address, the branch datapath includes a sign extension unit, a shift left by 2 unit, an adder, ALU to compare two register file operands. ALU provides an output signal that indicates whether the result is 0 or not. If two operands are equal, zero output is 1 else 0. Jump instruction operates by replacing the lower 28 bits of the PC with the lower 26 bits of the instruction shifted left by 2 bits. This unit is not shown here. Creating single data path: Combining the individual instruction class datapath components into a single datapath and add the control to complete the implementation. A simplest attempt is to execute all instructions in one clock cycle. So any element needed more than once must be duplicated. The operations on memory-reference and arithmetic-logical instructions are same. But have some differences: 1. Memory instructions use the ALU for address calculation with one input from sign extended 16-bit offset field from instruction and arithmetic-logical instructions use ALU with the inputs from registers. 2. The ALU result for first class instruction is always to address of data memory, but for second class it is always a register.
Page 5
Module 2
The simple data path for MIPS architecture for the three class instruction is shown below:
The ALU inputs are coming from two registers and memory-reference instructions can also use ALU to do address calculation. So the second input of ALU is selected from a register or signextended 16-bit offset field from the instruction using a MUX. The control signal of this MUX is ALUSrc. The value stored in destination register (write data) comes from the ALU result (for R-type instruction) or a memory data (for load instruction). So it is selected by another MUX. The control signal for this MUX is MemtoReg. An additional MUX is required for selecting sequentially executing instructions address PC+4 or the branch target address to be written to PC. It has a control line PCSrc.
Page 6
Module 2
Simple Implementation: To add the simple control function to the datapath unit, consider the instructions lw, sw, beq, add, sub, and, or and slt. ALU control: ALU has 4-bit control lines, so there are 16 possible ALU functions. But now use only 6 functions and are shown in the following table:
For the three class instructions ALU need to perform first five functions. For memory-reference instructions ALU need to compute memory address by addition, for arithmetic-logical class instruction ALU needs to perform any one of the five functions depends on the value of 6-bit funct field in lower bits of instruction and for branch equal instruction ALU must perform subtraction. The 4-bit control input for ALU can be generated using a small control unit that has two inputs, one is a 2-bit control filed called ALUOp (ALU operation) and 6-bit function field from the instruction. The following table shows how the 4-bit control lines to ALU is related to 2-bit ALUOp and 6-bit funct field in instruction:
The table shows multilevel decoding. There is 8-bit input to generate 4-bit output. Using optimization designing method repeating logic can be replaced by dont care (X) condition. Then the truth table for the ALU control inputs is shown below:
Page 7
Module 2
Designing of main control unit: Consider the instruction format of R-Type, memory-reference and branch instructions shown below:
The major observations about this instructions are: 1. The opcode field op is always in bits 31:26 (6-bit), usually referred as Op[5:0]. This is common for all three class instructions. 2. Two registers to read are always specified by rs and rt fields at the positions 25:21 and 20:16 respectively. This is also common for all three class instructions. 3. The base register for load/store instruction is always in 25:21 (rs). 4. The 16-bit offset for branch equal and load/store instructions is at 15:0. 5. The destination register for load and R-type instructions is in one of two places. For load it is in 20:16 (rt), while for R-type instruction it is in 15:11 (rd). This will need a MUX to select which field of instruction is used to indicate the register number to be written. The control unit implementation along with datapath is shown below:
Page 8
Module 2
This implementation has seven control lines and a 2-bit ALUOp control signal. The functions of seven control lines are: 1. RegDst: If 0, the destination register number comes from rt field bits 20:16. Else from the rd field bits 15:11. 2. RegWrite: If 1, the register on the write register input is written with the value on the write data input. Else nothing happen. 3. ALUSrc: If 0, the second ALU operand comes from second register file output (Read data 2). Else from sign-extended lower 16-bit of the instruction. 4. PCSrc: If 0, the PC is replaced by the output of the adder that computes the value PC+4. Else the output of the adder that computes the branch target. 5. MemRead: If 1, data memory content designated by the address input are put on the read data output. Else nothing happen. 6. MemWrite: If 1, data memory content designated by the address input is replaced by the value on the write data input. Else nothing happen. 7. MemtoReg: If 0, the value fed to the register write data input comes from the ALU. Else from the data memory. The simple data path design with control unit is shown below:
The control unit generates nine control signals (including 2-bit ALUOp) according to instruction opcode. But for branch equal instruction, the control signal PCSrc is generated by branching decision from instruction and Zero output from ALU. For the Branch signal from control unit and Zero signal from ALU is ANDed.
Page 9
Module 2
The setting of control lines determined by the opcode field of the instruction is shown below:
The operation of the datapath for R-type instruction like add $t1, $t2, $t2 is shown in following figure.
Where everything occurs in one clock cycle and requires 4 steps to execute the instruction. The steps for the flow of the instruction are: 1. The instruction is fetched and the PC is incremented. 2. Two registers $t2 and $t3 is read from the register file and the main control unit computes the setting of the control lines during this step. 3. The ALU operates on the data read from the register files and function code bits 5:0 from instruction to generate ALU function. 4. The result from the ALU is written into the register file using bits 15:11 of the instruction to select the destination register $t1.
Page 10
08.503 Computer Organization and Architecture Illustration of the instruction
Module 2
is shown below:
The five steps for load instruction are: 1. Instruction is fetched from instruction memory and PC is incremented. 2. A register $t2 value is read from the register file. 3. ALU computes the sum of register value from register file and sign-extended lower 16-bit of the instruction (offset). 4. The sum from the ALU is used to address for the data memory. 5. The data from the memory unit is written into register file, the destination register ($t1) given by bits 20:16 of the instruction.
Page 11
08.503 Computer Organization and Architecture Illustration of the instruction
Module 2
is shown below:
The four steps in execution for branch instruction are: 1. Instruction is fetched from the instruction memory and PC is incremented. 2. Two registers $t1 and $t2 are read from the register file. 3. ALU performs a subtract on the values read from the register file. The value of PC + 4 is added to the sign-extended, lower 16-bit of the instruction (offset) shifted left by two. The result is branch target address. 4. The Zero result from the ALU is used to decide which adder result to store into PC.
Page 12
Module 2
Finalizing the control: The input signals and its corresponding output of the control unit is shown in the following truth table: Inputs Outputs
Signal name of opcode Op3 Op2 Instruction R-Type (0) lw (35) sw (43) beq (4) Op5 Op4 Op1 Op0 Reg Dst ALU Src Memto Reg Reg Write Mem Read Mem Write Bran ch ALU Op1 ALU Op0
0 1 1 0
0 0 0 0
0 0 1 1
0 0 0 0
0 1 1 0
0 1 1 0
1 0 X X
0 1 1 0
0 1 X X
1 1 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 0 0 0
0 0 0 1
Page 13
Module 2
Implementing jump instruction: Jump instruction is similar to branch instruction, but computes the target address for PC differently and is not conditional. Like branch instruction, the lower order 2 bits of jump instruction are always 002 (multiply by 4). The next lower 26 bits of this 32-bit address comes from the 26-bit immediate field in the instruction as shown below:
The upper 4 bit of the address should replace the PC + 4 address bit. Then jump instruction can be implement by storing PC by: 1) The upper 4 bits of current PC+4 (31:28 bits of sequentially following instruction address) 2) The 26-bit immediate field of the jump instruction. 3) The bits 002. The addition of the control for jump instruction and multiplexer for selecting jump address, PC+4 or branch target is shown below:
Page 14
Module 2
Advantages and disadvantages of single cycle implementation: The only advantage of single cycle implementation is its simplicity. The disadvantage is its inefficiency and slow speed. The clock cycle is not same for all instructions, so it is inefficient (ie CPI not 1) Due to the inefficiency, nowadays this implementation is not used. Performance of single cycle implementation: Assume that the operation time for major functional units in single cycle implementation is: a) Memory units: 200ps b) ALU and adders: 100ps c) Register file: 50ps Assume mux, control unit, PC, sign-extended units and wires have no delay. Consider two systems, one implementation every instruction operates in 1 clock cycle of a fixed length and other every instruction executes in 1 clock cycle using a variable length clock according to the requirement of the instruction. To compare the performance, assume 25% loads, 10% stores, 45% ALU, 15% branch and 5% jump instructions are there. We know that CPU execution time = IC x CPI x Clock cycle time. If CPI = 1, CPU execution time = IC x Clock cycle time. We have to find clock cycle time for both cases, since IC and CPI are same for both case. The critical path for different class instruction is shown below:
Using these critical paths, the required length for each instruction class are:
Then the clock cycle time with a single clock for all instructions will be determined by longest instruction, which is 600ps for load word. A machine with variable clock will have a clock cycle that varies between 200ps and 600ps. The average clock cycle for the machine is CPU clock cycle = 400 x 0.45 + 600 x 0.25 + 550 x 0.1 + 350 x 0.15 + 200 x 0.05 = 447.5ps The CPU performance can be found by Department of ECE, VKCET Page 15
Module 2
This shows that variable clock implementation is 1.34 time faster than single clock implementation. Implementation of variable clock machine is very difficult and cause overhead during execution. Single clock implementation with fixed clock length is more suitable for small instruction set. In single cycle implementation, each functional unit can be used only one clock, therefore some units must be duplicated and cause raise of cost. So it is inefficient both in performance as well as hardware cost. Multi cycle implementation: The drawbacks of single cycle implementation can be overcome by this method. This allows a sharing of functional unit, instead of duplication and it is used on different clock cycles. The sharing of hardware reduces the amount of hardware required. The major advantages of this method are the ability to allow instruction to take different number of clock cycles and ability to share functional units within the execution of a single instruction. The high level view of multi cycle datapath is shown below:
The main difference of this implementation comparing to single cycle implementation is: 1. A single memory unit is used for both instructions and data. 2. There is a single ALU, rather than an ALU and two adders. 3. One or more registers are added after every functional unit to hold the output of that unit until the value is used in a subsequent clock cycle. At the end of a clock cycle, all data that is used in the subsequent clock cycle must be stored in state elements: register file, PC or memory. Department of ECE, VKCET Page 16
Module 2
The data used by the same instruction in a later cycle must be stored one of the additional registers. In this design, the operations required are: a memory access, a register file access or an ALU operation. So the data from these functional units must be saved into temporary register for later cycle. The temporary registers used and its purpose are: 1. Instruction Register (IR) and Memory Data Register (MDR): To save the output of the memory for an instruction read and a data respectively. 2. A and B registers: To hold the register operand values read from register file. 3. ALUOut register: To hold the output of the ALU. All the registers except IR hold data only between a pair of adjacent clock cycles and thus no need a write control signal. The IR needs to hold the instruction until the end of execution of that instruction and thus will require a write control signal. To share functional units for different purposes, we need more MUX as well as expand the existing MUX. For one memory is used for instructions and data, we require a MUX for selecting two sources for a memory address from PC (for instruction access) and ALUOut (for data access). Three ALUs of single cycle implementation is replaced by a single ALU. So additional multiplexers are required at the two input of ALU. A MUX for the first ALU input chooses A register and PC. Another MUX on the second input is a 4-way MUX to choose a constant 4 (to increment PC), the sign-extended offset and shifted offset field (both are used for branch address computation). The details of datapath with the additional MUXs are shown below:
The datapath takes multiple clock cycles per instruction and it will require different set of control signals. The programmer-visible state units PC, memory and registers require wirte control signals as well as IR also need write control signals. The memory also need a read control signal. ALU also need control signal similar to single cycle implementation. Page 17
08.503 Computer Organization and Architecture Each multiplexers also need contol lines. The datapath with control lines are shown below:
Module 2
For jump and branch instruction, there are three possible sources for the value to be written into PC: 1. The output of ALU, which is PC+4 during instruction fetch. 2. The register ALUOut, which is where the address of the branch target. 3. The lower 26 bits of the IR shifted by 2 and concatenated with upper 4-bits of PC+4, which is the source when the instruction is jump. PC is updated conditionally and unconditionally. During normal increment PC is written unconditionally. If instruction in conditional branch, PC is replaced by ALUOut only if two designated registers are equal. So two separate control signals are required for PC and are: PCWrite, which is for unconditional write of PC and PCWriteCond, which is for write of PC if the branch condition is true.
Page 18
Module 2
The multicycle datapath and control unit including additional control signals and multiplexer for implementing PC updating is shown below:
The functions of 1-bit control lines are:
Page 19
08.503 Computer Organization and Architecture The functions of 2-bit control lines are:
Module 2
Fetch, Decode, Execute and Memory Access Cycles: Breaking the execution of instruction into multiple clock cycle should improve the performance of the system. Breaking instruction execution into a series of steps and each step taking one clock cycle. For example, restrict each step contain one ALU operation, or one register access, or one memory access. With this restriction, the clock cycle could be as short as possible. There are three to five steps for execution of MIPS instruction using multicycle implementation. They are: 1. Instruction fetch step: Fetch the instruction from the memory and compute the address of next instruction:
Operation: Send PC to memory as the address, perform memory read, and write the instruction to IR, where it will be stored. Also increment PC by 4. To implement this step, the following signals are: MemRead and IRWrite to assert (as 1), set IorD as 0 to select PC as source address, set ALUSrcA as 0 to select PC and send to ALU, ALUSrcB as 01 to select 4 and send to ALU, ALUOp as 00 to make ALU add. Also, to store incremented instruction address back to PC, PCSource signal to 00 and set PCWrite as 1. The increment in PC and instruction memory access occurs in parallel and new value of the PC is not visible until the new clock cycle. 2. Instruction decode and register fetch step: The instruction is decoded and operands are fetched in this step. The branch target address is also computed with ALU in this step. The potential branch target is saved in ALUOut.
Page 20
Module 2
If instruction has two register inputs, they are always in rs and rt fields, and if instruction is a branch, the offset is always in low-order 16 bits. This is shown below:
Operation: Access the register file to read rs and rt and store the results in A and B registers. Since A and B registers are overwritten on every cycle, registers can be read on every cycle and values stored into A and B. The same step will also computes branch target address and store result to ALUOut, where it will use on next cycle for instruction fetch. The required control signals for this step are: set ALUSrcA to 0 to send PC to ALU, ALUSrcB to value 11 to send sign-extended and shifted offset value to ALU and ALUOp to 00 to ALU add. The register file access and computation of branching target address occur in parallel. After this step clock cycle, the determining action to take depends upon the instruction. 3. Execution, memory address computation, or branch completion: First cycle during the datapath operation is determined by the instruction class. For memory reference:
Operation: ALU adds operands to form memory address. Set ALUSrcA to 1 for send A to ALU input and set ALUSrcB to 10 for send sign-extended offset to second ALU input. ALUOp signals are set to 00 for ALU add. For R-type instruction:
Operation: ALU perform the operation specified by funct field on two value read from register file in the previous cycle. For this control signal ALUSrcA set to 1 for send A to ALU input and set ALUSrcB to 00 for send B to ALU other input. The ALUOpsignals will need to be set to 10 and using funct field ALU control unit generate signals for ALU operation. For branch:
Operation: ALU is used to compare two register read in previous step. The zero signal out from ALU is used to determine whether or not to branch. The required control signals are: set ALUSrcA to 1 and ALUSrcB to 00 to select A and B register to ALU inputs. ALUOp signals set to 01 for equality testing (subtract). The PCWriteCond signal will need to assert to update PC if the zero output of ALU is asserted. PCSource set to 01 for send value to PC from ALUOut, which hold the target address. For conditional branches, there are two write operation to PC: once from the output from ALU during instruction decode/register fetch and once from ALUOut during branch completion step. The last value written to PC is used to fetch the next instruction.
Page 21
08.503 Computer Organization and Architecture For jump:
Module 2
Operation: PC is replaced by jump address. PCSource is set to 10 for jump address to PC and PDWrite is asserted to write jump address into PC. 4. Memory access or R-type instruction complete step: During this step, a memory reference instruction access memory and R-type instruction writes its result. When a value is retrieved from memory, it is stored in MDR and is used on the next clock cycle. For memory reference:
Operation: For load instruction, a data word is retrieved from memory and is written into MDR. For store instruction, the data word is written into memory. In both cases the address used is computed during previous step and stored in ALUOut. For store instruction, the source operand is in B. The signals are: MemRead for load and MemWrite for store will be asserted to 1. The signal IorD is set to 1 to force the memory address come from ALU. For R-type instruction:
Operation: Place the contents of ALUOut into result register. The signal RegDst set to 1 for the rd field (15:11 bits) to use the register file entry to write. RegWrite is asserted and MemtoReg signal set to 0 for write ALUOut data to register file. 5. Memory read completion step: During this step, load instruction complete by writing back the data from memory to register. For load:
Operation: Write the load data stored in MDR during previous cycle into register file. The signal MemtoReg set to 1 for write the result from memory, assert signal RegWrite to 1 and make RegDst as 0 to choose the rt (20:16 bits) field of the register.
Page 22
Module 2
Design of the Control Unit: To design control unit for single cycle implementation, truth table that specified the setting of the control signals based on the instruction class is used. For multicycle datapath, the control unit is more complex, because the instruction execution is by series of steps. Two different techniques used for control unit design of multicycle implementation are: One is based on finite state machines (hardwired) and other is using microprogramming. Both represent the control in the form of an implementation using gates, ROMs, or PLAs. The high level view of the finite state machine control for the five steps of multi-cycle implementation is shown below:
The first two states of the machine using graphical representation is shown below:
State 0 is for instruction fetch and after this FSM switches to state 1 for instruction decode/Register fetch. After state 1, FSM switches to any of the four states depend upon the instruction. For memory-reference instructions: Page 23
Module 2
For R-type instructions:
Page 24
08.503 Computer Organization and Architecture For branch instructions:
Module 2
For jump instructions:
All these states can be implemented by a control unit shown below:
Page 25
Module 2
This FSM can be implemented with a temporary register that holds the current state and a block of combinational logic that determines both datapath signals to be asserted as well as the next state. The combinational control logic for this FSM is implemented both with a ROM and a PLA. Microprogramming Control Design: A technique for designing complex control units. It uses a simple hardware that can be programmed to implement a more complex instruction set. Enhancing Performance with Pipelining: Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Consider a laundry system with non-pipelining and pipelining approaches. In non-pipelining approach:
In pipelining approach, as soon as washer is finished with first load and placed in the dryer, load the washer with second load. When the first load is dry, place it on the folder, move the wet load to dryer and load the next load to the washer. Next put the first load away, second load start fold, third load to dryer and put the fourth load into the washer. These two approaches are shown below:
Pipelining is faster and is applicable to implement in MIPS instruction, because classically it take five steps: Page 26
Module 2
Comparison between single cycle and pipelining implementation: Let as consider total time required for different units to execute each instruction as shown below:
Execution of instruction in single cycle non-pipelining implementation is shown below:
Execution of instruction in pipelining implementation is shown below:
Required time for executing first four instruction in non-pipelining implementation is 800 x 4 = 3200ps, but for pipelining implementation it is around 1500ps. The speed of the pipelined instructions depends on the number of stages pipelined. Then
Page 27
Module 2
Pipeline hazards: It is the situation, when the next instruction cannot execute in the following clock cycle Three types of pipeline hazards: 1. Structural hazards Occur when the hardware cannot support the combination of instructions to execute in the same clock cycle. 2. Data hazards Occur when one step of execution wait for the completion of other step. 3. Control hazards Arise from the need to make a decision based on the result of one instruction while others are executing. Branch prediction, forwarding and stalls are used to avoid these hazards. Pipelining increases the number of simultaneously executing instructions and the rate at which instructions are started and completed. Pipelining does not reduce the time it takes to complete an individual instruction and also called latency. Thus pipelining improves instruction throughput rather than individual instruction execution time or latency. Pipelined Datapath: Consider the single cycle datapath and divide it into five stages for five step execution as well as five-stage pipeline. It means that five instructions will be in execution during any single clock cycle.
Page 28
Module 2
The name of each stage are: 1) IF (Instruction Fetch) 2) ID (Instruction Decode/Register Fetch) 3) EX (Execute/Address Calculation) 4) MEM (Memory Access) 5) WB (Write Back) The instructions and data move generally from left to right through five stages for its completion. But there are two exceptions for this, they are: 1) The WB stage places the result back to ID stage which is in the middle of the datapath. 2) The new value of PC is choosing between PC + 4 and branch address from the MEM stage. These exceptions may cause data hazard (for first) and control hazard (for second). The executions of some instructions and their datapath on a common time line is shown below:
Here three instructions need three datapaths and allow the sharing of units for other instructions. For example, IM is used only one of the five stages of an instruction and it is shared by other instruction during the other four stages. Consider the pipelined datapath with the pipeline registers highlighted.
Page 29
Module 2
The pipeline registers are: IF/ID registers between IF and ID stages, ID/EX registers between ID and EX stages, EX/MEM registers between EX and MEM stages and MEM/WB registers between MEM and WB stages. There is no pipeline register at the end of the WB stage. All instructions must update some state in the register file, memory or PC, so a separate pipeline register is redundant to the state updated. Pipelined datapath of load instruction The active portions of the datapath highlighted as a load instruction goes through the first stage of pipelined execution is shown below:
Instruction fetch: Read instruction from memory using address in PC, and then place it in IF/ID register. IF/ID register is similar to IR. PC address is incremented by 4 and written back to PC for next instruction. This incremented conted is also saved into IF/ID register for further use of the instruction like beq. The active portions of the datapath highlighted as a load instruction goes through the second stage of pipelined execution is shown below:
Page 30
Module 2
Instruction decode and register file read: IF/ID pipeline register supply 16.bit offset and register numbers to read the two registers. The sign extended 32-bit offset value, two register data and incremented PC values are stored ID/EX pipeline register. The active portions of the datapath highlighted as a load instruction goes through the third stage of pipelined execution is shown below:
Execute and address calculation: Load instruction read register 1 content and sign-extended 32bit offset value from ID/EX and adds them using the ALU. The result is placed in the EX/MEM pipeline register. The active portions of the datapath highlighted as a load instruction goes through the fourth stage of pipelined execution is shown below:
Page 31
Module 2
Memory access: The load instruction read data memory using the address from EX/MEM pipeline register and loads the data into the MEM/WB pipeline register. The active portions of the datapath highlighted as a load instruction goes through the fifth stage of pipelined execution is shown below:
Write back: This is the final step, reading data from MEM/WB register and writing it into the register file in the middle of the datapath. The above steps show that any information passing to next stage is via pipeline register. Pipelined datapath of store instruction The first two steps are same as load instruction. The others stages are shown below:
Page 32
Module 2
Here also information is passed to the next stage via pipeline registers.
Page 33
Module 2
Corrected datapath of load instruction: At the final stage of load instruction, write register number is required. If IM is shared for other instructions, the write register number may be changed. So we need to preserve the write register number or instruction during the last stage using pipeline register. The corrected datapath, by passing write register number first to ID/EX, then to EX/MEM and finally to MEM/WB is shown below:
Page 34
Module 2
Graphical representation of pipelines: To understand more about pipelines, consider multiple clock cycle pipeline diagram and single clock cycle diagrams. The multiple clock cycle diagram of the following five instruction sequence is shown below:
This shows that time advances from left to right and instructions advances from top to bottom. Another version multiple clock cycle pipeline diagram is shown below:
Page 35
08.503 Computer Organization and Architecture The single clock cycle diagram is shown below:
Module 2
This is a vertical slice representation. Pipelined Control: The control lines introduced in pipelined data path is show below:
All control lines are same as single cycle implementation without pipeline. There are no separate write control signals for pipeline registers, because they are written during each clock cycle. Here the control lines are grouped into five according to the pipeline stages. Department of ECE, VKCET Page 36
Module 2
1. IF Control signals:- To read IM and write PC, they are always asserted. So there is nothing special control signals to pipeline stage. 2. ID Control signals:- Similar to previous stage, there are no optional signals for this stage. 3. EX Control signals:- The signals are: RegDst select result register, ALUOp ALU operation and ALUSrc select ALU input from either Read data 2 or sign-extended offset. 4. MEM Control signals:- The signals are: Branch for branch target address, MemRead for load instruction and MemWrite for store instruction. Also have PCSrc signal to assert branch control and ALU result Zero signal. 5. WB Control signals:- The signals are: MemtoReg select ALU result or the memory read data to register and RegWrite to write the value to register. Pipelining doesnt change the functions of control lines, but they are grouped together. The full datapath with pipeline registers and control lines are shown below:
Page 37
08.503 Computer Organization and Architecture Data Hazards and Forwarding: Consider the following instruction sequence which has dependency.
Module 2
The last four instructions are all dependent on the result in register $2 of the first instruction. The execution of these instructions in the pipeline is shown below:
This shows the register $2 value changes from 10 to -20 during the middle of clock cycle CC5 during the result of first instruction. So add and sw instructions get the correct value -20, but and and or instructions get the wrong value 10. Carefully looking into first instruction execution, the result is available during the operation of EX stage, ie at end of CC3. And the data is needed for and and or instructions at the beginning of EX stage, ie at CC4 and CC5 respectively. Data forwarding is the method for this hazard, in which the data simply forward as soon as it is available to any units that need it before to read from the register file. One method for data forwarding is by forwarding an operation in the EX stage, which is either an ALU operation or an effective address calculation.
Page 38
08.503 Computer Organization and Architecture The following figure shows the forwarding data using pipeline registers from EX stage:
Module 2
Here required data exists in time for later instructions by the pipeline registers EX/MEM and MEM/WB. If the inputs to ALU are from any pipeline register rather than ID/EX, then forwarding data is possible. For this multiplexers to the input of ALU with proper control line are required. This arrangement gives pipeline at full speed with data dependencies. The close-up of the ALU and pipeline register before and after adding forwarding is shown below:
Page 39
Module 2
Forwarding control will be in the EX stage, because the ALU multiplexers are at this stage. The operand register numbers will pass from ID stage via ID/FX register to determine whether to forward values. The control values and its operation for the multiplexers are:
Page 40
08.503 Computer Organization and Architecture Data hazards and stalls Consider the following illustration:
Module 2
The data forwarding cannot solve the data hazard problem introduced by load instruction as shown. Here the data is still being read from memory at CC4, while ALU is performing the operation following instruction. This problem is solved by stall the pipeline for the combination of load followed by an instruction that reads its result. An additional forwarding unit called hazard detection unit required and it operates during the ID stage so that it can insert stall between load and its use. If the instruction in ID stage is stalled, then the instruction in the IF stage must also be stalled. This is accomplished by preventing the PC and the IF/ID pipeline register from changing. Stalling can be done by nop instruction; it has no effect in execution. The following figure shows the action of nop instruction execution for stalling.
Page 41
Module 2
Here the pipeline execution slot for and instruction is turned into nop and all instructions beginning with the and are delayed by one cycle. The hazards forces and and or instruction to repeat in CC4, what they did in CC3, where and reads registers and decodes, and or is re-fetched from instruction memory. The pipeline connection for both hazard detection unit and the forwarding unit is shown below:
Forwarding unit controls ALU multiplexers to replace the value from general purpose registers with proper pipeline register. The hazard detection unit controls the writing of the PC and IF/ID registers, and the multiplexer that choose the real control values and all 0s. The hazard detection unit stalls and de-asserts the control field if the load-use hazard instruction occurs.
Page 42
08.503 Computer Organization and Architecture Branch hazards or Control hazard Consider the following illustrations:
Module 2
By pipelining, every clock cycle an instruction is fetched. But for branch instruction the decision determines whether to branch or not and is until the MEM pipeline stage. The delay up to MEM stage can be used to determine the proper instruction to fetch and is called branch or control hazard. Control hazards are shorter than data hazards, because they are relatively simple and occur less frequently than data hazards. Branch stalling In which stalling until the branch is complete, but it is too slow. An improvement to this method is to assume that the branch will not be taken and continue execution down the sequential instruction stream. If the branch is taken, the instruction that fetched and decoded must discarded. If branch is not taken half the time and a little to discard the instructions. This optimization halves the cost of control hazards. To discard instruction (also called flushing instructions), the control values are changed to 0s, which is similar to stall for load-use data hazard. But here the IF, ID and EX stages instructions are stalled when the branch reaches the MEM stage. Branch prediction Another method to solve control hazard is by reducing the delay of branches. The completion of branch instruction is over at MEM stage, but if it is in earlier stage, then fewer instructions need be flushed. Department of ECE, VKCET Page 43
Module 2
In MIPS architecture, branch instructions need only simple test and do not require full ALU operation. Branch instruction require two actions: computing branch target address and evaluating the branch decision. The address calculation is an easy part and it can move from EX stage to ID stage, because immediate offset field is available in IF/ID pipeline register. This operation is needed only the branch decision is true. For branch decision, two register values are compared by EX-ORing all the bits and then ORing all the results. Moving branch test to the ID stage results additional forwarding and hazard detection. Two factors to implementing this are: 1. In ID, decode instruction, decide whether a bypass to the equality unit is needed, and complete the equality comparison by set the PC to the branch target address. Forwarding the operands is by the same way for data hazards, but there is an equality test unit in ID requires a new forwarding logic. 2. The values for branch comparison may be produced later in time and cause data hazard. This will need stalling. Branch execution at the ID stage improve the speed by reduces the penalty of a branch to only one instruction if the branch is taken. Consider the following code:
Page 44
08.503 Computer Organization and Architecture The implementation is:
Module 2
Page 45
Module 2
Super Scalar Processor: They are dynamic multiple-issue processors, in which instructions are fetched in order, but the processor decides whether zero, one or more instructions can issue in a given clock cycle. This improves the instruction execution rate. The basic framework of dynamic issue decisions is dynamic pipeline scheduling. It chooses which instructions to execute in a given clock cycle while trying to avoid hazards are stalls. Consider the following code:
In this case the sub instruction is ready to execute, but it has to wait to complete first two instructions. Dynamic pipeline scheduling avoids this type of hazards either partially or fully. Dynamic pipeline scheduling This chooses which instructions to execute next, possibly by reordering them to avoid stalls. The processor with this facility have three major units: an instruction fetch and issue unit, multiple functional units and a commit unit. A typical model is shown below:
First unit fetch instruction, decode it and sends each instruction to the corresponding functional unit for execution. The functional units have some buffers called reservation units that hold the operands and operations. The buffer contains all the operand and functional units are ready to execute, the result is calculated. The results are sent to buffers which are waiting as well as commit unit. In commit unit there is also buffer called recorder buffer. It is used to supply the operands similar to forwarding. Page 46

COA Module 2 Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COA Module 2 Notes

Uploaded by

Copyright:

Available Formats

08.

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

Department of ECE, VKCET

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture Illustration of the instruction

Department of ECE, VKCET

08.503 Computer Organization and Architecture Illustration of the instruction

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

08.503 Computer Organization and Architecture

08.503 Computer Organization and Architecture

Department of ECE, VKCET

Department of ECE, VKCET

08.503 Computer Organization and Architecture

The functions of 1-bit control lines are:

Department of ECE, VKCET

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture For jump:

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

For R-type instructions:

Department of ECE, VKCET

08.503 Computer Organization and Architecture For branch instructions:

For jump instructions:

All these states can be implemented by a control unit shown below:

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Execution of instruction in single cycle non-pipelining implementation is shown below:

Execution of instruction in pipelining implementation is shown below:

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET

08.503 Computer Organization and Architecture

Department of ECE, VKCET