Hazards

M.Tech. (CSE) I Sem.
COMPUTER SYSTEM DESIGN
UNIT II
PROCESSING UNIT
Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer 1
Pipeline Hazard
There are situations in pipelining when the next instruction cannot execute in the following clock cycle. These events are called Pipeline Hazards Any condition that causes the pipeline to stall is called a hazard.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal operation
1. Resource Conflicts Structural Hazard - caused by access to memory by two segments at the same time - Most of these conflicts can be resolved by using separate instruction and data memories
Pipeline Hazards
2. Data dependency conflicts - Data Hazard

- arise when an instruction depends on the result of a previous instruction, but this result is not yet available. 3. Branch difficulties Control or Instruction Hazard - arise from branch and other instructions that change the value of PC.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 3
Instruction Pipeline Hazards

Three major causes for pipeline breakdown or hesitation of an instruction pipeline
Structural Hazard due to Resource conflicts

Control Hazard due to Procedural dependencies (caused notably by branch instructions) Data Hazard due to Data dependencies
Data Hazards
Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer
Data Hazard
A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. As a result some operation has to be delayed, and the pipeline stalls.
Data dependency conflicts

A difficulty that may cause a degradation of performance in an instruction pipeline is due to possible collision of data or address. A collision occurs when an instruction cannot proceed because previous instructions did not complete certain operations.

Data dependency
A data dependency occurs when an instruction needs data that are not yet available. For example,
an instruction in the F stage may need to fetch an operand that is being generated at the same time by the previous instruction in segment E . Therefore, the second instruction must wait for data to become available by the first instruction.
Address dependency
An address dependency may occur when an operand address cannot be calculated because the information needed by the addressing mode is not available. For example,
An instruction with register indirect mode cannot proceed to fetch the operand if the previous instruction is loading the address into the register.
Therefore, the operand access to memory must be delayed until the required address is available.
Data Hazards Data dependencies

A data dependency describes the normal situation that the data that instructions use depend upon the data created by other instructions . Instructions depend upon each other in the program and must be executed in a specific order. There are three types of data dependency between instructions : True data dependency Anti-dependency Output dependency
True Data Dependency

Consider the code sequence:
1 2
add R3, R2, R1 sub R4, R3, R5
; R3 = R2 + R1 ; R4 = R3 R5
It would be incorrect to begin reading R3 in instruction 2 before instruction 1 has produced its new value for R3.
Hence there is a data dependency between instructions 1 and 2. This dependency is called a true data dependency. A true data dependency occurs when the value produced by an instruction is required by a subsequent instruction. True data dependencies are also called Read-After-Write (RAW) hazards because we are reading a value after writing to it. True dependencies are also sometimes known as flow dependencies because the dependency is due to the flow of data in the program.
11
True Data Dependency

Write R3 Write R4
WB MEM
add
1 1
sub
2 2 2
EX
Read
1
Read
ID IF 1
1
R1, R2
2
R5
stall
stall
Read R3
2 time Space-Time Diagram

add R3, R2, R1 sub R4, R3, R5
12
Anti-dependency Consider the code sequence: 1 add R4, R2, R1 ; R4 = R2 + R1 2 sub R2, R3, R5 ; R2 = R3 R5 Instruction 2 must not produce its result in R2 before instruction 1 reads R2, otherwise instruction 1 would use the value produced by instruction 2 rather than the previous value of R2. This type of dependency is called an Antidependency (also called a Write-After-Read (WAR) hazard) and occurs when an instruction writes to a location which has been read by a previous instruction.
Anti-dependency
In most pipelines, reading occurs in a stage before writing, so normally if the instructions remain in program order in the pipeline, an antidependency would not be a problem. It becomes a problem if the pipeline structure is such that writing can occur before reading in the pipeline, or the instructions are not processed in program order within the pipeline. The latter situation could occur for example if instructions are allowed to overtake stalled instructions in the pipeline. For example, suppose instruction 1 was stalled because of a resource conflict. Instruction 2 could be allowed to overtake instruction 1 in the pipeline, but then we could have an anti-dependency (write-after-read hazard).
14
Output Dependency
Consider the code sequence:
1 2 3
add R3, R2, R1 sub R4, R3, R5 add R3, R2, R5
;R3 = R2 + R1 ;R4 = R3 R5 ;R3 = R2 + R5
Instruction 1 must produce its result in R3 before instruction 3 produces its result in R3 otherwise instruction 2 might use the wrong value of R3. This type of dependency is called an Output dependency (also called a Write After Write (WAW) hazard) and occurs when a location is written to by two instructions. Again the dependency would not be significant if all instructions write at the same time in the pipeline and instructions are processed in program order. Actually output dependencies are a form of resource conflict, because the register in question, R3, is accessed by two instructions. The register is being reused and consequently, the use of another register in the instruction would eliminate the potential problem.
15
Control Hazards (Instruction Hazards)
16
Instruction Hazard
The pipeline may also be stalled because of a delay in the availability of an instruction. For example, this may be a result of a miss in the cache, requiring the instruction to be fetched from the main memory or Branch instruction Such hazards are often called hazards or instruction hazards
control
17
Lecture Slides on Computer System Pipeline stall caused by a cache miss in F2 Design - MTech(CSE) @ Dr A R Naseer
18
Instruction Hazard
Instruction I1 is fetched from the cache in cycle1, and its execution proceeds normally.
However, the fetch operation for instruction I2, which is started in cycle 2, results in a cache miss. The instruction fetch unit must now suspend any further fetch requests and wait for I2 to arrive. We assume that instruction I2 is received and loaded into buffer B1 at the end of cycle 5.
The pipeline resumes its normal operation at that point.

Control Hazards
Control Hazards are caused by Procedural dependencies Procedural dependencies occur when the execution order is unknown prior to execution of instructions. Procedural dependencies are caused notably by branch instructions
Two types of branch instructions

- Unconditional branch instructions
- Conditional branch instructions
20
Control Hazards
Conditional Branch Instructions
Example: JNZ L1 ; Branch to L1 if Z=0
The fetch unit keeps fetching instructions irrespective of the execution of the instructions.
When the branch instruction is completely executed, or at least when the condition can be tested, it would be known which instruction to process next. If this instruction is not the next instruction in the pipeline, all instructions in the pipeline are abandoned and the pipeline cleared. Hopefully, the abandoned instructions have not altered the state of the processor, i.e., not written to any register, otherwise the changes must be reversed, an expensive operation. Then the required instruction is fetched from the target address and must be processed through all units of the pipeline in the same way as when the pipeline is first started.
Control Hazards Unconditional Branch Instructions

Example : JMP L2 ;Branch to L2 unconditionally
Unconditional branch instructions always cause a change in the sequence. The change is predictable and fixed, but can also affect the pipeline.
22
Structural Hazards
23
Structural Hazard
A third type of hazard that may be encountered in pipelined operation is known as a structural hazard.
This is the situation when two instruction require the use of a given hardware resource at the same time. The most common case in which this hazard may arise is in access to memory. One instruction may need to access memory as part of the execute or write stage while another instruction is being fetched. If instructions and data reside in the same cache unit, only one instruction can proceed and the other instruction is delayed. Many processors use separate instruction and data caches to avoid this delay.
Structural Hazards
Structural Hazard are caused by Resource conflicts
Resource conflicts occur when a resource such as memory or a functional unit is required by more than one instruction at the same time in the pipeline. A notable resource conflict is the main memory of the system. For memories with only one read/write port, only one access can be made at any instant.
25
Strategies to deal with Data Hazards
26
Data Hazards
A data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed for some reason.
Consider a program that contains two instructions, I1 followed by I2 which requires the results of I1. When this program is executed in a pipeline, the execution of I2 can begin before the execution of I1 is completed. This means that the results generated by I1, may not be available for use by I2. We must ensure that the results obtained when instructions are executed in a pipelined processor are identical to those obtained when the same instructions are executed sequentially.
Data Hazards
Consider the 4-stagepipeline. The data dependency arises when the destination of one instruction is used as a source in the next instruction.
For example, the two instructions
Mul R2, R3, R4; Add R5, R4, R6;

give rise to a data dependency.
R4 R2 * R3 R6 R5 + R4
The result of the multiply instruction is placed into register R4, which in turn is one of the two source operands of the Add instruction. Assuming that the multiply operation takes one clock cycle to complete. As the decode unit decodes the Add instruction in cycle 3, it realizes that R4 is used as a source operand. Hence the D step of that instruction cannot be completed until the W step of the multiply instruction has been completed. Completion of step D2 must be delayed to clock cycle 5, and is shown as step D2A in the figure. Instruction I3 is fetched in cycle 3, but its decoding must be delayed because step D3 cannot precede D2. Hence pipelined execution is stalled for two cycles.
Lecturedependency Slides on Computer System Pipeline stalled by data between D2 and W1 29 Design - MTech(CSE) @ Dr A R Naseer
Data Hazard
Pipelined computers deal with such conflicts between data dependencies in a variety of ways. 1. Hardware Interlocks
2. Operand Forwarding
3. Delayed load
30
Hardware Interlocks
Data Hazard
The most straightforward method is to insert hardware interlocks.
An interlock is a circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline.
Detection of this situation causes the instruction whose source is not available to be delayed by enough clock cycles to resolve the conflict.
This approach maintains the program sequence by using hardware to insert the required delays.
Forwarding
The term forwarding refers to the technique of passing the result of one instruction directly to another instruction to eliminate the use of intermediate storage locations.
It can be applied at the compiler level to eliminate unnecessary references to memory locations by forwarding values through registers rather than through memory locations. Such forwarding would generally increase the speed of operation, as accesses to processor registers are normally faster than accesses to memory locations. Forwarding can also be applied at the hardware level to eliminate pipeline cycles for reading registers that were updated in a previous pipeline stage. In that case, forwarding eliminates register accesses, by using faster data paths.
32
Compiler Directed Forwarding

Three types of forwarding at the compiler level can be identified as :
Store fetch forwarding Fetch fetch forwarding
Store Store overwriting

Store refers to writing operands into memory Fetch refers to reading operands from memory
33

Store fetch forwarding
In store-fetch forwarding, fetching an operand which has been stored and hence is also held in a processor operand register is eliminated by taking the operand directly from the processor operand register. For example, the code sw R2, label lw R3, label
could be reduced to
sw R2, label move R3, R2
Which eliminates one memory reference
34

Fetch fetch forwarding
In fetch-fetch forwarding, multiple accesses to the same memory location are eliminated by making all accesses to the operand in a processor operand register once it has been read into the register. For example, the code lw R2, addr lw R3, addr could be reduced to lw R2, addr move R3, R2 Which eliminates one memory reference
35

Store Store Overwriting
In store-store overwriting, one or more write operations without intermediate operations on the stored information can be eliminated. For example, the code sw R2, addr sw R3, addr could be reduced to sw R3, addr Which eliminates one memory reference
36
Data Hazard
Operand Forwarding
Another technique called operand forwarding uses special hardware to detect a conflict and then avoid it by routing the data through special paths between pipeline segments For example, instead of transferring an ALU result into a destination register, the hardware checks the destination operand, and if it is needed as a source in the next instruction, it passes the result directly into the ALU input, bypassing the register file.
This method requires additional hardware paths through multiplexers as well as the circuit that detects the conflict.
Internal Forwarding
Internal forwarding is hardware forwarding implemented processor registers or data paths not visible to the programmer. by
It can be applied in an instruction pipeline to eliminate references to general-purpose registers. For example, in the code sequence :
sub R2, R1, R3 ; R2 R1 R3 and R12, R2, R5 ; R12 R2 /\ R5

the and instruction requires the contents of R2, which is generated by the sub instruction. The and instruction will be stalled in the ID unit waiting for the value of R2 to become valid. Internal forwarding forwards the value being stored in R2 by the WB directly to the execute unit. Without forwarding, the and instruction would have to wait for R2 to be updated and then would need to read R2.
Data Hazard
Operand Forwarding
The data hazard arises because one instruction, instruction I2 is waiting for data to be written in the register file. However, these data are available at the output of the ALU once the execute stage completes step E1. Hence the delay can be reduced, or possibly eliminated, if we arrange for the result of instruction I1to be forwarded directly for use in step E2.
Operand forwarding in a pipelined processor
40
Data Hazard
Operand Forwarding
A part of the processor data path involving the ALU and the register file shown in figure. This arrangement is similar to the three bus structure except that registers SRC1, SRC2, and RSLT have been added. These registers constitute the inter-stage buffers needed for pipelined operation. Registers SRC1 and SRC2 are part of buffer B2 and RSLT is part of B3. The data forwarding mechanism is provided by the blue connection lines. The two multiplexers connected at the inputs to the ALU allow the data on the destination bus to be selected instead of the contents of either the SRC1 or SRC2 register.
When the instructions are executed in the data path the operations performed in each clock cycle are as follows.
After decoding instruction I2 and detecting the data dependency, a decision is made to use data forwarding. The operand not involved in the dependency, register R2, is read and loaded in register SRC1 in clock cycle 3. In the next clock cycle, the product produced by instruction I1 is available in register RSLT, and because of the forwarding connection, it can be used in step E2. Hence execution of i2 proceeds without interruption.
41
Detecting Data Hazards

When data dependency occurs, there are two possible strategies :
Detect the data dependencies and then hold up the pipeline completely until the dependencies have been resolved
Allow all instructions to be fetched into the pipeline but only allow independent instructions to proceed to their completion, and delay instructions which are dependent upon other, not yet executed, instructions until these instructions are executed.
42

Data Dependencies can be detected by considering read and write operations on specific locations accessible by the instructions
Read-After-Write (RAW) Hazard

Exists if a read operation occurs before a previous write operation has been completed, and hence the read operation would obtain an incorrect value. (a value not yet updated
Write-After-Read (WAR) Hazard

Exists when a write operation occurs before a previous read operation has had time to complete, and again the read operation would obtain an incorrect value ( a prematurely updated value).
Write-After-Write (WAW) Hazard

Exists if there are two write operations upon a location such that the second write operation in the pipeline completes before the first. Hence the written value will be altered by the first write operation when it completes.
43

Bersteins Conditions for detecting data hazards A potential hazard can be identified between instruction i and instruction j when at least one of the following condition fails. For no Read-After-Write O(i) I(j) = For no Write-After-Read I(i) O(j) = For no Write-After-Write O(i) O(j) = O(i) indicates the set of output locations altered by instruction i. I(i) indicates the set of input locations read by instruction i indicates an empty set.
44

Bersteins Conditions for detecting data hazards Example : Consider the following code sequence : 1. add R3, R2, R1 ; R3=R2 + R1 2. sub R5, R4, R3 ; R5=R4 - R3 O(1) = (R3) I(1) = (R1, R2) O(2) = (R5) I(2) = (R4, R3)
Read-After-Write hazard since O(1) I(2) = {R3} No Write-After-Read hazard since I(1) O(2) = No Write-After-Write hazard since O(1) O(2) = All conditions are satisfied, there are no hazards
Data Hazard
Delayed load
A procedure employed in some computers is to give the responsibility for solving data conflicts problems to the compiler that translates the high-level programming language into a machine language program. The compiler for such computers is designed to detect a data conflict and reorder the instructions as necessary to delay the loading of the conflicting data by inserting nooperation instructions. This method is referred to as delayed load.
Delayed load SOFTWARE)
(HANDLING
Data Hazard
DATA
HAZARDS
IN
An alternative approach is to leave the task of detecting data dependencies and dealing with them to the software. In this case, the compiler can introduce the two cycle delay needed between instructions I1 and I2by inserting NOP (No operation) instructions, as follows:
I1 :
I2:

Mul R2, R3, R4 NOP NOP Add R5, R4, R6
; R4 R2 * R3
; R6 R5 + R4
If the responsibility for detecting such dependencies is left entirely to the software, the compiler must insert the NOP instructions to obtain a correct result. This possibility illustrates the close link between the compiler and the hardware. A particular feature can be either implemented in hardware or left to the compiler leads to simpler hardware.
Being aware of the need for delay, the compiler can attempt to reorder instructions to perform useful tasks in the NOP slots, and thus achieve better performance. On the other hand insertion of NOP instructions leads to larger code size. Also, it is often the case that a given processor architecture has several hardware implementations, offering different features. NOP instructions inserted to satisfy requirements of one implementation may not be needed and hence, would lead to reduced performance on a different implementation.
Data Hazards
Example : Using Compiler to detect & resolve Data Hazards
sub and or add sw R2, R1, R3 R12, R2, R5 R13, R6, R2 R14, R2, R2 R15, 100(R2)
In this case, the compiler needs to insert two independent instructions between the sub and the and instructions, thereby making the hazard disappear. When no such independent instructions can be found, the compiler inserts instructions guaranteed to be independent i.e., nop instructions
sub nop nop and or add sw
R2, R1, R3
R12, R2, R5 R13, R6, R2 R14, R2, R2 R15, 100(R2)
Although this code works properly for this pipeline, these two nops occupy 2 clock cycles that do no useful work.
Strategies to reduce Control Hazards
49
INSTRUCTION HAZARDS
The purpose of the instruction fetch unit is to supply the execution units with a steady stream of instructions. Whenever this stream is interrupted, the pipeline stalls for the case of a cache miss.
A branch instruction may also cause the pipeline to stall.
50
Handling of Branch Instructions

One of the major problems in operating an instruction pipeline is the occurrence of branch instructions.
A branch instruction can be conditional or unconditional.

An unconditional branch always alters the sequential program flow by loading the program counter with the target address. In a conditional branch, the control selects the target instruction if the condition is satisfied or the next sequential instruction if the condition is not satisfied.
51
Unconditional Branches
52
Consider unconditional branch instruction being executed in a Two-stage pipeline. The pipeline is stalled for one clock cycle. The time lost as a result of a branch instruction is often referred to as the branch penalty . For a longer pipeline the branch penalty may be higher.
Unconditional Branches
Slides by on Computer System Instruction An idle Cycle Lecture caused a Branch Design - MTech(CSE) @ Dr A R Naseer
53
Reducing Branch Penalty

Reducing the branch penalty requires the branch address to be computed earlier in the pipeline.
Typically the instruction fetch unit has dedicated hardware to identify a branch instruction and compute the branch target address as quickly as possible after an instruction is fetched.
With this additional hardware, both of these tasks can be performed in step Decode. In this case, the branch penalty is only one clock cycle.
Branch Address Computation- Reducing Branch Penalty
55
Instruction Queue and Prefetching

Either a cache miss or a branch instruction stalls the pipeline for one or more clock cycles. To reduce the effect of these interruptions, many processors employ sophisticated fetch units that can fetch instructions before they are needed and put them in a queue. Typically, the instruction queue can store several instructions. A separate unit, called dispatch unit, takes instructions from the front of the queue and sends them to the execution unit. The dispatch unit also performs the decoding function.
56

To be effective, the fetch unit must have sufficient decoding and processing capability to recognize and execute branch instructions. It attempts to keep the instruction queue filled at all times to reduce the impact of occasional delays when fetching instructions. When the pipeline stalls because of a data hazard, for example, the dispatch unit is not able to issue instructions from the instruction queue. However the fetch unit continues to fetch instructions and add them to the queue. Conversely, if there is a delay in fetching instructions because of a branch or a cache miss, the dispatch unit continues to issue instructions from the instruction queue.
Prefetching & Instruction Queue in the Pipeline
58
Branch timing in the presence of an instruction Queue.

Lecture Slides is on Computer System in the D Stage Branch Target address computed Design - MTech(CSE) @ Dr A R Naseer 59

Branch Folding
Because of the use of instruction queue, the branch instruction does not increase the overall execution time. This is because the instruction fetch unit has executed the branch instruction (by computing the branch address) concurrently with the execution of other instructions. This technique is referred to as Branch Folding.
Note that branch folding occurs only if at the time a branch instruction is encountered, atleast one instruction is available in the queue other than the branch instruction.

It is desirable to arrange for the queue to be full most of the time, to ensure an adequate supply of instructions for processing. This can be achieved by increasing the rate at which the fetch unit reads instructions from the cache. In many processors, the width of the connection between the fetch unit and the instruction cache allows reading more than one instruction in each clock cycle. If the fetch unit replenishes the instruction queue quickly after a branch has occurred, the probability that branch folding will occur increases. Having an instruction queue is also beneficial in dealing with cache misses.
61
When a cache miss occurs, the dispatch unit continues to send instructions for execution as long as the instruction queue is not empty.
Meanwhile, the desired cache block is read from the main memory or from a secondary cache. When fetch operations are resumed, the instruction queue does not become empty, a cache miss will have no effect on the rate of instruction execution.
In summary the instruction queue mitigates the impact of branch instructions on performance through the process of branch folding.
It has a similar effect on stalls caused by cache misses.
The effectiveness of this technique is enhanced when the instruction fetch unit is able to read more than one instruction at a time from the instruction cache.
Conditional Branches
63
A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction.
The decision to branch cannot be made until the execution of that instruction has been completed. Branch instructions represent about 20 percent of the dynamic instruction count of most programs.
(The dynamic count is the number of instruction executions, taking into account the fact that some program instructions are executed many times because of loops. ) Because of the branch penalty, this large percentage would reduce the gain in performance expected from pipelining. Fortunately, branch instruction can be handled in several ways to reduce their negative impact on the rate of execution of instructions.
64
The branch instruction breaks the normal sequence of the instruction stream, causing difficulties in the operation of the instruction pipeline.
Pipelined computers employ various hardware techniques to minimize the performance degradation caused by instruction branching. 1. Prefetch target instruction 2. Branch target Buffer 3. Branch Prediction 4. Delayed Branch
Reducing Pipeline Conditional Branch Hazards Dealing with Branch Stalls

Approach 1: Stall until branch direction is clear
Simplest scheme to handle branches freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known
Instruction
1 2 ID IF 3 EX IF Clock cycle Number 4 MEM ID IF 5 WB EX ID IF MEM EX ID WB MEM EX WB MEM 6 7 8
Branch instruction Branch successor Branch successor + 1 Branch successor + 2
IF
A branch causes a 1-cycle stall in the five-stage pipeline
66
Branch Prediction
67
Branch Prediction
Another technique for reducing the branch penalty associated with conditional branches is to attempt to predict whether or not a particular branch will be taken.
The simplest form of branch prediction is to assume that the branch will not take place and to continue to fetch instructions in sequential address order. Until the branch condition is evaluated, instruction execution along the predicted path must be done on a speculative basis.
Branch Prediction
Speculative execution means that instructions are executed before the processor is certain that they are in the correct execution sequence. Hence, care must be taken that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed. If the branch decision indicates otherwise, the instructions and all their associated data in the execution units must be purged, and the correct instructions fetched and executed.
Static Branch Prediction
70
Reducing Pipeline Branch Hazards

Dealing with Branch Stalls
Approach 2: Predict Branch Not Taken
Execute successor instructions in sequence PC+4 already calculated, so use it to get next instruction; chances are the branch is not taken Squash instructions in pipeline if branch actually taken Can do this because CPU state not updated till late in the pipeline
Instruction 1 Untaken branch instruction i Instruction i + 1 Instruction i + 2 Instruction i + 3 IF 2 ID IF 3 EX ID IF Clock cycle Number 4 MEM EX ID IF 5 WB MEM EX ID WB MEM EX WB MEM WB 6 7 8
Taken branch instruction i Instruction i + 1 Branch target Branch target + 1
IF
ID IF
EX idle IF
MEM idle ID
WB idle EX ID idle MEM EX 71 WB MEM WB
Lecture Slides on Computer System Design - IF MTech(CSE) @ Dr A R Naseer
Pipeline Branch Hazards

Approach 3: Predict Branch Taken Most branches are taken But havent yet calculated target address in a 5-stage RISC pipeline - So, will still incur a 1-cycle latency Makes sense on machines where branch target is known before outcome (Branch Target Buffers)
72
A decision on which way to predict the result of the branch may be made in hardware by observing whether the target address of the branch is lower than or higher than the address of the branch instruction. A more flexible approach is to have the compiler decide whether a given branch instruction should be predicted taken or not taken. The branch instructions of some processors, such as SPARC, include a branch prediction bit, which is set to 0 or 1 by the compiler to indicate the desired behavior. The instruction fetch unit checks this bit to predict whether the branch will be taken or not taken. With either of these schemes, the branch prediction decision is always the same every time a given instruction is executed.
Static Branch Prediction
Any approach that has this characteristic is called static branch
prediction
Delayed Branch Technique
74
Pipeline Branch Hazards

Approach 4: Delayed Branch
Define branch to take place AFTER n following instructions branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken
Instruction 1 Untaken branch instruction i Branch delay Instruction i + 1 Instruction i + 2 Instruction i + 3 IF 2 ID IF 3 EX ID IF
n branch delay slots
Clock cycle Number 4 MEM EX ID IF 5 WB MEM EX ID WB MEM EX WB MEM WB 6 7 8
Taken branch instruction i

Branch delay Instruction i + 1 Branch target Branch target + 1
IF
ID
IF
EX
ID
MEM
EX
WB
MEM EX ID WB MEM EX WB 75 MEM
Lecture Slides on Computer System IF Design -ID MTech(CSE) @ Dr A R Naseer IF
WB
Delayed Branch Instructions

In the delayed branch instruction scheme, branch instructions operate such that the sequence of execution is not altered immediately after the branch instruction is executed, but after one or more subsequent non-branch instructions, depending upon the design The subsequent instructions are executed irrespective of the branch outcome. That is, compiler or programmer has to find increasing number of suitable independent instructions in the program to place after a branch instruction
76

Delayed branching technique can minimize the penalty incurred as a result of conditional branch instructions. The instructions in the delay slots are always fetched and at least partially executed before the branch decision is made and the branch target address is computed. Therefore we would like to arrange for them to be fully executed whether or not the branch is taken.
The objective is to be able to place useful instructions in these slots.

If no useful instructions can be placed in the delay slots, these slots must be filled with NOP instructions.
This situation is exactly the same as in the case of data dependency (Delayed Load).
77

Branch Delay Slots
Instructions in the branch delay slot(s) get executed whether or
not branch is taken
Heavily used in early RISC machines

1 delay-slot suffices for a 5-stage pipeline (target available at end of ID) Machines with deep pipelines require additional delay slots to avoid branch penalties Benefits are unclear

Consider the following code sequence :
add R3, R4, R5 sub R6, R6, 1 jz L1 R3, R4, R5, has no influence on whether the
L1:
However, the instruction add branch instruction, jz L1 is taken.
Using the one-instruction delayed version of the branch instruction, the add instruction can be moved to after the branch, i.e., sub R6, R6, 1 jz L1 add R3, R4, R5
L1:
Use of NOP to Delay Branch Instructions
NOP (No Operation) instruction can be used to delay the branch instruction when the compiler cannot find independent instructions to insert after branch instructions in the code sequence
Consider the following code sequence :
sub R2, R2, 1 jz L1
L1:
A compiler can easily insert these NOPs.
sub R2, R2, 1 jz L1 nop

L1:

The effectiveness of the delayed branch approach depends on how it is possible to reorder instructions.
Experimental data collected from many programs indicate that sophisticated compilation techniques can use one branch delay slot in as many as 85 percent of the cases.
For a processor with two branch delay slots, the compiler attempts to find two instructions preceding the branch instruction that it can move into the delay slots without introducing a logical error. The chances of finding two such instructions are considerably less than the chances of finding one. Thus if increasing the number of pipeline stages involves an increase in the number of branch delay slots, the potential gain in performance may not be fully realized.
81
Dynamic Branch Prediction
82
Branch Prediction Motivation

High branch penalties in pipelined processors
Situation even worse for multiple-issue processors, because we need to provide an instruction stream of n instructions per cycle. Idea: predict the outcome of branches based on their history and execute instructions speculatively
An approach in which the prediction decision may change depending on execution history is called dynamic branch prediction.
83
Dynamic Branch Prediction

There are various methods of predicting the next address, mainly based upon expected repetitive instruction usage of the branch
expected non-repetitive instruction
usage
of
the
branch
84
Dynamic Branch Prediction Logic

The objective of branch prediction algorithms is to reduce the probability of making a wrong decision, to avoid fetching instructions that eventually have to be discarded. In dynamic branch prediction schemes, the processor hardware assesses the likelihood of a given branch being taken by keeping track of branch decisions every time that instruction is executed. In its simplest form, the execution history used in predicting the outcome of a given branch instruction is the result of the most recent execution of that instruction. The processor assumes that the next time the instruction is executed, the result is likely to be the same.
85
Dynamic Branch Prediction Logic

A simple dynamic prediction algorithm may be described by the two-state machine.
The two states are : LT : Branch is likely to be taken LNT : Branch is likely not to be taken Suppose that the algorithm is started in state LNT.
When the branch instruction is executed and if the branch is taken, the machine moves to state LT. Otherwise it remains in state LNT. The next time the same instruction is encountered, the branch is predicted as taken if the corresponding state machine is in state LT. Otherwise it is predicted as not taken.
1-bit Branch Prediction Buffer

1-bit branch prediction buffer or branch history table:
PC 10..10 101 00 BHT 0 1 0 1 0 1 1 0
Buffer is like a cache without tags
87
1-bit Branch Prediction Scheme

1-bit branch prediction simple scheme, which requires one bit of history information for each branch instruction, works well inside program loops.
Once a loop is entered, the branch instruction that controls looping will always yield the same result until the last pass through the loop is reached. ] In the last pass, the branch prediction will turn out to be incorrect, the branch history state machine will be changed to the opposite state. Unfortunately this means that the next time this same loop is entered, and assuming that there will be more than one pass through the loop, the machine will lead to the wrong prediction. Better performance can be achieved by keeping more information about execution history.
Branch Prediction Buffer or Branch Target Buffer

Simplest dynamic branch-prediction scheme
A branch-prediction buffer is a Small memory indexed by the loworder portion of the branch instruction The memory contains a bit that says whether the branch was recently taken or not Stores a single bit of information: T or NT Starts off as T, flips whenever a branch behaves opposite to prediction If the hint turns out to be wrong, he prediction bit is inverted and stored back.
Problems with this simple 1-bit prediction scheme are :

Prediction value may not correspond to branch being considered Cannot avoid this: Branch Prediction Buffer serves as a cache without tags Does not do a good job of predicting mostly-taken branches If a branch is almost always taken then branch will be predicted incorrectly twice, rather than once, when it is not taken.
Branch Prediction Buffer or Branch Target Buffer

Example : Consider a loop branch for (i=0; i<10; i++) { } whose behavior is taken nine times in a row, then not taken once. What is the prediction accuracy for this branch, assuming the prediction bit for this branch remains in the prediction buffer?
Repeated executions of the loop will result in 2 incorrect predictions The steady-state prediction behavior will mispredict on the first and last loop iterations Last iteration of the previous loop flips T to NT (the branch has been taken nine times in a row at that point) First iteration of the next loop flips NT to T since branch is taken So, prediction accuracy for this branch that is taken 90% of the time is only 80% (two incorrect predictions and eight correct ones) Can we do better?
90
Branch Prediction Buffer : 1 bit prediction

Branch address (K bits) 2 K entries
Problems :
prediction bit
Aliasing: lower K bits of different branch instructions could be the same Soln: Use tags (the buffer becomes a tag); however very expensive Loops are predicted wrong twice Soln: Use n-bit saturation counter prediction * taken if counter 2 (n-1)
* not-taken if counter < 2 (n-1)

A 2 bit saturating counter predicts a loop wrong only once
2-bit Branch Prediction Buffer

2-bit scheme where prediction is changed only if mispredicted twice Can be implemented as a saturating counter
T
NT
Predict Taken 11
T
NT
Predict Taken 10
T
Predict Not Taken 01
NT
Predict Not Taken 00
NT

2-bit Branch prediction algorithm that uses 4 states, thus requires two bits of history information for each branch instruction:
ST : Strongly likely to be taken SNT: Strongly likely not to be taken LT : Likely to be taken LNT: Likely not to be taken
93
State-Machine representation of Branch-Prediction Lecture Slides on Computer System Design - Algorithms MTech(CSE) @ Dr A R Naseer
94

Assume that the state of the algorithm is initially set to LNT. After the branch instruction has been executed and if the branch is actually taken, the state is changed to ST; otherwise it is changed to SNT. As program execution progresses and the same instruction is encountered again, the state of the branch prediction algorithm continues to change. When a branch instruction is encountered, the instruction fetch unit predicts that the branch will be taken if the state is either LT or ST. After that, the branch will be predicted as taken.
95

Consider what happens when executing a program loop. Assume that the branch instruction is at the end of the loop and that the processor sets the initial state of the algorithm to LNT. During the first pass, the prediction will be wrong (not taken), and hence the state will be changed to ST. In all subsequent passes the prediction will be correct except for the last pass. At that time, the state will change to LT. When the loop is entered a second time, the prediction will be correct (branch taken).
Dynamic Branch Prediction - 2-bit Prediction Scheme
Store 2 bits of information in branch history buffer
How does this do on our loop example?

1misprediction per iteration if we start off in the (11) state 1 misprediction per iteration (plus 2 mispredictions the first time) if we start in (00)state
Generalization: n-bit saturating counters

Increment if taken, decrement if not Predict T if value more than half, else NT Not too much of a win over 2-bit counters

The state information used in dynamic branch prediction algorithm may be kept by the processor in a variety of ways.
It may be recorded in a look-up table, which is accessed using the low-order part of the branch instruction address.
In this case, it is possible for two branch instructions to share the same table entry. This may lead to a branch being mispredicted but it does not cause an error in execution. Misprediction only introduces a small delay in execution time. An alternative approach is to store the history bits as a tag associated with branch instructions in the instruction cache.
Dynamic Prediction Logic

To make a prediction based upon repetitive historical usage, an initial prediction is made when the branch instruction is first encountered.
When the true branch instruction target address has been discovered, it is stored in a high speed look-up table, and used if the same instruction at the same address is encountered again. Subsequently, the stored target address will always be the address used on the last occasion. A stored bit might be included with each table entry to indicate that a previous prediction has been made. The look-up table can hold various information, the branch tag (the branch instruction address), the branch target address, prediction information, and even the instruction from the branch target address. If the branch instruction is held in the look-up table, this is fetched if the same branch instruction address is encountered again.
Branch target Buffer

Another possibility is the use of a branch target buffer or BTB. The BTB is an associative memory included in the fetch segment of the pipeline. Each entry in the BTB consists of the address of a previously executed branch instruction and the target instruction for that branch. It also stores the next few instructions after the branch target instruction. When the pipeline decodes a branch instruction, it searches the associative memory BTB for the address of the instruction. If it is in the BTB, the instruction is available directly and prefetch continues from the new path. If the instruction is not in the BTB, the pipeline shifts to a new instruction stream and stores the target instruction in the BTB. The advantage of this scheme is that branch instructions that have occurred previously are readily available in the pipeline without interruption.
100
The prediction look-up table is a branch history table, also called a branch target buffer (BTB) which is implemented in a similar fashion to a cache. A direct mapped, a fully-associative or a set-associative cache-type table could be employed.
101
Instruction pipeline with Branch History Table
102
Variations in the prediction strategy :

Strategy 1 : Rather than update the predicted address when it was found to be
wrong, allow it to remain until the next occasion and change it then if it is still found to be wrong.
This algorithm requires an additional bit stored with each entry to indicate that the previous prediction was correct, but might produce better results.
Strategy 2 :
Run-time execution history based upon the last n executions of the branch is used to predict a branch
Strategy 3:
Prediction based upon direction of branch (forward or backward) or branch opcode.
Influence on Instruction Sets

(Relationship between pipelined execution and machine instruction features)
104
1. Addressing Modes
105
Addressing Modes
Addressing modes should provide the means for accessing a variety of data structures simply and efficiently. Useful addressing modes include index, indirect, auto increment, and auto decrement.
Many processors provide various combinations of these modes to increase the flexibility of their instruction sets.
Complex addressing modes, such as those involving double indexing, are often encountered.
Addressing Modes
In choosing the addressing modes to be implemented in a pipelined processor, we must consider the effect of each addressing mode on instruction flow in the pipeline.
Two important considerations in this regard are the side effects of the modes such as auto increment and auto decrement and the extent to which the complex addressing modes cause the pipeline to stall. Another important factor is whether a given mode is likely to be used by compilers.
107
Addressing Modes
To compare various approaches, assume a simple model for accessing operands in the memory.
The load instruction load X(R1), R2 takes five cycles to complete execution However the instruction Load (R1), R2 can be organized to fit a four stage pipeline because no address computation is required. Access to memory can take place in stage E. A more complex addressing mode may require several accesses to the memory to reach the named operand.
Addressing Modes
For example, the instruction
Load (X(R1)), R2
may be executed assuming that the index offset, X, is given in the instruction word.
After computing the address in cycle 3, the processor needs to access memory twice-first to read location X + [R1] in clock cycle 4 and then to read location [X + [R1]] in cycle 5. If R2 is a source operand in the next instruction, that instruction would be stalled for three cycles, which can be reduced to two cycles with operand forwarding.
109
Equivalent Operations using Complex and Simple addressing modes
110
Addressing Modes
To implement the same Load operation using only simple addressing modes requires several instructions. For example on computer that allows three operand addresses, we can use Add #X, R1, R2 Load (R2), R2 Load (R2), R2 The Add instruction performs the operation R2 X+[R1]. The two Load instructions fetch the address and then the operand from the memory. This sequence of instructions takes exactly the same number of clock cycles as the original, single Load instruction.
This example indicates that, in a pipelined processor, complex addressing modes that involve several accesses to the memory do not necessarily lead to faster execution.
Addressing Modes
The main advantage of such modes is that they reduce the number of instructions needed to perform a given task and thereby reduce the program space needed in the main memory. Their main disadvantage is that their long execution times cause the pipeline to stall, thus reducing its effectiveness.
They require more complex hardware to decode and execute them. Also, they are not convenient for compilers to work with. The instruction sets of modern processors are designed to take maximum advantage of pipelined hardware.
Because complex addressing modes are not suitable for pipelined execution, they should be avoided.
112
Addressing Modes
The addressing modes used in modern processors often have the following features:
- access to an operand does not require more than one access to the memory. - Only load and store instructions access memory operands. The addressing modes used do not have side effects.
Three basic addressing modes that have these features are register, register indirect, and index.
(RISC concepts Example :SPARC processor architecture ) The first two require no address computation. In the index mode, the address can be computed in one cycle, whether the index value is given in the instruction or in a register. Memory is accessed in the following cycle. None of these modes has any side effects, with one possible exception.
2. Condition Codes
114
Condition Codes
In many processors the condition code flags are stored in the processor status register.
They are either set or cleared by many instructions, so that they can be tested by subsequent conditional branch instructions to change the flow of program execution. An optimizing compiler for a pipelined processor attempts to reorder instructions to avoid stalling the pipeline when branches or data dependencies between successive instructions occur. In doing so, the compiler must ensure that reordering does not change in the outcome of a computation. The dependency introduced by the condition code flags reduces the flexibility available for the compiler to reorder instructions.
115
Instruction Reordering
a) A program fragment
Add R1, R2
Compare
Branch=0
R3, R4
.
b) Instructions Reordered
Compare
R3, R4
Add
Branch=0
R1, R2
.
116
Condition Codes
Consider the sequence of instructions and assume that the execution of the Compare and branch=0 instruction proceeds as in figure.
The branch decision takes place in step E2 rather than D2 because it must await the result of the Compare instruction. The execution time of the branch instruction can be reduced by interchanging the Add and Compare instructions. This will delay instruction by one cycle relative to the Compare instruction.
As a result, at the time the Branch instruction is being decoded, the result of the Compare instruction will be available and a correct branch decision will be made.
There would be no need for branch prediction.
However, interchanging the Add and Compare instructions can be done only if the Add instruction does not affect the condition codes.
117
Condition Codes
These observations lead to two important conclusions about the way condition codes should be handled.
First to provide flexibility in reordering instructions, the condition code flags should be affected by as few instructions as possible. Second, the compiler should be able to specify in which instructions of a program the condition codes are affected and which they are not.
An instruction set designed with pipelining in mind usually provides the desired flexibility Figure shows the instructions reordered assuming that the condition code flags are affected only when this is explicitly stated as part of the instruction OP code. The SPARC and ARM architectures provide this flexibility.
118
DATAPATH
& CONTROL
CONSIDERATIONS
119
DataPath & Control

Several important changes to Processor organization should be done to make it suitable for pipelined execution
1. There are separate instruction and data caches that use separate address and data connections to the processor.
This requires two versions of the MAR IMAR for accessing the instruction cache DMAR for accessing the data cache. 2. register. and
The PC is connected directly to the IMAR , so that the contents of the PC can be transferred to IMAR at the same time that an independent ALU operation is taking place.
120
Datapath modified for Pipelined execution with Lecture Slides on Computer System Design - MTech(CSE) @and Dr A Routput Naseer of the ALU Interstage Buffers at the input
121
DataPath & Control

3. The data address in DMAR can be obtained directly from the register file or from the ALU to support the register indirect and indexed addressing modes. 4. Separate MDR registers are provided for read and write operation. Data can be transferred directly between these registers and the registers file during load and store operations without the need to pass through the ALU. 5. Buffer registers have been introduced at the inputs and output of the ALU. These are registers SRC1, SRC2, and RSLT Forwarding connections may be added if
desired.
122
DataPath & Control

6. The instruction register has been replaced with an instruction queue, which is loaded from the instruction cache.
7.
The output of the instruction decoder connected to the control signal pipeline.
is
There is a need for buffering control signals and passing them from one stage to the next along with the instruction . This pipeline holds the control signals in B2 and B3. buffers
123
DataPath & Control

The following operations independently in the processor can be performed
Reading an instruction from the instruction cache Incrementing the PC Decoding an instruction Reading from or writing into the data cache Reading the contents of up to two registers from the register file writing into one register in the register file Performing an ALU operation
Because these operations do not use any shared resources, they can be performed simultaneously in any combination.
DataPath and Control

The processor structure provides the flexibility required to implement the four stage pipeline. For example, let I1, I2, I3, and I4 be a sequence of four instructions. The following actions all happen during clock cycle 4:
Write the result of instruction I1 into the register file Read the operands of instructions I2 from the register file Decode instruction I3 Fetch instruction I4 and Increment the PC.
125
SUPERSCALAR OPERATION
126
Superscalar Operation
A more aggressive approach is to equip the processor with multiple processing units to handle several instructions in parallel in each processing stage.
With this arrangement, several instructions start execution in the same clock cycle, and the processor is said to use multiple issue. Such processors are capable of achieving an instruction execution throughput of more than one instruction per cycle.
They are known as superscalar processors. Many modern high performance processors use this approach.
For superscalar operation, the arrangement of an instruction queue is essential. To keep the instruction queue filled, a processor should be able to fetch more than one instruction at a time from the cache. Multiple issue operation requires a wider path to the cache and multiple execution units.
Separate execution units are provided for integer and floating point instructions.
Figure shows an example of a processor with two execution units, one for integer and one for floating point operations. The instruction fetch unit is capable of reading two instructions at a time and storing them in the instruction queue.
A Processor with two execution units

An example of instruction execution flow in the Superscalar processor, assuming no hazards are encountered Lecture Slides on Computer System
Design - MTech(CSE) @ Dr A R Naseer
130
In each clock cycle, the dispatch unit retrieves and decodes upto two instructions from the front of the queue. If there is one integer, one floating point instructions, and no hazards, both instructions are dispatched in the same clock cycle. In a superscalar processor, the detrimental effect on performance of various hazards becomes even more pronounced.
131
The compiler can avoid many hazards through judicious selection and ordering of instructions.
For example, for the processor in figure, the compiler should strive to interleave floating point and integer instructions. This would enable the dispatch unit to keep both the integer and floating point units busy most of the time. In general, high performance is achieved if the compiler is able to arrange program instructions to make maximum advantage of the available hardware units.
132
OUT-OF-ORDER EXECUTION
133
Out-of-order Execution
In figure instructions are dispatched in the same order as they appear in the program
However their execution is completed out of order. Does this lead to any problems? We have already discussed the issues arising from dependencies among instructions. For example if instruction I2 depends on the result of I1 the execution of I2 will be delayed.
As long as such dependencies are handled correctly, there is no reason to delay the execution of an instruction. However, a new complication arises when we consider the possibility of an instruction causing an exception.
134
Exceptions may be caused by a bus error during an operand fetch or by a illegal operation such as an attempt to divide by zero.
The results of I2 are written back into the register file in cycle 4. If instruction I1 causes an exception, program execution is in an inconsistent state.
The program counter points to the instruction in which the exception occurred.
However one or more of the succeeding instructions have been executed to completion. If such a situation is permitted, the processor is said to have imprecise exceptions.
Instruction Completion in Program Order
136
To guarantee a consistent state when exceptions occur the result of the execution of instructions must be written into the destination locations strictly in program order. This means we must delay step W2 in figure until cycle 6. In turn the integer execution unit must retain the result of Instructions I2, and hence it cannot accept instruction i4 until cycle 6.
If an exception occurs during an instruction all subsequent instructions that have been partially executed are discarded. This is called a precise exception.
137
It is easier to provide precise exceptions in the case of external interrupts. When an external interrupt is received, the Dispatch unit stops reading new instructions from the instruction queue, and the instructions remaining in the queue are discarded.
All instructions whose execution is pending continue to completion.

At this point, the processor and all its registers are in consistent state, and interrupt processing can begin.
EXECUTION COMPLETION
139
It is desirable to use out of order execution, so that an execution unit is freed to execute other instructions a soon as possible.
At the same time, instructions must be completed in program order to allow precise exceptions. These seemingly conflicting requirements are readily resolved if execution is allowed to proceed as shown in figure, but the results are written into temporary registers. The contents of these registers are later transferred to the permanent registers in correct program order.
Execution Completion
This approach is illustrated in figure. Step TW is a write into a temporary register. Step W is the final step in which the contents of the temporary register are transferred into the appropriate permanent register. This step is often called the commitment step because the effect of the instruction cannot be reversed after that point.
If an instruction causes an exception, the results of any subsequent instruction that has been executed would still be in temporary registers and can be safely discarded.
A temporary register assumes the role of the permanent register whose data it is holding and is given the same name.
141
For example, if the destination register of I2 is R5, the temporary register used in step TW2 is treated as R5 during clock cycles 6 and 7.
Its contents would be forwarded to any subsequent instruction that refers to R5 during that period.
Because of this feature, this technique is called register renaming. Note that the temporary register is used only for instructions that follow I2 in program order.
142
If an instruction that precedes I2 needs to read R5 in cycle 6 or 7, it would access the actual register R5, which still contains data that have not been modified by instruction I2.
When out of order execution is allowed, a special control unit is needed to guarantee in order commitment.
This is called the commitment unit.
It uses a queue called the reorder buffer to determine which instructions should be committed next.
Instructions are entered in the queue strictly in program order as they are dispatched for execution.
143
When an instruction reaches the head of that queue and the execution of that instruction has been completed, the corresponding results and the instruction is removed from the queue. All resources that were assigned to the instruction, including the temporary registers are released. The instruction is said to have been retired at this point.
Because an instruction is retired only when it is at the head of the queue, all instructions that were dispatched before it must also have been retired.
Hence, instructions may complete execution out of order, but they are retired in program order.
144
DISPATCH OPERATION
145
Dispatch Operation
When dispatch decisions are made the dispatch unit must ensure that all the resources needed for the execution of an instruction are available.
For example, since the results of an instruction may have to be written in a temporary register, the required register must be free, and it is reserved for use by that instruction as apart of the dispatch operation. A location in the reorder buffer must also be available for the instruction.
When all the resources needed are assigned including an appropriate execution unit, the instruction is dispatched
146
Dispatch Operation
Should instructions be dispatched out of order ? For example if instruction I2 in figure is delayed because of a cache miss for a source operand, the integer unit will be busy in cycle 4, and I4 cannot be dispatched. Should i5 be dispatched instead? In principle this is possible, provided that a place is reserved in the reorder buffer for instruction I4 to ensure that all instructions are retired in the correct order. Dispatching instructions considerable care. out of order requires
If I5 is dispatched while I4 is still waiting for some resource, we must ensure that there is no possibility of a deadlock occurring.
Dispatch Operation
A deadlock is a situation that can arise when two units, A and B, use a shared resource. Suppose that unit B cannot complete its task until unit A completes its task. At the same time, unit B has been assigned a resource that unit A needs. If this happens, neither unit can complete its task.
Unit A is waiting for the resource it needs, which is being held by unit B.
At the same time unit B is waiting for unit A to finish before it can release that resource.
148
Dispatch Operation
If instructions are dispatched out of order, a deadlock can arise as follows. Suppose that the processor has only one temporary register, and that when I5 is dispatched, that register is reserved for it. Instruction I4 cannot be dispatched because it is waiting for the temporary register, which in turn will not become free until instruction I5 is retired.
Since instruction I5 cannot be retired before I4, we have a deadlock.

To prevent deadlocks, the dispatcher must take many factors into account. Hence, issuing instructions out of order is likely to increase the complexity of the dispatch unit significantly.
Dispatch Operation
It may also mean that more time is required to make dispatching decisions. For these reasons most processors use only in order dispatching. Thus the program order of instructions is enforced at the time instructions are dispatched and again at the time instructions are retired. Between these two events, the execution of several instructions can proceed at their own speed, subject only to any interdependencies that may exist among instructions. Ultra SPARC II is a commercially successful, superscalar, highly pipelined processor which uses these techniques
150
Dynamic Pipeline Scheduling

Dynamic pipeline scheduling goes past stalls to find later instructions to execute while waiting for the stall to be resolved.
Typically, the pipeline is divided into three major units :
An instruction fetch and issue unit Execute units A Commit unit

The Instruction fetch and issue unit fetches instructions, decodes them, and sends each instruction to a corresponding functional unit of the execute stage. There might be 5 to 10 functional units Each functional unit has buffers, called reservation stations, that hold the operands and the operation As soon as the buffer contains all its operands and the functional unit is ready to execute, the result is calculated It is then up to the commit unit to decide when it is safe to put the result into the register file or into memory
151
152

A commit unit controls updates to the register file and memory.
Some dynamically scheduled machines update the register file immediately during execution. Other machines have a copy of the register file, and the actual update to the register file occurs later as part of the commit.
For memory, there is normally a store buffer, also called a write buffer.
The commit unit allows the store to write to memory from the buffer when the buffer has a valid data, and when the store is no longer dependent on predicted branches.
153
Dynamic Pipeline Scheduling Dynamic scheduling is normally combined with branch prediction, so the commit unit must be able to discard all the results in the execution unit that were due to instructions executed after a mispredicted branch. Combined dynamic scheduling with branch prediction is called speculative execution. A second reason for complexity is that dynamic scheduling is also typically combined with superscalar execution, so each unit may be issuing or committing four to six instructions each clock cycle.

Dynamic machines are predicting program flow, looking at the instructions in multiple segments to see which to execute next, and then speculatively executing instructions based on the prediction and the instruction dependencies.
The motivations for dynamic pipeline are threefold: Hide memory latency, a major issue for computers.
Avoid stalls that the compiler could not schedule, often due to potential dependencies between store and load.
Speculatively execute instructions while waiting for hazards to be resolved.
155
A microprocessor that follows all three trendsdeep pipelines, superscalar, and dynamic pipelining-is the DEC Alpha 21264. The super-scalar machine fetches four instructions per clock cycle- but can issue up to six instructions- and uses out of order execution and in-order completion.
The pipeline takes nine stages for simple integer and floating-point operations, yielding a clock rate of 600 MHz in 1997. Dynamically scheduled pipelines are used in both the PowerPC 604 and the Pentium Pro.
Generic Dynamic Pipeline organization in Intel Pentium Pro & PowerPC 604
157

The instruction cache fetches 16 bytes of instructions and sends them to an instruction queue : four instructions for the PowerPC and a variable number of instructions for the Pentium Pro.
Next, several instructions are fetched and decoded.
Both processors use a 512-entry branch history table to predict branches and speculatively execute instructions after the predicted branch.
The dispatcher unit sends each instruction and its operands to the reservation station of one of the six functional units. The dispatcher also places an entry for the instruction in the reorder buffer of the commit unit. Thus, an instruction cannot issue unless there is space available in both an appropriate reservation station and in the reorder buffer.
158

With so many instructions executing at the same time, internal registers are required to keep results
Both processors have extra internal registers, called rename buffers or rename registers, that are used to hold results while waiting for the commit unit to commit the result to one of the real registers. The decode unit is where rename buffers get assigned, thereby reducing hazards on register numbers. Whenever an entry in the reservation station has all its operands and the associated functional unit is available, the operation is performed.
159

The commit unit keeps track of all the pending instructions in its reorder buffer.
Because both machines use branch prediction, an instruction is not finished until the commit until says it is. When the branch functional unit determines whether or not a branch was taken, it informs both the branch prediction unit, so that it can update its state machine, and the commit unit, so that it can decide the fate of pending instructions. If the prediction was accurate, the results of the instructions after the branch are marked valid and thus can be placed in the programmer-visible registers and memory. If a misprediction occurred, then all the instructions after the branch are marked invalid and discarded from the reservation stations and reorder buffer. The commit unit can commit several instructions per clock cycle. To provide precise behavior during exceptions, the commit unit makes sure the instructions commit in the order they were issued.
160
Pentium 4 Processor Block diagram
161
Pentium 4 Processor Architecture
162
Superscalar Processors
Some 64-bit Processors
163
164
165
166
Block Diagram
167
Pipeline Stages
Athlon
1 2 3 Fetch Scan Align1
Hammer
Fetch1 Fetch2 Pick
4
5 6
Align2
Decode1 (EDEC) Decode2 (IDEC/Rename)
Decode1
Decode2 Pack
7
8 9 10
Schedule
AGU/ALU L1 Address Generation Data Cache
Pack/Decode
Dispatch Schedule AGU/ALU
11
12
Data Cache 1
Data Cache 2
168
Block Diagram
CISC Characteristics
The essential goal of a CISC architecture is to attempt to provide a single machine instruction for each statement that is written in a high level language.
Examples of CISC architectures are DEC VAX, IBM 370, Intel
170
CISC Characteristics
Major characteristics of CISC architecture are : 1. Large number of instructions - typically from 100 to 250 instructions 2. Some instructions that perform specialized tasks and are used infrequently 3. A large variety of addressing modes typically from 5 to 20 different modes 4. Variable-length instruction formats 5. Instructions that manipulate operands memory
in
171
RISC Characteristics
The concept of RISC architecture involves an attempt to reduce execution time by simplifying the instruction set of the computer. Example :Berkeley RISC I, MIPS The major characteristics of a RISC processor are : 1. Relatively few instructions 2. Relatively few addressing modes 3. Memory access limited to load and store instructions 4. All operations done within the registers CPU 5. Fixed-length, easily decoded instruction format
of the
172
6. Single-cycle instruction execution 7. Hardwired rather than micro programmed control 8. A relatively large number of registers in the processor unit 9. Use of overlapped register windows to speedprocedure call and return 10. Efficient instruction pipeline 11. Compiler support for efficient translation high-level language programs into machine language programs
up
of
173

Overlapped Register Windows

A characteristic of some RISC processors is their use of overlapped register windows to provide the passing of parameters and avoid the need for saving and restoring register values Each procedure call results in the allocation of a new window consisting of a set of registers from the register file for use by the new procedure. Windows for adjacent procedures have overlapping registers that are shared to provide the passing of parameters and results.
174
175

In general, the organization of register windows will have the following relationships:
Number of global registers = G Number of local registers in each window = L Number of registers common to two windows = C Number of windows = W The number of registers available for each window is calculated as follows :
Window size = L + 2C + G
The total number of registers needed in the processor is
Register file = (L + C) W + G
For example : if G = 10, L = 10 , C =6, W =4 Then window size is 10 + 12 + 10 = 32 registers and the register file consists of (10+6) x 4 + 10 = 74 registers
Berkeley RISC I
177

Hazards

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hazards

Uploaded by

Copyright:

Available Formats

M.Tech. (CSE) I Sem.

COMPUTER SYSTEM DESIGN

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

2. Data dependency conflicts - Data Hazard

Instruction Pipeline Hazards

Structural Hazard due to Resource conflicts

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Data dependency conflicts

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Data dependency conflicts

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Data dependency conflicts

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Data Hazards Data dependencies

True Data Dependency

add R3, R2, R1 sub R4, R3, R5

True Data Dependency

2 time Space-Time Diagram

add R3, R2, R1 sub R4, R3, R5

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

add R3, R2, R1 sub R4, R3, R5 add R3, R2, R5

;R3 = R2 + R1 ;R4 = R3 R5 ;R3 = R2 + R5

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Control Hazards (Instruction Hazards)

Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer

The pipeline resumes its normal operation at that point.

Two types of branch instructions

- Conditional branch instructions

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Control Hazards Unconditional Branch Instructions

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Strategies to deal with Data Hazards

Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer

Mul R2, R3, R4; Add R5, R4, R6;

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

The most straightforward method is to insert hardware interlocks.

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Compiler Directed Forwarding

Store Store overwriting

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Compiler Directed Forwarding

Which eliminates one memory reference

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Compiler Directed Forwarding

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Compiler Directed Forwarding

sub R2, R1, R3 ; R2 R1 R3 and R12, R2, R5 ; R12 R2 /\ R5

Operand forwarding in a pipelined processor

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Detecting Data Hazards

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Detecting Data Hazards

Read-After-Write (RAW) Hazard

Write-After-Read (WAR) Hazard

Write-After-Write (WAW) Hazard

Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer

Detecting Data Hazards

Detecting Data Hazards

Delayed load SOFTWARE)