Professional Documents
Culture Documents
UNIT II
PROCESSING UNIT
Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer 1
Pipeline Hazard
There are situations in pipelining when the next instruction cannot execute in the following clock cycle. These events are called Pipeline Hazards Any condition that causes the pipeline to stall is called a hazard.
In general, there are three major difficulties that cause the instruction pipeline to deviate from its normal operation
1. Resource Conflicts Structural Hazard - caused by access to memory by two segments at the same time - Most of these conflicts can be resolved by using separate instruction and data memories
Pipeline Hazards
Data Hazards
Data Hazard
A data hazard is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. As a result some operation has to be delayed, and the pipeline stalls.
Address dependency
An address dependency may occur when an operand address cannot be calculated because the information needed by the addressing mode is not available. For example,
An instruction with register indirect mode cannot proceed to fetch the operand if the previous instruction is loading the address into the register.
Therefore, the operand access to memory must be delayed until the required address is available.
; R3 = R2 + R1 ; R4 = R3 R5
It would be incorrect to begin reading R3 in instruction 2 before instruction 1 has produced its new value for R3.
Hence there is a data dependency between instructions 1 and 2. This dependency is called a true data dependency. A true data dependency occurs when the value produced by an instruction is required by a subsequent instruction. True data dependencies are also called Read-After-Write (RAW) hazards because we are reading a value after writing to it. True dependencies are also sometimes known as flow dependencies because the dependency is due to the flow of data in the program.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
11
WB MEM
add
1 1
sub
2 2 2
EX
Read
1
Read
ID IF 1
1
R1, R2
2
R5
stall
stall
Read R3
12
Anti-dependency Consider the code sequence: 1 add R4, R2, R1 ; R4 = R2 + R1 2 sub R2, R3, R5 ; R2 = R3 R5 Instruction 2 must not produce its result in R2 before instruction 1 reads R2, otherwise instruction 1 would use the value produced by instruction 2 rather than the previous value of R2. This type of dependency is called an Antidependency (also called a Write-After-Read (WAR) hazard) and occurs when an instruction writes to a location which has been read by a previous instruction.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 13
Anti-dependency
In most pipelines, reading occurs in a stage before writing, so normally if the instructions remain in program order in the pipeline, an antidependency would not be a problem. It becomes a problem if the pipeline structure is such that writing can occur before reading in the pipeline, or the instructions are not processed in program order within the pipeline. The latter situation could occur for example if instructions are allowed to overtake stalled instructions in the pipeline. For example, suppose instruction 1 was stalled because of a resource conflict. Instruction 2 could be allowed to overtake instruction 1 in the pipeline, but then we could have an anti-dependency (write-after-read hazard).
14
Output Dependency
Consider the code sequence:
1 2 3
Instruction 1 must produce its result in R3 before instruction 3 produces its result in R3 otherwise instruction 2 might use the wrong value of R3. This type of dependency is called an Output dependency (also called a Write After Write (WAW) hazard) and occurs when a location is written to by two instructions. Again the dependency would not be significant if all instructions write at the same time in the pipeline and instructions are processed in program order. Actually output dependencies are a form of resource conflict, because the register in question, R3, is accessed by two instructions. The register is being reused and consequently, the use of another register in the instruction would eliminate the potential problem.
15
16
Instruction Hazard
The pipeline may also be stalled because of a delay in the availability of an instruction. For example, this may be a result of a miss in the cache, requiring the instruction to be fetched from the main memory or Branch instruction Such hazards are often called hazards or instruction hazards
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
control
17
Lecture Slides on Computer System Pipeline stall caused by a cache miss in F2 Design - MTech(CSE) @ Dr A R Naseer
18
Instruction Hazard
Instruction I1 is fetched from the cache in cycle1, and its execution proceeds normally.
However, the fetch operation for instruction I2, which is started in cycle 2, results in a cache miss. The instruction fetch unit must now suspend any further fetch requests and wait for I2 to arrive. We assume that instruction I2 is received and loaded into buffer B1 at the end of cycle 5.
Control Hazards
Control Hazards are caused by Procedural dependencies Procedural dependencies occur when the execution order is unknown prior to execution of instructions. Procedural dependencies are caused notably by branch instructions
20
Control Hazards
Conditional Branch Instructions
Example: JNZ L1 ; Branch to L1 if Z=0
The fetch unit keeps fetching instructions irrespective of the execution of the instructions.
When the branch instruction is completely executed, or at least when the condition can be tested, it would be known which instruction to process next. If this instruction is not the next instruction in the pipeline, all instructions in the pipeline are abandoned and the pipeline cleared. Hopefully, the abandoned instructions have not altered the state of the processor, i.e., not written to any register, otherwise the changes must be reversed, an expensive operation. Then the required instruction is fetched from the target address and must be processed through all units of the pipeline in the same way as when the pipeline is first started.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 21
Unconditional branch instructions always cause a change in the sequence. The change is predictable and fixed, but can also affect the pipeline.
22
Structural Hazards
23
Structural Hazard
A third type of hazard that may be encountered in pipelined operation is known as a structural hazard.
This is the situation when two instruction require the use of a given hardware resource at the same time. The most common case in which this hazard may arise is in access to memory. One instruction may need to access memory as part of the execute or write stage while another instruction is being fetched. If instructions and data reside in the same cache unit, only one instruction can proceed and the other instruction is delayed. Many processors use separate instruction and data caches to avoid this delay.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 24
Structural Hazards
Structural Hazard are caused by Resource conflicts
Resource conflicts occur when a resource such as memory or a functional unit is required by more than one instruction at the same time in the pipeline. A notable resource conflict is the main memory of the system. For memories with only one read/write port, only one access can be made at any instant.
25
26
Data Hazards
A data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed for some reason.
Consider a program that contains two instructions, I1 followed by I2 which requires the results of I1. When this program is executed in a pipeline, the execution of I2 can begin before the execution of I1 is completed. This means that the results generated by I1, may not be available for use by I2. We must ensure that the results obtained when instructions are executed in a pipelined processor are identical to those obtained when the same instructions are executed sequentially.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 27
Data Hazards
Consider the 4-stagepipeline. The data dependency arises when the destination of one instruction is used as a source in the next instruction.
For example, the two instructions
R4 R2 * R3 R6 R5 + R4
The result of the multiply instruction is placed into register R4, which in turn is one of the two source operands of the Add instruction. Assuming that the multiply operation takes one clock cycle to complete. As the decode unit decodes the Add instruction in cycle 3, it realizes that R4 is used as a source operand. Hence the D step of that instruction cannot be completed until the W step of the multiply instruction has been completed. Completion of step D2 must be delayed to clock cycle 5, and is shown as step D2A in the figure. Instruction I3 is fetched in cycle 3, but its decoding must be delayed because step D3 cannot precede D2. Hence pipelined execution is stalled for two cycles.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 28
Lecturedependency Slides on Computer System Pipeline stalled by data between D2 and W1 29 Design - MTech(CSE) @ Dr A R Naseer
Data Hazard
Pipelined computers deal with such conflicts between data dependencies in a variety of ways. 1. Hardware Interlocks
2. Operand Forwarding
3. Delayed load
30
Hardware Interlocks
Data Hazard
An interlock is a circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline.
Detection of this situation causes the instruction whose source is not available to be delayed by enough clock cycles to resolve the conflict.
This approach maintains the program sequence by using hardware to insert the required delays.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 31
Forwarding
The term forwarding refers to the technique of passing the result of one instruction directly to another instruction to eliminate the use of intermediate storage locations.
It can be applied at the compiler level to eliminate unnecessary references to memory locations by forwarding values through registers rather than through memory locations. Such forwarding would generally increase the speed of operation, as accesses to processor registers are normally faster than accesses to memory locations. Forwarding can also be applied at the hardware level to eliminate pipeline cycles for reading registers that were updated in a previous pipeline stage. In that case, forwarding eliminates register accesses, by using faster data paths.
32
33
could be reduced to
sw R2, label move R3, R2
34
35
36
Data Hazard
Operand Forwarding
Another technique called operand forwarding uses special hardware to detect a conflict and then avoid it by routing the data through special paths between pipeline segments For example, instead of transferring an ALU result into a destination register, the hardware checks the destination operand, and if it is needed as a source in the next instruction, it passes the result directly into the ALU input, bypassing the register file.
This method requires additional hardware paths through multiplexers as well as the circuit that detects the conflict.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 37
Internal Forwarding
Internal forwarding is hardware forwarding implemented processor registers or data paths not visible to the programmer. by
It can be applied in an instruction pipeline to eliminate references to general-purpose registers. For example, in the code sequence :
Data Hazard
Operand Forwarding
The data hazard arises because one instruction, instruction I2 is waiting for data to be written in the register file. However, these data are available at the output of the ALU once the execute stage completes step E1. Hence the delay can be reduced, or possibly eliminated, if we arrange for the result of instruction I1to be forwarded directly for use in step E2.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 39
40
Data Hazard
Operand Forwarding
A part of the processor data path involving the ALU and the register file shown in figure. This arrangement is similar to the three bus structure except that registers SRC1, SRC2, and RSLT have been added. These registers constitute the inter-stage buffers needed for pipelined operation. Registers SRC1 and SRC2 are part of buffer B2 and RSLT is part of B3. The data forwarding mechanism is provided by the blue connection lines. The two multiplexers connected at the inputs to the ALU allow the data on the destination bus to be selected instead of the contents of either the SRC1 or SRC2 register.
When the instructions are executed in the data path the operations performed in each clock cycle are as follows.
After decoding instruction I2 and detecting the data dependency, a decision is made to use data forwarding. The operand not involved in the dependency, register R2, is read and loaded in register SRC1 in clock cycle 3. In the next clock cycle, the product produced by instruction I1 is available in register RSLT, and because of the forwarding connection, it can be used in step E2. Hence execution of i2 proceeds without interruption.
41
Detect the data dependencies and then hold up the pipeline completely until the dependencies have been resolved
Allow all instructions to be fetched into the pipeline but only allow independent instructions to proceed to their completion, and delay instructions which are dependent upon other, not yet executed, instructions until these instructions are executed.
42
43
44
Read-After-Write hazard since O(1) I(2) = {R3} No Write-After-Read hazard since I(1) O(2) = No Write-After-Write hazard since O(1) O(2) = All conditions are satisfied, there are no hazards
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 45
Data Hazard
Delayed load
A procedure employed in some computers is to give the responsibility for solving data conflicts problems to the compiler that translates the high-level programming language into a machine language program. The compiler for such computers is designed to detect a data conflict and reorder the instructions as necessary to delay the loading of the conflicting data by inserting nooperation instructions. This method is referred to as delayed load.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 46
(HANDLING
Data Hazard
DATA
HAZARDS
IN
An alternative approach is to leave the task of detecting data dependencies and dealing with them to the software. In this case, the compiler can introduce the two cycle delay needed between instructions I1 and I2by inserting NOP (No operation) instructions, as follows:
I1 :
I2:
; R4 R2 * R3
; R6 R5 + R4
If the responsibility for detecting such dependencies is left entirely to the software, the compiler must insert the NOP instructions to obtain a correct result. This possibility illustrates the close link between the compiler and the hardware. A particular feature can be either implemented in hardware or left to the compiler leads to simpler hardware.
Being aware of the need for delay, the compiler can attempt to reorder instructions to perform useful tasks in the NOP slots, and thus achieve better performance. On the other hand insertion of NOP instructions leads to larger code size. Also, it is often the case that a given processor architecture has several hardware implementations, offering different features. NOP instructions inserted to satisfy requirements of one implementation may not be needed and hence, would lead to reduced performance on a different implementation.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 47
Data Hazards
Example : Using Compiler to detect & resolve Data Hazards
sub and or add sw R2, R1, R3 R12, R2, R5 R13, R6, R2 R14, R2, R2 R15, 100(R2)
In this case, the compiler needs to insert two independent instructions between the sub and the and instructions, thereby making the hazard disappear. When no such independent instructions can be found, the compiler inserts instructions guaranteed to be independent i.e., nop instructions
R2, R1, R3
Although this code works properly for this pipeline, these two nops occupy 2 clock cycles that do no useful work.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 48
49
INSTRUCTION HAZARDS
The purpose of the instruction fetch unit is to supply the execution units with a steady stream of instructions. Whenever this stream is interrupted, the pipeline stalls for the case of a cache miss.
50
51
Unconditional Branches
52
Consider unconditional branch instruction being executed in a Two-stage pipeline. The pipeline is stalled for one clock cycle. The time lost as a result of a branch instruction is often referred to as the branch penalty . For a longer pipeline the branch penalty may be higher.
Unconditional Branches
Slides by on Computer System Instruction An idle Cycle Lecture caused a Branch Design - MTech(CSE) @ Dr A R Naseer
53
Typically the instruction fetch unit has dedicated hardware to identify a branch instruction and compute the branch target address as quickly as possible after an instruction is fetched.
With this additional hardware, both of these tasks can be performed in step Decode. In this case, the branch penalty is only one clock cycle.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 54
55
56
58
Note that branch folding occurs only if at the time a branch instruction is encountered, atleast one instruction is available in the queue other than the branch instruction.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 60
61
When a cache miss occurs, the dispatch unit continues to send instructions for execution as long as the instruction queue is not empty.
Meanwhile, the desired cache block is read from the main memory or from a secondary cache. When fetch operations are resumed, the instruction queue does not become empty, a cache miss will have no effect on the rate of instruction execution.
In summary the instruction queue mitigates the impact of branch instructions on performance through the process of branch folding.
It has a similar effect on stalls caused by cache misses.
The effectiveness of this technique is enhanced when the instruction fetch unit is able to read more than one instruction at a time from the instruction cache.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 62
Conditional Branches
63
A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction.
The decision to branch cannot be made until the execution of that instruction has been completed. Branch instructions represent about 20 percent of the dynamic instruction count of most programs.
(The dynamic count is the number of instruction executions, taking into account the fact that some program instructions are executed many times because of loops. ) Because of the branch penalty, this large percentage would reduce the gain in performance expected from pipelining. Fortunately, branch instruction can be handled in several ways to reduce their negative impact on the rate of execution of instructions.
Conditional Branches
64
Conditional Branches
The branch instruction breaks the normal sequence of the instruction stream, causing difficulties in the operation of the instruction pipeline.
Pipelined computers employ various hardware techniques to minimize the performance degradation caused by instruction branching. 1. Prefetch target instruction 2. Branch target Buffer 3. Branch Prediction 4. Delayed Branch
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 65
IF
66
Branch Prediction
67
Branch Prediction
Another technique for reducing the branch penalty associated with conditional branches is to attempt to predict whether or not a particular branch will be taken.
The simplest form of branch prediction is to assume that the branch will not take place and to continue to fetch instructions in sequential address order. Until the branch condition is evaluated, instruction execution along the predicted path must be done on a speculative basis.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 68
Branch Prediction
Speculative execution means that instructions are executed before the processor is certain that they are in the correct execution sequence. Hence, care must be taken that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed. If the branch decision indicates otherwise, the instructions and all their associated data in the execution units must be purged, and the correct instructions fetched and executed.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 69
70
IF
ID IF
EX idle IF
MEM idle ID
72
A decision on which way to predict the result of the branch may be made in hardware by observing whether the target address of the branch is lower than or higher than the address of the branch instruction. A more flexible approach is to have the compiler decide whether a given branch instruction should be predicted taken or not taken. The branch instructions of some processors, such as SPARC, include a branch prediction bit, which is set to 0 or 1 by the compiler to indicate the desired behavior. The instruction fetch unit checks this bit to predict whether the branch will be taken or not taken. With either of these schemes, the branch prediction decision is always the same every time a given instruction is executed.
prediction
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 73
74
IF
ID
IF
EX
ID
MEM
EX
WB
MEM EX ID WB MEM EX WB 75 MEM
WB
76
This situation is exactly the same as in the case of data dependency (Delayed Load).
77
L1:
Using the one-instruction delayed version of the branch instruction, the add instruction can be moved to after the branch, i.e., sub R6, R6, 1 jz L1 add R3, R4, R5
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 79
L1:
NOP (No Operation) instruction can be used to delay the branch instruction when the compiler cannot find independent instructions to insert after branch instructions in the code sequence
Consider the following code sequence :
sub R2, R2, 1 jz L1
L1:
For a processor with two branch delay slots, the compiler attempts to find two instructions preceding the branch instruction that it can move into the delay slots without introducing a logical error. The chances of finding two such instructions are considerably less than the chances of finding one. Thus if increasing the number of pipeline stages involves an increase in the number of branch delay slots, the potential gain in performance may not be fully realized.
81
82
83
usage
of
the
branch
84
85
When the branch instruction is executed and if the branch is taken, the machine moves to state LT. Otherwise it remains in state LNT. The next time the same instruction is encountered, the branch is predicted as taken if the corresponding state machine is in state LT. Otherwise it is predicted as not taken.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 86
87
90
Problems :
prediction bit
Aliasing: lower K bits of different branch instructions could be the same Soln: Use tags (the buffer becomes a tag); however very expensive Loops are predicted wrong twice Soln: Use n-bit saturation counter prediction * taken if counter 2 (n-1)
NT
Predict Taken 11
T
NT
Predict Taken 10
T
Predict Not Taken 01
NT
Predict Not Taken 00
NT
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 92
ST : Strongly likely to be taken SNT: Strongly likely not to be taken LT : Likely to be taken LNT: Likely not to be taken
93
State-Machine representation of Branch-Prediction Lecture Slides on Computer System Design - Algorithms MTech(CSE) @ Dr A R Naseer
94
95
It may be recorded in a look-up table, which is accessed using the low-order part of the branch instruction address.
In this case, it is possible for two branch instructions to share the same table entry. This may lead to a branch being mispredicted but it does not cause an error in execution. Misprediction only introduces a small delay in execution time. An alternative approach is to store the history bits as a tag associated with branch instructions in the instruction cache.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 98
100
The prediction look-up table is a branch history table, also called a branch target buffer (BTB) which is implemented in a similar fashion to a cache. A direct mapped, a fully-associative or a set-associative cache-type table could be employed.
101
102
Strategy 2 :
Run-time execution history based upon the last n executions of the branch is used to predict a branch
Strategy 3:
Prediction based upon direction of branch (forward or backward) or branch opcode.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 103
104
1. Addressing Modes
105
Addressing Modes
Addressing modes should provide the means for accessing a variety of data structures simply and efficiently. Useful addressing modes include index, indirect, auto increment, and auto decrement.
Many processors provide various combinations of these modes to increase the flexibility of their instruction sets.
Complex addressing modes, such as those involving double indexing, are often encountered.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 106
Addressing Modes
In choosing the addressing modes to be implemented in a pipelined processor, we must consider the effect of each addressing mode on instruction flow in the pipeline.
Two important considerations in this regard are the side effects of the modes such as auto increment and auto decrement and the extent to which the complex addressing modes cause the pipeline to stall. Another important factor is whether a given mode is likely to be used by compilers.
107
Addressing Modes
To compare various approaches, assume a simple model for accessing operands in the memory.
The load instruction load X(R1), R2 takes five cycles to complete execution However the instruction Load (R1), R2 can be organized to fit a four stage pipeline because no address computation is required. Access to memory can take place in stage E. A more complex addressing mode may require several accesses to the memory to reach the named operand.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 108
Addressing Modes
For example, the instruction
Load (X(R1)), R2
may be executed assuming that the index offset, X, is given in the instruction word.
After computing the address in cycle 3, the processor needs to access memory twice-first to read location X + [R1] in clock cycle 4 and then to read location [X + [R1]] in cycle 5. If R2 is a source operand in the next instruction, that instruction would be stalled for three cycles, which can be reduced to two cycles with operand forwarding.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
109
110
Addressing Modes
To implement the same Load operation using only simple addressing modes requires several instructions. For example on computer that allows three operand addresses, we can use Add #X, R1, R2 Load (R2), R2 Load (R2), R2 The Add instruction performs the operation R2 X+[R1]. The two Load instructions fetch the address and then the operand from the memory. This sequence of instructions takes exactly the same number of clock cycles as the original, single Load instruction.
This example indicates that, in a pipelined processor, complex addressing modes that involve several accesses to the memory do not necessarily lead to faster execution.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 111
Addressing Modes
The main advantage of such modes is that they reduce the number of instructions needed to perform a given task and thereby reduce the program space needed in the main memory. Their main disadvantage is that their long execution times cause the pipeline to stall, thus reducing its effectiveness.
They require more complex hardware to decode and execute them. Also, they are not convenient for compilers to work with. The instruction sets of modern processors are designed to take maximum advantage of pipelined hardware.
Because complex addressing modes are not suitable for pipelined execution, they should be avoided.
112
Addressing Modes
The addressing modes used in modern processors often have the following features:
- access to an operand does not require more than one access to the memory. - Only load and store instructions access memory operands. The addressing modes used do not have side effects.
Three basic addressing modes that have these features are register, register indirect, and index.
(RISC concepts Example :SPARC processor architecture ) The first two require no address computation. In the index mode, the address can be computed in one cycle, whether the index value is given in the instruction or in a register. Memory is accessed in the following cycle. None of these modes has any side effects, with one possible exception.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 113
2. Condition Codes
114
Condition Codes
In many processors the condition code flags are stored in the processor status register.
They are either set or cleared by many instructions, so that they can be tested by subsequent conditional branch instructions to change the flow of program execution. An optimizing compiler for a pipelined processor attempts to reorder instructions to avoid stalling the pipeline when branches or data dependencies between successive instructions occur. In doing so, the compiler must ensure that reordering does not change in the outcome of a computation. The dependency introduced by the condition code flags reduces the flexibility available for the compiler to reorder instructions.
115
Instruction Reordering
a) A program fragment
Add R1, R2
Compare
Branch=0
R3, R4
.
b) Instructions Reordered
Compare
R3, R4
Add
Branch=0
R1, R2
.
116
Condition Codes
Consider the sequence of instructions and assume that the execution of the Compare and branch=0 instruction proceeds as in figure.
The branch decision takes place in step E2 rather than D2 because it must await the result of the Compare instruction. The execution time of the branch instruction can be reduced by interchanging the Add and Compare instructions. This will delay instruction by one cycle relative to the Compare instruction.
As a result, at the time the Branch instruction is being decoded, the result of the Compare instruction will be available and a correct branch decision will be made.
There would be no need for branch prediction.
However, interchanging the Add and Compare instructions can be done only if the Add instruction does not affect the condition codes.
117
Condition Codes
These observations lead to two important conclusions about the way condition codes should be handled.
First to provide flexibility in reordering instructions, the condition code flags should be affected by as few instructions as possible. Second, the compiler should be able to specify in which instructions of a program the condition codes are affected and which they are not.
An instruction set designed with pipelining in mind usually provides the desired flexibility Figure shows the instructions reordered assuming that the condition code flags are affected only when this is explicitly stated as part of the instruction OP code. The SPARC and ARM architectures provide this flexibility.
118
DATAPATH
& CONTROL
CONSIDERATIONS
Lecture Slides on Computer System Design MTech(CSE) @ Dr A R Naseer
119
1. There are separate instruction and data caches that use separate address and data connections to the processor.
This requires two versions of the MAR IMAR for accessing the instruction cache DMAR for accessing the data cache. 2. register. and
The PC is connected directly to the IMAR , so that the contents of the PC can be transferred to IMAR at the same time that an independent ALU operation is taking place.
120
Datapath modified for Pipelined execution with Lecture Slides on Computer System Design - MTech(CSE) @and Dr A Routput Naseer of the ALU Interstage Buffers at the input
121
desired.
122
7.
The output of the instruction decoder connected to the control signal pipeline.
is
There is a need for buffering control signals and passing them from one stage to the next along with the instruction . This pipeline holds the control signals in B2 and B3. buffers
123
Reading an instruction from the instruction cache Incrementing the PC Decoding an instruction Reading from or writing into the data cache Reading the contents of up to two registers from the register file writing into one register in the register file Performing an ALU operation
Because these operations do not use any shared resources, they can be performed simultaneously in any combination.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 124
Write the result of instruction I1 into the register file Read the operands of instructions I2 from the register file Decode instruction I3 Fetch instruction I4 and Increment the PC.
125
SUPERSCALAR OPERATION
126
Superscalar Operation
A more aggressive approach is to equip the processor with multiple processing units to handle several instructions in parallel in each processing stage.
With this arrangement, several instructions start execution in the same clock cycle, and the processor is said to use multiple issue. Such processors are capable of achieving an instruction execution throughput of more than one instruction per cycle.
They are known as superscalar processors. Many modern high performance processors use this approach.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 127
Superscalar Operation
For superscalar operation, the arrangement of an instruction queue is essential. To keep the instruction queue filled, a processor should be able to fetch more than one instruction at a time from the cache. Multiple issue operation requires a wider path to the cache and multiple execution units.
Separate execution units are provided for integer and floating point instructions.
Figure shows an example of a processor with two execution units, one for integer and one for floating point operations. The instruction fetch unit is capable of reading two instructions at a time and storing them in the instruction queue.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 128
Superscalar Operation
Superscalar Operation
An example of instruction execution flow in the Superscalar processor, assuming no hazards are encountered Lecture Slides on Computer System
Design - MTech(CSE) @ Dr A R Naseer
130
Superscalar Operation
In each clock cycle, the dispatch unit retrieves and decodes upto two instructions from the front of the queue. If there is one integer, one floating point instructions, and no hazards, both instructions are dispatched in the same clock cycle. In a superscalar processor, the detrimental effect on performance of various hazards becomes even more pronounced.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
131
The compiler can avoid many hazards through judicious selection and ordering of instructions.
For example, for the processor in figure, the compiler should strive to interleave floating point and integer instructions. This would enable the dispatch unit to keep both the integer and floating point units busy most of the time. In general, high performance is achieved if the compiler is able to arrange program instructions to make maximum advantage of the available hardware units.
Superscalar Operation
132
OUT-OF-ORDER EXECUTION
133
Out-of-order Execution
In figure instructions are dispatched in the same order as they appear in the program
However their execution is completed out of order. Does this lead to any problems? We have already discussed the issues arising from dependencies among instructions. For example if instruction I2 depends on the result of I1 the execution of I2 will be delayed.
As long as such dependencies are handled correctly, there is no reason to delay the execution of an instruction. However, a new complication arises when we consider the possibility of an instruction causing an exception.
134
Out-of-order Execution
Exceptions may be caused by a bus error during an operand fetch or by a illegal operation such as an attempt to divide by zero.
The results of I2 are written back into the register file in cycle 4. If instruction I1 causes an exception, program execution is in an inconsistent state.
The program counter points to the instruction in which the exception occurred.
However one or more of the succeeding instructions have been executed to completion. If such a situation is permitted, the processor is said to have imprecise exceptions.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 135
136
Out-of-order Execution
To guarantee a consistent state when exceptions occur the result of the execution of instructions must be written into the destination locations strictly in program order. This means we must delay step W2 in figure until cycle 6. In turn the integer execution unit must retain the result of Instructions I2, and hence it cannot accept instruction i4 until cycle 6.
If an exception occurs during an instruction all subsequent instructions that have been partially executed are discarded. This is called a precise exception.
137
Out-of-order Execution
It is easier to provide precise exceptions in the case of external interrupts. When an external interrupt is received, the Dispatch unit stops reading new instructions from the instruction queue, and the instructions remaining in the queue are discarded.
EXECUTION COMPLETION
139
It is desirable to use out of order execution, so that an execution unit is freed to execute other instructions a soon as possible.
At the same time, instructions must be completed in program order to allow precise exceptions. These seemingly conflicting requirements are readily resolved if execution is allowed to proceed as shown in figure, but the results are written into temporary registers. The contents of these registers are later transferred to the permanent registers in correct program order.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 140
Execution Completion
Execution Completion
This approach is illustrated in figure. Step TW is a write into a temporary register. Step W is the final step in which the contents of the temporary register are transferred into the appropriate permanent register. This step is often called the commitment step because the effect of the instruction cannot be reversed after that point.
If an instruction causes an exception, the results of any subsequent instruction that has been executed would still be in temporary registers and can be safely discarded.
A temporary register assumes the role of the permanent register whose data it is holding and is given the same name.
141
Execution Completion
For example, if the destination register of I2 is R5, the temporary register used in step TW2 is treated as R5 during clock cycles 6 and 7.
Its contents would be forwarded to any subsequent instruction that refers to R5 during that period.
Because of this feature, this technique is called register renaming. Note that the temporary register is used only for instructions that follow I2 in program order.
142
Execution Completion
If an instruction that precedes I2 needs to read R5 in cycle 6 or 7, it would access the actual register R5, which still contains data that have not been modified by instruction I2.
When out of order execution is allowed, a special control unit is needed to guarantee in order commitment.
This is called the commitment unit.
It uses a queue called the reorder buffer to determine which instructions should be committed next.
Instructions are entered in the queue strictly in program order as they are dispatched for execution.
143
Execution Completion
When an instruction reaches the head of that queue and the execution of that instruction has been completed, the corresponding results and the instruction is removed from the queue. All resources that were assigned to the instruction, including the temporary registers are released. The instruction is said to have been retired at this point.
Because an instruction is retired only when it is at the head of the queue, all instructions that were dispatched before it must also have been retired.
Hence, instructions may complete execution out of order, but they are retired in program order.
144
DISPATCH OPERATION
145
Dispatch Operation
When dispatch decisions are made the dispatch unit must ensure that all the resources needed for the execution of an instruction are available.
For example, since the results of an instruction may have to be written in a temporary register, the required register must be free, and it is reserved for use by that instruction as apart of the dispatch operation. A location in the reorder buffer must also be available for the instruction.
When all the resources needed are assigned including an appropriate execution unit, the instruction is dispatched
146
Dispatch Operation
Should instructions be dispatched out of order ? For example if instruction I2 in figure is delayed because of a cache miss for a source operand, the integer unit will be busy in cycle 4, and I4 cannot be dispatched. Should i5 be dispatched instead? In principle this is possible, provided that a place is reserved in the reorder buffer for instruction I4 to ensure that all instructions are retired in the correct order. Dispatching instructions considerable care. out of order requires
If I5 is dispatched while I4 is still waiting for some resource, we must ensure that there is no possibility of a deadlock occurring.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 147
Dispatch Operation
A deadlock is a situation that can arise when two units, A and B, use a shared resource. Suppose that unit B cannot complete its task until unit A completes its task. At the same time, unit B has been assigned a resource that unit A needs. If this happens, neither unit can complete its task.
Unit A is waiting for the resource it needs, which is being held by unit B.
At the same time unit B is waiting for unit A to finish before it can release that resource.
148
Dispatch Operation
If instructions are dispatched out of order, a deadlock can arise as follows. Suppose that the processor has only one temporary register, and that when I5 is dispatched, that register is reserved for it. Instruction I4 cannot be dispatched because it is waiting for the temporary register, which in turn will not become free until instruction I5 is retired.
Dispatch Operation
It may also mean that more time is required to make dispatching decisions. For these reasons most processors use only in order dispatching. Thus the program order of instructions is enforced at the time instructions are dispatched and again at the time instructions are retired. Between these two events, the execution of several instructions can proceed at their own speed, subject only to any interdependencies that may exist among instructions. Ultra SPARC II is a commercially successful, superscalar, highly pipelined processor which uses these techniques
150
151
152
For memory, there is normally a store buffer, also called a write buffer.
The commit unit allows the store to write to memory from the buffer when the buffer has a valid data, and when the store is no longer dependent on predicted branches.
153
Dynamic Pipeline Scheduling Dynamic scheduling is normally combined with branch prediction, so the commit unit must be able to discard all the results in the execution unit that were due to instructions executed after a mispredicted branch. Combined dynamic scheduling with branch prediction is called speculative execution. A second reason for complexity is that dynamic scheduling is also typically combined with superscalar execution, so each unit may be issuing or committing four to six instructions each clock cycle.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 154
Avoid stalls that the compiler could not schedule, often due to potential dependencies between store and load.
Speculatively execute instructions while waiting for hazards to be resolved.
155
A microprocessor that follows all three trendsdeep pipelines, superscalar, and dynamic pipelining-is the DEC Alpha 21264. The super-scalar machine fetches four instructions per clock cycle- but can issue up to six instructions- and uses out of order execution and in-order completion.
The pipeline takes nine stages for simple integer and floating-point operations, yielding a clock rate of 600 MHz in 1997. Dynamically scheduled pipelines are used in both the PowerPC 604 and the Pentium Pro.
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 156
Generic Dynamic Pipeline organization in Intel Pentium Pro & PowerPC 604
157
Both processors use a 512-entry branch history table to predict branches and speculatively execute instructions after the predicted branch.
The dispatcher unit sends each instruction and its operands to the reservation station of one of the six functional units. The dispatcher also places an entry for the instruction in the reorder buffer of the commit unit. Thus, an instruction cannot issue unless there is space available in both an appropriate reservation station and in the reorder buffer.
158
159
160
161
162
Superscalar Processors
163
164
165
166
Block Diagram
167
Pipeline Stages
Athlon
1 2 3 Fetch Scan Align1
Hammer
Fetch1 Fetch2 Pick
4
5 6
Align2
Decode1 (EDEC) Decode2 (IDEC/Rename)
Decode1
Decode2 Pack
7
8 9 10
Schedule
AGU/ALU L1 Address Generation Data Cache
Pack/Decode
Dispatch Schedule AGU/ALU
11
12
Data Cache 1
Data Cache 2
168
Block Diagram
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 169
CISC Characteristics
The essential goal of a CISC architecture is to attempt to provide a single machine instruction for each statement that is written in a high level language.
170
CISC Characteristics
Major characteristics of CISC architecture are : 1. Large number of instructions - typically from 100 to 250 instructions 2. Some instructions that perform specialized tasks and are used infrequently 3. A large variety of addressing modes typically from 5 to 20 different modes 4. Variable-length instruction formats 5. Instructions that manipulate operands memory
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
in
171
RISC Characteristics
The concept of RISC architecture involves an attempt to reduce execution time by simplifying the instruction set of the computer. Example :Berkeley RISC I, MIPS The major characteristics of a RISC processor are : 1. Relatively few instructions 2. Relatively few addressing modes 3. Memory access limited to load and store instructions 4. All operations done within the registers CPU 5. Fixed-length, easily decoded instruction format
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
of the
172
RISC Characteristics
6. Single-cycle instruction execution 7. Hardwired rather than micro programmed control 8. A relatively large number of registers in the processor unit 9. Use of overlapped register windows to speedprocedure call and return 10. Efficient instruction pipeline 11. Compiler support for efficient translation high-level language programs into machine language programs
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer
up
of
173
RISC Characteristics
174
175
Window size = L + 2C + G
The total number of registers needed in the processor is
Register file = (L + C) W + G
For example : if G = 10, L = 10 , C =6, W =4 Then window size is 10 + 12 + 10 = 32 registers and the register file consists of (10+6) x 4 + 10 = 74 registers
Lecture Slides on Computer System Design - MTech(CSE) @ Dr A R Naseer 176
Berkeley RISC I
177