Sank Art Hes Is

CHAPTER 1
Introduction:
What is Pipelining? Definition: In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one An instruction pipeline is a technique used in the design of computer and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time
The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.) Most modern CPUs are driven by a clock. The CPU consists internally of logic and register (flipflops). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is broken into four stages with a set of flip flops between each stage. 1. Instruction fetch 2. Instruction decode and register fetch 3. Execute 4. Memory access & Register write back When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist. A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely
1
cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly. Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is fed to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced. A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic gates in each stage. This generally means that the processor's frequency can be increased as the cycle time is lowered. This happens because there are fewer components in each stage of the pipeline, so the propagation delay is decreased for the overall stage . Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 4 stages. To operate at full performance, this pipeline will need to run 3 subsequent independent instructions while the first is completing. If 3 instructions that do not depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution. .Pipelining incorporates the concept of Time Overlapped Handling of Several Identical Tasks by Several Non Identical Stages.Each Stage is made to handle a Distinct Section / Sub-Task for each of the Tasks.At any point of time ideally each of the stages gets Busy with processing its own Sub Part belonging to different tasks . Each stage , at any given point of time, is processing a Sub Task belonging to a Different Task hence if there are N stages then ideally N tasks are being processed concurrently i.e. the Nth Task has started without any of the earlier N-1 tasks being complete. Each of the Stages can go on independent of all the other stages provided it has got some job to do / some Input to handle.Each of the stages, except the very first stage , gets its input from the previous stage and feeds the next stage [ except the very last one ]. Project Objective: The main objective of this project is to design a a new pipelined RISC processor and developing code for it in Verilog to verify it through simulation. we designed a pipelined RISC architecture Processor from ground-up to implement some simple functions like AND, OR, MOVE, STORE, ADD, SUBTRACT and simulated it using the code we developed and found it satisfactory.
Usage of Pipelining 1.Used in many everyday applications without our notice. a) Concrete Casting involving a Number of People passing on the Concrete Mix among different Levels. b)FireFighting . 2. Has proved to be a very popular and successful way to exploit Instruction Level Parallelism[will be explaine in the next section] Instruction pipes are being used in almost all modern processors. Consider a sufficiently large number of Identical Tasks [ Dumping Concrete Mix, Throwing Bucket of Water, Executing Instructions in a Computer ]. Break up each Task into several smaller Sub Tasks.Design & Employ one Sub Unit for carrying out each of these Sub Tasks. Each Sub Unit takes Input from its previous stage / Unit and delivers Output to its next stage / Unit. Keep each of these Sub Units busy ALL the Time i.e. Operate them in a Time Overlapped Fashion. If there are N Sub units and the slowest among them takes K units of time, then our Assembly Line will complete at least N tasks every K units of Time. Classic Examples: Consider the way in which any Typical Undergraduate Engineering College Works : 1. It offers a 4 Year Curriculum. 2. It has got facilities [ Sub Units ] to train/ teach students of a Particular Year. 3. Starting from a Particular Year onwards , it Admits M number of students every year. After the First 4 Years , number of students graduating per year = M in each of the Subsequent Years , Assuming NO failures / An Ideal Scenario :Pipelining Student Admissions
Salient Features of This Pipeline:Fixed Number of Stages : 4 1.Identical Stages : Each Training stage is of One Year duration and each of which Handles the Same Set of Students as had been admitted in the First Year. 2.No stage is starved of Inputs : getting adequate Number of Students. 3.Synchronized Stages : Through the Common Exam Schedule.
Time Overlapped Processing: [ Temporal Parallelism / Another Real Life Example] A.Task : WASH , DRY & IRON Ten ( 10) Dirty clothes. B. Units available [ Capacity ] Time Taken: WASHER [ Can Wash 5 Clothes in one go] takes 40 minutes.[ Hence Total Time Required to WASH 10 Dirty Clothes = 80 minutes] DRIER [ Can Dry 8 Clothes at a time ] takes 20 minutes.[ Therefore Total Time Required to DRY 10 Clothes= 40 minutes ] Various Units involved in Washing : IRON [ Manual Ironing of 1 Cloth at a time ] takes 4 minutes [ Total Time Required to IRON 10 Clothes = 40 minutes ] TOTAL Time Needed if operated in Strict , Time Non Overlapped Sequence = 160 Minutes. Time Overlapped Processing [ Temporal Parallelism / A Real Life Example - 2] C. Time Overlapped Operation Sequence : 1. Put 5 Clothes in WASHER [ DRIER , IRON Idle ] . 2a. After 40 minutes [ WASHER Finishes washing 1st Lot ] put washed clothes [ 5 ] to DRY in DRIER . 2b. Load WASHER with the left over 5 Clothes so for the subsequent period both WASHER & DRIER gets to work in a Time Overlapped Fashion. IRON is still Idle. Time Overlapped Processing [ Temporal Parallelism / A Real Life Example - 3] 3a. After 20 Minutes [ Total 60 Minutes ] DRIER will finish , one can take the clothes for IRONING ( Provided there is space to keep those clothes) . Meanwhile WASHER is still washing. DRIER is IDLE. 3b. After 20 more Minutes [Total 80 minutes] IRONING of first 5 clothes is finished while WASHER has also finished washing ALL 10 clothes. DRIER remains IDLE. Time Overlapped Processing [ Temporal Parallelism / A Real Life Example - 4] 4. Engage DRIER to DRY remaining 5 clothes takes 20 more minutes [ Total Time Taken = 100 minutes ]. IRONING activity is idle due to lack of availability of clothes . WASHER can be kept BUSY if more clothes were there. 5. IRONING these clothes will take 20 more minutes [ Total Time Taken = 120 minutes ]. Time Overlapped Processing [ Real Life Example Key Observations ] .Net Time saved due to Time Overlapped Processing = 180 -120 = 60 minutes. 2.Slowest Stage in the Pipeline = IRONING . 3. After all the 3 stages ( WASHER, DRIER, IRON ) have been made busy (after Step 3a.) one will get one cloth ready after every 4 minutes. Time Overlapped Usage of Different Processing Stations in an Assembly Line ( General Observations) 1 Motivation: To Decrease the Processing Time of a Number of Identical Jobs. The trick is to sub-divide the entire processing of a single job into a number of sub- tasks .
4
Each sub task is to be handled by a separate processing station / stage. 1.Time Overlapped Usage of Different Processing Stations in an Assembly Line ( General Observations) 2 4. Each of the Processing Stages should have some Input to Work on in order to keep that unit Busy as often as possible. 5. Each of the Processing Stages except the very last one is generating some Output to be consumed by the next Processing Stage only. 6. Each of these Processing Stages may not take same time and also need not be synchronized. Hence there will have to be some intermediate store / buffer to hold temporarily the Inputs to any particular processing station. Time Overlapped Usage of Different Processing Stations in an Assembly Line ( General Observations) 3 7.Since each processing stage is dependent on its predecessor processing stage only as well as feeding to its next processing stage only hence one cannot reduce the processing time for any particular task/job lower than the slowest processing stages processing time. 8.Each task normally passes through each of the processing stages regardless of the requirement 9.Hence time taken to process a single task may increase as compared to the case where the given task is processed based on its specific requirements since a task may have to go through some unnecessary stages. Time Overlapped Usage of Different Processing Stations in an Assembly Line ( General Observations) 4 10. System Throughput i.e. the number of tasks completed over a specific period of time will increase because of Time Overlapped operation of the various processing stages. 11 However , if during the course of Processing , if any of the Processing Stage Fails / Stalls then the entire Assembly Line will either crash OR get stalled. Using Pipeline Inside a Computer ( Salient Queries - 1) 1.How this Assembly Line Concept is applicable in the Instruction Processing in a typical Computer ? Ans . a). The CPU of any Computer essentially fetches , decodes and then executes Instructions belonging to a Program. b) Each Instruction Processing is composed of an almost identical set of stages / Machine Cycle. Hence one can view the CPU to represent an Assembly Line for Instruction Processing. Using Pipeline Inside a Computer ( Salient Queries - 2) 2. Is the improved Throughput i.e. number of tasks completed over a period of time dependent on / proportional to the number of processing / PIPELINE stages ? To be answered later in the context of Instruction Processing in a Computer.
CHAPTER2
The Typical Instruction Handling Sequence in a CPU
Typical Instruction Processing Stages Inside a CPU- 1 1.Fetch Instruction Op-Code [CISC ] / The Entire Instruction [ RISC ] from Instruction-Cache/ Memory into the Instruction Register using Instruction Pointer / PC appended by Code Segment Register, as well as Update the Instruction Pointer / PC to point to next Instruction.[ IF] . 2.Decode Instruction Op-Code Inside the CPU and select some Register Operands [ RISC] , (In this case Instruction Pointer / PC can be used to fetch the next Instruction ) or decide on future Operand Address Reads as well as the next Instruction Location as in CISC. Update PC Accordingly [ ID ] Typical Instruction Processing Stages Inside a CPU- 2 3.Read Operand Addresses into the Instruction Register from I-Cache using the Instruction Memory Address Register [ CISC only] [ROA] May have to be carried out a number of times once for each of the Operand Addresses. (Optional) Not required for RISC. 4.Execute Instruction Processing Op Code / Calculate Linear Operand Address Offset using ALU [EX] . In the former case (processing) the operation may vary in time depending on the type of Operation being carried out. Typical Instruction Processing Stages Inside a CPU 3 5.Read operand Values from Data -Cache / Memory using the computed Linear Offset as obtained in the previous step appended by the appropriate Segment Registers ( DATA / STACK / EXTRA) . [ MEM] N.B: For CISC the above two steps 4 & 5 may need to be executed a Number of times once each for reading each of the Operand Addresses and at least once for performing computation. This Computation Time need not be fixed. Typical Instruction Processing Stages Inside a CPU - 4 6.Write Back Result [ Into the Designated Destination ] [ WB ] . In case of a Memory being the destination the processor needs to compute the Linear Address Offset using the step 4 .
7.Interrupt Handling : Here main issues being two fold namely preserving the Current Context in the System Stack followed by computing / locating the Target and loading it to the Instruction Pointer. One can Time Overlap these operations provided A. There is no Resource Conflict among the various stages [ No Structural Hazards ] B. Each Instruction once in the Pipeline does in no way affect the Execution pattern of any of its Successor Instructions in the Pipeline [ There exists no Inter Instruction Dependency in the form of either DATA Hazards or Control Hazards ] . A representative RISC Processor [ MIPS / DLX ] Salient Features: 1. 32 bit Processor i.e. can handle 32 bit Operands at one go. 2. Fixed Instruction length ( 32 bits). Hence can be fetched in one machine cycle. 3. Load Store Architecture i.e. all the source operands need to be brought in some CPU Register before processing all Results are to be computed in some CPU register before being stored in some Memory location. 4. Restricted Addressing modes [ Register Direct , Indexed , Relative, Implied ]. 5. Large GPR file set. MIPS a RISC Processor Uses the following 5-stage Pipeline 1.IF: Instruction fetch from Instruction Memory. 2.ID: Decode operands and Select CPU Register operands. 3.EX: ALU operation or Memory Data operand Linear Address generation. 4.MEM: Data Memory reference to Read Operand Values. 5.WB: Write back into CPU Register file. MIPS Pipeline Stages 5 stages of MIPS Pipeline: 1. IF Stage: Needs access to the Program Memory to Fetch the whole instruction. Needs a dedicated adder to update the PC. 2. ID Stage: Needs access to the Register File. 3. EX Stage: Needs an ALU and Associated Registers. 4. MEM Stage: Needs access to the Data Memory. 5. WB Stage:Needs access to the Register File for writing Result. Pipeline Registers : Pipeline registers are essential part of pipelines serving as Inter Stage Buffers / Latches. There are N-1 groups of pipeline registers in an N stage pipeline one group lying between two successive Pipeline stages. Each stage after completion of its processing part saves ALL the relevant outputs generated by it to the Intermediate Register lying at its output. In MIPS Pipeline these Registers happens to be .
7
1. IF /ID (Instruction Register writes the Fetched Instruction into this . Condition Code Flags are also written into it ). 2. ID/EX ( ALU Operand Registers are written from it hence this stores the content of ALL the Input Register Operands). 3. EX/MEM ( ALU Result Register + Flags writes into it ). 4. MEM/WB ( Memory Data / Buffer Register writes into it.) This way, each time something is computed... Effective address, Immediate value, Register content, etc. are saved & can be made available in the context of the instruction that needs it. Pipeline Register Depiction
Historically, there are two different types of pipelines: 1.Instruction pipelines 2.Data / Arithmetic pipelines [ SIMD case ] Arithmetic pipelines (e.g. Floating Point Processing) are mostly found within Special Purpose Processors / Co-Processors since these are to be employed only occasionally and also such Data Pipelines Need a continuous stream of arithmetic operations. e.g. Vector processors operating on an array. On the other hand Instruction Pipelines are used in almost every modern processor to Increase Instruction Execution Throughput. Assumed a s default.
Instruction Level Parallelism [ILP] : It is a measure of how many of the Instructions in a Computer Program can be executed simultaneously [ In a Time Overlapped Fashion ] without violating the various Inter Instruction Dependencies that may exist. Consider the following program: I#1. e = a + b I#2. f = c + d I#3. g = e * f Instruction I#3 depends on the results of Instruction I#1 as well as on Instruction I#2 [ True (Data) [RAW] Dependency ] However, instructions I#1 and I#2 do not depend on any other Instruction , so they can be Executed simultaneously. If we assume that each Instruction can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2. Goal & Motivation to achieve Speed Up: Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer. ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible in a Specified Sequential Code. How much ILP exists in programs is very application specific. In certain fields, such as graphics [ Manipulation of Individual Pixels in a Group ]and scientific computing [ Matrix Multiplication] the amount can be very large. However, workloads such as cryptography exhibit much less parallelism because of the inherent RAW Data Dependency among the constituent Operations Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer. ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible in a Specified Sequential Code. How much ILP exists in programs is very application specific. In certain fields, such as graphics [ Manipulation of Individual Pixels in a Group ]and scientific computing [ Matrix Multiplication] the amount can be very large. However, workloads such as cryptography exhibit much less parallelism because of the inherent RAW Data Dependency among the constituent Operations. Micro Architectural Techniques used to Exploit ILP - 1 Instruction pipelining where the execution of multiple instructions can be partially overlapped. Superscalar execution in which multiple execution units are used to execute multiple instructions in parallel. In typical superscalar processors, the instructions executing simultaneously are adjacent in the original program order.
9
Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar. Register renaming which refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-oforder execution. Speculative execution which allow the execution of complete instructions or parts of instructions before being certain whether this execution should take place. A commonly used form of speculative execution is control flow speculation where instructions past a control flow instruction (e.g., a branch) are executed before the target of the control flow instruction is determined [ Branch Prediction (used to avoid stalling for control dependencies to be resolved) ]. Several other forms of speculative execution have been proposed and are in use including speculative execution driven by value prediction, memory dependence prediction and cache latency prediction. Factors Affecting ILP Implementation Inter Instruction Dependencies:Data Dependency & Control Dependency. Various types of Data Dependencies A data dependency in computer science is a situation in which a program statement (instruction) refers to the Data / Operand of a preceding statement / Instruction in some way or the other. In compiler theory, the technique used to discover data dependencies among statements (or instructions) is called Dependence analysis. Data Dependency Defn:Lets consider that in any Computer Program there are two Statements S1 & S2 where the statement S1 happens to be preceding the statement S2 in the Program. The Statement S2 is said to be Data dependent on the Statement S1 if any one of the following 3 cases exist. Data Dependency Conditions Bernstein Conditions :Assuming statement S1 and S2, S2 depends on S1 if: [I(S1) O(S2)] [O(S1) I(S2)] [O(S1) O(S2)] where: I (Si) is the set of memory locations read by Si and O (Sj) is the set of memory locations written by Sj and there is a feasible run-time execution path from S1 to S2.This Condition is called Bernstein Condition, named by A. J. Bernstein. Cases of Data Dependency True (data) Dependence: O(S1) I (S2) Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) that will be READ by the Successor Statement S2 . [ Read After Write (RAW) ] Anti-( Name) Dependence: I(S1) O(S2) , mirror relationship of true dependence here the predecessor Instruction S1 Reads from some Memory Location or Register which is later modified / written onto by the Successor Instruction S2 [ Write After Read (WAR) ]. Output Dependence: O(S1) O(S2), S1->S2 and both the Instructions S1 & S2 writes to the same memory location or Register [ Write After Write (WAW) ]
10
True Data [RAW] Dependency 1 Statement S1 precedes Statement S2 and S1 writes into some Place (Memory / Register ) that will be READ by the Successor Statement S2 . [ Read After Write (RAW) ] Example : A true dependency, also known as a data dependency, occurs when an instruction depends on the result of a previous instruction: I#1. A = 3 I#2. B = A I#3. C = B True Data [RAW] Dependency - 2 Here Instruction I#3 is truly dependent on instruction I#2, as the final value of C depends on the instruction updating B. Instruction I#2 is truly dependent on instruction I#1, as the final value of B depends on the instruction updating A. Since instruction I#3 is truly dependent upon instruction I#2 and instruction I#2 is truly dependent on instruction I#1, instruction I#3 is also truly dependent on instruction I#1. Instruction level parallelism is therefore not an option in this example. Anti (Name) [WAR] Dependency An anti-dependency occurs when an instruction requires a value that is later updated. In the following example, instruction 3 anti-depends on instruction 2 the ordering of these instructions cannot be changed, nor can they be executed in parallel (possibly changing the instruction ordering), as this would affect the final value of A. I#1. B = 3 I#2. A = B + 1 I#3. B = 7 An anti-dependency is an example of a name dependency. That is, renaming of variables could remove the dependency, as depicted in the next Slide: Removing Anti Dependency through Renaming of Variables I#1 . B = 3 I#N. B2 = B I#2. A = B2 + 1 I#3. B=7 Here a new variable, B2, has been declared as a copy of B in a new instruction, instruction N. The anti-dependency between the instruction I#2 and the Instruction I#3 has been removed, meaning that these instructions may now be executed in parallel. However, the modification has introduced new sets of RAW dependencies like instruction I#2 is now truly dependent on instruction I#N, which is in turn truly dependent upon instruction I#1. As true dependencies, these new dependencies are impossible to safely remove. Output [WAW] Dependency An output dependency occurs when the ordering of instructions will affect the final output value of a variable. In the example below, there is an output dependency between instructions I#3 and
11
I#1 changing the ordering of instructions in this example will change the final value of A, thus these instructions cannot be executed in parallel. I#1. A = 2 * X I#2. B = A / 3 I#3. A = 9 * Y As with anti-dependencies, output dependencies are name dependencies. That is, they may be removed through renaming of variables, as in the later modification of the above example: Removal of WAW Dependency through Renaming I#1. A2 = 2 * X I#2. B = A2 /3 I#3. A = 9 * Y Here a new variable, A2, has been declared as a copy of A in the Instruction I#1 . The Output dependency between the instruction I#1 and the Instruction I#3 has been removed, meaning that these instructions may now be executed in parallel. However, the modification has introduced new sets of RAW dependencies like instruction I#2 is now truly dependent on instruction I#1 and this new RAW dependency cannot be safely removed. Control Dependency: An instruction S2 is control dependent on a preceding instruction S1 if S1 determines whether S2 should execute or not. In the following example, instruction I#2 is control dependent on instruction I#1. I#1. if a == b goto AFTER I#2. A=2*X I#3. AFTER: Intuitively, there is control dependence between two statements S1 and S2 if S1 could be possibly executed before S2 The outcome of S1 execution will determine whether S2 will be executed. A typical example is that there is control dependence between a Conditional Branch statement's condition part and the statements in the corresponding true/false bodies. Defn:A statement S2 is said to be control dependent on another statement S1 iff There exists a path P from S1 to S2 such that every statement Si S1 within P will be followed by S2 in each possible path to the end of the program andS1 will not necessarily be followed by S2, i.e. there is an execution path from S1 to the end of the program that does not go through S2. Expressed with the help of (post-)dominance the two conditions are equivalent to a) S2 post-dominates all Si b) S2 does not post-dominate S1 Implications of Dependencies on ILP Conventional programs are written assuming the sequential execution model. Under this model, instructions execute one after the other, atomically (i.e., at any given point of time only one instruction is executed) and in the order specified by the program.
12
However, dependencies among statements or instructions may hinder parallelism parallel execution of multiple instructions, either by a parallelizing compiler or by a processor exploiting instruction level parallelism [ILP]. Recklessly executing multiple instructions without considering related dependences may cause danger of getting wrong results, namely hazards.
A Non Pipelined floating point Processing:
Pipelined Floating Point Processing:
13
Pipeline cycle : The time required to move an instruction one step further in the pipeline. Not to be confused with clock cycle.Determined by the time required by the slowest stage. Basic Pipelining Terminologies Pipeline designers try to balance the length (i.e. the processing time) of each pipeline stage. For a perfectly balanced N stage pipeline, the execution time per instruction is t/N, where t is the execution time per instruction on non-pipelined machine and N is the number of pipeline stages. However, it is very difficult to make the different pipeline stages perfectly balanced. So different Pipeline stages may possess different Processing time. Besides, pipelining itself involves some overhead arising due to the Registers/ Latches used between two successive pipeline stages Some Important Pipeline Issues Timing Factors in a Typical Pipeline
Pipeline cycle : If Inter stage Latch / Register Delay = d = max {m } + d Pipeline frequency : f f=1/ Ideal Pipeline Speedup k-stage pipeline processes n tasks in k + (n-1) clock cycles: k cycles for the first task and n-1 cycles for the remaining n-1 tasks. Total time to process n tasks Tk = [ k + (n-1)] For the non-pipelined processor T1 = n k [ n tasks passes through k stages each having delay ] Pipeline Speedup Expression Speedup(SK )=T1/TK = n K / [K+(n-1) = n K / [K+(n-1) Observe that the memory bandwidth must increase by a factor of Sk:Otherwise, the processor would stall waiting for data to arrive from memory
14
Exercise 1 Consider an unpipelined processor: Takes 4 cycles for ALU and other operations 5 cycles for memory operations. Assume the relative frequencies: ALU and other=60%, memory operations=40 Cycle time =1ns Compute speedup due to pipelining: Ignore effects of branching. Assume pipeline overhead = 0.2ns Solution Average instruction execution time for large number of instructions unpipelined= 1ns * (60%*4+ 40%*5) =4.4ns Pipelined=1.2ns Speedup=4.4/1.2=3.7 times Pipeline Types: Synchronous pipeline: Either Pipeline cycle is constant (OR) Pipeline Cycle through any Pipeline stage is an Integer Multiple of Clock Frequency known apriori to each of the Pipeline stages so each stage knows when its input will be available. N.B: Assumed Default. Asynchronous pipeline: Time for moving from stage to stage varies. Individual stages need not be aware about the Timing of any other Stage. Handshaking communication between stages. A stage may have to WAIT for Input availability thereby requiring Interlocking of Stages. Synchronous Pipeline Transfers between stages are simultaneous. One task or operation enters the pipeline per cycle.
15
No of Pipeline Stages vs Performance 1
Various Pipelined Processing Stages 1 8086:
Bus Interface Unit and Execution unit will work independently .(To enable two stage pipelined processing) in 8086 Fetch and Execution overlap is there. It is Only 2 stage pipelining F E F
E F E F=Fetch the instruction and decode the inst, E=Execute the inst and write in to memory
16
17
Various Pipelined Processing Stages 2:
Pipelined CPU Memory Interface:
18
Pipelined CPU GPR Interface:
Speedup Factors with Instruction Pipelining
19
Performance Evaluation Method Amdahls Law Quantifies overall performance gain due to improve in a part of a computation. Performance improvement gained from using some faster mode of execution is limited by the amount of time the enhancement is actually used. Amdahls Law Speedup=Execution time for task with out enhancement/Execution time for the task using enhancement Amdahls Law and Speedup Speedup tells us:How much faster a machine will run due to an enhancement. For using Amdahls law two things should be considered: Fraction of the computation time in the original machine that can use the enhancement If a program executes in 30 seconds and 15 seconds of execution uses enhancement, Fraction = . This value termed as Fraction (Enhanced) is always less than or equal to 1.Improvement gained by enhanced Execution mode ; that is , how much faster the task would run if the Enhanced mode were used for the entire program .If enhanced task takes 3.5 seconds and original task took 7 seconds, we say the speedup is 2. CISC processors are not suitable for pipelining because of: Variable instruction format. Variable execution time. Complex addressing modes. RISC processors are suitable for pipelining because of: Fixed instruction format. Fixed execution time. Limited addressing modes. Advantages and disadvantages: Pipelining does not help in all cases. There are several possible disadvantages. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline Advantages of Pipelining: 1.An n-stage pipeline:Can improve performance upto n times. 2.Not much investment in hardware:No replication of hardware resources necessary.The principle deployed is to keep the units as busy as possible. 3.Transparent to the programmers:Easy to use 4The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases. 5Some combinational circuits such as adders or multipliers can be made faster by adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more complex combinational circuit.
20
6.Pipelines: Few Key Observations -1 Pipeline increases instruction throughput:But, does not decrease the execution time of the individual instructions.In fact, slightly increases execution time of each instruction due to pipeline overheads since each Instruction passes through Identical Pipeline stages. Disadvantages of Pipelining: 1.A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture. 2.The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is because extra flipflops must be added to the data path of a pipelined processor. 3.A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs. Pipeline Overheads Pipeline register delay:Caused due to set up time. Clock skew:the maximum delay between clock arrival at any two registers. Once clock cycle is as small as the pipeline overhead:No further pipelining would be useful.Very deep pipelines may not be useful . EXAMPLES: Four Stages of an Instruction: Instruction Fetch(F): Fetch the instruction from the Instruction Memory Operand Fetch and Instruction Decode(D): Fetch the operand Data from the Memory or Reg & Decode the inst Execute(E):Calculate the memory address and/or execute the function Memory & Write back(M) : Read the data from the Data Memory & Write Back to Register INSTRUCTIONS WAITING: D C B A
D C B A X X X 1
D C B A X X 2
D C B A X 3 D C B A 4 X D C B 5 X X D C 6 X X X D 7 X X X X 8
FETCH DECODE XECUTE MEMORY 0
21
INSTRUCTIONS COMPLETED:
B A
C B A
D C B A
4-stage pipeline; the boxes represent instructions independent of each other The top box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline. Execution is as follows:
Time Execution
Four instructions are awaiting to be executed
The A instruction is fetched from memory the A instruction is decoded the B instruction is fetched from memory the A instruction is executed (actual operation is performed) the B instruction is decoded the C instruction is fetched the A instruction's results are written back to the register file or memory the B instruction is executed the C instruction is decoded the D instruction is fetched the A instruction is completed the B instruction is written back the C instruction is executed the D instruction is decoded The B instruction is completed the C instruction is written back the D instruction is executed the C instruction is completed
7
22
the D instruction is written back the D instruction is completed
8 9 Bubble:
All instructions are executed
D C B A X X X X 0 D C B A X X X 1 D C B A X X 2 D B OO A X 3 C B OO A 4 D C B OO 5 X D C B 6 X X D C 7 X X X D 8 X X X X 9
COMPLETED INSTRUCTIONS:
B A
C B A
D C B A
Bubble in cycle 3 delays execution Bubble (computing): When a "hiccup" in execution occurs, a "bubble" is created in the pipeline in which nothing useful happens. In cycle 2, the fetching of the B instruction is delayed and the decoding stage in cycle 3 now contains a bubble. Everything "behind" the B instruction is delayed as well but everything "ahead" of the B instruction continues with execution. Clearly, when compared to the execution above, the bubble yields a total execution time of 8 clock ticks instead of 7. Bubbles are like stalls, in which nothing useful will happen for the fetch, decode, execute and writeback. It can be completed with a NOP(no operation) code.
23
Example2: Pipelined Execution Of Six Instructions : 2 F 3 D F F F F 2 4 6 F 8 10 12 4 E D D D D D 17 3 M E 4 M E 4 4 4 4
M E
M E 20
M E 25
M 29 32
Shaded region is the time while inst is waiting for processing unit(F or D or E or M) Total Time Taken For Six Inst = 32.(b) =2+3+4+3+(6-1)*4 (can be derived easily from the above fig) If there are N number of instructions then total time required =Total time required single inst + (N-1) * slowest process Similarly in 8086 with F+E(2+3=5 units) & D+M(4+3 =7 units) as two stages Total time = 12+(6-1) * 7 =47.(c) Throughput: From (1) Sequential Processing = 72/6 = 12 units.......(from (a)&(1)) 8086(2 stage)= 47/6 8 units.......(form (c)&(1)) 4 Stage Pipelined Processing =32/6 6 units..(from (b) & (1)) From the above we can conclude that with the use of 4 stage pipelined architecture we can reduce the throughput such that no of inst processed by a processor in a given time will increase Example3 A typical instruction to add two numbers might be ADD A, B, C, which adds the values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of tasks similar to: LOAD R1, A LOAD R2, B ADD R3, R1, R2 STORE C, R3 LOAD next instruction
24
The locations 'R1', 'R2' and 'R3' are registers in the CPU. The values stored in memory locations labeled 'A' and 'B' are loaded (copied) into the R1 and R2 registers, then added, and the result (which is in register R3) is stored in a memory location labeled 'C'. In this example the pipeline is three stages long- load, execute, and store. Each of the steps are called pipeline stages. On a non-pipelined processor, only one stage can be working at a time so the entire instruction has to complete before the next instruction can begin. On a pipelined processor, all of the stages can be working at once on different instructions. So when this instruction is at the execute stage, a second instruction will be at the decode stage and a 3rd instruction will be at the fetch stage. Pipelining doesn't reduce the time it takes to complete an instruction; it increases the number of instructions that can be processed at once and reduces the delay between completed instructions. The more pipeline stages a processor has, the more instructions it can be working on at once and the less of a delay there is between completed instructions. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline.) Intel Pentium 4 processors have 20 stage pipelines. Example 4 To better visualize the concept, we can look at a theoretical 3-stage pipeline: Stage Description
Load
Read instruction from memory
Execute Execute instruction
Store
Store result in memory and/or registers
and a pseudo-code assembly listing to be executed: LOAD A, #40 ; load 40 in A MOVE B, A ; copy A in B ADD B, #20 ; add 20 to B STORE 0x300, B ; store B into memory cell 0x300
25
This is how it would be executed: Clock 1
Load
Execute Store
LOAD The LOAD instruction is fetched from memory. Clock 2
Load
Execute Store
MOVE LOAD The LOAD instruction is executed, while the MOVE instruction is fetched from memory. Clock 3
Load Execute Store
ADD MOVE LOAD The LOAD instruction is in the Store stage, where its result (the number 40) will be stored in the register A. In the meantime, the MOVE instruction is being executed. Since it must move the contents of A into B, it must wait for the ending of the LOAD instruction. Clock 4
Load
Execute Store
26
STORE ADD
MOVE
The STORE instruction is loaded, while the MOVE instruction is finishing off and the ADD is calculating And so on. Note that, sometimes, an instruction will depend on the result of another one (like our MOVE example). When more than one instruction references a particular location for an operand, either reading it (as an input) or writing it (as an output), executing those instructions in an order different from the original program order can lead to hazards (mentioned above). There are several established techniques for either preventing hazards from occurring, or working around them if they do. . Complications Many designs include pipelines as long as 7, 10 and even 20 stages (like in the Intel Pentium 4). The later "Prescott" and "Cedar Mill" Pentium 4 cores (and their Pentium D derivatives) had a 31-stage pipeline, the longest in mainstream consumer computing. The Xelerator X10q has a pipeline more than a thousand stages long. The downside of a long pipeline is that when a program branches, the processor cannot know where to fetch the next instruction from and must wait until the branch instruction finishes, leaving the pipeline behind it empty. In the extreme case, the performance of a pipelined processor could theoretically approach that of an un-pipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages. Branch prediction attempts to alleviate this problem by guessing whether the branch will be taken or not and speculatively executing the code path that it predicts will be taken. When its predictions are correct, branch prediction avoids the penalty associated with branching. However, branch prediction itself can end up exacerbating the problem if branches are predicted poorly, as the incorrect code path which has begun execution must be flushed from the pipeline before resuming execution at the correct location. In certain applications, such as supercomputing, programs are specially written to branch rarely and so very long pipelines can speed up computation by reducing cycle time. If branching happens constantly, re-ordering branches such that the more likely to be needed instructions are placed into the pipeline can significantly reduce the speed losses associated with having to flush failed branches. Self-Modifying Programs: Because of the instruction pipeline, code that the processor loads will not immediately execute. Due to this, updates in the code very near the current location of execution may not take effect because they are already loaded into the Prefetch
27
Input Queue. Instruction caches make this phenomenon even worse. This is only relevant to self-modifying programs. Mathematical pipelines: Mathematical or arithmetic pipelines are different from instructional pipelines, in that when mathematically processing large arrays or vectors, a particular mathematical process, such as a multiply is repeated many thousands of times. In this environment, an instruction need only kick off an event whereby the arithmetic logic unit (which is pipelined) takes over, and begins its series of calculations. Most of these circuits can be found today in math processors and math processing sections of CPUs like the Intel Pentium line. History Math processing (super-computing) began in earnest in the late 1970s as Vector Processors and Array Processors. Usually very large bulky super-computing machines that needed special environments and super-cooling of the cores. One of the early super computers was the Cyber series built by Control Data Corporation. Its main architect was Seymour Cray, who later resigned from CDC to head up Cray Research. Cray developed the XMP line of super computers, using pipelining for both multiply and add/subtract functions. Later, Star Technologies took pipelining to another level by adding parallelism (several pipelined functions working in parallel), developed by their engineer, Roger Chen. In 1984, Star Technologies made another breakthrough with the pipelined divide circuit, developed by James Bradley. By the mid 1980s, super-computing had taken off with offerings from many different companies around the world. Today, most of these circuits can be found embedded inside most micro-processors.
28
CHAPTER 3
ARCHITECTURE: FETCH(1) DECODE(2) EXECUTE(3) MEMORY(4)
E SAF
M S RD WR LRG
DATA MEMORY
STR(4)
PROGRAM MEMORY
LDR
ALU
INST REG(2)
+1
PC(1)
OA DECODER (IR) 4-7=R2 (IR) 0-3 =R1 MPAE(3) MPME C O N T R O L RSE(3) RSM(4) (IR) 8-11=R3 REGISTER ARRAY 0
ACCUMULATOR (4)
OB
1 MUX
MPAM(4)
MPMM
29
INSTRUCTION FORMAT OPCODE R3 AND SOURCE OR ADD SUB MOVE LOAD STORE NOT SOURCE SOURCE SOURCE XXXXXXX XXXXXXX DESTINATION XXXXXXX
R2 SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE SOURCE DESTINATION &
R1 DESTINATION DESTINATION DESTINATION DESTINATION DESTINATION DESTINATION XXXXXXX XXXXXXX
EXAMPLE PROGRAM NUMBER INSTRUCTION I1 ADD R5 R4 R1 I2 I3 I4 I5 I6 I7 SUB R6 R4 R7 MOVE R4 R3 OR R3 R7 R0 LOAD R0 R3 AND R7 R0 R2 STORE R1 R6
OPERATION [R1]<-[R5]+[R4] [R7]<-[R4]-[R6] [R3]<-[R4] [R0]<-[R3]||[R7] [R3]<-[[R0]] [R0]<-[R3]&&[R7] [[R6]]<-[R1]
BINARY CODE 16H 0541 16H 1647 16H 4043 16H 3370 16H 5503 16H 2702 16H 6160
MICRO PROGRAM
MNEMONICS SAF(4bit) ADD 4H 0 SUB 4H 1 AND 4H 2 OR 4H 3 MOVE 4H 4 LOAD 4H 5 STORE 4H 6 NOT 4H 7

30
S(1-bit) 1 1 1 1 1 0 1 1
RGW(1-bit) MW(1BIT) 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0
MR(1BIT) 0 0 0 0 0 1 0 0
MICRO PROGRAM MEMORY CONTENT CODE 8H 0C 8H 1C 8H 2C 8H 3C 8H 4C 8H 55 8H 6A 8H 7C
SIGNALS SAF S RGW MW MR FETCH
FULLFORM SELECT ALU FUNCTION MUX SELECT LINE REGISTER WRITE MEMORY WRITE MEMORY READ DECODE [IR] DECODER=[IR]15-12 [PC] EXECUTE [MPAE] [OB] [OA] [RSE] [STR]<-[OA] [AR]<-ALU_OUT [RSM]<-[RSE] [MPAM]<[MPAE] clk4 clk5 MEMORY [STR] [AR] [RSM] [MPAM]
[IR]<-[[PC]] [PC]<-[PC]+1
[MPAE]<-DECODER [OB]<-[[IR]]11-8 [OA]<-[[IR]]7-4 [RSE]<-[IR]3-0 clk2 clk3

I1-EXECUTE
clk1
I1-FETCH
clk6
I1-DECODE [IR]=16H 0541 [PC]=4H00001 [MPAE]<DECODER [OB]<-[R5] [OA]<-[R4] [RSE]<-1 NOTE: Initally [R5]=4;[R4]=7 I2-FETCH [IR]<-[[0001]] [PC]<-0001+1
[OA]=4:[OA]=7 [RSE]=1; [MPAE]=8H0C [STR]<-7 [AR]<-11 [RSM]<-1 [MPAM}<8H0C I2-DECODE [IR]=16H 1647 [PC]=4H00002 [MPAE]<DECODER [OB]<-[R6] [OA]<-[R4] [RSE]<-7 NOTE: Initally [R6]=4; I3-FETCH [IR]<-[[0002]] [PC]<-0002+1
I1-MEMORY [STR]=7 [AR]=11 [RSM]=1 [MPAM]=8H0C [R1]=11
[IR]<[[0000]] [PC]<0000+1
I2-EXECUTE [OA]=4:[OA]=7 [RSE]=7; [MPAE]=8H1C [STR]<-7 [AR]<-3 [RSM]<-7 [MPAM}<8H1C
I2-MEMORY
[STR]=7 [AR]=3 [RSM]=7 [MPAM]=8H1C [R7]=3 I3-EXECUTE [OA]=X:[OA]=7 [RSE]=3; [MPAE]=8H4C [STR]<-7 [AR]<-3 [RSM]<-7 [MPAM}<-8H4C
I3-DECODE [IR]=16H 4043 [PC]=4H00003 [MPAE]<DECODER [OB]<-[R0] [OA]<-[R4] [RSE]<-3
I3-MEMORY [STR]=7 [AR]=7 [RSM]=3 [MPAM]=8H4C [R3]=7
31
I4-FETCH Clk4
I4-DECODE Clk5 [IR]=16H 3370 [PC]=4H00004 [MPAE]<DECODER [OB]<-[R3] [OA]<-[R7] [RSE]<-0 I5-FETCH [IR]<-[[0004]] [PC]<-0004+1
I4-EXECUTE Clk6 [OA]=7:[OA]=3 [RSE]=0; [MPAE]=8H3C [STR]<-3 [AR]<-7 [RSM]<-0 [MPAM}<8H3C I5-DECODE [IR]=16H 5503 [PC]=4H00005 [MPAE]<DECODER [OB]<-[R5] [OA]<-[R0] [RSE]<-3
I4-MEMORY Clk7 [STR]=3 [AR]=7 [RSM]=0 [MPAM]=8H3C [R0]=7
Clk8
Clk9
[IR]<[[0003]] [PC]<0003+1
I5-EXECUTE [OB]=4;[OA]=7 [RSE]=3; [MPAE]=8H55 [STR]<-7 [AR]<-7 [RSM]<-3 [MPAM}<8H55
I5-MEMORY
[STR]=7 [AR]=7 [RSM]=3 [MPAM]=8H55 [R0]=MEM[7] (Contents at memory location 7) I6-EXECUTE [OA]=3:[OA]=7 [RSE]=2; [MPAE]=8H2C [STR]<-[7] [AR]<-MEM[7] && R3 [RSM]<-2 [MPAM}<-8H2C
I6-FETCH [IR]<-[[0005]] [PC]<-0005+1
I6-DECODE [IR]=16H 2702 [PC]=4H00006 [MPAE]<DECODER [OB]<-[R7] [OA]<-[R0] [RSE]<-2
I6-MEMORY [STR]=7 [AR]=MEM[7] and 3 [RSM]=2 [MPAM]=8H55 [R2]=MEM[7] && R3
I7-FETCH Clk7
I7-DECODE Clk8 [IR]=16H 6160 [PC]=4H00007 [MPAE]<DECODER [OB]<-[R1] [OA]<-[R6] [RSE]<-2
I7-EXECUTE Clk9 [OA]=3:[OA]=4 [RSE]=0; [MPAE]=8H6A [STR]<-4 [AR]<-3 [RSM]<-0 [MPAM}<-8H6A
I7-MEMORY Clk10
[IR]<-[[0006]] [PC]<-0006+1
[STR]=4 [AR]=3 [RSM]=0 [MPAM]=8H6A MEM[4]=4
(1)FETCH: R2=[IR](7-4); [OA]<--[R2]; R3=[IR](11-8); [OB]<--[R3]; [RSE]=[IR](3-0); CONTROL SIGNAL :::::: SAF;
32
(3).EXECUTE: [STR]<--[OA]; [AR]<--ALU_OUT; [RSM]<--[RSE]; CONTROL SIGNAL::::::: S4, RD, WR, LRG; (4).MEMORY: R1=[RSM]; DATA.MEMORY ADRS<--[AR]; [DATA.MEMORY ADRS]<--[STR]; // FOR STORE INSTRUCTION ONLY [R1]<--[DATA.MEMORY ADRS]; // FOR LOAD INSTRUCTION ONLY [R1]<--[AR]; // FOR ARTHMETIC AND LOGIC INSTRUCTIONS ONLY SAF---SELECT ALU FUNCTION S4----MUX SELECT [ 1-AR; 0-LDR ] RGW-REGISTER WRITE MW--MEMORY WRITE MR--MEMORY READ
33
CHAPTER 4
VERILOG CODE: module data_memory(); //parameter dataaddress=16; //parameter datasize=256; parameter data_address=65536; parameter word_size=16; //integer i; //parameter data_address=16; reg [word_size-1:0] datamemory[0:data_address-1]; //memory with 16 bit word size and 65536 memory locations initial begin $readmemb("init.data",datamemory); /*for(i=0;i<12;i=i+1) $display("datamemory [%d]=%b",i,datamemory[i]);*/ end endmodule module program_memory(memory_out,address,data_in_memory,write_memory,clk,rst); parameter wordsize=16; parameter memorysize=256; parameter addrsize=8; output[wordsize -1 :0] memory_out; input [addrsize -1 :0] address; input [wordsize -1 :0] data_in_memory; //input read_memory; input write_memory; //not necessary input clk; input rst; module ir(ir_out,data_in_ir,clk,rst); parameter wordsize=16; //parameter memorysize=256; //parameter addrsize=8; //input load_ir; input [wordsize-1:0] data_in_ir; // INSTRUCTION FROM PROGRAM MEMORY
34
input clk; input rst; output [wordsize-1:0] ir_out; reg ir_out; always @(posedge clk) begin if(rst) begin ir_out<=0; end else begin ir_out<=data_in_ir; end end endmodule module micro_memory(memory_out,address,data_in_memory,write_memory,clk,rst); parameter uwordsize=8; parameter umemorysize=16; parameter uaddrsize=4; output[uwordsize -1 :0] memory_out; input [uaddrsize -1 :0] address; input [uwordsize -1 :0] data_in_memory; //input read_memory; input write_memory; input clk; input rst; module register8(register_out,register_in,clk,rst); parameter r8wordsize=8; output [r8wordsize -1 :0] register_out; input [r8wordsize -1 : 0] register_in; input clk; input rst; reg [r8wordsize -1:0] register_out; initial begin register_out=0; end always @(posedge clk) begin if(rst) begin register_out<=0;end else if(clk) begin register_out<=register_in; end end endmodule
35
module register_array(register_out1,register_out2,address1,address2,address3,data_in_register,write_regi ster,clk,rst); parameter regwordsize=16; parameter regmemorysize=16; parameter regaddrsize=4; output[regwordsize -1 :0] register_out1; output[regwordsize -1 :0] register_out2; input [regaddrsize -1 :0] address1; input [regaddrsize -1 :0] address2; input [regaddrsize -1 :0] address3; input [regwordsize -1 :0] data_in_register; //input read_memory; input write_register; input clk; input rst; reg [regwordsize-1:0] memory[regmemorysize-1:0]; initial begin memory[4'h0]<=16'h0001; memory[4'h1]<=16'h0002; memory[4'h2]<=16'h0003; memory[4'h3]<=16'h0013; memory[4'h4]<=16'h0023; memory[4'h5]<=16'h0001; memory[4'h6]<=16'h0002; memory[4'h7]<=16'h0003; memory[4'h8]<=16'h0013; memory[4'h9]<=16'h0023; memory[4'ha]<=16'h0001; memory[4'hb]<=16'h0002; memory[4'hc]<=16'h0003; memory[4'hd]<=16'h0013; memory[4'he]<=16'h0023; memory[4'hf]<=16'h0001; end //asynchronous read operation for data_output
36
assign register_out1 = memory[address1]; assign register_out2 = memory[address2]; //data write operation always @(posedge clk) begin if(write_register) begin memory[address3]<=data_in_register; end end endmodule oa,ob module register16(register_out,register_in,clk,rst); parameter r16wordsize=16; output [r16wordsize -1 :0] register_out; input [r16wordsize -1 : 0] register_in; input clk; input rst; reg [r16wordsize -1:0] register_out; initial begin register_out=0; end always @(posedge clk) begin if(rst) begin register_out<=0;end else if(clk) begin register_out<=register_in; end end endmodule //rse module register4(register_out,register_in,clk,rst); parameter r4wordsize=4; output [r4wordsize -1 :0] register_out; input [r4wordsize -1 : 0] register_in; input clk; input rst; reg [r4wordsize -1:0] register_out; initial begin register_out=0; end always @(posedge clk) begin if(rst) begin register_out<=0;end
37
else if(clk) begin register_out<=register_in; end end endmodule module alu(alu_out,OB,OA,SAF); parameter wordsize=16; parameter N=4; output [wordsize-1:0] alu_out; input [wordsize-1:0] OB; input [wordsize-1:0] OA; input [N-1:0] SAF; reg alu_out; always@(SAF or OA or OB) begin case(SAF) 4'd0 : alu_out = OA + OB; // ADDITION 4'd1 : alu_out = OA - OB; // SUBTRACTION 4'd2 : alu_out = OA & OB; // AND OF OA AND OB 4'd3 : alu_out = ~OA; // NOT of OA 4'd4 : alu_out = OA; // MOVE INSTRUCTION 4'd5 : alu_out = OA; // LOAD INSTRUCTION 4'd6 : alu_out = OB; // STORE INSTRUCTION /*4'd7 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd8 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd9 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd10 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd11 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd12 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd13 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd14 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd15 : $display("INVALID ALU_CONTROL SIGNAL"); 4'd16 : $display("INVALID ALU_CONTROL SIGNAL"); default : $display("INVALID ALU_CONTROL SIGNAL");*/ endcase end endmodule module program_memory1(memory_out,address,data_in_memory,write_memory,clk,rst); parameter dwordsize=16;
38
parameter dmemorysize=1024; parameter daddrsize=16; output[dwordsize -1 :0] memory_out; input [daddrsize -1 :0] address; input [dwordsize -1 :0] data_in_memory; //input read_memory; input write_memory; input clk; input rst; reg [dwordsize-1:0] memory[dmemorysize-1:0]; initial begin memory[16'h0000]<=8'h0c; memory[16'h0001]<=8'h1c; end //asynchronous read operation for data_output assign memory_out = memory[address]; //data write operation always @(posedge clk) begin if(write_memory) begin memory[address]<=data_in_memory; end end endmodule module memory(alu_out,ar_out,rsm_out,mpam_out,str_out,rse,oa_out,ob_out,register_out1,register_out 2,mpae_out,umemory_out,ir_out,mem_out,pc_out,clk,rst); parameter wordsize=16; parameter memorysize=256; parameter addrsize=8; parameter uwordsize=8; parameter r8wordsize=8; parameter r16wordsize=16; parameter regwordsize=16; parameter r4wordsize=4; parameter dwordsize=16; parameter dmemorysize=65536; parameter daddrsize=16; input clk;
39
input rst; output [wordsize -1 :0] mem_out; output [addrsize -1 :0] pc_out; output [wordsize -1 :0] ir_out; //decode inputs output [uwordsize -1 :0] umemory_out; output [r8wordsize -1 :0] mpae_out; output[regwordsize -1 :0] register_out1; output[regwordsize -1 :0] register_out2; output [r16wordsize -1 :0] oa_out; output [r16wordsize -1 :0] ob_out; output [r4wordsize -1 :0] rse; //execute inputs output [r16wordsize -1 :0] alu_out; output [r16wordsize -1 :0] ar_out; output [r16wordsize -1 :0] str_out; output [r4wordsize -1 :0] rsm_out; output [r8wordsize -1 :0] mpam_out; //memory inputs wire ldr[dwordsize-1:0]; execute execute1(alu_out,ar_out,rsm_out,mpam_out,str_out,rse,oa_out,ob_out,register_out1,register_out 2,mpae_out,umemory_out,ir_out,mem_out,pc_out,clk,rst); //module program_memory1(memory_out,address,data_in_memory,write_memory,clk,rst); program_memory1 data_memory(.memory_out(ldr),.address(ar_out),.data_in_memory(str_out),.write_memory(mpa m_out[1:1]),.clk(clk),.rst(rst)); wire register_wire[dwordsize-1:0]; assign register_wire=(mpam_out[3:3])?(ar_out:ldr); endmodule module execute(alu_out,ar_out,rsm_out,mpam_out,str_out,rse,oa_out,ob_out,register_out1,register_out2, mpae_out,umemory_out,ir_out,mem_out,pc_out,clk,rst); parameter wordsize=16; parameter memorysize=256; parameter addrsize=8;
40
parameter uwordsize=8; parameter r8wordsize=8; parameter r16wordsize=16; parameter regwordsize=16; parameter r4wordsize=4; input clk; input rst; output [wordsize -1 :0] mem_out; output [addrsize -1 :0] pc_out; output [wordsize -1 :0] ir_out; //decode inputs output [uwordsize -1 :0] umemory_out; output [r8wordsize -1 :0] mpae_out; output[regwordsize -1 :0] register_out1; output[regwordsize -1 :0] register_out2; output [r16wordsize -1 :0] oa_out; output [r16wordsize -1 :0] ob_out; output [r4wordsize -1 :0] rse; //execute inputs output [r16wordsize -1 :0] alu_out; output [r16wordsize -1 :0] ar_out; output [r16wordsize -1 :0] str_out; output [r4wordsize -1 :0] rsm_out; output [r8wordsize -1 :0] mpam_out; //module decode1(rse,oa_out,ob_out,register_out1,register_out2,mpae_out,umemory_out,ir_out,mem_out, pc_out,clk,rst); decode1 decode(rse,oa_out,ob_out,register_out1,register_out2,mpae_out,umemory_out,ir_out,mem_out,p c_out,clk,rst); //module alu(alu_out,OB,OA,SAF); alu alu1(.alu_out(alu_out),.OB(ob_out),.OA(oa_out),.SAF(mpae_out[7:4])); //module register16(register_out,register_in,clk,rst); register16 ar(.register_out(ar_out),.register_in(alu_out),.clk(clk),.rst(rst)); register16 str(.register_out(str_out),.register_in(oa_out),.clk(clk),.rst(rst)); register4 rsm(.register_out(rsm_out),.register_in(rse),.clk(clk),.rst(rst));
41
register8 mpam(.register_out(mpam_out),.register_in(mpae_out),.clk(clk),.rst(rst)); endmodule module fetch(ir_out,mem_out,pc_out,clk,rst); parameter wordsize=16; parameter memorysize=256; parameter addrsize=8; output [wordsize -1 :0] mem_out; output [addrsize -1 :0] pc_out; output [wordsize -1 :0] ir_out; input clk; input rst; //module pc(pc_out,load_pc,data_in_pc,clk,rst); pc pc1(.pc_out(pc_out),.clk(clk),.rst(rst)); //module program_memory(memory_out,address,data_in_memory,write_memory,clk,rst); program_memory pm(.memory_out(mem_out),.address(pc_out),.clk(clk),.rst(rst)); //module ir(ir_out,data_in_ir,clk,rst); ir ir1(.ir_out(ir_out),.data_in_ir(mem_out),.clk(clk),.rst(rst)); endmodule module main(ldr,register_wire,alu_out,ar_out,rsm_out,mpam_out,str_out,rse,oa_out,ob_out,register_out 1,register_out2,mpae_out,umemory_out,ir_out,mem_out,pc_out,clk,rst); parameter wordsize=16; parameter memorysize=256; parameter addrsize=8; parameter uwordsize=8; parameter r8wordsize=8; parameter r16wordsize=16; parameter regwordsize=16; parameter r4wordsize=4; parameter dwordsize=16; parameter dmemorysize=65536; parameter daddrsize=16; input clk; input rst; //fetch inputs output [wordsize -1 :0] mem_out;
42
output [addrsize -1 :0] pc_out; output [wordsize -1 :0] ir_out; //decode inputs output [uwordsize -1 :0] umemory_out; output [r8wordsize -1 :0] mpae_out; output[regwordsize -1 :0] register_out1; output[regwordsize -1 :0] register_out2; output [r16wordsize -1 :0] oa_out; output [r16wordsize -1 :0] ob_out; output [r4wordsize -1 :0] rse; //execute inputs output [r16wordsize -1 :0] alu_out; output [r16wordsize -1 :0] ar_out; output [r16wordsize -1 :0] str_out; output [r4wordsize -1 :0] rsm_out; output [r8wordsize -1 :0] mpam_out; output [dwordsize-1:0] ldr; output [dwordsize-1:0] register_wire; //FETCH PHASE // [pc]<----[pc]+1 pc pc1(.pc_out(pc_out),.clk(clk),.rst(rst)); //pc is incremented at the positive edge of clock // [mem_out]<----[[pc]] program_memory pm(.memory_out(mem_out),.address(pc_out),.clk(clk),.rst(rst)); //asynchronous memory read // [ir_out]<----[[pc]]; ir ir1(.ir_out(ir_out),.data_in_ir(mem_out),.clk(clk),.rst(rst)); // DECODE PHASE // [umemory_out]<----[[ir_out[15:12]]] micro_memory umemory(.memory_out(umemory_out),.address(ir_out[15:12]),.clk(clk),.rst(rst));//decode opcode // [mpae_out]<----[umemory_out] register8 mpae(.register_out(mpae_out),.register_in(umemory_out),.clk(clk),.rst(rst)); // register_out1<----[ir_out[7:4]] // register_out2<----[ir_out[11:8]] // [[rsm_out]]<----register_wire
43
of
register_array register(.register_out1(register_out1),.register_out2(register_out2),.address3(rsm_out),.data_in_r egister(register_wire),.write_register(mpam_out[2:2]),.address1(ir_out[7:4]),.address2(ir_out[11: 8]),.clk(clk),.rst(rst)); // [oa]<----register_out1 register16 oa(.register_out(oa_out),.register_in(register_out1),.clk(clk),.rst(rst)); // [ob]<----register_out2 register16 ob(.register_out(ob_out),.register_in(register_out2),.clk(clk),.rst(rst)); // [rse]<----[ir_out[3:0]] register4 rse1(.register_out(rse),.register_in(ir_out[3:0]),.clk(clk),.rst(rst)); //EXECUTE PHASE // [alu_out]<----[oa] SAF [ob] alu alu1(.alu_out(alu_out),.OB(ob_out),.OA(oa_out),.SAF(mpae_out[7:4])); // [ar_out]<----[alu_out] register16 ar(.register_out(ar_out),.register_in(alu_out),.clk(clk),.rst(rst)); // [str_out]<----[oa] register16 str(.register_out(str_out),.register_in(oa_out),.clk(clk),.rst(rst)); // [rsm_out]<----[rse] register4 rsm(.register_out(rsm_out),.register_in(rse),.clk(clk),.rst(rst)); // [mpam_out]<----[mpae_out] register8 mpam(.register_out(mpam_out),.register_in(mpae_out),.clk(clk),.rst(rst)); // MEMORY PHASE // DATA MEMORY // [[ar_out]]<----[str_out] .....IF STORE INSTRUCTION // ldr<----[[ar_out]] ...........IF LOAD INSTRUCTION program_memory1 data_memory(.memory_out(ldr),.address(ar_out),.data_in_memory(str_out),.write_memory(mpa m_out[1:1]),.clk(clk),.rst(rst)); // MUX // register_wire<----[ar_out]......IF 1 IS SELECTED // register_wire<----ldr ..........IF 0 IS SELECTED assign register_wire = mpam_out[3:3] ? ar_out : ldr; endmodule
44
SIMULATION RESULTS FOR SOME ENTITIES: ALU
45
PC
46
FETCH
EXECUTE
47
INST REGISTER
48
MAIN:
49
50
CHAPTER 5
51
Bibliography:
1.Pipelined architecture processors from behavioural-level(2001 IEEE) by Robert Heath and Sreenivas Durbha ,Dept. of Electrical Engineering, 453 Anderson Hall,University of Kentucky Lexington, KY 40506 2.Low-cost fault tolerance on the ALU in simple pipelined processors (2010 IEEE)Nguyen Minh Huu , Bruno Robisson and Michel Agoyan CEA-Leti - Centre Microlectronique de Provence e 880 route de Mimet,France 3. A study of floating-point architectures forpipelined RISC processors by Reyes, J.A.P.; Alarcon, L.P.; Alarilla, L.; Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on Publication Year: 2006 , Page(s): 4 pp. 2716 4.High-level implementation of the 5-stagepipelined ARM9TDM core Arandilla, C.C.; Constantino, J.B.A.; Glova, A.O.M.; Ballesil-Alvarez, A.P.; Reyes, J.A.P.; TENCON 2010 - 2010 IEEE Region 10 Conference Publication Year: 2010 5. Design of High-Speed-Pipelined Execution Unit of 32-bit RISC Processor Shofiqul Islam; Debanjan Chattopadhyay; Manoja Kumar Das; V Neelima; Rahul Sarkar; India Conference, 2006 Annual IEEE Publication Year: 2006 6. Design through verilog T.R. Padmanabhan and B. Bala Tripura Sundari(WSE-2009) 7.Computer system architecture Morris Mano 3rd Edition-Pearson Education. 8. Advanced Microprocessors And Peripherals by A.k.Ray Tata Mgraw Hill (2006) 9.www.isi.edu/~youngcho/csem
52
CONCLUSION & FUTUREWORK: CONCLUSION: In this project we have designed the Pipelined RISC processor. Processor is designed from ground-up to implement some simple functions like AND, OR, MOVE, STORE, ADD, SUBTRACT & developedcode for it Through simulation we have evaluated & studied the performance of our architecture for different instructions & found it satisfactory. Pipelining is very useful concept in for improvement of speed for the current generation systems. We feel very happy to work in one of the emerging topics like pipelining & to conclude this project.
FUTUREWORK: Still a lot of work can be done towards miniaturisation & speed improvement of the pipelined processors & to over come the disadvantages of pipelined architectures.In future I want to modify the above pipelined architectureas least complex as possible since hardware complexity is one of the drawback of Pipelined Architectures.
THANKYOU
53

Sank Art Hes Is

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sank Art Hes Is

Uploaded by

Copyright:

Available Formats

CHAPTER 1

A Non Pipelined floating point Processing:

Pipelined Floating Point Processing:

No of Pipeline Stages vs Performance 1

Various Pipelined Processing Stages 1 8086:

Various Pipelined Processing Stages 2:

Pipelined CPU Memory Interface:

Pipelined CPU GPR Interface:

Speedup Factors with Instruction Pipelining

FETCH DECODE XECUTE MEMORY 0

Four instructions are awaiting to be executed

the D instruction is written back the D instruction is completed

All instructions are executed

Example2: Pipelined Execution Of Six Instructions : 2 F 3 D F F F F 2 4 6 F 8 10 12 4 E D D D D D 17 3 M E 4 M E 4 4 4 4

Read instruction from memory

Execute Execute instruction

Store result in memory and/or registers

This is how it would be executed: Clock 1

LOAD The LOAD instruction is fetched from memory. Clock 2

Load Execute Store

R1 DESTINATION DESTINATION DESTINATION DESTINATION DESTINATION DESTINATION XXXXXXX XXXXXXX

OPERATION [R1]<-[R5]+[R4] [R7]<-[R4]-[R6] [R3]<-[R4] [R0]<-[R3]||[R7] [R3]<-[[R0]] [R0]<-[R3]&&[R7] [[R6]]<-[R1]

MNEMONICS SAF(4bit) ADD 4H 0 SUB 4H 1 AND 4H 2 OR 4H 3 MOVE 4H 4 LOAD 4H 5 STORE 4H 6 NOT 4H 7

MICRO PROGRAM MEMORY CONTENT CODE 8H 0C 8H 1C 8H 2C 8H 3C 8H 4C 8H 55 8H 6A 8H 7C

SIGNALS SAF S RGW MW MR FETCH

[MPAE]<-DECODER [OB]<-[[IR]]11-8 [OA]<-[[IR]]7-4 [RSE]<-[IR]3-0 clk2 clk3

I1-MEMORY [STR]=7 [AR]=11 [RSM]=1 [MPAM]=8H0C [R1]=11

I2-EXECUTE [OA]=4:[OA]=7 [RSE]=7; [MPAE]=8H1C [STR]<-7 [AR]<-3 [RSM]<-7 [MPAM}<8H1C

I3-DECODE [IR]=16H 4043 [PC]=4H00003 [MPAE]<DECODER [OB]<-[R0] [OA]<-[R4] [RSE]<-3

I3-MEMORY [STR]=7 [AR]=7 [RSM]=3 [MPAM]=8H4C [R3]=7

I4-MEMORY Clk7 [STR]=3 [AR]=7 [RSM]=0 [MPAM]=8H3C [R0]=7

I5-EXECUTE [OB]=4;[OA]=7 [RSE]=3; [MPAE]=8H55 [STR]<-7 [AR]<-7 [RSM]<-3 [MPAM}<8H55

I6-FETCH [IR]<-[[0005]] [PC]<-0005+1

I6-DECODE [IR]=16H 2702 [PC]=4H00006 [MPAE]<DECODER [OB]<-[R7] [OA]<-[R0] [RSE]<-2

I6-MEMORY [STR]=7 [AR]=MEM[7] and 3 [RSM]=2 [MPAM]=8H55 [R2]=MEM[7] && R3

I7-DECODE Clk8 [IR]=16H 6160 [PC]=4H00007 [MPAE]<DECODER [OB]<-[R1] [OA]<-[R6] [RSE]<-2

I7-EXECUTE Clk9 [OA]=3:[OA]=4 [RSE]=0; [MPAE]=8H6A [STR]<-4 [AR]<-3 [RSM]<-0 [MPAM}<-8H6A

[STR]=4 [AR]=3 [RSM]=0 [MPAM]=8H6A MEM[4]=4

SIMULATION RESULTS FOR SOME ENTITIES: ALU

You might also like