Professional Documents
Culture Documents
Outline
• Performance Evolution
• The Task of a Computer Designer
• Technology and Computer Usage Trends
• Cost and Trends in Cost
• Measuring and Reporting Performance
• Quantitative Principles of Computer Design
Computer Architecture Is
• The attributes of a [computing] system as seen by the programmer, i.e., the conceptual
structure and functional behavior, as distinct from the organization of the data flows
and controls, the logic design, and the physical implementation. (Amdahl, Blaaw, and
Brooks, 1964)
Performance Evolution
• $1K today buys a gizmo better than $1M could buy in 1965.
• 1970s
– Mainframes dominated – performance improved 25—30%/yr
– Mostly due to improved architecture + some technology aids
• 1980s
– VLSI + microprocessor became the foundation
– Technology improves at 35%/yr
– Machine language death = opportunity
– Mostly with UNIX and C in mid-80’s
• Even most system programmers gave up assembly language
• With this came the need for efficient compilers
– Compiler focus brought on the great CISC vs. RISC debate
• With the exception of Intel – RISC won the argument
• RISC performance improved by 50%/year initially
• Of course RISC is not as simple anymore and the compiler is a key part of the
game
–Does not matter how fast your computer is, if the compiler wastes most of it
due to the inability to generate efficient code
– With the exploitation of instruction-level parallelism (pipeline + super-scalar) and
the use of caches, performance is further enhanced
Growth in Performance (Figure 1.1)
Functional Requirements
User Application
Language Subsystems Utilities
Compiler Operating
System
Instruction Set Architecture
Our
Focus Hardware Organization
CPU Memory I/O Coprocesso
Architecture r
Implementation
VLSI Logic Powe Packagin …
r g
Task of A Computer Design
Emerging Technologies
DRAM Interleaving
Bus protocols
Memory Coherence,
L2 Cache Bandwidth,
Hierarchy
Latency
L1 Cache Addressing,
VLSI
Protection,
Instruction Set Architecture Exception Handling
Pipelining, Hazard Resolution, Pipelining and
Super-scalar, Reordering, Instruction Level Parallelism
Prediction, Speculation,
Vector, DSP
P M P M P M P M
°°
Processor-Memory-Switch Topologies,
Multiprocessors Routing,
Bandwidth,
Networks and Interconnections
Latency,
Reliability
Optimizing the Design
• Memory usage
– Average program needs grow by 50% to 100%/year
– Impact - add an address bit each year (Instruction set)
Technology Trends
• Integrated Circuits
– Density increases at 35%/yr.
– Die size increases 10%-20%/yr
– Combination is a chip complexity growth rate of 55%/yr
– Transistor speed increase is similar but signal propagation does not track this curve
- so clock rates don’t go up as fast
• Semiconductor DRAM
– Density quadruples every 3 years (approx. 60%/yr) [4x steps]
– Cycle time decreases slowly - 33% in 10 years
– Interface changes have improved bandwidth
• Magnetic Disk
– Currently density improves at 100%/yr
– Access time has improved by 33% in 10 years
• Network Technology
– Depends both on the performance of switches and transmission system
– 1GB Ethernet becomes available about 5 years after 100MB
– Doubling in bandwidth every year.
• Implications
– Pipelined design efforts using multiple design teams
– Have to design for a complexity target that can’t be implemented until the end of
the cycle (Design for the next technology)
– Can’t afford to miss the best technology so you have to chase the trends
Cost, Price, and Their Trends
Cost
• Clearly a market place issue -- profit as a function of volume
• Let’s focus on hardware costs
• Factors impacting cost
– Learning curve – manufacturing costs decrease over time
– Yield – the percentage of manufactured devices that survives the testing procedure
– Volume is also a key factor in determine cost
– Commodities are products that are sold by multiple vendors in large volumes and
are essentially identical.
Find the number of die per 30-cm wafer for a die that is 0.7 cm on a side.
π × (30/2)2 π × 30
Dies per wafer = ------------- − ---------------- = 1347
0.49 ( 2 × 0.49)0.5
Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a
defect density of 0.6 per cm2.
Ans: The total die areas are 1 cm2 and 0.49 cm2
.
For the larger die yield is
• The computer designer affects die size, and hence cost, both by what functions are
included on or excluded from the die and by the number of I/O pins
Cost/Price
• Component Costs
• Direct Costs (add 10% to 30%): costs directly related to making a project
– Labor, purchasing, scrap, warranty
• Gross Margin (add 10% to 45%): the company’s overhead that cannot be billed
directly to one project
– R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax
profits, taxes
Cost/Price Illustration
100%
0%
Measuring and Reporting Performance
Mini W/S PC
Performance
OS Time
• BEWARE
– OS’s have a way of under-measuring themselves
• Synthetic benchmarks
– Try to match the average frequency of operations and operands of a large set of
programs
– No user really runs them -- not even pieces of real programs
– They typically reside in cache & don’t test memory performance
– At the very least you must understand what the benchmark code is in order to
understand what it might be measuring
– Companies thrive or bust on benchmark performance
• Hence they optimize for the benchmark
– BEWARE ALWAYS!!
Benchmark Suites
• Desktop benchmarks
– CPU-intensive: SPEC CPU2000
– Graphic-intensive: SPECviewperf
• Server benchmarks
– CPU throughput-oriented: SPECrate
– I/O activity: SPECSFS (NFS), SPECWeb
– Transaction processing: TPC (Transaction Processing Council)
• Embedded benchmarks
– EEMBC (EDN Embedded Microprocessor Benchmark Consortium)
Some pc Benchmarks
Benchmark Reporting
Other Problems
∑Weight * Time
i =1
i i
n
n
∏ i=1
ExecutionT imeRatio i
Amdahl’s Law
• Defines speedup gained from a particular feature
• Depends on 2 factors
– Fraction of original computation time that can take advantage of the enhancement -
e.g. the commonality of the feature
– Level of improvement gained by the feature
• Amdahl’s law
Simple Example
• Important Application:
– FPSQRT 20%
– FP instructions account for 50%
– Other 30%
• Designers say same cost to speedup:
– FPSQRT by 40x
– FP by 2x
– Other by 8x
• Which one should you invest?
• Straightforward plug in the numbers & compare BUT what’s your guess??
OR
IC * CPI
CPU _ Time = IC * CPI * Clock _ cycle _ time =
Clock _ rate
• 3 Focus Factors -- Cycle Time, CPI, IC
– Sadly - they are interdependent and making one better often makes another worse
(but small or predictable impacts)
• Cycle time depends on HW technology and organization
• CPI depends on organization (pipeline, caching...) and ISA
• IC depends on ISA and compiler technology
• Often CPI’s are easier to deal with on a per instruction basis
n
CPU _ clock _ cycles = ∑ CPI i * ICi
i =1
n
∑ CPI * IC i i n
ICi
Overall _ CPI = i =1
= ∑ CPI i *
Instruction _ count i =1 Instruction _ count
Simple Example
• Suppose we have made the following measurements:
– Frequency of FP operations (other than FPSQR) =25%
– Average CPI of FP operations=4.0
– Average CPI of other instructions=1.33
– Frequency of FPSQR=2%
– CPI of FPSQR=20
• Two design alternatives
– Reduce the CPI of FPSQR to 2
– Reduce the average CPI of all FP operations to 2
n
ICi
CPI original = ∑ CPI i * = (4 * 25%) + (1.33 * 75%) = 2.0
i =1 Instruction _ count
CPI with _ new _ FPSQR = CPI original − 2% * (CPI oldFPSQR − CPI ofnewFPSQRonly )
= 2.0 − 2% * (20 − 2) = 1.64
• Add the contents of memory 940 to the content of memory 941 and stores the
result at 941
A Note on Measurements
Instruction Characteristics
Immediate Operands
Distribution of Immediate Values
• DSPs deal with infinite, continuous streams of data, they routinely rely on circular
buffers
– Modulo or circular addressing mode
• Support data shuffling in Fast Fourier Transform (FFT)
– Bit reverse addressing
– 0112 1102
• However, the two fancy addressing modes do not used heavily
– Mismatch between what programmers and compilers actually use versus what
architects expect
• Arithmetic + Logical
– Integer arithmetic: ADD, SUB, MULT, DIV, SHIFT
– Logical operation: AND, OR, XOR, NOT
• Data Transfer - copy, load, store
• Control - branch, jump, call, return, trap
• System - OS and memory management
– We’ll ignore these for now - but remember they are needed
• Floating Point
– Same as arithmetic but usually take bigger operands
• Decimal - if you go for it what else do you need?
– legacy from COBOL and the commercial application domain
• String - move, compare, search
• Graphics – pixel and vertex, compression/decompression operations
Top 10 Instructions for 80x86
• The most widely executed instructions are the simple operations of an instruction set
• The top-10 instructions for 80x86 account for 96% of instructions executed
• Make them fast, as they are the common case
• Call/Returns
– Integer: 19% FP: 8%
• Jump
– Integer: 6% FP: 10%
• Conditional Branch
– Integer: 75% FP: 82%
• Known at compile time for unconditional and conditional branches - hence specified
in the instruction
– As a register containing the target address
– As a PC-relative offset
• Consider word length addresses, registers, and instructions
– Full address desired? Then pick the register option.
• BUT - setup and effective address will take longer.
– If you can deal with smaller offset then PC relative works
• PC relative is also position independent - so simple linker duty
Branch Distances
Condition Testing Options
• Branch addressing to be able to jump to about 100+ instructions either above or below
the branch
– Imply a PC-relative branch displacement of at least 8 bits
• Register-indirect and PC-relative addressing for jump instructions to support returns as
well as many other features of current systems
• addl3 r1, 737(r2), (r3): 32-bit integer add instruction with 3 operands need 6 bytes
to represent it
– Opcode for addl3: 1 byte
–A VAX address specifier is 1 byte (4-bits: addressing mode, 4-bits: register)
• r1: 1 byte (register addressing mode + r1)
• 737(r2)
– 1 byte for address specifier (displacement addressing + r2)
– 2 bytes for displacement 737
• (r3): 1 byte for address specifier (register indirect + r3)
• Length of VAX instructions: 1—53 bytes
Optimization Types
• Important questions
– How are variables allocated and addressed?
– How many registers will be needed?
• We must look at 3 areas to allocate data
• Stack
– Local variable access in activation records, almost no push/pop
– Addressing is relative to the stack pointer
– Grown or shrunk on calls and returns
• Global data area - the easy one
– Constants and global static structures
– For arrays addressing may be indexed off head
• Heap
– Used for dynamic objects
– Access usually by pointers
– Data is typically not scalar
Register Allocation & Data
• Reasonably simple for stack objects
• Hard for global data due to aliasing opportunity
– Must be conservative
• Heap objects & pointers in general are even harder
– Computed pointers make allocation impossible to register save the target data
– Any structured data - string, array, etc. is too big to save
• Since register allocation is a major optimization source
– The effect is clearly important
• ISA has at least 16 GPR (not counting FP registers) to simplify allocation of registers
using graph coloring
• Orthogonally suggests all supported addressing modes apply to all instructions that
transfer data
• Simplicity – understand that less is more in ISA design
– Provide primitives instead of solutions
– Simplify trade-offs between alternatives
– Don’t bind constants at runtime
• Counterexample – Lack of compiler support for multimedia instructions
Instruction-Level Parallelism and Its Dynamic Exploitation
Outline
• Instruction-Level Parallelism: • High-Performance Instruction
Concepts and Challenges Delivery
• Overcoming Data Hazards with • Taking Advantage of More ILP with
Dynamic Scheduling Multiple Issue
• Dynamic Scheduling: Examples and • Hardware-Based Speculation
the Algorithm • Studies of the Limitations of ILP
• Reducing Branch Penalties with • Limitations on ILP for Reliable
Dynamic Hardware Prediction Processors
Introduction
• Instruction-Level Parallelism (ILP): potential execution overlap among instructions
– Instructions are executed in parallel
– Pipeline supports a limited sense of ILP
• This chapter introduces techniques to increase the amount of parallelism exploited
among instructions
– How to reduce the impact of data and control hazards
– How to increase the ability of the processor to exploit parallelism
• Pipelined CPI=Ideal pipeline CPI+Structural stalls+RAW stalls+WAR stalls+WAW
stalls+Control stalls
ILP Methods
ILP within a Basic Block
Introduction
• If two instructions are independent, then
– They can execute (parallel) simultaneously in a pipeline without stall
• Assume no structural hazards
– Their execution orders can be swapped
• Dependent instructions must be executed in order, or partially overlapped in pipeline
• Why to check dependence?
– Determine how much parallelism exists, and how that parallelism can be exploited
• Types of dependences -- Data, Name, Control dependence
ADD.D F4,F0,F2
L.D F0,-8(R1)
ADD.D F4,F0,F2
Control Dependence
• Since branches are conditional
– Some instructions will be executed and others will not
– Instructions before the branch don’t matter
– Only possibility is between a branch and instructions which follow it
• 2 obvious constraints to maintain control dependence
– Instructions controlled by the branch cannot be moved before the branch (since it
would then be uncontrolled)
– An instruction not controlled by the branch cannot be moved after the branch (since
it would then be controlled)
• Note
– Transitive control dependence is also a factor
– In simple pipelines - order is preserved anyway so no big deal
• What’s the big deal
– No data dependence so move something before the branch
– Trash the result if the branch goes the wrong way
– Note only works when result goes to a register which becomes dead (result never
used) if the wrong way is taken
• However 2 important side-effects affect correctness issues
– Exception behavior remains intact
• Sometimes this is relaxed but it probably should not be
– Branches effectively set up conditional data flow
• Data flow is definitely real so if we do the move then we better make sure it
does not change the data flow
• So it can be done but care must be taken
– Enter HW and SW speculation & conditional instructions
• IF R4 were unused (dead) after skipnext and DSUBU could not generate an exception,
we could move DSUBU before the branch, since the data flow cannot be affected
AL
IM Issu Reg DM Reg
e
I
WAR & WAW may arise when dynamic scheduling
Tomasulo’s Approach
Key Idea
• Pipelined or multiple function units (FU)
• Each FU has multiple reservation stations (RS)
• Issue to reservation stations was in-order (in-order issue)
• RS starts whenever they had collected source operands from real registers (RR) -
hence out-of-order execution
• Reservation stations contain virtual registers (VR) that remove WAW and WAR
induced stalls
– RS fetches operands from RR and stores them into VR
– Since virtual registers can be more than real registers, the technique can even
eliminate hazards arising from name dependences that could not be eliminated by a
compiler
• Register renaming is provided by reservation stations (RS) and instruction issue logic
– Each function unit has several reservation stations
– A RS fetches and buffers an operand as soon as it is available
• Eliminate the need to get the operand from a register
– Pending instructions designate the RS that will provide their input
– When successive writes to a register overlap in execution, only the last one is
actually used to update the register
• Avoid WAW
Instruction Steps
• Issue (note in-order due to queue structure)
– Get instruction from instruction Queue
– Issue if there is an empty RS or available buffer (loads, stores)
– If the operands are in registers send them to the reservation station
– Stall otherwise due to the structural hazard
• Execute (may be out of order)
– When all operands are available then execute
– If not, then monitor CDB to grab desired operand when it is produced
– Effectively deals with RAW hazards
• Write Result (also may be out of order)
– When result available write it to the CDB
– From CDB it will go to a waiting RS and to the registers and store buffer
– Note renaming model prevents WAW and WAR hazards as a side effect
Virtual Registers
• Tag field associated with data
• Tag field is a virtual register ID
• Corresponds to
– Reservation station and load buffer names
• Motivation due to the 360’s register weakness
– Had only 6 FP registers
– The 9 renamed virtual registers were a significant bonus
Tomasulo Structure
• Each Reservation Station
– Op - the operation
– Qj, Qk - RS that will produce the operand
• 0value is already available or no necessary operand
– Vj, Vk - the value of the operands
• Only one of V or Q is valid for each operand
– Busy - RS and its corresponding functional unit are occupied
– A –information for memory address calculation for a load or store
• Immediate effective address
• Register file and store buffers
– Qi – RS that produces the value to be stored in this register
• Load and store buffers each require a busy field
• Processor attempts to resolve the outcome of a branch early, thus preventing control
dependences from causing stalls
• BP Performance = f (accuracy, cost of misprediction)
• Branch History Table (BHT)
– Lower bits of PC address index table of 1-bit values
• No “precise” address check – just match the lower bits
– Says whether or not branch taken last time
BHT Prediction
N-bit Predictors
• Hypothesis: recently executed branches are correlated; that is, behavior of recently
executed branches affects prediction of current branch
• Idea: record m most recently executed branches as taken or not taken, and use that
pattern to select the proper branch history table
• In general, (m,n) predictor means record last m branches to select between 2m history
tables each with n-bit counters
– Old 2-bit BHT is then a (0,2) predictor
Ininitial
general:value d==0?
(m,n) b1 buffer) value of d
BHT (prediction d==1? b2
• ofp dbits of buffer index = 2 bit BHT
p before b2
• 0Use last m branches
YES = global not branch
taken history
1 YES not taken
• Use n bit predictor
1 NO taken 1 YES not taken
2 NO taken 2 NO taken
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
Prediction bits Prediction if last branch Prediction if last branch
not taken taken
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T
– 2m banks of memory selected by the global branch history (which is just a shift
register) - e.g. a column address
– Use p bits of the branch address to select row
– Get the n predictor bits in the entry to make the decision
(2,2) Predictor Implementation
Accuracy of Different Schemes
Tournament Predictors
• Adaptively combine local and global predictors
– Multiple predictors
• One based on global information: Results of recently executed m branches
• One based on local information: Results of past executions of the current branch
instruction
– Selector to choose which predictors to use
• 2-bit saturating counter, incremented whenever the “predicted” predictor is
correct and the other predictor is incorrect, and it is decremented in the reverse
situation
• Advantage
– Ability to select the right predictor for the right branch
• Alpha 21264 Branch Predictor
Unscheduled Loop
Clock Cycle Issued
Loop: L.D F0,0( R1 ) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4, 0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1, R2, Loop 9
stall 10
Problems So Far
• Look at the opcodes
– See if the pair is an appropriate issue pair
• Some integer operations are a problem
– FP register loads/stores - since other instruction may be dependent
• A stall will result - options?
– Force FP loads, stores or moves to issue by themselves
• Safe but suboptimal since the other instruction may still be independent
– OR add more ports to the FP register file
• Such as separate read and write ports
• Still must stall the 2nd instruction if it is dependent
Other Issues
• Hazard detection
– Similar to the normal pipeline model, but need large set of bypass path (twice as
many instructions in the pipeline)
• Load use delay
– Assume = 1 cycle now covers 3 instruction slots
• Branch delay
– Have branches to be issued by themselves?
– The 1 instruction branch delay now holds 3 instructions as well
• Instruction scheduling by compiler
– Mandatory for issuing independent operations in SS
– Increasingly important as issue width goes up
Example
• Can issue two arbitrary operations per clock
• One integer FU for ALU operation and EA-calculation
• A separate pipelined FP FU
• One memory unit, 2CDB
• no delayed branch with perfect branch prediction
– Fetch and issue as if the branch predictions are always correct
• Latency between a source instruction and an instruction consuming the result –
presence of Write Result stage
– 1 CC for integer ALU operations
– 2 CC for loads
– 3 CC for FP add
Note
• WR stages does not apply to either stores or branches
• For L.D and S.D, the execution cycle is EA calculation
• For branches, the execution cycle shows when the branch condition can be evaluated
and the prediction checked
• Any instruction following a branch cannot start execution until after the branch
condition has been evaluated
• If two instructions could use the same FU at the same point (structural hazard), priority
is given to the older instruction
Execution Timing
Example Result
• Result
– IPC issued = 5/3 = 1.67; Instruction execution rate = 15/16 = 0.94
• Only one load, store, and Integer ALU operation can execute
– Load of the next iteration performs its memory address before the store of the
current iteration
– A single CDB is actually required
– Integer operations become the bottleneck
• Many integer operations, but only one integer ALU
– One stall cycle each loop iteration due to a branch hazard
Note
• Result
– IPC issued = 5/3 = 1.67; Instruction execution rate = 15/11 = 1.36
– A second CDB is needed
– This example has a higher instruction execution rate but lower efficiency as
measured by the utilization of FU
• Increased HW cost
– Increased ports for register files
– Cost of scoreboarding (e.g. Tomasulo data structure) and forwarding paths
– Memory bandwidth requirement goes up
• Most have gone with separate I and D ports already
• Newest approaches are to go for multiple D ports as well - big time expense!!
(PA- 8000)
– Branch prediction by HW is an absolute must – HW Speculation (Sect. 3.7)
3.7 Hardware-Based Speculation
Overview
• Overcome control dependence by speculating on the outcome of branches and
executing the program as if our guesses were correct
– Fetch, issue, and execute instructions
– Need mechanisms to handle the situation when the speculation is incorrect
• Dynamic scheduling: only fetch and issue such instructions
Key Ideas
• Dynamic branch prediction to choose which instructions to execute
• Speculation to allow the speculated blocks to execution before the control
dependences are resolved
– And undo the effects of an incorrectly speculated sequence
• Dynamic scheduling to deal with the scheduling of different combinations of basic
blocks (Tomasulo style approach)
HW Speculation Approach
• Issue execution write result commit
– Commit is the point where the operation is no longer speculative
• Allow out of order execution
– Require in-order commit
– Prevent speculative instructions from performing destructive state changes (e.g.
memory write or register write)
• Collect pre-commit instructions in a reorder buffer (ROB)
– Holds completed but not committed instructions
– Effectively contains a set of virtual registers to store the result of speculative
instructions until they are no longer speculative
• Similar to reservation station And becomes a bypass source
The Speculative MIPS
ROB Fields
• Instruction type – branch, store, register operations
• Destination field
– Unused for branches
– Memory address for stores
– Register number for load and ALU operations (register operations)
• Value – hold the value of the instruction result until commit
• Ready – indicate if the instruction has completed execution
Example Result
• Tomasulo without speculation
– SUB.D and ADD.D have completed (clock cycle 16, slide 58)
• Tomasulo with speculation
– No instruction after the earliest uncompleted instruction (MUL.D) is allowed to
complete
– In-order commit
Loop Example
Loop: L.D F0, 0(R1)
MUL.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1,R1, #-8
BNE R1, R2, Loop
• Assume we have issued all the instructions in the loop twice
• Assume L.D and MUL.D from the first iteration have committed and all others have
completed execution
Other Issues
• Performance is more sensitive to branch-prediction
– Impact of a mis-prediction will be higher
– Prediction accuracy, mis-prediction detection, and mis-prediction recovery increase
in importance
• Precise exception
– Handled by not recognizing the exception until it is ready to commit
– If a speculation instruction raises an exception, the exception is recorded in ROB
• Mis-prediction branch exception are flushed as well
• If the instruction reaches the ROB head take the exception
Example Result
• Without speculation
– L.D following BNE cannot start execution earlier wait until branch outcome is
determined
– Completion rate is falling behind the issue rate rapidly, stall when a few more
iterations are issued
• With speculation
– L.D following BNE can start execution early because it is speculative
ILP Studies
• Perfect Hardware model - in the ideal infinite cost case
– Rename as much as you need
• Implies infinite virtual registers
• Hence - complete WAW or WAR insensitivity
– Branch prediction is perfect
• This will never happen in reality of course
– Jump prediction (even computed such as return) are also perfect
• Similarly unreal
– Perfect memory disambiguation
• Almost perfect is not too hard in practice
– Can issue an unlimited # of instructions at once & no restriction on types of
instructions issued # FUs
– One-cycle latency
How to Measure
• A set of programs were compiled and optimized with the standard MIPS optimizing
compilers
• Execute and produce a trace of the instruction and data references
– Perfect branch prediction and perfect alias analysis are easy to do
• Every instruction in the trace is then scheduled as early as possible, limited only by the
data dependence
– Including moving across branches
What A Perfect Processor Must Do?
• Look arbitrary far ahead to find a set of instructions to issue, predicting all branches
perfectly
• Rename all register uses to avoid WAW and WAR hazards
• Determine whether there are any dependences among the instructions in the issue
packet; if so, rename accordingly
• Determine if any memory dependences exist among the issuing instructions and hand
them appropriately
• Provide enough replicated Fus to allow all the ready instructions to issue
Note
• We can’t possibly cover everything in class
– Without skipping important fundamental concepts
• So read the Intel P6 (3.10) and IA-64 (4.7) section
– Every machine has some goals and a lot of balance points
– Instructive to follow the list of possibilities we have discussed with a real case
study
– If there’s another machine you are interested in then go read about it
• Analyze as much as you can
– Try to understand why a particular decision was made
– From normal literature the real reasons are impossible to find
– But you know a lot now and you’ll mostly get it right if you
– Do battle with the problem in a serious way.
4. Exploiting ILP with Software Approaches
Unscheduled Loop
Clock Cycle Issued
Loop: L.D F0,0( R1 ) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4, 0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1, R2, Loop 9
stall 10
Scheduled Loop
L.D F0,-8(R1)
ADD.D F4,F0,F2
L.D F0,-16(R1)
ADD.D F4,F0,F2
S.D F4, -16(R1)
L.D F0,-24(R1)
ADD.D F4,F0,F2
S.D F4, -24(R1)
DADDUI R1,R1,#32
BNE R1,R2,LOOP
Limitation of Gains of Loop Unrolling
• Amount of loop overhead amortized with each unroll
– Unroll 4 times – 2 out 14 CC are overhead 0.5 CC per iteration
– Unrolled 8 times 0.25 CC per iteration
– What’s the theoretically optimal number of unrolling using the latencies shown in
slide 3?
• Growth in code size
– Large code size is not good for embedded computer
– Large code size may increase cache miss rate
• Potential shortfall in registers that is created by aggressive unrolling and scheduling
– Register pressure
Basic VLIW
Example 1
for (i=1; i <= 100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
Example 2
for (i=1; i <= 100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
• S1 uses a value computed by S2 in an earlier iteration (B[i+1])
– Loop-carried dependence
• Dependence is not circular
– Neither statement depends on itself, and although S1 depends on S2, S2 does not
depend on S1
• A loop is parallel if it can be written without a cycle in the dependences
– Absence of a cycle give a partial ordering on the statements
• Transform the code in the previous slide to conform to the partial ordering and expose
the parallelism
• No longer loop-carried. Iterations of the loop may be overlapped, provided the
statements in each iteration are kept in order
• Most compilers require that optimizations that rely on associativity (e.g. tree-height
reduction) be explicitly enabled
– Integer/FP arithmetic (range and precision) may lead to round-error
• Optimization related to recurrence
– Recurrence: expressions whose value on one iteration is given by a function that
depends on previous iteration
– When a loop with a recurrence is unrolled, we may be able to algebraically
optimized the unrolled loop, so that the recurrence need only be evaluated once per
unrolled iteration
• sum = sum + x sum = sum + x1 + x2 + x3 + x4 + x5 (5 dependent
operations) sum = ((sum + x1) + (x2 + x3)) + (x4 + x5) (3 dependent
operations)
Factors in Moving B
• The compiler will consider the following factors
– Relative execution frequencies of THEN and ELSE
– Cost of executing the computing and assignment to B above branch
• Any empty instruction issue slots and stalls above branch?
– How will the movement of B change the execution time for THEN
– Is B the best code fragment that can be moved? How about C or others?
– The cost of the compensation code that may be necessary for ELSE
Trace Scheduling
• Useful for processors with a large number of issues per clock, where conditional or
predicted execution (Section 4.5) is inappropriate or unsupported, and where loop
unrolling may not be sufficient by itself to uncover enough ILP
• A way to organize the global code motion process, so as to simplify the code
scheduling by incurring the costs of possible code motion on the less frequent paths
• Best used where profile information indicates significant differences in frequency
between different paths and where the profile information is highly indicative of
program behavior independent of the input
• Parallelism across conditional branches other than loop branches
• Looking for the critical path across conditional branches
• Two steps:
– Trace Selection: Find likely sequence of multiple basic blocks (trace)
of long sequence of straight-line code
– Trace Compaction
• Squeeze trace into few VLIW instructions
• Move trace before the branch decision
• Need bookkeeping code in case prediction is wrong
• Compiler undoes bad guess (discards values in registers)
• Subtle compiler bugs mean wrong answer vs. poor performance; no hardware
interlocks
Superblocks
• Drawback of trace scheduling: the entries and exits into the middle of the trace cause
significant complications
– Compensation code, and hard to assess their cost
• Superblocks – similar to trace, but
– Single entry point but allow multiple exits
• In a loop that has a single loop exit based on a count, the resulting superblocks have
only one exit
• Use tail duplication to create a separate block corresponding to the portion of the trace
after the entry
Some Things to Notice
• Useful when the branch behavior is fairly predictable at compiler time
• Not totally independent techniques
– All try to avoid dependence induced stalls
• Primary focus
– Unrolling: reduce loop overhead of index modification and branch
– SW pipelining: reduce single body dependence stalls
– Trace scheduling/superblocks: reduce impact of branch walls
• Most advanced compilers attempt all
– Result is a hybrid which blurs the differences
– Lots of special case analysis changes the hybrid mix
• All tend to fail if branch prediction is unreliable
4.5 Hardware Support for Exposing More Parallelism at Compiler Time
Conditional or Predicated Instructions
• Most common form is move
• Other variants
– Conditional loads and stores
– ALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional moves
– IA_64 supports full predication for all instructions
• Effect is to eliminating simple branches
– Moves dependence resolution point from early in the pipe (branch resolution) to
late in the pipe (register write) forwarding…is more possible
– Also changes a control dependence into a data dependence
• Net win since in global scheduling the control dependence fence is the key limiting
complexity
Conditional Instruction in SuperScalar
First slot(Mem) Second slot (ALU) • Waste a memory operation slot in the
LW R1,40(R2) ADD R3, R4, 2nd cycle
R5 • data dependence stall if not taken
ADD R6, R3, R7 First slot(Mem) Second slot (ALU)
BEQZ R10, L LW R1,40(R2) ADD R3, R4,
LW R8, 20(R10) R5
LW R9, 0(R8) LWC R8,20(R10),R10 ADD R6, R3,
R7
BEQZ R10, L
LW R9, 0(R8)
• Note:
– Require to maintain a basic block for the THEN case
– Checking for a possible exception requires extra code
Poison Bits
• Track exceptions as they occur but postpones any terminating exception until a value
is actually used.
• Incorrect programs that caused termination without speculation will still cause
exceptions when instructions are speculated.
• Poison bit for every register. A bit to indicate SI
• The poison bit of a destination register is set when SI results in a terminating exception.
– All other exceptions are handled immediately
• A SI uses a poisoned register dest-reg is poisoned
• Fault if regular instruction tries to use a poisoned reg.
Poison-Bit Example
LD R1, 0(R3) ; load A
sLD R14, 0(R2) ; spec-lw B. If exception R14 poisoned
BEQZ R1, L3 ; other bran.
DADDI R14, R1, #4 ; else
L3: SD R14, 0(R3) ; store A. R14 poisoned SD fault
• If sLD generates a terminating exception, the poison bit of R14 will be turned on.
When SD occurs, it will raise an exception if the poison bit for R14 is on.
Boosting
• Boosting
– How to deal with exception? Similar to Poison Bits?
– Reduce # of registers used
– Provide separate shadow resources for boosted instruction results
– If condition resolves selecting the boosted path
• Then these results are committed to the real registers
LD R1, 0(R3) ; load A
LD+ R1, 0(R2) ; Boosted load B. Result is never written to
; R1 if branch is not taken
BEQZ R1, L3 ; other bran.
DADDI R1, R1, #4 ; else
L3: SD 0(R3), R1 ; store A.