You are on page 1of 91

Fundamentals of Computer Design

Outline

• Performance Evolution
• The Task of a Computer Designer
• Technology and Computer Usage Trends
• Cost and Trends in Cost
• Measuring and Reporting Performance
• Quantitative Principles of Computer Design

Computer Architecture Is

• The attributes of a [computing] system as seen by the programmer, i.e., the conceptual
structure and functional behavior, as distinct from the organization of the data flows
and controls, the logic design, and the physical implementation. (Amdahl, Blaaw, and
Brooks, 1964)

Computer Architecture’s Changing Definition

• 1950s to 1960s: Computer Architecture Course


– Computer Arithmetic
• 1970s to mid 1980s: Computer Architecture Course
– Instruction Set Design, especially ISA appropriate for compilers
• 1990s to 2000s: Computer Architecture Course
– Design of CPU, memory system, I/O system, Multiprocessors

Performance Evolution

• $1K today buys a gizmo better than $1M could buy in 1965.
• 1970s
– Mainframes dominated – performance improved 25—30%/yr
– Mostly due to improved architecture + some technology aids

• 1980s
– VLSI + microprocessor became the foundation
– Technology improves at 35%/yr
– Machine language death = opportunity
– Mostly with UNIX and C in mid-80’s
• Even most system programmers gave up assembly language
• With this came the need for efficient compilers
– Compiler focus brought on the great CISC vs. RISC debate
• With the exception of Intel – RISC won the argument
• RISC performance improved by 50%/year initially
• Of course RISC is not as simple anymore and the compiler is a key part of the
game
–Does not matter how fast your computer is, if the compiler wastes most of it
due to the inability to generate efficient code
– With the exploitation of instruction-level parallelism (pipeline + super-scalar) and
the use of caches, performance is further enhanced
Growth in Performance (Figure 1.1)

The Task of A Computer Designer

Aspects of Computer Design

• Changing face of computing – different system design issues


– Desktop computing
– Servers
– Embedded computers
• Bottom line is that it is a complex game
– Determine important attributes (perhaps a market issue)
• Functional Requirement
– THEN maximize performance
– WHILE staying within the cost and power constraints
• Classic conflicting constraint problem
A Summary of the Three Computing Classes and Their System characteristics

Functional Requirements
User Application
Language Subsystems Utilities
Compiler Operating
System
Instruction Set Architecture
Our
Focus Hardware Organization
CPU Memory I/O Coprocesso
Architecture r
Implementation
VLSI Logic Powe Packagin …
r g
Task of A Computer Design

Input/Output and Storage


Disks, WORM, Tape RAID

Emerging Technologies
DRAM Interleaving
Bus protocols
Memory Coherence,
L2 Cache Bandwidth,
Hierarchy
Latency

L1 Cache Addressing,
VLSI
Protection,
Instruction Set Architecture Exception Handling
Pipelining, Hazard Resolution, Pipelining and
Super-scalar, Reordering, Instruction Level Parallelism
Prediction, Speculation,
Vector, DSP
P M P M P M P M
°°

S Interconnection Network Network Interfaces

Processor-Memory-Switch Topologies,
Multiprocessors Routing,
Bandwidth,
Networks and Interconnections
Latency,
Reliability
Optimizing the Design

• Usually the functional requirements are set by the company/marketplace


• Which design is optimal dependent on the choice of metric
– Cost minimized  simple design
– Performance maximized  complex design or better technology
– Time to market minimized  also favors simplicity
• Oh – and you only get one shot
– Requires heaps of simulation and must quantify everything
– Inherent requirements for deep infrastructure and support
• Plus you must predict the trends…

Key trends that must always be tracked

• Usage patterns and the market


• Technology
• Cost and performance

Technology and Computer Usage Trends


Usage Trends

• Memory usage
– Average program needs grow by 50% to 100%/year
– Impact - add an address bit each year (Instruction set)

• Assembly language replaced by HLL


– Increasingly important role of compilers
– Compiler and architecture types MUST now work together

• Whacked on pictures - even TV


– Graphics and multimedia capability
• Whacked on communications
– I/O subsystems become a higher priority

Technology Trends

• Integrated Circuits
– Density increases at 35%/yr.
– Die size increases 10%-20%/yr
– Combination is a chip complexity growth rate of 55%/yr
– Transistor speed increase is similar but signal propagation does not track this curve
- so clock rates don’t go up as fast

• Semiconductor DRAM
– Density quadruples every 3 years (approx. 60%/yr) [4x steps]
– Cycle time decreases slowly - 33% in 10 years
– Interface changes have improved bandwidth

• Magnetic Disk
– Currently density improves at 100%/yr
– Access time has improved by 33% in 10 years

• Network Technology
– Depends both on the performance of switches and transmission system
– 1GB Ethernet becomes available about 5 years after 100MB
– Doubling in bandwidth every year.

• Scaling of transistor performance, wires, and power in ICs Effects of Rapid


Technology Trends

• Consider today’s product cycle


– concept to production = 2 years

• AND market requirement


– of something new every 6 - 12 months

• Implications
– Pipelined design efforts using multiple design teams
– Have to design for a complexity target that can’t be implemented until the end of
the cycle (Design for the next technology)
– Can’t afford to miss the best technology so you have to chase the trends
Cost, Price, and Their Trends

Cost
• Clearly a market place issue -- profit as a function of volume
• Let’s focus on hardware costs
• Factors impacting cost
– Learning curve – manufacturing costs decrease over time
– Yield – the percentage of manufactured devices that survives the testing procedure
– Volume is also a key factor in determine cost
– Commodities are products that are sold by multiple vendors in large volumes and
are essentially identical.

Integrated Circuits Costs

• The cost of a packaged integrated circuit is

Cost of die+Cost of testing die+Cost of packaging and final test


Cost of IC=-----------------------------------------------------------------------------
Final test yield
Cost of die=(Cost of wafer) / (Dies per wafer × Die yield)

π × (Wafer diameter/2)2 π × Wafer diameter


Dies per wafer=------------------------------ - -------------------------
Die area (2 × Die area) 0.5
Cost of an Integrated Circuit
• The fraction or percentage of good dies on a wafer number (die yield):
Defects per unit area × Die area -α
Die yield=Wafer yield × { 1 +------------------------------------------}
α

Where α is a parameter that corresponds roughly to the number of masking level, a


measure on manufacturing complexity, critical to die yield (∝ = 4.0 is a good
estimate).

Example: Finding the number of dies

Find the number of die per 30-cm wafer for a die that is 0.7 cm on a side.

Ans: The total die area is 049 cm2. Thus

π × (30/2)2 π × 30
Dies per wafer = ------------- − ---------------- = 1347
0.49 ( 2 × 0.49)0.5

Example: Finding the die yield

Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a
defect density of 0.6 per cm2.

Ans: The total die areas are 1 cm2 and 0.49 cm2
.
For the larger die yield is

Die yield={1+(0.6 × 1)/4}-4=0.57

For the smaller die, it is

Die yield = {1+(0.6 × 0.49)/4}-4=0.75

Computer Designers and Chip Costs

• The computer designer affects die size, and hence cost, both by what functions are
included on or excluded from the die and by the number of I/O pins

Cost/Price
• Component Costs
• Direct Costs (add 10% to 30%): costs directly related to making a project
– Labor, purchasing, scrap, warranty
• Gross Margin (add 10% to 45%): the company’s overhead that cannot be billed
directly to one project
– R&D, marketing, sales, equipment maintenance, rental, financing cost, pretax
profits, taxes

• Average Discount to get List Price (add 33% to 66%)


– Volume discounts and/or retailer markup

Cost/Price Illustration

Cost/Price for Different Kinds of Systems

100%

80% Average Discount

60% Gross Margin

40% Direct Costs

20% Component Costs

0%
Measuring and Reporting Performance
Mini W/S PC
Performance

• 2 key aspects – making 1 faster may slow the other


– Execution time (single task)
– Throughput (multiple tasks)
• Comparing performance

– Key measurement is Time of real programs


• MIPS? MFLOPS?
Performance = 1/execution time ExecutionTimeY
– N=
– If X is N times faster than Y: ExecutionTime X
– Similar for throughput comparisons
– Improved performance  decreasing execution time

• Several kinds of time

– Wall-clock time: response time, or elapsed time


• Load, I/O delays, OS overhead
– CPU time - time spent computing your program
• Factors out time spent waiting for I/O delays
• But includes the OS + your program
• Hence system CPU time, and user CPU time

OS Time

• Unix time command reports


– User CPU time
– System CPU time 90.7u 12.9s 2:39 65%
– Total elapsed time
– % of elapsed time that is user + system CPU time
• Tells you how much time you spent waiting as a %

• BEWARE
– OS’s have a way of under-measuring themselves

Choosing Programs to Evaluate Performance

• Real applications – clearly the right choice


– Porting and eliminating system-dependent activities
– User burden -- to know which of your programs you really care about

• Modified (or scripted) applications


– Enhance portability or focus on particular aspects of system performance
• Kernels – small, key pieces of real programs
– Best used to isolate performance of individual features to explain the reasons from
differences in performance of real programs
– Livermore Loops and Linpack are examples
– Not real programs however -- no user really uses them

• Toy benchmarks – quicksort, puzzle


– Beginning programming assignment

• Synthetic benchmarks
– Try to match the average frequency of operations and operands of a large set of
programs
– No user really runs them -- not even pieces of real programs
– They typically reside in cache & don’t test memory performance
– At the very least you must understand what the benchmark code is in order to
understand what it might be measuring
– Companies thrive or bust on benchmark performance
• Hence they optimize for the benchmark
– BEWARE ALWAYS!!

Benchmark Suites

• SPEC (Standard Performance Evaluation Corporation)


– http://www.spec.org

• Desktop benchmarks
– CPU-intensive: SPEC CPU2000
– Graphic-intensive: SPECviewperf

• Server benchmarks
– CPU throughput-oriented: SPECrate
– I/O activity: SPECSFS (NFS), SPECWeb
– Transaction processing: TPC (Transaction Processing Council)

• Embedded benchmarks
– EEMBC (EDN Embedded Microprocessor Benchmark Consortium)
Some pc Benchmarks

SPEC CPU2000 Benchmark Suites – Integer


SPEC CPU2000 Benchmark Suites – Floating Point

Reporting Performance Results

• Claim Spice takes X seconds on machine Y


• Missing:
– Spice version input? What was the circuit?
– Operational parameters - time step, duration
– Compiler and version & optimization settings
– Machine configuration - disk, memory, etc.
– Source code modification or hand-generated assembly language
• Reproducibility is a must
– List everything another experimenter would need to duplicate the results

Benchmark Reporting

Other Problems

• Let’s assume we can get the test jig specified properly


• See the following example
– Which is better?
– By how much?
– Are the program equally important?

Some Aggregate Job Mix Options


• Arithmetic Mean - provides a simple average
– Does not account for weight - all programs treated equal
n
1
n

i=1
Time i

• Weighted arithmetic mean


– Weight is the frequency % of use
– Better but beware the dominant program time
n

∑Weight * Time
i =1
i i

– Depend on the reference machine

Normalized Time Metrics


• Geometric Mean

n
n
∏ i=1
ExecutionT imeRatio i

• Has the nice property that:


– Ratio of the means = Mean of the ratios
– Consistent no matter which machine is the reference
• Better than arithmetic means but
– Don’t form accurate prediction models – don’t predict execution time
Quantitative Principles of Computer Design

Make the Common Case Fast


• Most pervasive principle of design
• Need to validate that it is common or uncommon
• Often
– Common cases are simpler than uncommon cases
– e.g. exceptions like overflow, interrupts, ...
– Truly simple is usually both cheap and fast - best of both worlds
• Trick is to quantify the advantage of a proposed enhancement

Amdahl’s Law
• Defines speedup gained from a particular feature

• Depends on 2 factors
– Fraction of original computation time that can take advantage of the enhancement -
e.g. the commonality of the feature
– Level of improvement gained by the feature
• Amdahl’s law

Simple Example

• Important Application:
– FPSQRT 20%
– FP instructions account for 50%
– Other 30%
• Designers say same cost to speedup:
– FPSQRT by 40x
– FP by 2x
– Other by 8x
• Which one should you invest?
• Straightforward plug in the numbers & compare BUT what’s your guess??

And the Winner Is…?


Calculating CPU Performance
• All commercial machines are synchronous
– Implies there is a clock which ticks once per cycle
– Hence there are 2 useful basic metrics
• Clock Rate - today in MHz
• Clock cycle time
• Clock cycle time = 1/clock rate
• E.g. 250 MHz rate corresponds to a 4 ns. cycle time
CPU 
 _ Time = CPU _ clock _ cycles _ for _ a _ program * Clock _ cycle _ time

OR

CPU _ clock _ cycles _ for _ a _ program


CPU _ Time =
Clock _ rate

• We tend to count instructions executed = IC


– Note looking at the object code is just a start
– What we care about is the dynamic count - e.g. don’t forget loops, recursion,
branches, etc.
• CPI (Clock Per Instruction) is a figure of merit
CPU _ clock _ cycles _ for _ a _ program
CPI =
IC

IC * CPI
CPU _ Time = IC * CPI * Clock _ cycle _ time =
Clock _ rate
• 3 Focus Factors -- Cycle Time, CPI, IC
– Sadly - they are interdependent and making one better often makes another worse
(but small or predictable impacts)
• Cycle time depends on HW technology and organization
• CPI depends on organization (pipeline, caching...) and ISA
• IC depends on ISA and compiler technology
• Often CPI’s are easier to deal with on a per instruction basis

n
CPU _ clock _ cycles = ∑ CPI i * ICi
i =1
n

∑ CPI * IC i i n
ICi
Overall _ CPI = i =1
= ∑ CPI i *
Instruction _ count i =1 Instruction _ count
Simple Example
• Suppose we have made the following measurements:
– Frequency of FP operations (other than FPSQR) =25%
– Average CPI of FP operations=4.0
– Average CPI of other instructions=1.33
– Frequency of FPSQR=2%
– CPI of FPSQR=20
• Two design alternatives
– Reduce the CPI of FPSQR to 2
– Reduce the average CPI of all FP operations to 2

n
ICi
CPI original = ∑ CPI i * = (4 * 25%) + (1.33 * 75%) = 2.0
i =1 Instruction _ count

CPI with _ new _ FPSQR = CPI original − 2% * (CPI oldFPSQR − CPI ofnewFPSQRonly )
= 2.0 − 2% * (20 − 2) = 1.64

CPI newFP = (75% *1.33) + (25% * 2.0) = 1.5


Instruction Set Principles and Examples
Outline
• Introduction
• Classifying instruction set architectures
• Instruction set measurements
– Memory addressing
– Addressing modes for signal processing
– Type and size of operands
– Operations in the instruction set
– Operations for media and signal processing
– Instructions for control flow
– Encoding an instruction set
• Role of compilers
• MIPS architecture

Brief Introduction to ISA

• Instruction Set Architecture: a set of instructions


– Each instruction is directly executed by the CPU’s hardware
• How is it represented?
– By a binary format since the hardware understands only bits
• Concatenate together binary encoding for instructions, registers, constants,
memories
• Typical physical blobs are bits, bytes, words, n-words
• Word size is typically 16, 32, 64 bits today
• Options - fixed or variable length formats
– Fixed - each instruction encoded in same size field (typically 1 word)
– Variable – half-word, whole-word, multiple word instructions are possible

Example of Program Execution


• Command
– 1: Load AC from Memory
– 2: Store AC to memory
– 5: Add to AC from memory

• Add the contents of memory 940 to the content of memory 941 and stores the
result at 941

A Note on Measurements

• We’re taking the quantitative approach


• BUT measurements will vary:
– Due to application selection or application mix
– Due to the particular compiler being used
– Also dependent on compiler optimization selection
– And the target ISA
• Hence the measurements we’ll talk about
– Are useful to understand the method
– Are a typical yet small sample derived from benchmark codes
• To do it for real
– You would want lots of real applications
– Plus - your compiler and ISA

Classifying Instruction Set Architecture

Instruction Set Design

Instruction Characteristics

• Usually a simple operation


– Which operation is identified by the op-code field
• But operations require operands - 0, 1, or 2
– To identify where they are, they must be addressed
• Address is to some piece of storage
• Typical storage possibilities are main memory, registers, or a stack
• 2 options explicit or implicit addressing
– Implicit - the op-code implies the address of the operands
• ADD on a stack machine - pops the top 2 elements of the stack, then pushes the
result
• HP calculators work this way
– Explicit - the address is specified in some field of the instruction
• Note the potential for 3 addresses - 2 operands + the destination

Classifying Instruction Set Architectures

Operand Locations for Four ISA Classes


C=A+B •Add AC (A) with B and store
• Stack the result into AC
– Push A – Store C
– Push B • Register (register-memory)
– Add – Load R1, A
• Pop the top-2 values of the – Add R3, R1, B
stack (A, B) and push the result – Store R3, C
value into the stack • Register (load-store)
– Pop C – Load R1, A
• Accumulator (AC) – Load R2, B
– Load A – Add R3, R1, R2
– Add B – Store R3, C

Pro’s and Con’s of Stack, Accumulator, Register Machine

Modern Choice – Load-store Register (GPR) Architecture

• Reasons for choosing GPR (general-purpose registers) architecture


– Registers (stacks and accumulators…) are faster than memory
– Registers are easier and more effective for a compiler to use
• (A+B) – (C*D) – (E*F)
– May be evaluated in any order (for pipelining concerns or …)
» But on a stack machine  must left to right
– Registers can be used to hold variables
• Reduce memory traffic
• Speed up programs
• Improve code density (fewer bits are used to name a register)
• Compiler writers prefer that all registers be equivalent and unreserved
– The number of GPR: at least 16
Characteristics Divide GPR Architectures
• # of operands
– Three-operand: 1 result and 2 source operands
– Two-operand – 1 both source/result and 1 source
• How many operands are memory addresses
– 0 – 3 (two sources + 1
result)

Pro’s and Con’s of Three Most Common GPR Computers

Short Summary – Classifying Instruction Set Architectures


• Expect the use of general-purpose registers
• Figure 2.4 + pipelining (Appendix A)
– Expect the use of Register-Register (load-store) GPR architecture
Memory Addressing

Memory Addressing Basics


• What is accessed - byte, word, multiple words?
– Today’s machine are byte addressable
– Main memory is organized in 32 - 64 byte lines
– Big-Endian or Little-Endian addressing
• Hence there is a natural alignment problem
– Size s bytes at byte address A is aligned if A mod s = 0
– Misaligned access takes multiple aligned memory references
• Memory addressing mode influences instruction counts (IC) and clock cycles per
instruction (CPI)

Typical Address Modes (I)

Typical Address Modes (II)


Use of Memory Addressing Mode

Displacement Field Size

Immediate Operands
Distribution of Immediate Values

Addressing Modes for Signal Processing

• DSPs deal with infinite, continuous streams of data, they routinely rely on circular
buffers
– Modulo or circular addressing mode
• Support data shuffling in Fast Fourier Transform (FFT)
– Bit reverse addressing
– 0112  1102
• However, the two fancy addressing modes do not used heavily
– Mismatch between what programmers and compilers actually use versus what
architects expect

Frequency of Addressing Modes for T1 TMS320C54x DSP


Short Summary – Memory Addressing

• Need to support at least three addressing modes


– Displacement, immediate, and register deferred (+ REGISTER)
– They represent 75% -- 99% of the addressing modes in benchmarks
• The size of the address for displacement mode to be at least 12—16 bits (75% – 99%)
• The size of immediate field to be at least 8 – 16 bits (50%— 80%)

Operand Type & Size

• Specified by instruction (opcode) or by hardware tag


– Tagged machines are extinct
• Typical types: assume word= 32 bits
– Character - byte - ASCII or EBCDIC (IBM) - 4 per word
– Short integer - 2- bytes, 2’s complement
– Integer - one word - 2’s complement
– Float - one word - usually IEEE 754 these days
– Double precision float - 2 words - IEEE 754
– BCD or packed decimal - 4- bit values packed 8 per word
• Instructions will be needed for common conversions -- software can do the rare ones

Data Access Patterns

Operands for Media and Signal Processing

• Graphics applications – vertex


– (x, y, z) + w to help with color or hidden surfaces (R, G, B, A)
– 32-bit floating-point values
• DSPs
– Fixed point – a binary point just to the right of the sign bit
• Represent fractions between –1 and +1
• Have a separate exponent variable
• Blocked floating point – a block of variables has a common exponent
• Need some registers that are wider to guard against round-off error

Operand Type and Size in DSP

Short Summary – Type and Size of Operand


• The future - as we go to 64 bit machines
• Decimal’s future is unclear
• Larger offsets, immediate, etc. is likely
• Usage of 64 and 128 bit values will increase
• DSPs need wider accumulating registers than the size in memory to aid accuracy in
fixed-point arithmetic

What Operations are Needed

• Arithmetic + Logical
– Integer arithmetic: ADD, SUB, MULT, DIV, SHIFT
– Logical operation: AND, OR, XOR, NOT
• Data Transfer - copy, load, store
• Control - branch, jump, call, return, trap
• System - OS and memory management
– We’ll ignore these for now - but remember they are needed
• Floating Point
– Same as arithmetic but usually take bigger operands
• Decimal - if you go for it what else do you need?
– legacy from COBOL and the commercial application domain
• String - move, compare, search
• Graphics – pixel and vertex, compression/decompression operations
Top 10 Instructions for 80x86

• load: 22% • and: 6%


• conditional branch: 20% • sub: 5%
• compare: 16% • move register-register: 4%
• store: 12% • call: 1%
• add: 8% • return: 1%

• The most widely executed instructions are the simple operations of an instruction set
• The top-10 instructions for 80x86 account for 96% of instructions executed
• Make them fast, as they are the common case

Control Instructions are a Big Deal

• Jumps - unconditional transfer


• Conditional Branches
– How is condition code set? – by flag or part of the instruction
– How is target specified? How far away is it?
• Calls
– How is target specified? How far away is it?
– Where is return address kept?
– How are the arguments passed? Callee vs. Caller save!
• Returns
– Where is the return address? How far away is it?
– How are the results passed?

Breakdown of Control Flows

• Call/Returns
– Integer: 19% FP: 8%
• Jump
– Integer: 6% FP: 10%
• Conditional Branch
– Integer: 75% FP: 82%

Branch Address Specification

• Known at compile time for unconditional and conditional branches - hence specified
in the instruction
– As a register containing the target address
– As a PC-relative offset
• Consider word length addresses, registers, and instructions
– Full address desired? Then pick the register option.
• BUT - setup and effective address will take longer.
– If you can deal with smaller offset then PC relative works
• PC relative is also position independent - so simple linker duty

Returns and Indirect Jumps

• Branch target is not known at compile time


• Need a way to specify the target dynamically
– Use a register
– Permit any addressing mode
– Regs[R4]  Regs[R4] + Mem[Regs[R1]]
• Also useful for
– case or switch
– Dynamically shared libraries
– High-order functions or function pointers
– Virtual functions in OO

Branch Stats - 90% are PC Relative


• Call/Return
– TeX = 16%, Spice = 13%, GCC = 10%
• Jump
– TeX = 18%, Spice = 12%, GCC = 12%
• Conditional
– TeX = 66%, Spice = 75%, GCC = 78%

Branch Distances
Condition Testing Options

What kinds of compares do Branches Use?

Direction, Frequency, and real Change

Key points – 75% are forward branch


• Most backward branches are loops - taken about 90%
• Branch statistics are both compiler and application dependent
• Any loop optimizations may have large effect

Short Summary – Operations in the Instruction Set

• Branch addressing to be able to jump to about 100+ instructions either above or below
the branch
– Imply a PC-relative branch displacement of at least 8 bits
• Register-indirect and PC-relative addressing for jump instructions to support returns as
well as many other features of current systems

Encoding an Instruction Set

Encoding the ISA


• Encode instructions into a binary representation for execution by CPU
• Can pick anything but:
– Affects the size of code - so it should be tight
– Affects the CPU design - in particular the instruction decode
• So it may have a big influence on the CPI or cycle-time
• Must balance several competing forces
– Desire for lots of addressing modes and registers
– Desire to make average program size compact
– Desire to have instructions encoded into lengths that will be easy to handle in a
pipelined implementation (multiple of bytes)

3 Popular Encoding Choices

• Variable (compact code but difficult to encode)


– Primary opcode is fixed in size, but opcode modifiers may exist
– Opcode specifies number of arguments - each used as address fields
– Best when there are many addressing modes and operations
– Use as few bits as possible, but individual instructions can vary widely in length
– e. g. VAX - integer ADD versions vary between 3 and 19 bytes
• Fixed (easy to encode, but lengthy code)
– Every instruction looks the same - some field may be interpreted differently
– Combine the operation and the addressing mode into the opcode
– e. g. all modern RISC machines
• Hybrid
– Set of fixed formats
– e. g. IBM 360 and Intel 80x86

An Example of Variable Encoding – VAX

• addl3 r1, 737(r2), (r3): 32-bit integer add instruction with 3 operands  need 6 bytes
to represent it
– Opcode for addl3: 1 byte
–A VAX address specifier is 1 byte (4-bits: addressing mode, 4-bits: register)
• r1: 1 byte (register addressing mode + r1)
• 737(r2)
– 1 byte for address specifier (displacement addressing + r2)
– 2 bytes for displacement 737
• (r3): 1 byte for address specifier (register indirect + r3)
• Length of VAX instructions: 1—53 bytes

Short Summary – Encoding the Instruction Set

• Choice between variable and fixed instruction encoding


– Code size than performance  variable encoding
– Performance than code size  fixed encoding
Role of Compilers
 Critical goals in ISA from the compiler viewpoint
 What features will lead to high-quality code
 What makes it easy to write efficient compilers for an architecture

Compiler and ISA

• ISA decisions are no more for programming AL easily


• Due to HLL, ISA is a compiler target today
• Performance of a computer will be significantly affected by compiler
• Understanding compiler technology today is critical to designing and efficiently
implementing an instruction set
• Architecture choice affects the code quality and the complexity of building a compiler
for it

Goal of the Compiler

• Primary goal is correctness


• Second goal is speed of the object code
• Others:
– Speed of the compilation
– Ease of providing debug support
– Inter-operability among languages
– Flexibility of the implementation - languages may not change much but they do
evolve - e. g. Fortran 66 ===> HPF
Typical Modern Compiler Structure

• Multi-pass structure  easy to write bug-free compilers


– Transform HL, more abstract representations, into progressively low-level
representations, eventually reaching the instruction set
• Compilers must make assumptions about the ability of later steps to deal with certain
problems
– Ex. 1 choose which procedure calls to expand inline before they know the exact
size of the procedure being called
– Ex. 2 Global common sub-expression elimination
• Find two instances of an expression that compute the same value and saves the
result of the first one in a temporary
– Temporary must be register, not memory (Performance)
– Assume register allocator will allocate temporary into register

Optimization Types

• High level - done at source code level


– Procedure called only once - so put it in-line and save CALL
• Local - done on basic sequential block (straight-line code)
– Common sub-expressions produce same value
– Constant propagation - replace constant valued variable with the constant - saves
multiple variable accesses with same value
• Global - same as local but done across branches
– Code motion - remove code from loops that compute same value on each pass and
put it before the loop
– Simplify or eliminate array addressing calculations in loop
Optimization Types (Cont.)
• Register allocation
– Use graph coloring (graph theory) to allocate registers
• NP-complete
• Heuristic algorithm works best when there are at least 16 (and preferably more)
registers
• Processor-dependent optimization
– Strength reduction: replace multiply with shift and add sequence
– Pipeline scheduling: reorder instructions to minimize pipeline stalls
– Branch offset optimization: Reorder code to minimize branch offsets

Major Types of Optimizations and Example in Each Class

Change in IC Due to Optimization

 Level 1: local optimizations, code scheduling, and local register allocation


 Level 2: global optimization, loop transformation (software pipelining), global
register allocation
 Level 3: + procedure integration
Optimization Observations

• Hard to reduce branches


• Biggest reduction is often memory references
• Some ALU operation reduction happens but it is usually a few %
• Implication:
– Branch, Call, and Return become a larger relative % of the instruction mix
– Control instructions among the hardest to speed up

Impact of Compiler Technology on Architect’s Decisions

• Important questions
– How are variables allocated and addressed?
– How many registers will be needed?
• We must look at 3 areas to allocate data

Where to allocate data?

• Stack
– Local variable access in activation records, almost no push/pop
– Addressing is relative to the stack pointer
– Grown or shrunk on calls and returns
• Global data area - the easy one
– Constants and global static structures
– For arrays addressing may be indexed off head
• Heap
– Used for dynamic objects
– Access usually by pointers
– Data is typically not scalar
Register Allocation & Data
• Reasonably simple for stack objects
• Hard for global data due to aliasing opportunity
– Must be conservative
• Heap objects & pointers in general are even harder
– Computed pointers make allocation impossible to register save the target data
– Any structured data - string, array, etc. is too big to save
• Since register allocation is a major optimization source
– The effect is clearly important

How can Architects Help Compiler Writers


• Provide Regularity
– Address modes, operations, and data types should be orthogonal (independent) of
each other
• Simplify code generation especially multi-pass
• Counterexample: restrict what registers can be used for a certain classes of
instructions
• Provide primitives - not solutions
– Special features that match a HLL construct are often un-usable
– What works in one language may be detrimental to others
• Simplify trade-offs among alternatives
– How to write good code? What is a good code?
• Metric: IC or code size (no longer true) caches and pipeline…
– Anything that makes code sequence performance obvious is a definite win!
• How many times a variable should be referenced before it is cheaper to load it
into a register
• Provide instructions that bind the quantities known at compile time as constants
– Don’t hide compile time constants
• Instructions which work off of something that the compiler thinks could be a
run-time determined value hand-cuffs the optimizer

Short Summary – Compilers

• ISA has at least 16 GPR (not counting FP registers) to simplify allocation of registers
using graph coloring
• Orthogonally suggests all supported addressing modes apply to all instructions that
transfer data
• Simplicity – understand that less is more in ISA design
– Provide primitives instead of solutions
– Simplify trade-offs between alternatives
– Don’t bind constants at runtime
• Counterexample – Lack of compiler support for multimedia instructions
Instruction-Level Parallelism and Its Dynamic Exploitation
Outline
• Instruction-Level Parallelism: • High-Performance Instruction
Concepts and Challenges Delivery
• Overcoming Data Hazards with • Taking Advantage of More ILP with
Dynamic Scheduling Multiple Issue
• Dynamic Scheduling: Examples and • Hardware-Based Speculation
the Algorithm • Studies of the Limitations of ILP
• Reducing Branch Penalties with • Limitations on ILP for Reliable
Dynamic Hardware Prediction Processors

Instruction-Level Parallelism: Concepts and Challenges

Introduction
• Instruction-Level Parallelism (ILP): potential execution overlap among instructions
– Instructions are executed in parallel
– Pipeline supports a limited sense of ILP
• This chapter introduces techniques to increase the amount of parallelism exploited
among instructions
– How to reduce the impact of data and control hazards
– How to increase the ability of the processor to exploit parallelism
• Pipelined CPI=Ideal pipeline CPI+Structural stalls+RAW stalls+WAR stalls+WAW
stalls+Control stalls

Approaches To Exploiting ILP


• Hardware approach: focus of this chapter
– Dynamic – running time
– Dominate desktop and server markets
– Pentium III and IV; Athlon; MIPS R10000/12000; Sun UltraSPARC III; PowerPC
603, G3, and G4; Alpha 21264
• Software approach: focus of next chapter
– Static – compiler time
– Rely on compilers
– Broader adoption in the embedded market
– But include IA-64 and Intel’s Itanium

ILP Methods
ILP within a Basic Block

• Basic Block – Instructions between branch instructions


– Instructions in a basic block are executed in sequence
– Real code is a bunch of basic blocks connected by branch
• Notice: dynamic branch frequency – between 15% and 25%
– Basic block size between 6 and 7 instructions
– May depend on each other (data dependence)
– Therefore, probably little in the way of parallelism
• To obtain substantial performance enhancement: ILP across multiple basic blocks
– Easiest target is the loop
– Exploit parallelism among iterations of a loop (loop-level parallelism)

Loop Level Parallelism (LLP)

• Consider adding two 1000 element arrays


– There is no dependence between data values produced in any iteration j and those
needed in j+n for any j and n
– Truly independent iterations
– Independence means no stalls due to data hazards
• Basic idea to convert LLP into ILP
– Unroll the loop either statically by the compiler (next chapter) or dynamically by
the hardware (this
chapter)
x[1] = x[1] + y[1]
for (i=1; i<=1000, i=i+1) x[2] = x[2] + y[2]
x[i] = x[i] + y[i] …
x[1000]=x[1000]+y[1000]
Data Dependences and Hazards

Introduction
• If two instructions are independent, then
– They can execute (parallel) simultaneously in a pipeline without stall
• Assume no structural hazards
– Their execution orders can be swapped
• Dependent instructions must be executed in order, or partially overlapped in pipeline
• Why to check dependence?
– Determine how much parallelism exists, and how that parallelism can be exploited
• Types of dependences -- Data, Name, Control dependence

Data Dependence Analysis


• i is data dependent on j if i uses a result produced by j
– OR i uses a result produced by k and k depends on j (chain)
• Dependence indicates a potential RAW hazard
– Induce a hazard and stall? - depends on the pipeline organization
– The possibility limits the performance
• Order in which instructions must be executed
• Sets a bound on how much parallelism can be exploited
• Overcome data dependence
– Maintain dependence but avoid a hazard – scheduling the code (HW,SW)
– Eliminate a dependence by transforming the code (by compiler)

Data Dependence Example


Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

Data Dependence through Memory Location


• Dependences that flow through memory locations are more difficult to detect
• Addresses may refer to the same location but look different
– 100(R4) and 20(R6) may be identical
• The effective address of a load or store may change from one execution of the
instruction to another
– Two execution of the same instruction L.D F0, 20(R4) may refer to different
memory location
• Because the value of R4 may change between two executions
Name Dependence
• Occurs when 2 instructions use the same register name or memory location without
data dependence
• Let i precede j in program order
– i is antidependent on j when j writes a register that i reads
• Indicates a potential WAR hazard
– i is output dependent on j if they both write to the same register
• indicates a potential WAW hazard
• Not true data dependences – no value being transmitted between instructions
– Can execute simultaneously or be reordered if the name used in the instructions is
changed so the instructions do not conflict

Name Dependence Example


L.D F0, 0(R1)

ADD.D F4,F0,F2

S.D F4, 0(R1)

L.D F0,-8(R1)

ADD.D F4,F0,F2

Register Renaming and WAW/WAR

• DIV.D F0, F2, F4


• ADD.D F6, F0, F8 • ADD.D S, F0, F8
• S.D F6, 0 (R1) • S.D S, 0 (R1)
• SUB.D F8, F10, F14 • SUB.D T, F10, F14
• MUL.D F6, F10, F8 • MUL.D F6, F10, T
• DIV.D F0, F2, F4

Control Dependence
• Since branches are conditional
– Some instructions will be executed and others will not
– Instructions before the branch don’t matter
– Only possibility is between a branch and instructions which follow it
• 2 obvious constraints to maintain control dependence
– Instructions controlled by the branch cannot be moved before the branch (since it
would then be uncontrolled)
– An instruction not controlled by the branch cannot be moved after the branch (since
it would then be controlled)
• Note
– Transitive control dependence is also a factor
– In simple pipelines - order is preserved anyway so no big deal
• What’s the big deal
– No data dependence so move something before the branch
– Trash the result if the branch goes the wrong way
– Note only works when result goes to a register which becomes dead (result never
used) if the wrong way is taken
• However 2 important side-effects affect correctness issues
– Exception behavior remains intact
• Sometimes this is relaxed but it probably should not be
– Branches effectively set up conditional data flow
• Data flow is definitely real so if we do the move then we better make sure it
does not change the data flow
• So it can be done but care must be taken
– Enter HW and SW speculation & conditional instructions

• Not the critical property that must be preserved


– May execute instructions that should not have been executed, thereby violating the
control dependence  as long as OK
• Wrong guess in delayed branch (from target/fall-through)
• Maintain control and data dependences can prevent raising new exceptions
– DADDU R2, R3, R4
– BEQZ R2, L1
– LW R1, 0(R2)
– L1:
– No data dependence prevents us from interchanging BEQZ and LW; it is only the
control dependence

• By preserving the control dependence of the OR on the branch, we prevent an illegal


change to the data flow
– DADDU R1, R2, R3
– BEQZ R4, L1
– DSUBU R1, R5, R6
– L1:….
– OR R7, R1, R8

• IF R4 were unused (dead) after skipnext and DSUBU could not generate an exception,
we could move DSUBU before the branch, since the data flow cannot be affected

• If branch is taken, DSUBU will execute and will be useless


– DADDU R1, R2, R3
– BEQZ R12, skipnext
– DSUBU R4, R5, R6
– DADDU R5, R4, R9
– skipnext: OR R7, R8, R9
Overcoming Data Hazards with Dynamic Scheduling
Introduction
• Approaches used to avoid data hazard in Appendix A and Chapter 4
– Forwarding or bypassing – let dependence not result in hazards
– Stall – Stall the instruction that uses the result and successive instructions
– Compiler (Pipeline) scheduling – static scheduling
– In-order instruction issue and execution
• Instructions are issued in program order, and if an instruction is stalled in the
pipeline, no later instructions can proceed
• If there is a dependence between two closely spaced instructions in the pipeline,
this will lead to a hazard and a stall will result

Dynamic Scheduling VS. Static Scheduling

• Dynamic Scheduling – Avoid stalling when dependences are present


• Static Scheduling – Minimize stalls by separating dependent instructions so that they
will not lead to hazards

Dynamic Scheduling Idea


• Dynamic scheduling – HW rearranges the instruction execution to avoid stalling when
dependences, which could generate hazards, are present
– Advantages
• Enable handling some dependences unknown at compile time
• Simplify the compiler
• Code for one machine runs well on another
– Approaches
• Scoreboard (Appendix A)
• Tomasulo Approach (focus of this part)
• Assume multiple instructions can be in execution at the same time (require multiple
FUs, pipelined Fus, or both)
Dynamic Scheduling
• Dynamic instruction reordering
– In-order issue
– But allow out-of-order execution (and thus out-of-order completion)
• Consider
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F12, F8, F14
– DIV.D has a long latency (20+ pipeline stages)
– ADD.D has a data dependence on F0, SUB.D does not
• Stalling ADD.D will stall SUB.D too
• So swap them - compiler might have done this but so could HW
• Problems – raise new exceptions?
– For now lets ignore precise exceptions (Section 3.7 and Appendix A)

• Key Idea – allow instructions behind stall to proceed


– SUB.D can proceed even when ADD.D is stalled
• Out-of-order execution divides ID stage:
– Issue – decode instructions, check for structural hazards
– Read operands – wait until no data hazards, then read operands
• All instructions pass through the issue stage in order
• But, instructions can be stalled or bypass each other in the read-operand stage, and
thus enter execution out of order.

AL
IM Issu Reg DM Reg
e

I
WAR & WAW may arise when dynamic scheduling

• More Interesting Code Fragment


– DIV.D F0, F2, F4
– ADD.D F6, F0, F8
– SUB.D F8, F10, F14
– MUL.D F6, F10, F8
• Note following
– ADD.D can’t start until DIV.D completes
– SUB.D does not need to wait but can’t post result to F8 until ADD.D reads F8;
otherwise, yielding WAR hazard
– MUL.D does not need to wait but can’t post result to F6 until ADD.D write F6;
otherwise, yielding WAW hazard

Tomasulo’s Approach

• The original idea is for IBM 360/91; overcome


– Limited compiler scheduling (only 4 double-precision FP registers)
– Reduce memory accesses and FP delays
• Goal: High Performance without special compilers
• Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,

• Key ideas
– Track data dependences to allow execution as soon as operands are available 
minimize RAW hazards
– Rename registers to avoid WAR and WAW hazards

Key Idea
• Pipelined or multiple function units (FU)
• Each FU has multiple reservation stations (RS)
• Issue to reservation stations was in-order (in-order issue)
• RS starts whenever they had collected source operands from real registers (RR) -
hence out-of-order execution
• Reservation stations contain virtual registers (VR) that remove WAW and WAR
induced stalls
– RS fetches operands from RR and stores them into VR
– Since virtual registers can be more than real registers, the technique can even
eliminate hazards arising from name dependences that could not be eliminated by a
compiler

Basic Structure of A Tomasulo-Based MIPS Processor

Reservation Station Duties


• Each RS holds an instruction that has been issued and is awaiting execution at a FU,
and either the operand values or the RS names that will provide the operand values
• RS fetches operands from CDB when they appear
• When all operands are present, enable the associated functional unit to execute
• Since values are not really written to registers
– No WAW or WAR hazards are possible

Register Renaming in Tomasulo’s Approach

• Register renaming is provided by reservation stations (RS) and instruction issue logic
– Each function unit has several reservation stations
– A RS fetches and buffers an operand as soon as it is available
• Eliminate the need to get the operand from a register
– Pending instructions designate the RS that will provide their input
– When successive writes to a register overlap in execution, only the last one is
actually used to update the register
• Avoid WAW

RS and Tomasulo’s Approach

• Hazard detection and execution control are distributed


– Information held in RS at each functional unit determine when an instruction can
begin execution at that unit
• Results are passed directly to functional units rather than through the registers
– Essentially similar to bypass logic
– Broadcast capability since they pass on CDB (common data bus)

Instruction Steps
• Issue (note in-order due to queue structure)
– Get instruction from instruction Queue
– Issue if there is an empty RS or available buffer (loads, stores)
– If the operands are in registers send them to the reservation station
– Stall otherwise due to the structural hazard
• Execute (may be out of order)
– When all operands are available then execute
– If not, then monitor CDB to grab desired operand when it is produced
– Effectively deals with RAW hazards
• Write Result (also may be out of order)
– When result available write it to the CDB
– From CDB it will go to a waiting RS and to the registers and store buffer
– Note renaming model prevents WAW and WAR hazards as a side effect

Basic Structure of A Tomasulo-Based MIPS Processor.


Hazards Handling
• Structural hazards checked at 2 points
– At dispatch - a free RS of the appropriate type must be available
– When operands are ready - multiple RS may compete for issue to the shared
execution unit
• Program order used as basis for the arbitration
• RAW, WAR, WAW
• To preserve exception behavior, instructions should not be allowed to execute if a
branch that is earlier in program has not yet completed
– Implemented by preventing any instruction from leaving the issue step, if there is a
pending branch already in the pipeline

Virtual Registers
• Tag field associated with data
• Tag field is a virtual register ID
• Corresponds to
– Reservation station and load buffer names
• Motivation due to the 360’s register weakness
– Had only 6 FP registers
– The 9 renamed virtual registers were a significant bonus

Tomasulo Structure
• Each Reservation Station
– Op - the operation
– Qj, Qk - RS that will produce the operand
• 0value is already available or no necessary operand
– Vj, Vk - the value of the operands
• Only one of V or Q is valid for each operand
– Busy - RS and its corresponding functional unit are occupied
– A –information for memory address calculation for a load or store
• Immediate  effective address
• Register file and store buffers
– Qi – RS that produces the value to be stored in this register
• Load and store buffers each require a busy field

Detailed Tomasulo Algorithm Control


Advantages of Tomasulo
• Distribution of the hazard detection logic
– Distributed RS and CDB
– If multiple instructions are waiting on a single result, and each already has its other
operand, then the instruction can be released simultaneously by the broadcast on
CDB
• No waiting for the register bus in a centralized register file
• Elimination of stalls for WAW and WAR
– Rename register using RS
– Store operands into RS as soon as they are available
– For WAW-hazard, the last write will win
• Issue stage: RegisterStat[rd].Qi  r (the last wins)
Tomasulo Drawbacks
• Complexity
– delays of 360/91, MIPS 10000, IBM 620?
• Many associative stores (CDB) at high speed
• Performance limited by Common Data Bus
– Multiple CDBs  more FU logic for parallel assoc stores

Tomasulo Loop Example


Loop: LD F0 0 R1
MULTD F4 F0 F2
SD F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop

• Assume Multiply takes 4 clocks


• Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit)
• To be clear, will show clocks for SUBI, BNEZ
• Reality, integer instructions ahead

Reducing Branch Penalties with Dynamic Hardware Prediction


Dynamic Control Hazard Avoidance
• Consider Effects of Increasing the ILP
– Control dependencies rapidly become the limiting factor
– They tend to not get optimized by the compiler
• Higher branch frequencies result
• Plus multiple issue (more than one instructions/sec)  more control
instructions per sec.
– Control stall penalties will go up as machines go faster
• Amdahl’s Law in action - again
• Branch Prediction: helps if can be done for reasonable cost
– Static by compiler: appendix A
– Dynamic by HW: this section

Dynamic Branch Prediction

• Processor attempts to resolve the outcome of a branch early, thus preventing control
dependences from causing stalls
• BP Performance = f (accuracy, cost of misprediction)
• Branch History Table (BHT)
– Lower bits of PC address index table of 1-bit values
• No “precise” address check – just match the lower bits
– Says whether or not branch taken last time
BHT Prediction

Problem with the Simple BHT


• Aliasing
– All branches with the same index (lower) bits reference same BHT entry
• Hence they mutually predict each other
• No guarantee that a prediction is right. But it may not matter anyway
– Avoidance
• Make the table bigger - OK since it’s only a single bit-vector
• This is a common cache improvement strategy as well
– Other cache strategies may also apply
• Consider how this works for loops
– Always mispredict twice for every loop
• Once is unavoidable since the exit is always a surprise
• However previous exit will always cause a mis-predict on the first try of every
new loop entry

N-bit Predictors

• Use an n-bit saturating counter


– 2-bit counter implies 4 states
• Statistically 2 bits gets most of the advantage
BHT Accuracy
• Mispredict because either:
– Wrong guess for that branch
– Got branch history of wrong branch when index the table

Improve Prediction Strategy By Correlating Branches

• Consider the worst case for the 2-bit predictor


if (aa==2) then aa=0;
if (bb==2) then bb=0;
if (aa != bb) then whatever
– single level predictors can never get this case
• Correlating or 2-level predictors
– Correlation = what happened on the last branch
• Note that the last correlator branch may not always be the same
– Predictor = which way to go
• 4 possibilities: which way the last one went chooses the prediction
– (Last-taken, last-not-taken) X (predict-taken, predict-not-taken)

The worst case for the 2-bit predictor


if (aa==2)
aa=0;
if (bb==2)
bb=0;
if (aa != bb) {
DSUBUI R3, R1, #2
BNEZ R3, L1
DADDD R1, R0, R0
L1: DSUBUI R3, R2, #2
BNEZ R2, R0, R0
L2: DSUBU R3, R1, R2
BEQZ R3, L3
Correlating Branches

• Hypothesis: recently executed branches are correlated; that is, behavior of recently
executed branches affects prediction of current branch
• Idea: record m most recently executed branches as taken or not taken, and use that
pattern to select the proper branch history table
• In general, (m,n) predictor means record last m branches to select between 2m history
tables each with n-bit counters
– Old 2-bit BHT is then a (0,2) predictor

Example of Correlating Branch Predictors


if (d==0)
d = 1;
if (d==1)

BNEZ R1, L1 ;branch b1 (d!=0)


DAAIU R1, R0, #1 ;d==0, so d=1
L1: DAAIU R3, R1, #-1
BNEZ R3, L2 ;branch b2 (d!=1)

L2:

Ininitial
general:value d==0?
(m,n) b1 buffer) value of d
BHT (prediction d==1? b2
• ofp dbits of buffer index = 2 bit BHT
p before b2
• 0Use last m branches
YES = global not branch
taken history
1 YES not taken
• Use n bit predictor
1 NO taken 1 YES not taken
2 NO taken 2 NO taken

d=? b1 b1 New b1 b2 b2 New b2


predic actio predic predi actio predic
tion n tion ction n tion
2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT
Prediction bits Prediction if last branch Prediction if last branch
not taken taken
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T

d=? b1 b1 action New b1 b2 b2 action New b2


prediction prediction prediction prediction
2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T

• Total bits for the (m, n) BHT prediction buffer:

Total _ memory _ bits = 2 m * n * 2 p

– 2m banks of memory selected by the global branch history (which is just a shift
register) - e.g. a column address
– Use p bits of the branch address to select row
– Get the n predictor bits in the entry to make the decision
(2,2) Predictor Implementation
Accuracy of Different Schemes

Tournament Predictors
• Adaptively combine local and global predictors
– Multiple predictors
• One based on global information: Results of recently executed m branches
• One based on local information: Results of past executions of the current branch
instruction
– Selector to choose which predictors to use
• 2-bit saturating counter, incremented whenever the “predicted” predictor is
correct and the other predictor is incorrect, and it is decremented in the reverse
situation
• Advantage
– Ability to select the right predictor for the right branch
• Alpha 21264 Branch Predictor

State Transition Diagram for A Tournament Predictor

Fraction of Predictions Coming from the Local Predictor (SPEC89)

Misprediction Rate Comparison


Branch Target Buffer/Cache
• To reduce the branch penalty to 0
– Need to know what the address is by the end of IF
– But the instruction is not even decoded yet
– So use the instruction address rather than wait for decode
• If prediction works then penalty goes to 0!
• BTB Idea -- Cache to store taken branches (no need to store untaken)
– Match tag is instruction address  compare with current PC
– Data field is the predicted PC
• May want to add predictor field
– To avoid the mispredict twice on every loop phenomenon
– Adds complexity since we now have to track untaken branches as well

Branch Target Buffer/Cache -- Illustration


Changes in DLX to incorporate BTB

Penalties Using this Approach for MIPS/DLX


Instruction in buffer Prediction Actual Branch Penalty Cycles
Yes Taken Taken 0
Yes Taken Not Taken 2
No Taken 2
No Not Taken 0
Note:
• Predict_wrong = 1 CC to update BTB + 1 CC to restart fetching
• Not found and taken = 2CC to update BTB
Note:
• For complex pipeline design, the penalties may be higher

Branch Penalty & CPI


• Prediction accuracy is 90%
• Hit rate in the buffer is 90%
• Taken branch frequency is 60%
• Branch_penalty=buffer_hit_rate*incorrect_prediction_rate*2 + (1-
buffer_hit_rate)*Taken_branch*2
= (0.9 * 0.1 * 2) + (0.1 * 0.6 * 2) = 0.18 + 0.12 = 0.3
• Branch penalty for delayed branches is about 0.5

Return Address Predictor


• Indirect jump – jumps whose destination address varies at run time
– indirect procedure call, select or case, procedure return
– SPEC89 benchmarks: 85% of indirect jumps are procedure returns
• Accuracy of BTB for procedure returns are low
– if procedure is called from many places, and the calls from one place are not
clustered in time
• Use a small buffer of return addresses operating as a stack
– Cache the most recent return addresses
– Push a return address at a call, and pop one off at a return
– If the cache is sufficient large (max call depth)  prefect

Dynamic Branch Prediction Summary


• Branch History Table: 2 bits for loop accuracy
• Correlation: Recently executed branches correlated with next branch
• Branch Target Buffer: include branch address & prediction
• Reduce penalty further by fetching instructions from both the predicted and
unpredicted direction
– Require dual-ported memory, interleaved cache  HW cost
– Caching addresses or instructions from multiple path in BTB
3.6 Taking Advantages of More ILP with Multiple Issue
Pipelined CPI=Ideal pipeline CPI+Structural stalls+RAM stalls+WAR stalls+WAW
stalls+Control stalls
Getting CPI < 1: Issuing
Multiple Instructions/Cycle
• Superscalar
– Issue varying numbers of instructions per clock
• Constrained by hazard style issues
– Scheduling
• Static - by the compiler
• Dynamic - hardware support for some form of Tomasulo
• VLIW (very long instruction word)
– Issue a fixed number of instructions formatted as…
• One large instruction – or –
• A fixed instruction packet with the parallelism among instructions explicitly
indicated by instruction
– Also known as EPIC – explicitly parallel instruction computers
– Scheduling: mostly static
Five Approaches in use for Multiple-Issue Processors

Statically Scheduled Superscalar Processors


• HW might issue 0 to 8 instructions in a clock cycle
• Instructions issue in program order
• Pipeline hazards are checked for at issue time
– Among instructions being issued in a given clock cycle
– Among the issuing instructions and all those still in execution
– If data or structural hazards occur, only the instruction preceding that one in the
instruction sequence will be issued (Dynamic issue)
• Complex issue stage
– Split and pipelined  But… result in higher branch penalties
– Instruction issue is likely to be one limitation on the clock rate of superscalar
processors
Superscalar 2-issue MIPS
• Very similar to the HP 7100
• Require fetching and decoding 64 bits of instructions
• Which instructions
– 1 integer: load, store, branch, or integer ALU operation
– 1 float: FP operation
• Why issue one integer and one FP operation?
– Eliminate most hazard possibility  simplify the logic
• Integer and FP register sets are different
• Integer and FP FUs are different
– Only difficulty: when integer instructions are FP load, store, move
• Need an additional read/write port on the FP registers
• May create RAW hazard
Superscalar 2-issue MIPS (Cont.)
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
• Instruction placement is not restricted in modern processor
• 1 cycle load delay expands to 3 instructions in SS
– instruction in right half can not use it, nor instructions in next slot
• Must have pipeline FP FUs or multiple independent FP FUs

Consider adding a scalar s to a vector


• for (i=1000; i > 0; i=i-1)
x[i] = x[i] + s
Loop: L.D F0,0( R1 ) ;F0=vector element
ADD.D F4,F0,F2 ;add scalar from F2
S.D F4, 0(R1), ;store result
DADDUI R1,R1,#-8 ;decrement pointer 8B (DW)
BNE R1, R2,Loop ;branch R1!=R2

Unscheduled Loop
Clock Cycle Issued
Loop: L.D F0,0( R1 ) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4, 0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1, R2, Loop 9
stall 10

Unrolled Loop that Minimizes Stalls for Scalar


1 Loop: L.D F0,0( R1)
4 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 DADDUI R1,R1,#-32
12 S.D F12, 16(R1)
13 BNE R1, R2, LOOP
14 S.D F16, 8(R1) ;8-32=-24

Unrolled Loop for SuperScalar (5 times)


1 Loop: L.D F0,0( R1)
2 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 ADD.D F4,F0,F2
5 L.D F14,-24(R1)
6 ADD.D F8,F6,F2
7 L.D F18, -32(R1)
...

Loop Unrolling in Superscalar


Integer instruction FP instruction Clock cycle
Loop: L.D F0,0(R1) 1
L.D F6,-8(R1) 2
L.D F10,-16(R1) ADD.D F4,F0,F2 3
L.D F14,-24(R1) ADD.D F8,F6,F2 4
L.D F18,-32(R1) ADD.D F12,F10,F2 5
S.D F4, 0(R1) ADD.D F16,F14,F2 6
S.D F8, -8(R1) ADD.D F20,F18,F2 7
S.D F12, -16(R1) 8
S.D F16, -24(R1) 9
DADDUI R1,R1,#-40 10
BNE R1, R2, LOOP 11
SD F20, -32(R1) 12
Seem Simple?
• Registers
– Each pipe has its own set
• Due to separation of FP and GP registers
• Also inherently separates data dependencies into 2 classes
– Exception is LDD or LDF
• EFA is an integer operation
• Destination register however is a FPreg
• FP pipe has longer latency
– Exacerbated by operation latency differences
• mult = 6 cycles, divide = 24 cycles for example
– Result is that completion is out of order
• Complicates hazard control within the FP execution pipe
– Pipeline FP ALU or use multiple FP ALUs

Problems So Far
• Look at the opcodes
– See if the pair is an appropriate issue pair
• Some integer operations are a problem
– FP register loads/stores - since other instruction may be dependent
• A stall will result - options?
– Force FP loads, stores or moves to issue by themselves
• Safe but suboptimal since the other instruction may still be independent
– OR add more ports to the FP register file
• Such as separate read and write ports
• Still must stall the 2nd instruction if it is dependent

Other Issues
• Hazard detection
– Similar to the normal pipeline model, but need large set of bypass path (twice as
many instructions in the pipeline)
• Load use delay
– Assume = 1 cycle  now covers 3 instruction slots
• Branch delay
– Have branches to be issued by themselves?
– The 1 instruction branch delay now holds 3 instructions as well
• Instruction scheduling by compiler
– Mandatory for issuing independent operations in SS
– Increasingly important as issue width goes up

Dynamic Scheduling In SuperScalar


• Use Tomasulo Algorithm
– Two arbitrary instructions per clock: issue and let RS sort it out
– But still can’t issue a dependent pair
– Two examples: pp. 221—224
• How to issue multiple arbitrary instructions per clock?
– Run the issue step in half a clock cycle (ex. Pipelined)
– Build the logic necessary to handle two instructions at once, including any possible
dependences between the instructions
– Modern SS processors that issue four or more instructions per clock often include
both approaches
• Only FP loads might cause dependency between integer and FP issue:
– Replace load reservation station with a load queue
• Operands must be read in the order they are fetched
– Load checks addresses in Store Queue to avoid RAW violation
– Store checks addresses in Load Queue to avoid WAR, WAW
– Called decoupled architecture

Example
• Can issue two arbitrary operations per clock
• One integer FU for ALU operation and EA-calculation
• A separate pipelined FP FU
• One memory unit, 2CDB
• no delayed branch with perfect branch prediction
– Fetch and issue as if the branch predictions are always correct
• Latency between a source instruction and an instruction consuming the result –
presence of Write Result stage
– 1 CC for integer ALU operations
– 2 CC for loads
– 3 CC for FP add
Note
• WR stages does not apply to either stores or branches
• For L.D and S.D, the execution cycle is EA calculation
• For branches, the execution cycle shows when the branch condition can be evaluated
and the prediction checked
• Any instruction following a branch cannot start execution until after the branch
condition has been evaluated
• If two instructions could use the same FU at the same point (structural hazard), priority
is given to the older instruction

Consider adding a scalar s to a vector


• for (i=1000; i > 0; i=i-1)
x[i] = x[i] + s

Loop: L.D F0,0( R1 ) ;F0=vector element


ADD.D F4,F0,F2 ;add scalar from F2
S.D F4, 0(R1) ;store result
DAADIU R1,R1,#-8 ;decrement pointer 8B (DW)
BNE R1, R2, Loop ;branch R1!=R2

Execution Timing
Example Result
• Result
– IPC issued = 5/3 = 1.67; Instruction execution rate = 15/16 = 0.94
• Only one load, store, and Integer ALU operation can execute
– Load of the next iteration performs its memory address before the store of the
current iteration
– A single CDB is actually required
– Integer operations become the bottleneck
• Many integer operations, but only one integer ALU
– One stall cycle each loop iteration due to a branch hazard

Note
• Result
– IPC issued = 5/3 = 1.67; Instruction execution rate = 15/11 = 1.36
– A second CDB is needed
– This example has a higher instruction execution rate but lower efficiency as
measured by the utilization of FU

Limitations on Multiple Issue


• How much ILP can be found in the application – fundamental problems
– Requires deep unrolling - hence big focus on loops
• Compiler complexity goes way up
• Deep unrolling needs lots of registers

• Increased HW cost
– Increased ports for register files
– Cost of scoreboarding (e.g. Tomasulo data structure) and forwarding paths
– Memory bandwidth requirement goes up
• Most have gone with separate I and D ports already
• Newest approaches are to go for multiple D ports as well - big time expense!!
(PA- 8000)
– Branch prediction by HW is an absolute must – HW Speculation (Sect. 3.7)
3.7 Hardware-Based Speculation
Overview
• Overcome control dependence by speculating on the outcome of branches and
executing the program as if our guesses were correct
– Fetch, issue, and execute instructions
– Need mechanisms to handle the situation when the speculation is incorrect
• Dynamic scheduling: only fetch and issue such instructions
Key Ideas
• Dynamic branch prediction to choose which instructions to execute
• Speculation to allow the speculated blocks to execution before the control
dependences are resolved
– And undo the effects of an incorrectly speculated sequence
• Dynamic scheduling to deal with the scheduling of different combinations of basic
blocks (Tomasulo style approach)
HW Speculation Approach
• Issue  execution  write result  commit
– Commit is the point where the operation is no longer speculative
• Allow out of order execution
– Require in-order commit
– Prevent speculative instructions from performing destructive state changes (e.g.
memory write or register write)
• Collect pre-commit instructions in a reorder buffer (ROB)
– Holds completed but not committed instructions
– Effectively contains a set of virtual registers to store the result of speculative
instructions until they are no longer speculative
• Similar to reservation station  And becomes a bypass source
The Speculative MIPS

• Need HW buffer for results of uncommitted instructions: reorder buffer (ROB)


– 4 fields: instruction type, destination field, value field, ready field
– ROB is a source of operands  more registers like RS
• ROB supplies operands in the interval between completion of instruction
execution and instruction commit
• Use ROB number instead of RS to indicate the source of operands when
execution completes (but not committed)
– Once instruction commits, result is put into register
– As a result, its easy to undo speculated instructions on mispredicted branches or on
exceptions

ROB Fields
• Instruction type – branch, store, register operations
• Destination field
– Unused for branches
– Memory address for stores
– Register number for load and ALU operations (register operations)
• Value – hold the value of the instruction result until commit
• Ready – indicate if the instruction has completed execution

Steps in Speculative Execution


• Issue (or dispatch)
– Get instruction from the instruction queue
– In-order issue if available RS AND ROB slot; otherwise, stall
– Send operands to RS if they are in register or ROB
– Update Tomasulo DS and ROB
– The ROB no. allocated for the result is sent to RS, so that the number can be used
to tag the result when it is placed on CDB
• Execute
– RS waits grabs results off the CDB if necessary
– When all operands are there execution happens
• Write Result
– Result posted to ROB via the CDB
– Waiting reservation stations can grab it as well
• Commit (or graduate) – instruction reaches the ROB head
– Normal commit – when instruction reaches the ROB head and its result is present
in the buffer
• Update the register and remove the instruction from ROB
– Store – Update memory and remove the instruction from ROB
– Branch with incorrect prediction – wrong speculation
• Flush ROB and the related FP OP queue (RS)
• Restart at the correct successor of the branch
• Remove the instruction from ROB
– Branch with correct prediction – finish the branch
• Remove the instruction from ROB
Example
• The same example as Tomasulo without speculation. Show the status tables when
MUL.D is ready to go to commit
– L.D F6, 34(R2)
– L.D F2, 45(R3)
– MUL.D F0, F2, F4
– SUB.D F8, F6, F2
– DIV.D F10, F0, F6
– ADD.D F6, F8, F2
• Modified status tables
– Qj and Qk fields, and register status fields use ROB (instead of RS)
– Add Dest field to RS (ROB to put the operation result)

Example Result
• Tomasulo without speculation
– SUB.D and ADD.D have completed (clock cycle 16, slide 58)
• Tomasulo with speculation
– No instruction after the earliest uncompleted instruction (MUL.D) is allowed to
complete
– In-order commit

• Implication – ROB with in-order instruction commit provides precise exceptions


– Precise exceptions – exceptions are handled in the instruction order

Loop Example
Loop: L.D F0, 0(R1)
MUL.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1,R1, #-8
BNE R1, R2, Loop
• Assume we have issued all the instructions in the loop twice
• Assume L.D and MUL.D from the first iteration have committed and all others have
completed execution

Loop Example Observation


• Suppose the first BNE is not taken  flush ROB and begins fetch instructions from
the other path

Other Issues
• Performance is more sensitive to branch-prediction
– Impact of a mis-prediction will be higher
– Prediction accuracy, mis-prediction detection, and mis-prediction recovery increase
in importance
• Precise exception
– Handled by not recognizing the exception until it is ready to commit
– If a speculation instruction raises an exception, the exception is recorded in ROB
• Mis-prediction branch  exception are flushed as well
• If the instruction reaches the ROB head  take the exception

Multiple Issue with Speculation


• Process multiple instructions per clock, assigning RS and ROB to the instructions
• To maintain throughput of greater than one instruction per cycle, must handle multiple
instruction commits per clock
• Speculation helps significantly when a branch is a key potential performance
limitation
• Speculation can be advantageous when there are data-dependent branches, which
otherwise would limit performance
– Depend on accurate branch prediction  incorrect speculation will typically harm
performance
Example
• Assume separate integer FUs for ALU operations, effective address calculation, and
branch condition evaluation
• Assume up to 2 instruction of any type can commit per clock
• Loop: LD R2, 0(R1)
DADDIU R2, R2, #1
SD R2, 0(R1)
DADDIU R1, R1, #4
BNE R2, R3, LOOP

Example Result
• Without speculation
– L.D following BNE cannot start execution earlier  wait until branch outcome is
determined
– Completion rate is falling behind the issue rate rapidly, stall when a few more
iterations are issued
• With speculation
– L.D following BNE can start execution early because it is speculative

ILP Studies
• Perfect Hardware model - in the ideal infinite cost case
– Rename as much as you need
• Implies infinite virtual registers
• Hence - complete WAW or WAR insensitivity
– Branch prediction is perfect
• This will never happen in reality of course
– Jump prediction (even computed such as return) are also perfect
• Similarly unreal
– Perfect memory disambiguation
• Almost perfect is not too hard in practice
– Can issue an unlimited # of instructions at once & no restriction on types of
instructions issued  # FUs
– One-cycle latency

Let’s Look at A Real Machine


• Alpha 21264 – one of the most advanced superscalar processors announced to date
– Issues up to four instructions per clock, and initiates execution on up to six
• At most 2 memory references, among other restrictions
– Support a large set of renaming registers (41 integer and 41 FP)
– Allow up to 80 instructions in execution
– Multicycle latencies
– Tournament-style branch predictor

How to Measure
• A set of programs were compiled and optimized with the standard MIPS optimizing
compilers
• Execute and produce a trace of the instruction and data references
– Perfect branch prediction and perfect alias analysis are easy to do
• Every instruction in the trace is then scheduled as early as possible, limited only by the
data dependence
– Including moving across branches
What A Perfect Processor Must Do?
• Look arbitrary far ahead to find a set of instructions to issue, predicting all branches
perfectly
• Rename all register uses to avoid WAW and WAR hazards
• Determine whether there are any dependences among the instructions in the issue
packet; if so, rename accordingly
• Determine if any memory dependences exist among the issuing instructions and hand
them appropriately
• Provide enough replicated Fus to allow all the ready instructions to issue

ILP at the Limit


• How many instructions would issue on the perfect machine every cycle?
– gcc - 54.8
– espresso - 62.6
– li - 17.9
– fpppp - 75.2
– doduc - 118.7
– tomcatv - 150.1
• Limited only by the ILP inherent in the benchmarks
• Note:
– Benchmarks are small codes
– More ILP tends to surface as the codes get bigger
Window Size
• The set of instructions that is examined for simultaneous execution is called the
window
• The window size will be determined by the cost of determining whether n issuing
instructions have any register dependences among them
2
– In theory, This cost will be O(n )
– But many restrictions of the real processors (issue in order…) reduces the cost
• Each instruction in the window must be kept in processor
• Window size is limited by the required storage, the comparisons, and a limited issue
rate
Effects of limiting the Issue Window Size

Short Summary on Window Size


• Integer programs do not contain nearly as much parallelism as FP programs
• Parallelism in the FP cases comes from loop-level parallelism
• When the window size is small, the processor simply cannot see the instructions in the
next iteration that could be issued in parallel with instructions from the current
iteration

Effects of Realistic Branch Prediction


• Perfect
• Tournament-based (97% accurate with 48K bits)
– Uses a correlating 2 bit and non-correlating 2 bit plus a selector to choose between
the two
– Prediction buffer has 8K (13 address bits from the branch)
– 3 entries per slot - non-correlating, correlating, select
• Standard 2 bit
– 512 (9 address bits) entries
– Plus 16 entry buffer to predict RETURNS
• Static
– Based on profile - predict either T or NT but it stays fixed
• None

Short Summary on Branch Prediction


• fpppp and tomcatv (FP programs) have much fewer branches and the few branches
that exist are more predictable
• To achieve significant amounts of parallelism in integer programs, the processor must
select and execute instructions that are widely separated. When branch prediction is
not highly accurate, the mispredicted branches become a barrier to finding the
parallelism
Effects of Limiting the Renaming Registers

Models for Memory Alias Analysis


• Perfect
– No mistakes - the unrealistic limit
• Global/Stack Perfect
– Similar to best compiler methods to date
– Perfect job on global and stack areas
– Assume heap addresses conflict (improvement here is likely)
• Inspection
– If pointer is to different allocation areas then no conflict
– Also no conflict using same register with different offsets
• None
– All memory references are assumed to conflict

Effects of Memory Aliasing


Consider a Realizable Processor
• Something we can conceive might be possible in 5 years
• 64 issue with no issue restrictions (may be unlikely)
• Tournament predictor - 1K entries
• 16-entry return predictor
• Dynamic perfect memory disambiguation (may be unlikely)
• Register Renaming with 64 additional FP registers and 64 additional integer registers

Effect of Window Size

Note
• We can’t possibly cover everything in class
– Without skipping important fundamental concepts
• So read the Intel P6 (3.10) and IA-64 (4.7) section
– Every machine has some goals and a lot of balance points
– Instructive to follow the list of possibilities we have discussed with a real case
study
– If there’s another machine you are interested in then go read about it
• Analyze as much as you can
– Try to understand why a particular decision was made
– From normal literature the real reasons are impossible to find
– But you know a lot now and you’ll mostly get it right if you
– Do battle with the problem in a serious way.
4. Exploiting ILP with Software Approaches

4.1 Basic Compiler Techniques for Exposing ILP


Basic Pipeline Scheduling and Loop Unrolling
• Idea – find sequences of unrelated instructions (no hazard) that can be overlapped in
the pipeline to exploit ILP
– A dependent instruction must be separated from the source instruction by a distance
in clock cycles equal to latency of the source instruction to avoid stall

Consider adding a scalar s to a vector


• for (i=1000; i > 0; i=i-1)
x[i] = x[i] + s
Loop: L.D F0,0( R1 ) ;F0=vector element
ADD.D F4,F0,F2 ;add scalar from F2
S.D F4, 0(R1), ;store result
DADDUI R1,R1,#-8 ;decrement pointer 8B (DW)
BNE R1, R2,Loop ;branch R1!=R2

Unscheduled Loop
Clock Cycle Issued
Loop: L.D F0,0( R1 ) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4, 0(R1) 6
DADDUI R1,R1,#-8 7
stall 8
BNE R1, R2, Loop 9
stall 10
Scheduled Loop

Loop Unrolling – Make Body Fat


• Three of the six instructions are overhead
– Get more operations within loop relative to # of overhead instructions
• Loop unrolling
– Replicate the loop body multiple times and adjust the loop termination code
• Basic Idea
– Take n loop bodies and concatenate them into 1 basic block
– Adjust the new termination code
• Let’s say n was 4
• Then modify the R1 pointer in the example by 4x of what it was before
– Savings - 4 BNE’s + 4 DADDUI’s  just one of each
• Hence 75% improvement
– Problem - still have 4 load stalls per loop
Loop Unrolling -- Primitive
Loop: L.D F0,0( R1)
ADD.D F4,F0,F2
S.D F4, 0(R1)
DADDUI R1,R1,#-8
BEQ R1, R2, exit
L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4, 0(R1)
DADDUI R1,R1,#-8
BEQ R1, R2, exit
….

Exit: BNE R1, R2, LOOP

Demonstration of the Basic Idea


1 Loop: L.D F0,0( R1)
2 ADD.D F4,F0,F2
3 S.D F4, 0(R1) ;drop SUBI & BNE
4 L.D F6,-8(R1)
5 ADD.D F8,F6,F2
6 S.D F8, -8(R1) ;drop SUBI & BNE
7 L.D F10,-16(R1)
8 ADD.D F12,F10,F2
9 S.D F12, -16(R1) ;drop SUBI & BNE
10 L.D F14,-24(R1)
11 ADD.D F16,F14,F2
12 S.D F16, -24(R1)
13 DADDUI R1,R1,#-32 ;alter to 4*8
14 BNE R1, R2, LOOP

Loop Unrolling – Make Body Fat (Cont.)


• Next idea
– Don’t just concatenate the unrolled segments - shuffle them instead
– 4 L.D’s then 4 ADD.D’s then 3 S.D’s, DADDUI, BNE, then fill the branch delay
slot with the final S.D
– No more stalls since L.D to dependent ADD.D path now has 3 instructions in it.
Demonstration of the Next Idea
1 Loop: L.D F0,0( R1)
4 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 DADDUI R1,R1,#-32
12 S.D F12, 16(R1)
13 BNE R1, R2, LOOP
14 S.D F16, 8(R1) ;8-32=-24

Summary of the Loop Unrolling


• Determine if it is legal to move the instructions and adjust the offset.
• Determine the unrolling loop would be useful by finding if the loop iterations were
independent.
• Use different registers to avoid unnecessary constraints.
• Eliminate the extra tests and branches.
• Determine that the loads and stores in the unrolling loop can be interchanged.
• Schedule the code, preserving any dependence needed.
Data Dependence for the Loop Unrolling Example
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4, 0(R1)
DADDUI R1,R1,#-8
L.D F6,0(R1)
ADD.D F8,F6,F2
S.D F8, 0(R1)
DADDUI R1,R1,#-8
L.D F10,0(R1)
ADD.D F12,F10,F2
S.D F12, 0(R1)
DADDUI R1,R1,#-8
L.D F14,0(R1)
ADD.D F16,F14,F2
S.D F16, 0(R1)
DADDUI R1,R1,#-8
BNE R1, R2, LOOP
Name Dependence for the Loop Unrolling Example
Loop: L.D F0, 0(R1)

ADD.D F4, F0,F2

S.D F4, 0(R1)

L.D F0,-8(R1)

ADD.D F4,F0,F2

S.D F4, -8(R1)

L.D F0,-16(R1)
ADD.D F4,F0,F2
S.D F4, -16(R1)
L.D F0,-24(R1)
ADD.D F4,F0,F2
S.D F4, -24(R1)
DADDUI R1,R1,#32
BNE R1,R2,LOOP
Limitation of Gains of Loop Unrolling
• Amount of loop overhead amortized with each unroll
– Unroll 4 times – 2 out 14 CC are overhead  0.5 CC per iteration
– Unrolled 8 times  0.25 CC per iteration
– What’s the theoretically optimal number of unrolling using the latencies shown in
slide 3?
• Growth in code size
– Large code size is not good for embedded computer
– Large code size may increase cache miss rate
• Potential shortfall in registers that is created by aggressive unrolling and scheduling
– Register pressure

4.2 Static Branch Prediction


Static Branch Prediction: Using Compiler Technology
• How to statically predict branches?
– Examination of program behavior
• Always predict taken (on average, 67% are taken)
– Mis-prediction rate varies large (9%—59%)
• Predict backward branches taken, forward branches un-taken (mis-prediction
rate > 60% -- 70%)
– Profile-based predictor: use profile information collected from earlier runs
• Simplest is the Basket bit idea
• Easily extends to use more bits
• Definite win for some regular applications
Mis-prediction Rate for a Profile-based Predictor

Prediction-taken VS. Profile-based Predictor


Static Branch Prediction: Using Compiler Technology
• Useful for
– Scheduling instructions when the branch delays are exposed by the architecture
(either delayed or canceling branches)
– Assisting dynamic predictors (IA-64 architecture in Section 4.7)
– Determining which code paths are more frequent, a key step in code scheduling
4.3 Static Multiple Issue: VLIW
Overview
• VLIW (very long instruction word)
• Issue a fixed number of instructions formatted as…
– One large instruction comprising independent MIPS instructions
– – or – A fixed instruction packet with the parallelism among instructions explicitly
indicated by instruction
• Also known as EPIC – explicitly parallel instruction computers
• Rely on compiler to…
– Minimize the potential hazard stall
– Actually format the instructions in a potential issue packet so that HW need not
check explicitly for dependencies. Compiler ensures…
• Dependences within the issue packet cannot be present
• – or – Indicate when a dependence may occur

Basic VLIW

• A VLIW uses multiple, independent functional units


• A VLIW packages multiple independent operations into one very long instruction
– The burden for choosing and packaging independent operations falls on compiler
– HW in a superscalar makes these issue decisions is unneeded
• This advantage increases as the maximum issue rate grows
• Here we consider a VLIW processor might have instructions that contain 5 operations,
including 1 integer (or branch), 2 FP, and 2 memory references
– Depend on the available FUs and frequency of operation
• VLIW depends on enough parallelism for keeping FUs busy
– Loop unrolling and then code scheduling
– Compiler may need to do local scheduling and global scheduling
• Techniques to enhance LS and GS will be mentioned later
– For now, assume we have a technique to generate long, straight-line code
sequences

Loop Unrolling in VLIW


Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.29 clocks per iteration
23 ops in 9 clock, average 2.5 ops per clock,
60% FUs are used
Note: Need more registers in VLIW

VLIW Problems – Technical


• Increase in code size
– Ambitious loop unrolling
– Whenever instruction are not full, the unused FUs translate to waste bits in the
instruction encoding
• An instruction may need to be left completely empty if no operation can be
scheduled
• Clever encoding or compress/decompress

VLIW Problems – Logistical


• Synchronous VS. Independent FUs
– Early VLIW – all FUs must be kept synchronized
• A stall in any FU pipeline may cause the entire processor to stall
– Recent VLIW – FUs operate more independently
• Compiler is used to avoid hazards at issue time
• Hardware checks allow for unsynchronized execution once instructions are
issued.
• Binary code compatibility
– Code sequence makes use of both the instruction set definition and the detailed
pipeline structure (FUs and latencies)
– Need migration between successive implementations, or between implementations
 recompliation
– Solution
• Object-code translation or emulation
• Temper the strictness of the approach so that binary compatibility is still
feasible (IA-64 in Section 4.7)

Advantages of Superscalar over VLIW


• Old codes still run
– Like those tools you have that came as binaries
– HW detects whether the instruction pair is a legal dual issue pair
• If not they are run sequentially
• Little impact on code density
– Don’t need to fill all of the can’t issue here slots with NOP’s
• Compiler issues are very similar
– Still need to do instruction scheduling anyway
– Dynamic issue hardware is there so the compiler does not have to be too
conservative
4.4 Advanced Compiler Support for Exposing and Exploiting ILP
Overview
• Discuss compiler technology for increasing the amount of parallelism that we can
exploit in a program
– Detecting and Enhancing loop-level parallelism
– Finding and eliminating dependent computations
– Software pipelining: symbolic loop unrolling
– Global code scheduling

Detect and Enhance LLP

• Loop-level parallelism: analyzed at the source or near level


– Most ILP analysis: analyzed once instructions have been generated
• Loop-level analysis
– Determine what dependences exist among the operands in a loop across the
iterations of that loop
– Determine whether data accesses in later iterations are dependent on data values
produced in earlier iterations
• Loop-carried dependence (LCD) VS. loop-level parallel
• LCD forces successive loop iterations to execute in series
– Finding loop-level parallelism involves recognizing structures such as loops, array
references, and induction variable computations
• The compiler can do this analysis more easily at or near the source level

Example 1
for (i=1; i <= 100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}

Assume A, B, and C are distinct, non-overlapping arrays


• S1 uses a value computed by S1 in an earlier iteration (A[i])
– Loop-carried dependence
• S2 uses a value computed by S2 in an earlier iteration (B[i])
– Loop-carried dependence
• S2 uses a value computed by S1 in the same iteration (A[i+1])
– Not loop-carried
– Multiple iterations can execute in parallelism, as long as dependent statements in
an iteration are kept in order

Example 2
for (i=1; i <= 100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
• S1 uses a value computed by S2 in an earlier iteration (B[i+1])
– Loop-carried dependence
• Dependence is not circular
– Neither statement depends on itself, and although S1 depends on S2, S2 does not
depend on S1
• A loop is parallel if it can be written without a cycle in the dependences
– Absence of a cycle give a partial ordering on the statements

A[1] = A[1] + B[1]


for (i=1; i <= 99; i=i+1) {
B[i+1] = C[i] + D[i];
A[i+1] = A[i+1] + B[i+1];
}
B[101] = C[100] + D[100]

• Transform the code in the previous slide to conform to the partial ordering and expose
the parallelism
• No longer loop-carried. Iterations of the loop may be overlapped, provided the
statements in each iteration are kept in order

Detect and Eliminate Dependences


• Trick is to find and remove dependencies
– Simple for single variable accesses
– More complex for pointers, array references, etc.
• Things get easier if
– Non-cyclic dependencies – No loop-carried dependencies
– Recurrent dependency distance is larger (more ILP can be explored)
• Recurrence: when a variable is defined based on the value of that variable in an
earlier iteration
• ex. Y[i] = Y[i-5] + Y[i] (dependence distance = 5)
– Array index calculation is consistent
• Affine array indices ai+b where a and b are constants
• GCD test (pp. 324) to determine two affine functions can have the same value
for different indices between the loop bound

Detect and Eliminate Dependences -- Difficulties


• Barriers to analysis
– Reference via pointers rather than predictable array indices
– Array indexing is Indirect through another array: x[y[i]] (non-affine)
• Common sparse array accesses
– False dependency: for some input values a dependence may exist
• Run time checks must be used to determine the dependent case
• General: NP hard problem
– Specific cases can be done precisely
– Current problem: a lot of special cases that don’t apply often
– The good general heuristic is the holy grail
– Points-to analysis: analyzing programs with pointers (pp. 326—327)

Eliminating Dependent Computations – Techniques


• Eliminate or reduce a dependent computation by back substitution – within a basic
block and within loop
• Algebraic simplification + Copy propagation (eliminate operations that copy values
within a basic block)
– Reduce multiple increment of array indices during loop unrolling and move
increments across memory addresses in Section 4.1

DADDUI R1, R2, #4


DADDUI R1, R1, #4 DADDUI R1, R2, #8
• Tree-height reduction – increase the code parallelism
DADDUI DADDUI R1, R2, R3
R1, R2, R3 DADDUI R4, R6, R7
DADDUI DADDUI R8, R1, R4
R4, R1, R6

• Most compilers require that optimizations that rely on associativity (e.g. tree-height
reduction) be explicitly enabled
– Integer/FP arithmetic (range and precision) may lead to round-error
• Optimization related to recurrence
– Recurrence: expressions whose value on one iteration is given by a function that
depends on previous iteration
– When a loop with a recurrence is unrolled, we may be able to algebraically
optimized the unrolled loop, so that the recurrence need only be evaluated once per
unrolled iteration
• sum = sum + x  sum = sum + x1 + x2 + x3 + x4 + x5 (5 dependent
operations)  sum = ((sum + x1) + (x2 + x3)) + (x4 + x5) (3 dependent
operations)

An Example to Eliminate False Dependences


for (i=1; i <= 100; i=i+1) {
Y[i] = X[i] / C; /* S1 */
X[i] = X[i] + C; /* S2 */
Z[i] = Y[i] + C; /* S3 */
Y[i] = C – Y[i]; /* S4 */
}
for (i=1; i <= 100; i=i+1) {
T[i] = X[i] / C; /* S1 */
X1[i] = X[i] + C; /* S2 */
Z[i] = T[i] + C; /* S3 */
Y[i] = C – T[i]; /* S4 */
}
Software Pipelining
• Observation: if iterations from loops are independent, then can get more ILP by taking
instructions from different iterations
• Software pipelining (Symbolic loop unrolling): reorganizes loops so that each iteration
is made from instructions chosen from different iterations of the original loop
• Idea is to separate dependencies in original loop body
– Register management can be tricky but idea is to turn the code into a single loop
body
– In practice both unrolling and software pipelining will be necessary due to the
register limitations

Before: Unrolled 3 times


1 L.D F0,0(R1)
2 ADD.D F4,F0,F2
3 S.D F4,0(R1)
4 L.D F6,-8(R1)
5 ADD.D F8,F6,F2
6 S.D F8,-8(R1)
7 L.D F10,-16(R1)
8 ADD.D F12,F10,F2
9 S.D F12,-16(R1)
10 DADDUI R1,R1,#-24
11 BNE R1,R2,LOOP
• Symbolic Loop Unrolling
• Maximize result-use distance
• Less code space than unrolling
• Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop
unrolling
Global Code Scheduling
• Aim to compact a code fragment with internal control structure into the shortest
possible sequence that preserves the data and control dependences
– Reduce the effect of control dependences arising from conditional non-loop
branches by moving code
• Loop branches can be solved by unrolling
• Require estimates of the relative frequency of different paths (Then…Else)
• Global code scheduling example (next slide)
– Suppose the shaded part (Then) is more frequently executed than the ELSE part
• In general, extremely complex
LD R4, 0(R1) ; load A
LD R5, 0(R2) ; load B
DADDU R4, R4, R5 ; Add to
A
SD R4, 0(R1) ; store A

BNEZ R4, elsepart ; Test A

SD …, 0(R2) ; Store B

J join ; j over else
elsepart:… ;
X ; code for X

join: … ; after if
SW …, 0(R3); store C[i]

• Try to move the assignment of B and C before the test of A


• Move B
– If an instruction in X depends on the original value of B, moving B before BNEZ
will influence the data dependence
– Make a shadow copy of B before IF and use that shadow copy in X?
• Complex to implement
• Slow down the program if the trace selected is not optimal
– Require additional instructions to execute
• Move C
– To the THEN part  A copy of C must be put in the ELSE part
– Before the BNEZ  If can be moved  the copy of C in the ELSE part can be
eliminated
Move Assignment of B Before BNEZ

LD R4, 0(R1) ; load A LD R4, 0(R1) ; load A


LD R5, 0(R2) ; load B LD R5, 0(R2) ; load B
DADDU R4, R4, R5 ; Add to A DADDU R4, R4, R5 ; Add to A
SD R4, 0(R1) ; store A SD R4, 0(R1) ; store A
… SD …, 0(R2) ; Store B
BNEZ R4, elsepart ; Test A …
BNEZ R4, elsepart ; Test A
SD …, 0(R2) ; Store B …
… …
J join ; j over else J join ; j over else
elsepart:… ; elsepart:… ;
X ; code for X X ; code for X
LD R5, 0(R2) LD R5, 0(R2)
DADDU R7, R5, R6
DADDU R7, R5, R6 join: … ; after if
join: … ; after if SW …, 0(R3) ; store C[i]
SW …, 0(R3) ; store C[i]

Factors in Moving B
• The compiler will consider the following factors
– Relative execution frequencies of THEN and ELSE
– Cost of executing the computing and assignment to B above branch
• Any empty instruction issue slots and stalls above branch?
– How will the movement of B change the execution time for THEN
– Is B the best code fragment that can be moved? How about C or others?
– The cost of the compensation code that may be necessary for ELSE

Trace Scheduling
• Useful for processors with a large number of issues per clock, where conditional or
predicted execution (Section 4.5) is inappropriate or unsupported, and where loop
unrolling may not be sufficient by itself to uncover enough ILP
• A way to organize the global code motion process, so as to simplify the code
scheduling by incurring the costs of possible code motion on the less frequent paths
• Best used where profile information indicates significant differences in frequency
between different paths and where the profile information is highly indicative of
program behavior independent of the input
• Parallelism across conditional branches other than loop branches
• Looking for the critical path across conditional branches
• Two steps:
– Trace Selection: Find likely sequence of multiple basic blocks (trace)
of long sequence of straight-line code
– Trace Compaction
• Squeeze trace into few VLIW instructions
• Move trace before the branch decision
• Need bookkeeping code in case prediction is wrong
• Compiler undoes bad guess (discards values in registers)
• Subtle compiler bugs mean wrong answer vs. poor performance; no hardware
interlocks

• Simplify the decisions concerning global code motion


• Branches are viewed as jumps into or out of the selected trace (the most probable path)
• Additional book-keeping code will often be needed on the entry of exit point
– If the trace is so much more probable than the alternatives that the cost of the
bookkeeping code need not be a deciding factor
• Trace scheduling is good for scientific code
– Intensive loops and accurate profile data
• But unclear if this approach is suitable for programs that are less simply characterized
and less loop-intensive

Superblocks
• Drawback of trace scheduling: the entries and exits into the middle of the trace cause
significant complications
– Compensation code, and hard to assess their cost
• Superblocks – similar to trace, but
– Single entry point but allow multiple exits
• In a loop that has a single loop exit based on a count, the resulting superblocks have
only one exit
• Use tail duplication to create a separate block corresponding to the portion of the trace
after the entry
Some Things to Notice
• Useful when the branch behavior is fairly predictable at compiler time
• Not totally independent techniques
– All try to avoid dependence induced stalls
• Primary focus
– Unrolling: reduce loop overhead of index modification and branch
– SW pipelining: reduce single body dependence stalls
– Trace scheduling/superblocks: reduce impact of branch walls
• Most advanced compilers attempt all
– Result is a hybrid which blurs the differences
– Lots of special case analysis changes the hybrid mix
• All tend to fail if branch prediction is unreliable
4.5 Hardware Support for Exposing More Parallelism at Compiler Time
Conditional or Predicated Instructions
• Most common form is move
• Other variants
– Conditional loads and stores
– ALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional moves
– IA_64 supports full predication for all instructions
• Effect is to eliminating simple branches
– Moves dependence resolution point from early in the pipe (branch resolution) to
late in the pipe (register write)  forwarding…is more possible
– Also changes a control dependence into a data dependence
• Net win since in global scheduling the control dependence fence is the key limiting
complexity
Conditional Instruction in SuperScalar
First slot(Mem) Second slot (ALU) • Waste a memory operation slot in the
LW R1,40(R2) ADD R3, R4, 2nd cycle
R5 • data dependence stall if not taken
ADD R6, R3, R7 First slot(Mem) Second slot (ALU)
BEQZ R10, L LW R1,40(R2) ADD R3, R4,
LW R8, 20(R10) R5
LW R9, 0(R8) LWC R8,20(R10),R10 ADD R6, R3,
R7
BEQZ R10, L
LW R9, 0(R8)

Condition Instruction Limitations


• Precise Exceptions
– If an exception happens prior to conditional evaluation, it must be carried through
the pipe
– Simple for register accesses but consider a memory protection violation or a page
fault
• Long conditional sequences – If-then with a big then body
– If the task to be done is complex, better to evaluate the condition once and then do
the big block
• Conditional instructions are most useful when the condition can be evaluated early
– If data dependence in determining the condition  help less
• Wasted resource
– Conditional instructions consume real resources
– Tends to work well in the superscalar case
• Our simple 2-way model  Even if no conditional instruction, other resource
is wasted anyway
• Cycle-time or CPI Issues
– Conditional instructions are more complex
– Danger is that they may consume more cycles or a longer cycle time
– Note that the utility is mainly to correct short control flaws
• Hence use may not be for the common case
• Things better not slow down for the real common case to support the
uncommon case

Compiler Speculation with HW support


• Ideal view
– Do conditional things in advance of the branch (and before the condition
evaluation)
– Nullify them if the branch goes the wrong way
– Also implies the need to nullify exception behavior as well
• Limits
– Speculated values can’t clobber any real results
– Exceptions can not cause any destructive activity
To Speculate Ambitiously…
• Ability of the compiler to find instructions that can be speculatively moved and not
affect the program data flow
• Ability of HW to ignore exceptions in speculated instructions, until we know that such
exceptions should really occur
• Ability of HW to speculatively interchange loads and stores, or stores and stores,
which may have address conflicts

HW Support for Preserving Exception Behavior


• How to make sure that a mis-predicted speculated instruction (SI) can not cause an
exception
• Four methods
– HW and OS cooperatively ignore exceptions for SI
– SI that never raise exceptions are used, and checks are introduced to determine
when an exception should occur
– Poison bits are attached to the result registers written by SI when SI cause
exceptions. The poison bits cause a fault when a normal instruction attempts to use
the register
– A mechanism to indicate that an instruction is speculative, and HW buffers the
instruction result until it is certain that the instruction is no longer speculative
Exception Types
• Indicate a program error and normally cause termination
– Memory protection violation…
– Should not be handled for SI when misprediction
– Exceptions cannot be taken until we know the instruction is no longer speculative
• Handled and normally resumed
– Page fault…
– Can be handled for SI just if they are normal instructions
• Only have negative performance effect when misprediction

HW-SW Cooperation for Speculation


• Return an undefined value for any terminating exception
– The program is allowed to continue, but almost generate incorrect results
– If the excepting instruction is not speculative  program in error
– If the excepting instruction is speculativeprogram correct but speculative result
will simply be unused (No harm)
– Never cause a correct program to fail, no matter how much speculation
– An incorrect program, which formerly might have received a terminating exception,
will get an incorrect result
• Acceptable if the compiler can also generate a normal version of the program
(no speculate, and receive a terminating exception)
• if (A==0) A=B; else A=A+4 J L2 ;skip else
– A is at 0(R3), B is at 0(R2); L1: DADDI R1, R1, #4 ;else
L2: SD R1, 0(R3) ; store A
LD R1, 0(R3) ; load A
BNEZ R1, L1 ; test A • Compiler-based speculation
LD R1, 0(R2) ; if LD R1, 0(R3) ; load A
LD R14, 0(R2) ;spec-lw B Non-Terminating Speculative
BEQZ R1, L3 ; other bran. Instructions + Exception Checking
DADDI R14, R1, #4 ;else LD R1, 0(R3) ; load A
L3: SD R14, 0(R3) ; store A sLD R14, 0(R2) ;spec-
lw B
• R14 is used to avoid destroying R1 BNEZ R1, L1 ;test A
when B is loaded SPECCK 0(R2) ;spec check
• No need to know which instruction is J L2
speculative L1: DADDI R14, R1, #4 ;else
L2: SD R14,0(R3) ; store A

• sLD: speculative load without termination


• SPECCK: speculation checking

• Note:
– Require to maintain a basic block for the THEN case
– Checking for a possible exception requires extra code

Poison Bits
• Track exceptions as they occur but postpones any terminating exception until a value
is actually used.
• Incorrect programs that caused termination without speculation will still cause
exceptions when instructions are speculated.
• Poison bit for every register. A bit to indicate SI
• The poison bit of a destination register is set when SI results in a terminating exception.
– All other exceptions are handled immediately
• A SI uses a poisoned register  dest-reg is poisoned
• Fault if regular instruction tries to use a poisoned reg.

Poison-Bit Example
LD R1, 0(R3) ; load A
sLD R14, 0(R2) ; spec-lw B. If exception  R14 poisoned
BEQZ R1, L3 ; other bran.
DADDI R14, R1, #4 ; else
L3: SD R14, 0(R3) ; store A. R14 poisoned  SD fault

• If sLD generates a terminating exception, the poison bit of R14 will be turned on.
When SD occurs, it will raise an exception if the poison bit for R14 is on.

Boosting
• Boosting
– How to deal with exception? Similar to Poison Bits?
– Reduce # of registers used
– Provide separate shadow resources for boosted instruction results
– If condition resolves selecting the boosted path
• Then these results are committed to the real registers
LD R1, 0(R3) ; load A
LD+ R1, 0(R2) ; Boosted load B. Result is never written to
; R1 if branch is not taken
BEQZ R1, L3 ; other bran.
DADDI R1, R1, #4 ; else
L3: SD 0(R3), R1 ; store A.

HW (and OS) ignores exception until instruction commits

• Rely on a hardware mechanism that operates like ROB


• Instructions are marked by the compiler as speculative and include an indicator of how
many branches the instruction was speculatively moved across and what branch action
(taken/not taken) the compiler assumed
• The original location of SI is marked by a sentinel, which tells HW that earlier SI is no
longer speculative and values may be committed
• All instructions are placed in ROB when issued and are forced to commit in order, as
in HW speculation
– Notice: no actual dynamic speculative branch prediction or dynamic scheduling
occurs
• ROB tracks when instructions are ready to commit and delays the write-back portion
of any SI
• SI are not allowed to commit until the branches that have been speculatively moved
over are also ready to commit, or, alternatively, until the corresponding sentinel is
reached
– We know whether SI should have been executed or not
• If a ready-to-commit SI should have been executed and it generated a terminating
exception, then we know that the program should be terminated; otherwise, exceptions
are ignored

HW Support for Memory Reference Speculation


• Try to Move loads across stores  any address conflict?
• HW Use a special instruction to check for address conflicts
• The special instruction is left at the original location of the load instruction (act like a
guardian), and the load is moved up across one or more stores
• When a speculated load is executed, HW saves the address of the accessed memory
location
• If a subsequent store changes the location before the check instruction, then the
speculation has failed
• Speculation failure handling
– If only the load instruction was speculated, redo the load at the point of the check
instruction
– If additional instruction that depended on the load were also speculated, then a fix-
up sequence that re-executes all the SI starting with the load is needed
• Penalties!!
4.6 HW Versus SW Speculation Mechanisms

• To speculate extensively, we must be able to disambiguate memory reference  easy


for HW (Tomasulo)
• HW speculation works better when control flow is unpredictable, and when HW
branch prediction is superior to SW branch prediction done at compiler time
– Misprediction rate 16%/10% for 4 major integer SPEC92 SW/HW
• HW speculation maintains a completely precise exception model for SI
• HW speculation does not require compensation or bookkeeping code, needed by
ambitious SW speculation
• HW speculation with dynamic scheduling does not require different code sequences to
achieve good performance for different implementation of an architecture
• HW speculation require complex and additional HW resources
• Some designers have tried to combine the dynamic and compiler-based approaches to
achieve the best of each

You might also like