Micro Review

CHAPTER 13
60 MHz & 66 MHz Pentiums 273 pin PGA; 1st to

introduced to the computing market; 5 V
236-pin PGA allows hardware control over the ratio of bus
speed to processor speed, to support motherboard designs that
may not be capable of operating at the clock speed of the
processor; 3.3 V, reduced the power
CISC (Complex instruction set computer) - requires multiple
clock cycles
RISC (Reduced instruction set computer) solution to the
performance bottleneck; smaller and fewer instructions, simpler
addressing modes; large set of register; easier to implement in
silicon; faster performance
Reduce access to main memory
Simple instruction & addressing modes
Make good use of register
Pipeline, utilize the compiler extensively
Cache high speed RAM (data & add. of the data are stored);
10x faster than main menu; 10ns
Hit one of the addresses stored in the cache matches the
address being used for the memory read
Miss required instruction or data is not found in the internal
cache
Instruction cache used to store frequently used instruction
Data cache used to store frequently used data
Fewer instruction & add. modes reduces the complexity of
the instruction decoder, addressing logic & execution unit;
allows the machine to be clocked at a faster speed, less work
needs to be done each clock period
T2P continues the bus cycle started in T12

TD used to insert a dead state between 2 consecutive cycle
(read then write or vice versa), in order to give the system bus
time to change states
State-Transfer cycle used to transfer up to 8b of noncacheable data between the processor & memory
Watt states used to give the memory and I/O systems
additional time to complete the read or write request
Burst cycle used to read and write 32 bytes for line loads &
WB
8 number of floating-point registers contained in the Pentium;
number of byte chunk can be transferred every clock cycle
Locked operations to lock the bus
Atomic operation cannot be broken down into smaller
suboperations, comes in a semaphore.
Semaphore special type of counter var that must be read,
updated & stored in 1 single, uninterruptable operations; read
then write; no other devices may take over control while is
taking place.
BOFF input provides a way for other processor in a
multiprocessor system to instantly take over the Pentiums
buses.
Bus bold input provide a 2nd way for a diff bus master to take
control of the Pentiums buses; completes the current bus cycle
& then tri-states its buses.
HLDA output indicates when the Pentium is on HOLD state.
External circuitry responsible for supplying 8b interrupt
vector # on D 0 through D 7 .
Pipelining technique used to enable 1 instruction to complete

w/ each clock cycle; parallel
Cache flush that cache are flushed; performing WB for any

lines that have been change.
Branch prediction helps to identify possible interruptions to

the normal flow of instructions through the U and V pipelines
Flushed acknowledge cycle run when all WB complete to

inform external circuitry that its catches are flushed.
Pentium bus logic indicates the type of bus cycle currently

starting with the use of its cycle definition signals
Shutdown cycle - run if the Pentium detects an internal parity

error until it receives NMI. INIT, or RESET to execute.
Special cycle bus requires additional decoding & use the byte
enable outputs for selection
HALT similar to shutdown, INTR signal may be used to

resume execution.
Ti idle state, no bus cycle is currently running
WB & locked bus cycles not pipelined.
T1 valid address is output on the address lines and ADS is

made active (low)
Inquire cycle - used to maintain cache coherency in a

multiprocessor system.
T2 data is read or written, BRDY input is examined
Bus snooping watch the system bus (add, data, control) in a

multiprocessor system; bidirectional; used to help maintain
consistent data in a multiprocessor system where each
processor has a separate cache.
T12 enters when a 2nd bus cycle is started before the 1 st one
completes
Superscalar machine capable of parallel execution of

multiple instructions.
Data dependency exist if the 2 nd instruction reads an operand
written to it by the 1st instruction (RAW) or (WAW) or WAR
Writethrough write to cache & main memory / RAM
Compiler help keep the Pentiums superscalar architecture

running at full speed.
LRU the cache entry LRU is not likely to be used in the

future, so it may be replaced.
U pipeline execute any processor instruction
Direct-mapped cache uses a portion of the incoming

physical add to select an entry.
Out-of-date main memory updated with the correct data

from cache.
V pipeline only executes simple instruction

***128 entries in cache, w/c has 32B line size
PF stages instruction are fed into it after perfetched from the
instruction cache or memory.
Index bits used to select 1 of 28 entries in cache
D1 determine if the current pair of instruction can be execute

together.
Fully associative cache uses large tags & does not select an
entry based on index bits
Prefix byte only execute in the U pipeline; not be paired.
***upper add fully associated is compared with every tag
D2 - add for operand that reside in memory are calculated

EX stage operands are read from the data cache or memory,
ALU operations are performed & branch prediction for
instructions are verified.
ACRONYMS
CHAPTER 13
PGA Pin Grid Away
CISC Complex Instruction Set Computer
WB write the result of the completed instructions & verify

conditional branch instruction prediction; cache only.
RISC Reduced Instruction Set Computer
*** U stalls so does the V; V stalls, U may continue
RAM Random Access Memory
Program transfer instructions change the sequence, causing

all instruction that entered the pipeline after it become invalid
WB Writeback
ALU Arithmetic Logic Unit
Bubbles no work is done as the pipeline stages are reloaded

BTB Brach Target Buffer
Dynamic branch prediction to avoid bubbles
LRU Last Recently Used
BTB special cache that stores the instruction & target add of
any branch instruction that have been encountered in the
instruction stream; stores 2 History bits
TLB Translation Lookaside Buffers

MESI Modified/Exclusive/Shared/Invalid
History bits initially set to 11 when a new target ass is placed

into the BTB
INVD Invalid Cache
00 repeated NT / failures
INVLPG Invalidate TLB entry
Two 32B work w/ BTB & D1 to keep a steady stream of

instruction flowing into the pipeline
WBINVD WriteBack and Invalidate Cache
Conditional jumps used to perform loops
2 Types of Pipeline
***U & V
***Pipeline that performs special types of bus cycles
On-chip cache - used to feed instructions and data to the

CPUs pipeline
External cache 2nd level cache
Hit ratio percentage of hits to total cache accesses
5 Types of Bus Cycle

M/IO
D/C
W/R
CACHE
KEN
Miss ratio percentage of misses to total cache accesses

Line of data group of locations from the main memory
3 Main Design of Cache Organization

***direct mapped
***fully associate
***set associative
6 Bus Cycle State
Ti
T1
T2
T12
T2P
TD
***3rd port is used for bus snooping

***instruction cache is write protected to prevent self
modifying
Splitline accesses reading the upper half of 1 line & the
lower half of the next line at the same time.
Parity bits used in each cache to help maintain data integrity
The Five-stage U and V Pipeline

PF
D1
D2
EX
WB
***one parity bits per every 8B

Translation lookaside buffers used to translate virtual
address into physical address
Virtual address logical or linear address
2 TLB
***four-way set associative with 64 entries
***four-way set associative with 8 entries
3 Cache Instructions
***INVD
***INVLPG
***WBINVD
Physical address used to access the cache because the same

ass is used to access main memory
Four-way set associative with 64 entries translate ass for
4KB
Four-way set associative with 8 entries used to handle 4MB
pages.
Three Features of Pentium 4

HYPER-PIPELINED TECHNOLOGY
RAPID EXECUTION ENGINE
EXECUTION TRACE CACHE
***both TLB are parity protected and dual ported
3 Divisions of SSE Instructions

SIMD-FP
NEW MEDIA
STREAMING MEMORY
Modified current line has been modified (does not match

main memory) & is only available in a single cache. Writing to
this line changes it state to modified.
4 MESI States
MODIFIED
EXCLUSIVE
SHARED
INVALID
MESI cache-consistence protocol that uses 2 bits stored with

each line of data to keep track of the state of the cache line
Exclusive current line has not been modified (matches main

memory) & is only available in a single cache. Writing to this
line changes it state to modified.
Shared copies of the current line may exist in more than one
cache. Write to this cause a writethrough to main memory &
may invalidate the copies in the other cache.
The Eight-stage FPU Pipeline

PF
D1
D2
EX
X1
X2
WF
Invalid current line is empty; miss; Write to this cause a

writethrough to main memory.
Associative cache compromise between direct mapped &

fully associative designs; 2, 4, 8 or more; combines the simple
tag comparator hardware of the direct mapped cache w/ flexible
data caching capability of the fully assoc
WBINVD 1st writes back any updated cache entries ----invalidates them
Two-way set associative cache 2 entries per set
PF prefetch
***256 entries per cache, 64 bit per set, 8KB of storage per
cache
D1 instruction decode
INVD erases all info in the data cache

INVLPG associated w/ a supplied memory operand;
necessary when implementing a paged memory system.
***FPU-8 stages
D2 add generation
Triple ported tags in the data cache can be accessed from
three different places at the same time.
EX execute, cache & ALU access; memory & reg, read,

floating point implementing a paged memory system.
X1 this is the floating-point execute, stage one. Memory data
converted into memory format, memory, write operand to
floating point reg file, bypass 1 (sends data back to EX)
X2 floating-point execute stage 2
WF round floating point result & write to floating point reg
fle, bypass 2 (send data back to EX stage)
ER error reporting, update status word
Forwarding prevents pipeline from stalling while for the ref
file to provide a copy of ST
***floating point reg file eight 80 bit floating point reg,
ST(0)
- ST(7)
CHAPTER 15
Pentium Pro 14 stages superpipeline, internal level-2 cache
(operates at the speed of three processor core), 4 additional add
lines (36b bus width; 64GB physical add space), 4 pentium pro
CPUs
DIB allows the Pentium pro to access data in external
memory (over the system bus) while simultaneously accessing
the internal level-2 data cache.
Register renaming used to map temporary result registers to
the actual processor registers
Dispatch execution unit a reservation station controls the

flow of data through the integer, floating point, MMX &
load/store units
Reservation station contains multiple entries, w/ each entry
*
*
*
*
*
*
*
*
*
*
*
*
Cache Management Instructions used in perfetching &
maintaining cache data
128bit XXM register capable of storing a 128b integer or
four 32-bit floating point number
SIMB-FB allows the programmer to convert between int &
floating point num, as well as perform floating point reciprocal
& sq. root
New media these new instructions give the programmer ways
to organize data for SIMD computation
Streaming memory instruction allows the programmer to
specify the cacheability of a data object for improved
performance
*** Pentium 4 uses 20 stage pipeline
EEC detects & corrects single-bit errors on the data bus &
detect bit errors on the add bus & control signals
Hyper-pipelined technology allows the use of minimal logic

in each stage and a higher clock speed for the pipeline
MMX Intels name for a set of new inst designed to improve

the performance of applications utilizing graphics &
communication algorithms.
***rapid execution engine ALUs are clocked at twice
***Pentium II 242-pin cartridge called SECC
Execution trace cache part of the processors level-2 cache

wherein the copy of the decoded inst is placed where inst are
fetched & decoded into micro ops
SECC contains the processor core & level-2 cache mounted

on a aluminum substrate
Fetch decode unit fetches instructions in a program order
from the inst cache & breaks the 1A-32 instructions into
smaller microchips
Micro-ops typically 3 to an inst (2 for source operands, 1 for
the destination)
***complex 1A-32 inst requires more than 3 micro-ops
Micro-____ instruction sequencer emits pre-designed microops sequence
Reorder buffer inst pool; stores the micro ops & their
execution properties
Hyperthreading ability of Pentium 4 to run two threads of

ops simultaneously
Thread small portion of a program
EM64T allows 64-bit computing on integers, pointers, and
registers; 1 TB (1024GB), virtual add space to a flat 64b model
***OS support EM64T: windows 64b client os release. Reg
Flag ver. 4.1, Novell Linux Desktop 9, Red Hat DT 3.0
Execute Disable Bit new security feature of Pentium 4 that
helps prevent malicious buffer overflows attacks; a common
computer security violation technique
Celeron offered as a lower-cost alternative to the Pentium II
Celeron D contains support for EM64T & the Execute

Disable Bit, as well as SSE3 & similar in architecture to the
Pentium 4
Xeon Intels server & workstation powerhouse
Intels GMA 900 newest generation of accelerated graphics

technology, containing support for Direct X 9, Open GL 1.4 &
3D graphics w/ 4 pixel pipelines for superfast 1.3 G pixel/sec
rendering speed
***Xeon dual-processor, 0.13 micron process 32b processing
Hotspots - public access WI-FI networks
***Xeon & Xeon MP 4.8GB /sec or higher I/O BW
WPA successor of WEP
***Xeon, Xeon MP. Itanium 2 (dual core) allow true

simultaneous execution of 2 threads at the same time.
WEP - the original security standards for wireless transmission
Concurrent execution all threads complete execution over a

period of time (taking turns)
ICE specially designed interface is inserted between CPU &

the motherboard socket in order to tap into every pin of the
processor.
Simultaneous execution thread are executing in hardware at

the same time w/o sharing resources.
ICE control module connects the interface to a PC running

emulation software
EPIC utilizes the compiler extensively to rearrange inst from

the original program to extract & exploit parallelism
PC running emulation software able to monitor & control

the test CPU & system
Loop unrolling duplicates the inst found w/in a loop in such

a way that 1 pass through the new loop produces 2 or more
results.
Prediction true & false sec of code are both executed, along
w/ the condition testing code, in order to keep the pipeline busy
Speculation or speculative loading data is preloaded before
it is needed, while the other instructions are also being executed
Preloading used to hide the latency encountered when
reading data from memory
***Itanium 128 64b gen purpose reg & 128 floating point
reg
***Important factors determine the power consumption:
clock speed and operating voltage

Micro Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Micro Review

Uploaded by

Copyright:

Available Formats

CHAPTER 13

60 MHz & 66 MHz Pentiums 273 pin PGA; 1st to

T2P continues the bus cycle started in T12

Pipelining technique used to enable 1 instruction to complete

Cache flush that cache are flushed; performing WB for any

Branch prediction helps to identify possible interruptions to

Flushed acknowledge cycle run when all WB complete to

Pentium bus logic indicates the type of bus cycle currently

Shutdown cycle - run if the Pentium detects an internal parity

HALT similar to shutdown, INTR signal may be used to

Ti idle state, no bus cycle is currently running

WB & locked bus cycles not pipelined.

T1 valid address is output on the address lines and ADS is

Inquire cycle - used to maintain cache coherency in a

T2 data is read or written, BRDY input is examined

Bus snooping watch the system bus (add, data, control) in a

Superscalar machine capable of parallel execution of

Writethrough write to cache & main memory / RAM

Compiler help keep the Pentiums superscalar architecture

LRU the cache entry LRU is not likely to be used in the

U pipeline execute any processor instruction

Direct-mapped cache uses a portion of the incoming

Out-of-date main memory updated with the correct data

V pipeline only executes simple instruction

Index bits used to select 1 of 28 entries in cache

D1 determine if the current pair of instruction can be execute

Prefix byte only execute in the U pipeline; not be paired.

***upper add fully associated is compared with every tag

D2 - add for operand that reside in memory are calculated

WB write the result of the completed instructions & verify

RISC Reduced Instruction Set Computer

*** U stalls so does the V; V stalls, U may continue

RAM Random Access Memory

Program transfer instructions change the sequence, causing

Bubbles no work is done as the pipeline stages are reloaded

TLB Translation Lookaside Buffers

History bits initially set to 11 when a new target ass is placed

INVD Invalid Cache

INVLPG Invalidate TLB entry

Two 32B work w/ BTB & D1 to keep a steady stream of

WBINVD WriteBack and Invalidate Cache

Conditional jumps used to perform loops

On-chip cache - used to feed instructions and data to the

5 Types of Bus Cycle

Miss ratio percentage of misses to total cache accesses

3 Main Design of Cache Organization

***3rd port is used for bus snooping

The Five-stage U and V Pipeline

***one parity bits per every 8B

Physical address used to access the cache because the same

Three Features of Pentium 4

***both TLB are parity protected and dual ported

3 Divisions of SSE Instructions

Modified current line has been modified (does not match

MESI cache-consistence protocol that uses 2 bits stored with

Exclusive current line has not been modified (matches main

The Eight-stage FPU Pipeline

Invalid current line is empty; miss; Write to this cause a

Associative cache compromise between direct mapped &

Two-way set associative cache 2 entries per set

INVD erases all info in the data cache

EX execute, cache & ALU access; memory & reg, read,

Dispatch execution unit a reservation station controls the