You are on page 1of 5

CHAPTER 13

60 MHz & 66 MHz Pentiums 273 pin PGA; 1st to


introduced to the computing market; 5 V
236-pin PGA allows hardware control over the ratio of bus
speed to processor speed, to support motherboard designs that
may not be capable of operating at the clock speed of the
processor; 3.3 V, reduced the power
CISC (Complex instruction set computer) - requires multiple
clock cycles
RISC (Reduced instruction set computer) solution to the
performance bottleneck; smaller and fewer instructions, simpler
addressing modes; large set of register; easier to implement in
silicon; faster performance
Reduce access to main memory
Simple instruction & addressing modes
Make good use of register
Pipeline, utilize the compiler extensively
Cache high speed RAM (data & add. of the data are stored);
10x faster than main menu; 10ns
Hit one of the addresses stored in the cache matches the
address being used for the memory read
Miss required instruction or data is not found in the internal
cache
Instruction cache used to store frequently used instruction
Data cache used to store frequently used data
Fewer instruction & add. modes reduces the complexity of
the instruction decoder, addressing logic & execution unit;
allows the machine to be clocked at a faster speed, less work
needs to be done each clock period

T2P continues the bus cycle started in T12


TD used to insert a dead state between 2 consecutive cycle
(read then write or vice versa), in order to give the system bus
time to change states
State-Transfer cycle used to transfer up to 8b of noncacheable data between the processor & memory
Watt states used to give the memory and I/O systems
additional time to complete the read or write request
Burst cycle used to read and write 32 bytes for line loads &
WB
8 number of floating-point registers contained in the Pentium;
number of byte chunk can be transferred every clock cycle
Locked operations to lock the bus
Atomic operation cannot be broken down into smaller
suboperations, comes in a semaphore.
Semaphore special type of counter var that must be read,
updated & stored in 1 single, uninterruptable operations; read
then write; no other devices may take over control while is
taking place.
BOFF input provides a way for other processor in a
multiprocessor system to instantly take over the Pentiums
buses.
Bus bold input provide a 2nd way for a diff bus master to take
control of the Pentiums buses; completes the current bus cycle
& then tri-states its buses.
HLDA output indicates when the Pentium is on HOLD state.
External circuitry responsible for supplying 8b interrupt
vector # on D 0 through D 7 .

Pipelining technique used to enable 1 instruction to complete


w/ each clock cycle; parallel

Cache flush that cache are flushed; performing WB for any


lines that have been change.

Branch prediction helps to identify possible interruptions to


the normal flow of instructions through the U and V pipelines

Flushed acknowledge cycle run when all WB complete to


inform external circuitry that its catches are flushed.

Pentium bus logic indicates the type of bus cycle currently


starting with the use of its cycle definition signals

Shutdown cycle - run if the Pentium detects an internal parity


error until it receives NMI. INIT, or RESET to execute.

Special cycle bus requires additional decoding & use the byte
enable outputs for selection

HALT similar to shutdown, INTR signal may be used to


resume execution.

Ti idle state, no bus cycle is currently running

WB & locked bus cycles not pipelined.

T1 valid address is output on the address lines and ADS is


made active (low)

Inquire cycle - used to maintain cache coherency in a


multiprocessor system.

T2 data is read or written, BRDY input is examined

Bus snooping watch the system bus (add, data, control) in a


multiprocessor system; bidirectional; used to help maintain
consistent data in a multiprocessor system where each
processor has a separate cache.

T12 enters when a 2nd bus cycle is started before the 1 st one
completes

Superscalar machine capable of parallel execution of


multiple instructions.
Data dependency exist if the 2 nd instruction reads an operand
written to it by the 1st instruction (RAW) or (WAW) or WAR

Writethrough write to cache & main memory / RAM

Compiler help keep the Pentiums superscalar architecture


running at full speed.

LRU the cache entry LRU is not likely to be used in the


future, so it may be replaced.

U pipeline execute any processor instruction

Direct-mapped cache uses a portion of the incoming


physical add to select an entry.

Out-of-date main memory updated with the correct data


from cache.

V pipeline only executes simple instruction


***128 entries in cache, w/c has 32B line size
PF stages instruction are fed into it after perfetched from the
instruction cache or memory.

Index bits used to select 1 of 28 entries in cache

D1 determine if the current pair of instruction can be execute


together.

Fully associative cache uses large tags & does not select an
entry based on index bits

Prefix byte only execute in the U pipeline; not be paired.

***upper add fully associated is compared with every tag

D2 - add for operand that reside in memory are calculated


EX stage operands are read from the data cache or memory,
ALU operations are performed & branch prediction for
instructions are verified.

ACRONYMS
CHAPTER 13
PGA Pin Grid Away
CISC Complex Instruction Set Computer

WB write the result of the completed instructions & verify


conditional branch instruction prediction; cache only.

RISC Reduced Instruction Set Computer

*** U stalls so does the V; V stalls, U may continue

RAM Random Access Memory

Program transfer instructions change the sequence, causing


all instruction that entered the pipeline after it become invalid

WB Writeback
ALU Arithmetic Logic Unit

Bubbles no work is done as the pipeline stages are reloaded


BTB Brach Target Buffer
Dynamic branch prediction to avoid bubbles
LRU Last Recently Used
BTB special cache that stores the instruction & target add of
any branch instruction that have been encountered in the
instruction stream; stores 2 History bits

TLB Translation Lookaside Buffers


MESI Modified/Exclusive/Shared/Invalid

History bits initially set to 11 when a new target ass is placed


into the BTB

INVD Invalid Cache

00 repeated NT / failures

INVLPG Invalidate TLB entry

Two 32B work w/ BTB & D1 to keep a steady stream of


instruction flowing into the pipeline

WBINVD WriteBack and Invalidate Cache

Conditional jumps used to perform loops

2 Types of Pipeline
***U & V
***Pipeline that performs special types of bus cycles

On-chip cache - used to feed instructions and data to the


CPUs pipeline
External cache 2nd level cache
Hit ratio percentage of hits to total cache accesses

5 Types of Bus Cycle


M/IO
D/C
W/R
CACHE
KEN

Miss ratio percentage of misses to total cache accesses


Line of data group of locations from the main memory

3 Main Design of Cache Organization


***direct mapped
***fully associate

***set associative
6 Bus Cycle State
Ti
T1
T2
T12
T2P
TD

***3rd port is used for bus snooping


***instruction cache is write protected to prevent self
modifying
Splitline accesses reading the upper half of 1 line & the
lower half of the next line at the same time.
Parity bits used in each cache to help maintain data integrity

The Five-stage U and V Pipeline


PF
D1
D2
EX
WB

***one parity bits per every 8B


Translation lookaside buffers used to translate virtual
address into physical address
Virtual address logical or linear address

2 TLB
***four-way set associative with 64 entries
***four-way set associative with 8 entries
3 Cache Instructions
***INVD
***INVLPG
***WBINVD

Physical address used to access the cache because the same


ass is used to access main memory
Four-way set associative with 64 entries translate ass for
4KB
Four-way set associative with 8 entries used to handle 4MB
pages.

Three Features of Pentium 4


HYPER-PIPELINED TECHNOLOGY
RAPID EXECUTION ENGINE
EXECUTION TRACE CACHE

***both TLB are parity protected and dual ported

3 Divisions of SSE Instructions


SIMD-FP
NEW MEDIA
STREAMING MEMORY

Modified current line has been modified (does not match


main memory) & is only available in a single cache. Writing to
this line changes it state to modified.

4 MESI States
MODIFIED
EXCLUSIVE
SHARED
INVALID

MESI cache-consistence protocol that uses 2 bits stored with


each line of data to keep track of the state of the cache line

Exclusive current line has not been modified (matches main


memory) & is only available in a single cache. Writing to this
line changes it state to modified.
Shared copies of the current line may exist in more than one
cache. Write to this cause a writethrough to main memory &
may invalidate the copies in the other cache.

The Eight-stage FPU Pipeline


PF
D1
D2
EX
X1
X2
WF

Invalid current line is empty; miss; Write to this cause a


writethrough to main memory.

Associative cache compromise between direct mapped &


fully associative designs; 2, 4, 8 or more; combines the simple
tag comparator hardware of the direct mapped cache w/ flexible
data caching capability of the fully assoc

WBINVD 1st writes back any updated cache entries ----invalidates them

Two-way set associative cache 2 entries per set

PF prefetch

***256 entries per cache, 64 bit per set, 8KB of storage per
cache

D1 instruction decode

INVD erases all info in the data cache


INVLPG associated w/ a supplied memory operand;
necessary when implementing a paged memory system.

***FPU-8 stages

D2 add generation
Triple ported tags in the data cache can be accessed from
three different places at the same time.

EX execute, cache & ALU access; memory & reg, read,


floating point implementing a paged memory system.
X1 this is the floating-point execute, stage one. Memory data
converted into memory format, memory, write operand to
floating point reg file, bypass 1 (sends data back to EX)
X2 floating-point execute stage 2
WF round floating point result & write to floating point reg
fle, bypass 2 (send data back to EX stage)
ER error reporting, update status word
Forwarding prevents pipeline from stalling while for the ref
file to provide a copy of ST
***floating point reg file eight 80 bit floating point reg,
ST(0)
- ST(7)

CHAPTER 15
Pentium Pro 14 stages superpipeline, internal level-2 cache
(operates at the speed of three processor core), 4 additional add
lines (36b bus width; 64GB physical add space), 4 pentium pro
CPUs
DIB allows the Pentium pro to access data in external
memory (over the system bus) while simultaneously accessing
the internal level-2 data cache.
Register renaming used to map temporary result registers to
the actual processor registers

Dispatch execution unit a reservation station controls the


flow of data through the integer, floating point, MMX &
load/store units
Reservation station contains multiple entries, w/ each entry
*
*
*
*
*
*
*
*
*
*
*
*
Cache Management Instructions used in perfetching &
maintaining cache data
128bit XXM register capable of storing a 128b integer or
four 32-bit floating point number
SIMB-FB allows the programmer to convert between int &
floating point num, as well as perform floating point reciprocal
& sq. root
New media these new instructions give the programmer ways
to organize data for SIMD computation
Streaming memory instruction allows the programmer to
specify the cacheability of a data object for improved
performance
*** Pentium 4 uses 20 stage pipeline

EEC detects & corrects single-bit errors on the data bus &
detect bit errors on the add bus & control signals

Hyper-pipelined technology allows the use of minimal logic


in each stage and a higher clock speed for the pipeline

MMX Intels name for a set of new inst designed to improve


the performance of applications utilizing graphics &
communication algorithms.

***rapid execution engine ALUs are clocked at twice

***Pentium II 242-pin cartridge called SECC

Execution trace cache part of the processors level-2 cache


wherein the copy of the decoded inst is placed where inst are
fetched & decoded into micro ops

SECC contains the processor core & level-2 cache mounted


on a aluminum substrate
Fetch decode unit fetches instructions in a program order
from the inst cache & breaks the 1A-32 instructions into
smaller microchips
Micro-ops typically 3 to an inst (2 for source operands, 1 for
the destination)
***complex 1A-32 inst requires more than 3 micro-ops
Micro-____ instruction sequencer emits pre-designed microops sequence
Reorder buffer inst pool; stores the micro ops & their
execution properties

Hyperthreading ability of Pentium 4 to run two threads of


ops simultaneously
Thread small portion of a program
EM64T allows 64-bit computing on integers, pointers, and
registers; 1 TB (1024GB), virtual add space to a flat 64b model
***OS support EM64T: windows 64b client os release. Reg
Flag ver. 4.1, Novell Linux Desktop 9, Red Hat DT 3.0
Execute Disable Bit new security feature of Pentium 4 that
helps prevent malicious buffer overflows attacks; a common
computer security violation technique
Celeron offered as a lower-cost alternative to the Pentium II

Celeron D contains support for EM64T & the Execute


Disable Bit, as well as SSE3 & similar in architecture to the
Pentium 4
Xeon Intels server & workstation powerhouse

Intels GMA 900 newest generation of accelerated graphics


technology, containing support for Direct X 9, Open GL 1.4 &
3D graphics w/ 4 pixel pipelines for superfast 1.3 G pixel/sec
rendering speed

***Xeon dual-processor, 0.13 micron process 32b processing

Hotspots - public access WI-FI networks

***Xeon & Xeon MP 4.8GB /sec or higher I/O BW

WPA successor of WEP

***Xeon, Xeon MP. Itanium 2 (dual core) allow true


simultaneous execution of 2 threads at the same time.

WEP - the original security standards for wireless transmission

Concurrent execution all threads complete execution over a


period of time (taking turns)

ICE specially designed interface is inserted between CPU &


the motherboard socket in order to tap into every pin of the
processor.

Simultaneous execution thread are executing in hardware at


the same time w/o sharing resources.

ICE control module connects the interface to a PC running


emulation software

EPIC utilizes the compiler extensively to rearrange inst from


the original program to extract & exploit parallelism

PC running emulation software able to monitor & control


the test CPU & system

Loop unrolling duplicates the inst found w/in a loop in such


a way that 1 pass through the new loop produces 2 or more
results.
Prediction true & false sec of code are both executed, along
w/ the condition testing code, in order to keep the pipeline busy
Speculation or speculative loading data is preloaded before
it is needed, while the other instructions are also being executed
Preloading used to hide the latency encountered when
reading data from memory
***Itanium 128 64b gen purpose reg & 128 floating point
reg
***Important factors determine the power consumption:
clock speed and operating voltage

You might also like