You are on page 1of 50

POWER-PC

Where we can find POWER-PC


What do the worlds fastest supercomputer, network and communications equipment such as Internet routers and switches, the Mars Rover, consumer electronics such as set top boxes, and the game consoles all have in common ?

?
They are powered by microprocessors based
On IBMs Power Instruction Set Architecture
(POWER-ISA)

POWER is a RISC instruction set architecture designed by IBM. The name is a acronym for Performance Optimization With Enhanced RISC

Topical Outline

Introduction to POWER-PC History Current Status PowerPC Architecture How Instruction execution differs from other Microprocessors? Design principles Registers Data Types Instruction Types MPC8640D Dual core PowerPC Processor e600 PowerPC Core Features PowerPC e600 Core Pipeline Stages AltiVec Vector Engine in e600 What is vector Processing? SIMD Intra element Instructions The Four Vector Engines in e600 AltiVec Characteristics AltiVec Software Enablement for Vector Signal and Image Processing Key Areas of Bandwidth, Performance and Computation Abilities Quad Power-PC Processor Card

Introduction to POWER-PC

History

POWER-PC stands for Performance Optimization With Enhanced RISC - Performance Computing.
IBM (1990) introduced POWER-ISA in 1990 with RS/6000. In 1991, a group from IBM, Motorola and Apple decided to design a new architecture, based on POWER-ISA which lead to the development of POWER-PC Aim was to form the basis of a new generation of high-performance Superscalar low-cost products ranging from low cost embedded controllers to massively parallel supercomputers. The first products were delivered near the end of 1993 Recent implementations include PowerPC 601, 603, 604

Current Status

PowerPC e200 - 32 bit power architecture microprocessor - speed ranging up to 600 MHz - ideal for embedded applications. PowerPC e300 similar to e200 with an increase in speed upto 667 MHz. PowerPC e600 speed upto 2 Ghz ideal for high performance routing and telecommunications applications. POWER5 IBM dual core P POWER6 IBM Dual core P - A notable difference from POWER5 is that the POWER6 executes instructions in-order instead of out-of-order PowerPC G3 - Apple Macintosh computers such as the PowerBook G3, the multicolored iMacs, iBooks and several desktops, including both the Beige and Blue and White Power Macintosh G3s. PowerPC G4 - is a designation used by Apple Computer to describe a fourth generation of 32-bit PowerPC microprocessors. PowerPC G5 - 64-bit Power Architecture processors Xenon - based on IBMs PowerPC ISA XBOX 360 game console. Broadway based on IBMs PowerPC ISA Nintendo Wii gaming console Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004 Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007

PowerPC Architecture

POWER-PC is a high-performance superscalar design supporting multiple independent execution units, including Integer Unit, Floating Point Unit, Branch Processing Unit. standard, fixed instruction format single-cycle execution of most instructions memory access is available only for load and store instruction. other instructions are register-toregister operations due to this execution units can run faster. a small number of machine instructions, and instruction formats. a large number of general-purpose registers a small number of addressing modes

How Instruction execution differs from other Microprocessors?


General Processor based on CISC Architecture

POWER-PC based on RISC Architecture


To do multiplication of two operands in memory as shown in figure Assembly Instructions MUL 2:3,5:2

Here Execution Units will access memory directly.

To do multiplication of two operands in memory as shown in figure. Assembly Instructions LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A Load and Store unit will fetch operand at 2:3 from memory to Reg A. Load and Store unit will fetch operand at 2:3 from memory to Reg B. Integer unit will perform product of A and B and result is stored in A Load and Store unit will store Result in Reg A at 2:3 in memory. Here Execution Units will perform operations only on Processor Registers with a single cycle through put.

Design principles

Simplicity favors' regularity Standard 32 bit instruction format for all instructions

fixed-length instructions, register-to-register architecture three-operand instruction format.

Smaller is faster

3- Categories of registers , but each handles specific instructions so presumably faster access time Integer and floating point instructions To align with RISC principles many instructions that required three source operands were eliminated Many complex instructions curtailed to confirm with RISC principles but compensated by large number of mnemonics that increase the number of instructions .

Make the common case fast


Good design demands good compromises

General

All PowerPC processors run the same core PowerPC instruction set.

It is independent of implementation aspects.


It allows anyone to design and fabricate compatible PowerPC processors independent of implementation differences as the technology advances. They differ primarily in the degree of dedicated hardware support for multiple execution units, cache size and capability, length of pipeline, and interface busses. These differences result in different tradeoffs in processing performance, die area, and power dissipation.

Initialization

When the processor is first initialized, it is in supervisor (also called privileged) mode. In this mode, all processor resources, including registers and instructions are accessible. The processor can limit access to certain privileged registers and instructions by placing itself in user mode. This protection limits application code from being able to modify global and sensitive resources, such as the caches, memory management system, and timers.

Registers
Architecture defines five types of registers :

Special Purpose Registers (SPRs) General Purpose Registers (GPRs) Floating Point Registers (FPRs) Device Control Registers (DCRs) Machine State Register (MSR)

Registers

SPRs give status and control of resources within the processor core.

Registers
Five important user mode SPRs are:

The Fixed-Point Exception Register (XER) is used for indicating conditions for integer operations, such as carries and overflows.

The Floating-Point Status and Control Register (FPSCR) is a 32-bit register used to store the status and control of the floating-point operations.
The Count Register (CTR) is used to hold a loop count that can be decremented during the execution of branch instructions. The Condition Register (CR) is a 32-bit register grouped into eight fields, where each field is 4 bits that signify the result of an instructions operation: Equal (EQ), Greater Than (GT), Less Than (LT), and Summary Overflow (SO).

The Link Register (LR) contains the address to return to at the end of a function call.

Registers
General Purpose Registers :

The Architecture specifies that all implementations have 32 GPRs (GPR0 - GPR31). GPRs are the source and destination of all fixed-point operations and load/store operations. They also provide access to SPRs and DCRs. They are all available for use in every instruction with one exception: In certain instructions, GPR0 simply means 0 and no lookup is done for GPR0s contents.

Registers
Floating Point Registers :

The PowerPC architecture provides thirty-two 64-bit floating-point registers.

Device Control Registers :

DCRs are similar to SPRs in that they give status and control information, but DCRs are for resources outside the processor core. DCRs allow for memory-mapped I/O control without using up portions of the memory address space.

Registers
Machine State Register :

MSR represents the state of the machine.

It is accessed only in supervisor mode, and contains the settings for things such as memory translation, cache settings, interrupt enables, user/privileged state, and floating point availability. Exact control bits vary by implementation.
The MSR does not readily fit into the SPR/GPR classification, as it contains its own pair of instructions to read and write the contents of the MSR into a GPR.

Data Types

PowerPC can deal with data types of 8bits (byte), 16-bits (half word), 32-bits (word) and 64-bits (double word) in length. It can use either little-endian or big-endian style; that is, the least significant byte is stored in the lowest or highest address. Fixed-point data types include: * Unsigned byte * Unsigned half word * Signed half word * Unsigned word * Signed word * Unsigned double word * Byte Strings: From 0 128 bytes in length Floating-point data types include IEEE-754 single- and double-precision types.

Instruction Format

The architecture encodes all instructions in 32 bits and aligns them on word address boundaries in memory. Instructions are first decoded by the upper 6 bits, in a field called the primary opcode. The remaining 26 bits contain operands and/or reserved fields. Different types of instructions defined are : ALU, Floating Point , Load/Store, Branch, Condition and Synchronization Instructions

Instruction Types

Addressing Modes
Three types of operand addressing :

Memory operand addressing:


Indirect addressing : * Base address in a GPR + a 16-bit sign-extended literal Indirect-indexed addressing : * Base address in a GPR + displacement from another GPR

ALU and Floating-point instruction operand addressing:

Three-register Format

Branch Operand Addressing :


Absolute : Use the literal as the absolute address. Relative : Use the literal as the displacement from the branch instruction address. Indirect : Take the target address from the LR or CTR registers

Power PC
MPC8640D Dual core PowerPC Processor

MPC8640D Dual core PowerPC Processor

Fig: Block Diagram of MPC8641D Dual core PowerPC Processor

MPC8640D PowerPC Processor Specifications

e600 PowerPC Core Features


High-performance, 32-bit superscalar microprocessor that implements the PowerPC architecture Eleven independent execution units and three register files Branch processing unit (BPU) Four integer units (IUs) that share 32 GPRs for integer operands 64-bit floating-point unit (FPU) Four vector units and a 32-entry vector register file (VRs) Three-stage load/store unit (LSU) Three issue queues, FIQ, VIQ, and GIQ, can accept as many as one, two, and three instructions, respectively, in a cycle. Dispatch unit Completion unit Two separate 32-Kbyte instruction and data level 1 (L1) caches Integrated 1-Mbyte, eight-way set-associative unified instruction and data level 2 (L2) cache with ECC 36-bit real addressing Separate memory management units (MMUs) for instructions and data Multiprocessing support features Power and thermal management Performance monitor In-system testability and debugging feature

MPC8640D PowerPC

Superscalar Microprocessor with Instruction Level Parallelism and Seven Stage Pipeline Execution allows multiple instructions to be executed in parallel

High-performance superscalar e600 core As many as 4 instructions can be fetched from the instruction cache at a time. As many as 3 instructions can be dispatched to the issue queues at a time. As many as 12 instructions can be in the instruction queue (IQ). As many as 16 instructions can be at some stage of execution simultaneously.

Single-cycle execution for most instructions


One-instruction throughput per clock cycle for most instructions

Seven-stage pipeline control Execution Units BPU : Branch Processing Unit VPU : Vector Permute Unit VIU : Vector Integer Unit VFPU : Vector Floating Point Unit FPU : Floating Point Unit IU : Integer Unit LSU : Load/Store Unit

MPC 8640D With e600 PPC core Micro architecture with emphasis on pipeline stages of the front end and the functional units.

Fig: e600 POWER-PC Core

PowerPC e600 Core Pipeline Stages

Stages 1 and 2 - Instruction Fetch:

These two stages are both dedicated primarily to grabbing an instruction from the L1 cache. The e600 can fetch four instructions per clock cycle from the L1 cache and send them on to the next stage

Stage 3 - Decode/Dispatch:

Once an instruction has been fetched, it goes into a 12-entry instruction queue to be decoded. The e600's decoder can dispatch up to three instructions per clock cycle to the next stage.

PowerPC e600 Core Pipeline Stages

Stage 4 - Issue:

The first queue Floating-Point Issue Queue (FIQ), which holds floating-point (FP) instructions that are waiting to be executed. The second is the Vector Issue Queue (VIQ), which holds vector operations. The third queue is the General Instruction Queue (GIQ), which holds everything else. Once the instruction leaves its issue queue, it goes to the execution engine to be executed.

PowerPC e600 Core Pipeline Stages

Stage 5 - Execute:

The instructions can pass out-of-order from their issue queues into their respective functional units and be executed.

Stage 6 and 7 - Complete and Write-Back :

In these two stages, the instructions are put back into the order in which they came into the processor, and their results are written back to memory.

AltiVec Vector Engine in e600

AltiVec Vector Engine in e600

AltiVec is a floating point and integer SIMD(Single Instruction and Multiple Data) instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, (the AIM alliance), and implemented on versions of the PowerPC The Vector Processing Unit is Branded with several names IBM The vector multimedia extension (VMX) Apple Velocity Engine Freescale AltiVec

What is vector Processing?


A vector architecture allows the simultaneous processing of multiple data items in parallel Operations are performed on multiple data elements by a single instruction Referred to as Single Instruction Multiple Data (SIMD) parallel processing For example in Addition of two vectors instruction VT = (VA +VB) will be computed in single cycle latency and single cycle throughput

Multiply and accumulate instruction VT = (VA *VB) +VC will be computed in 5 cycle latency and single cycle throughput
Where vectors can be 128 bit size an array of 16 characters an array of 8 short Integers an array of 4 long Integers an array of 4 SP Floating Point Numbers

SIMD Intra element Instructions

The Four Vector Engines in e600

AltiVec Vector Permute Unit (VPU)

The VPU executes permutation instructions such as pack, unpack, merge, splat, and permute on vectoroperands. The VIU1 executes simple vector integer computational instructions, such as addition, subtraction, maximum and minimum comparisons, averaging, rotation, shifting, comparisons, and Boolean operations. The VIU2 executes longer-latency vector integer instructions, such as multiplication, multiplication/addition, and sum-across with saturation. The VFPU executes all vector floating-point instructions. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1VIQ0). An instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions.

AltiVec Vector Integer Unit 1 (VIU1)

AltiVec Vector Integer Unit 2 (VIU2)

AltiVec Vector Floating-Point Unit (VFPU)

AltiVec computational instructions are executed in four independent, pipelined AltiVec execution units. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1VIQ0). This means an instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions. The VPU has a two-stage pipeline; the VIU2 and VFPU each have fourstage pipelines. As many as ten AltiVec instructions can be executing concurrently.

AltiVec Characteristics

128b vector size

4, 8 or 16 data elements

Separate register file with 32 register namespace Vector-element data type of 8-, 16-, 32-bit signed / unsigned int, and IEEE SP float 162 instructions

Intra- and inter-element arithmetic instructions Intra- and inter-element conditional instructions Powerful Permute, Shift and Rotate, Splat, Pack/Unpack and Merge instructions

Saturation or modulo arithmetic Four-operand, nondestructive instruction format

Three sources, one destination

Modeless operation for zero-overhead use of AltiVec instructions Simultaneous dispatch of one ALU-class vector and one permute-class vector, or either paired with a vector load/store

Peak throughput of 2 instructions per cycle Simple ops: 1 cycle latency Compound ops: 3-4 cycle latency No restriction on issue with scalar instructions

All instructions fully pipelined with single-cycle throughput


AltiVec Software Enablement for Vector Signal and Image Processing


Tool and product support Compilers GCC, GHS, WR Conversion tools


SSE to AltiVec Vectorization linear to parallel code

AltiVec libraries

Intrinsic optimized libraries from Freescale


OpenSAL, VSIPL, OpenCV, OpenGL ES, VSIPL++, multicore SAL

Ecosystem libraries

Multi-core/multi-threading

Key Areas of Bandwidth, Performance and Computation Abilities

Key Areas of Bandwidth, Performance and Computation Abilities

Summary of floating point calculations for e600 @ 1 GHz: 0.8 SP or DP GFLOPS (one core only, regular FP instructions). 1.6 SP or DP GFLOPS (one core only, only multiply-add FP instructions). 4.8 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 9.6 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP instructions). Summary of floating point calculations for e600 @ 1.5 GHz: 1.2 SP or DP GFLOPS (one core only, regular FP instructions). 2.4 SP or DP GFLOPS (one core only, only multiply-add FP instructions). 7.2 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 14.4 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4), one Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements. Computation time required to perform 1024 pt Complex to complex FFT is 10 micro sec.

Key Areas of Bandwidth, Performance and Computation Abilities

Summary of floating point calculations for MPC8640D @ 1GHz: 1.6 SP or DP GFLOPS (dual core, regular FP instructions). 3.2SP or DP GFLOPS (dual core, only multiply-add FP instructions). 9.6 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP instructions). 19.2 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-add FP instructions). Summary of floating point calculations for MPC8640D @ 1.5 GHz: 2.4 SP or DP GFLOPS (dual core, regular FP instructions). 4.8 SP or DP GFLOPS (dual core, only multiply-add FP instructions). 14.4 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP instructions). 28.8 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-add FP instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4), Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements.

COTS VPX Card with

Quad Power-PC Processors (MPC8640D)

Quad Power-PC Processor Card

Quad Power-PC Processor Card

SMP OS Running on on a Dual core Processor

Quad Processor card each running SMP OS


Node A
Node B

Embedded Controller

Node D

Node C

Node A , Node B and Node C will be used for Signal Processing Purpose

Conclusion

Overall the Power PC is a better architecture it is capable of handling more instructions, it is able do more operations as far as branching and floating point operations and it is a more efficient architecture in handling various complexities in data and memory.

You might also like