Professional Documents
Culture Documents
?
They are powered by microprocessors based
On IBMs Power Instruction Set Architecture
(POWER-ISA)
POWER is a RISC instruction set architecture designed by IBM. The name is a acronym for Performance Optimization With Enhanced RISC
Topical Outline
Introduction to POWER-PC History Current Status PowerPC Architecture How Instruction execution differs from other Microprocessors? Design principles Registers Data Types Instruction Types MPC8640D Dual core PowerPC Processor e600 PowerPC Core Features PowerPC e600 Core Pipeline Stages AltiVec Vector Engine in e600 What is vector Processing? SIMD Intra element Instructions The Four Vector Engines in e600 AltiVec Characteristics AltiVec Software Enablement for Vector Signal and Image Processing Key Areas of Bandwidth, Performance and Computation Abilities Quad Power-PC Processor Card
Introduction to POWER-PC
History
POWER-PC stands for Performance Optimization With Enhanced RISC - Performance Computing.
IBM (1990) introduced POWER-ISA in 1990 with RS/6000. In 1991, a group from IBM, Motorola and Apple decided to design a new architecture, based on POWER-ISA which lead to the development of POWER-PC Aim was to form the basis of a new generation of high-performance Superscalar low-cost products ranging from low cost embedded controllers to massively parallel supercomputers. The first products were delivered near the end of 1993 Recent implementations include PowerPC 601, 603, 604
Current Status
PowerPC e200 - 32 bit power architecture microprocessor - speed ranging up to 600 MHz - ideal for embedded applications. PowerPC e300 similar to e200 with an increase in speed upto 667 MHz. PowerPC e600 speed upto 2 Ghz ideal for high performance routing and telecommunications applications. POWER5 IBM dual core P POWER6 IBM Dual core P - A notable difference from POWER5 is that the POWER6 executes instructions in-order instead of out-of-order PowerPC G3 - Apple Macintosh computers such as the PowerBook G3, the multicolored iMacs, iBooks and several desktops, including both the Beige and Blue and White Power Macintosh G3s. PowerPC G4 - is a designation used by Apple Computer to describe a fourth generation of 32-bit PowerPC microprocessors. PowerPC G5 - 64-bit Power Architecture processors Xenon - based on IBMs PowerPC ISA XBOX 360 game console. Broadway based on IBMs PowerPC ISA Nintendo Wii gaming console Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004 Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007
PowerPC Architecture
POWER-PC is a high-performance superscalar design supporting multiple independent execution units, including Integer Unit, Floating Point Unit, Branch Processing Unit. standard, fixed instruction format single-cycle execution of most instructions memory access is available only for load and store instruction. other instructions are register-toregister operations due to this execution units can run faster. a small number of machine instructions, and instruction formats. a large number of general-purpose registers a small number of addressing modes
To do multiplication of two operands in memory as shown in figure Assembly Instructions MUL 2:3,5:2
To do multiplication of two operands in memory as shown in figure. Assembly Instructions LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A Load and Store unit will fetch operand at 2:3 from memory to Reg A. Load and Store unit will fetch operand at 2:3 from memory to Reg B. Integer unit will perform product of A and B and result is stored in A Load and Store unit will store Result in Reg A at 2:3 in memory. Here Execution Units will perform operations only on Processor Registers with a single cycle through put.
Design principles
Simplicity favors' regularity Standard 32 bit instruction format for all instructions
Smaller is faster
3- Categories of registers , but each handles specific instructions so presumably faster access time Integer and floating point instructions To align with RISC principles many instructions that required three source operands were eliminated Many complex instructions curtailed to confirm with RISC principles but compensated by large number of mnemonics that increase the number of instructions .
General
All PowerPC processors run the same core PowerPC instruction set.
Initialization
When the processor is first initialized, it is in supervisor (also called privileged) mode. In this mode, all processor resources, including registers and instructions are accessible. The processor can limit access to certain privileged registers and instructions by placing itself in user mode. This protection limits application code from being able to modify global and sensitive resources, such as the caches, memory management system, and timers.
Registers
Architecture defines five types of registers :
Special Purpose Registers (SPRs) General Purpose Registers (GPRs) Floating Point Registers (FPRs) Device Control Registers (DCRs) Machine State Register (MSR)
Registers
SPRs give status and control of resources within the processor core.
Registers
Five important user mode SPRs are:
The Fixed-Point Exception Register (XER) is used for indicating conditions for integer operations, such as carries and overflows.
The Floating-Point Status and Control Register (FPSCR) is a 32-bit register used to store the status and control of the floating-point operations.
The Count Register (CTR) is used to hold a loop count that can be decremented during the execution of branch instructions. The Condition Register (CR) is a 32-bit register grouped into eight fields, where each field is 4 bits that signify the result of an instructions operation: Equal (EQ), Greater Than (GT), Less Than (LT), and Summary Overflow (SO).
The Link Register (LR) contains the address to return to at the end of a function call.
Registers
General Purpose Registers :
The Architecture specifies that all implementations have 32 GPRs (GPR0 - GPR31). GPRs are the source and destination of all fixed-point operations and load/store operations. They also provide access to SPRs and DCRs. They are all available for use in every instruction with one exception: In certain instructions, GPR0 simply means 0 and no lookup is done for GPR0s contents.
Registers
Floating Point Registers :
DCRs are similar to SPRs in that they give status and control information, but DCRs are for resources outside the processor core. DCRs allow for memory-mapped I/O control without using up portions of the memory address space.
Registers
Machine State Register :
It is accessed only in supervisor mode, and contains the settings for things such as memory translation, cache settings, interrupt enables, user/privileged state, and floating point availability. Exact control bits vary by implementation.
The MSR does not readily fit into the SPR/GPR classification, as it contains its own pair of instructions to read and write the contents of the MSR into a GPR.
Data Types
PowerPC can deal with data types of 8bits (byte), 16-bits (half word), 32-bits (word) and 64-bits (double word) in length. It can use either little-endian or big-endian style; that is, the least significant byte is stored in the lowest or highest address. Fixed-point data types include: * Unsigned byte * Unsigned half word * Signed half word * Unsigned word * Signed word * Unsigned double word * Byte Strings: From 0 128 bytes in length Floating-point data types include IEEE-754 single- and double-precision types.
Instruction Format
The architecture encodes all instructions in 32 bits and aligns them on word address boundaries in memory. Instructions are first decoded by the upper 6 bits, in a field called the primary opcode. The remaining 26 bits contain operands and/or reserved fields. Different types of instructions defined are : ALU, Floating Point , Load/Store, Branch, Condition and Synchronization Instructions
Instruction Types
Addressing Modes
Three types of operand addressing :
Indirect addressing : * Base address in a GPR + a 16-bit sign-extended literal Indirect-indexed addressing : * Base address in a GPR + displacement from another GPR
Three-register Format
Absolute : Use the literal as the absolute address. Relative : Use the literal as the displacement from the branch instruction address. Indirect : Take the target address from the LR or CTR registers
Power PC
MPC8640D Dual core PowerPC Processor
High-performance, 32-bit superscalar microprocessor that implements the PowerPC architecture Eleven independent execution units and three register files Branch processing unit (BPU) Four integer units (IUs) that share 32 GPRs for integer operands 64-bit floating-point unit (FPU) Four vector units and a 32-entry vector register file (VRs) Three-stage load/store unit (LSU) Three issue queues, FIQ, VIQ, and GIQ, can accept as many as one, two, and three instructions, respectively, in a cycle. Dispatch unit Completion unit Two separate 32-Kbyte instruction and data level 1 (L1) caches Integrated 1-Mbyte, eight-way set-associative unified instruction and data level 2 (L2) cache with ECC 36-bit real addressing Separate memory management units (MMUs) for instructions and data Multiprocessing support features Power and thermal management Performance monitor In-system testability and debugging feature
MPC8640D PowerPC
Superscalar Microprocessor with Instruction Level Parallelism and Seven Stage Pipeline Execution allows multiple instructions to be executed in parallel
High-performance superscalar e600 core As many as 4 instructions can be fetched from the instruction cache at a time. As many as 3 instructions can be dispatched to the issue queues at a time. As many as 12 instructions can be in the instruction queue (IQ). As many as 16 instructions can be at some stage of execution simultaneously.
Seven-stage pipeline control Execution Units BPU : Branch Processing Unit VPU : Vector Permute Unit VIU : Vector Integer Unit VFPU : Vector Floating Point Unit FPU : Floating Point Unit IU : Integer Unit LSU : Load/Store Unit
MPC 8640D With e600 PPC core Micro architecture with emphasis on pipeline stages of the front end and the functional units.
These two stages are both dedicated primarily to grabbing an instruction from the L1 cache. The e600 can fetch four instructions per clock cycle from the L1 cache and send them on to the next stage
Stage 3 - Decode/Dispatch:
Once an instruction has been fetched, it goes into a 12-entry instruction queue to be decoded. The e600's decoder can dispatch up to three instructions per clock cycle to the next stage.
Stage 4 - Issue:
The first queue Floating-Point Issue Queue (FIQ), which holds floating-point (FP) instructions that are waiting to be executed. The second is the Vector Issue Queue (VIQ), which holds vector operations. The third queue is the General Instruction Queue (GIQ), which holds everything else. Once the instruction leaves its issue queue, it goes to the execution engine to be executed.
Stage 5 - Execute:
The instructions can pass out-of-order from their issue queues into their respective functional units and be executed.
In these two stages, the instructions are put back into the order in which they came into the processor, and their results are written back to memory.
AltiVec is a floating point and integer SIMD(Single Instruction and Multiple Data) instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, (the AIM alliance), and implemented on versions of the PowerPC The Vector Processing Unit is Branded with several names IBM The vector multimedia extension (VMX) Apple Velocity Engine Freescale AltiVec
A vector architecture allows the simultaneous processing of multiple data items in parallel Operations are performed on multiple data elements by a single instruction Referred to as Single Instruction Multiple Data (SIMD) parallel processing For example in Addition of two vectors instruction VT = (VA +VB) will be computed in single cycle latency and single cycle throughput
Multiply and accumulate instruction VT = (VA *VB) +VC will be computed in 5 cycle latency and single cycle throughput
Where vectors can be 128 bit size an array of 16 characters an array of 8 short Integers an array of 4 long Integers an array of 4 SP Floating Point Numbers
The VPU executes permutation instructions such as pack, unpack, merge, splat, and permute on vectoroperands. The VIU1 executes simple vector integer computational instructions, such as addition, subtraction, maximum and minimum comparisons, averaging, rotation, shifting, comparisons, and Boolean operations. The VIU2 executes longer-latency vector integer instructions, such as multiplication, multiplication/addition, and sum-across with saturation. The VFPU executes all vector floating-point instructions. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1VIQ0). An instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions.
AltiVec computational instructions are executed in four independent, pipelined AltiVec execution units. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1VIQ0). This means an instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions. The VPU has a two-stage pipeline; the VIU2 and VFPU each have fourstage pipelines. As many as ten AltiVec instructions can be executing concurrently.
AltiVec Characteristics
4, 8 or 16 data elements
Separate register file with 32 register namespace Vector-element data type of 8-, 16-, 32-bit signed / unsigned int, and IEEE SP float 162 instructions
Intra- and inter-element arithmetic instructions Intra- and inter-element conditional instructions Powerful Permute, Shift and Rotate, Splat, Pack/Unpack and Merge instructions
Modeless operation for zero-overhead use of AltiVec instructions Simultaneous dispatch of one ALU-class vector and one permute-class vector, or either paired with a vector load/store
Peak throughput of 2 instructions per cycle Simple ops: 1 cycle latency Compound ops: 3-4 cycle latency No restriction on issue with scalar instructions
AltiVec libraries
Ecosystem libraries
Multi-core/multi-threading
Summary of floating point calculations for e600 @ 1 GHz: 0.8 SP or DP GFLOPS (one core only, regular FP instructions). 1.6 SP or DP GFLOPS (one core only, only multiply-add FP instructions). 4.8 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 9.6 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP instructions). Summary of floating point calculations for e600 @ 1.5 GHz: 1.2 SP or DP GFLOPS (one core only, regular FP instructions). 2.4 SP or DP GFLOPS (one core only, only multiply-add FP instructions). 7.2 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 14.4 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4), one Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements. Computation time required to perform 1024 pt Complex to complex FFT is 10 micro sec.
Summary of floating point calculations for MPC8640D @ 1GHz: 1.6 SP or DP GFLOPS (dual core, regular FP instructions). 3.2SP or DP GFLOPS (dual core, only multiply-add FP instructions). 9.6 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP instructions). 19.2 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-add FP instructions). Summary of floating point calculations for MPC8640D @ 1.5 GHz: 2.4 SP or DP GFLOPS (dual core, regular FP instructions). 4.8 SP or DP GFLOPS (dual core, only multiply-add FP instructions). 14.4 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP instructions). 28.8 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-add FP instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4), Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements.
Embedded Controller
Node D
Node C
Node A , Node B and Node C will be used for Signal Processing Purpose
Conclusion
Overall the Power PC is a better architecture it is capable of handling more instructions, it is able do more operations as far as branching and floating point operations and it is a more efficient architecture in handling various complexities in data and memory.