Aca

Basic Computer Architectures How does a CPU work?
? CPU design characteristics - CISC, RISC, EPIC Memory considerations Cache Main memory and paging Compilers Networking and communication Beowulf Clusters
Architecture References
http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_ http://www.intel.com/products/processor/pentium_D/ http://www-03.ibm.com/chips/power/powerpc/
The NAND Gate
All computers can be built from NAND gates.

3
a Basic Microprocessor Instructions Microprocessor CPUs can only execute a limited number of functions. Load - load data from memory into the CPU Store - store data from the CPU into memory Branch or Jump - alter order of instruction execution Math and Logical Operations - internal operations within or between dierent words (shifts, adds, XOR, etc) The specic operations are often more complex, such as load from memory indirectly from a pointer in register X to register Y. However, they all fall into those simple categories.
4
Processors CPU have greatly improved over the last 15 years. The changes in CPU design led to much higher system performance. As outlined by Dowd and Severance, the basic phases of this evolution are: Complex Instruction Set Computers Reduced Instruction Set Computers Super-scalar and super-pipelined processors Post-RISC Computers An excellent review of optimization and high performance computing is in High Performance Computing by Kevin Dowd and Charles Severance (OReilly, 1998). Many of these notes are based on this book.
5
CISC - Complex Instruction Set Computers The rst few generations of microprocessors all had CISC designs. The idea was simple - lots of instructions minimized memory and made the CPUs easier to use. A rich set of instructions makes it easier to write complex algorithms. Complicated ideas could be concisely expressed. Coding these instructions into the chips hardware made sense, since programmers could more easily work within the system constraints. Most compilers did not take full advantage of the extra machine instructions, so they didnt fully optimize the performance of the high-level codes. Improving performance through clever compiling obfuscated the need for special instruction sets. Only one instruction could be acted on at any given time in these systems as well.
6
RISC - Reduced Instruction Sets RISC machines have small, highly optimized instruction sets. However, the main reasons for the high performance of RISC machines is more complicated. The common characteristics of RISC machines are: instruction pipelining uniform instruction length simple addressing modes load/store architecture delayed branching pipelining oating point numbers
7
Uniform Instruction Length The simplest of these is the uniform instruction length. In CISC processors, the instruction length could vary from 1 byte to 4 (or more) bytes. All RISC instructions have the same length, making it faster to fetch and process the instructions.
Instruction Pipelining Every instruction goes through a similar set of stages when it is processed. For example, in a given processor, the stages might be: fetching the instruction decoding the instruction loading the operands processing the instruction saving the results to memory
Instruction Pipelining
Fetch
Decode
Load
Process
Save
10
Simultaneous Execution Since each of these steps is more or less independent from the other steps, it is possible to execute multiple instructions at the same time.
Fetch
Decode Fetch
Load Decode Fetch
Process Load Decode Fetch
Save Process Load Decode Fetch Save Process Load Save Process Save Process Save
Decode Load
At any intermediate time slice, eectively ve instructions are simultaneously executing.

11
Pipeline Eciency To make pipelines work eectively, three simple modications are added to the internal architecture. Uniform instruction Length- all instructions have a uniform byte length. This means loading instructions is always the same, and decoding the instructions is straightforward. Simple Address Modes - only simple address modes are allowed. Complicated memory calculations are not allowed in any single program step. Simple Load/Store Modes - only simple load and store commands are allowed. There are no complicated multi-cycle load or store commands in the processor.
12
Pipeline Eciency All the modications are done for the sake of eciency. This has no real eect on the types of programs allowed in a high level language. It only impacts how the compiler translates the program into machine code.
13
Problems with Pipelines Unfortunately, you dont always known what the next instruction is in real programs. If there is a branch which relies on the current system state, you cant predict which path to follow. There are three approaches to this problem in normal RISC processors: treat the branch as a no-op and continue the execution (assume it will fail) begin to process the instructions after following the branch guess the branch route based on recent behavior at this location randomly pick a direction
14
When Speculative Execution Fails All of these work moderately well. However, all can fail in some cases. If the guess is wrong, the processor simply dumps the incorrectly executed instructions and starts lling is pipeline again.
15
Super-RISC Systems First generation RISC machines have been improved upon in two ways. Super-scalar processors execute several instructions at the SAME TIME. This only works, of course, if the instructions are independent of each other. However, the compilers can gure this out. This is essentially a subset of parallel computing. Super-pipeline processors have enlarged pipelines. Instead of ve stages, they might have ten or more. Super-scalar processors allow multiple threads to execute at the same time.
16
Post-RISC Modern processors often are super-scalar. In order to keep several instructions executing at the same time, they often have to resort to some strange sounding tricks. Out of order execution makes sense in some cases. Even if instructions need to be dumped, you win if you guess correctly some or most of the time. Speculative execution also is used in modern processors. They literally do things they think you might want to have done. Again, guessing is eective if it is right some or most of the time. Again, these are winning strategies if the guesses are helped by a smart compiler.
17
Itanium - EPIC The Intel Itaniium uses an Explicitly Parallel Instruction Set, or EPIC architecture. The idea is to have an instructor register that can simultaneously execute several dierent instructions at the same time. Generally, the instruction buer contains 128 bits, with typically three instructions being held in the buer. The AMD 64 bit machines have sets of instructions that can operate on 128 data structures, but generally have support CISC and RISC instructions. The current memory bandwidth and bus sizes makes having instructions smaller than 64 bits generally unnecessary. 64 bits is one read operation.
18
Floating Point Pipelines Floating point pipelines are also EXTREMELY important in scientic computing. The idea is the same as normal instruction pipelining. A set of oating point instructions is applied through a pipeline. Filling the oating point pipeline can greatly increase the speed of the instruction. Unpipelined oating point operations can be executed, but usually MUCH slower than in fully pipelined machines.
19
Memory Memory architectures in modern workstations are also more complex than on older machines.
Early Computers
CPU Registrar RAM Disk
Modern Computers
L1 Cache L2 Cache Registrar L3 Cache RAM Disk
CPU
20
Memory vs Cache The use of very high speed caches and large internal registers has signicantly increased computer speeds. Modern computers also use virtual memory for large jobs. This means that some of the programs storage is actually on disk, rather than in RAM.
21
Cache Principles Memory is cached to avoid the cost of accessing RAM through a slow speed bus. Items cached support memory locality. There are two types of locality spatial - regions in main memory that are physically close together temporally - regions in main memory that are accessed close to each other in time
22
Cache Management When you need to access a block, a slot must be freed in the cache. The cleared block is chosen by several algorithms: LRU - Least Recently Used FIFO - First In, First Out LFU - Least Frequently Used Random There are many algorithms used to nd these blocks eciently.
23
Internal Cache Structure Cache and memory is organized in 32 byte blocks. Cache memory is called slots. A tag to associate a memory address A valid bit marks a slot currently being executed. A dirty bit marks blocks modied in the cache Main memory is organized in blocks.
24
25
26
27
Cache Models Direct Map Caching Simple to implement Array strides will have cache misses Associative Mapping Cache Very exible memory management Dicult to search for memory in cache Dicult to implement eciently Set Associative Mapped Cache Combines characteristics of both models Used in nearly all modern computers
28
Registers, Counters and Flags All CPUs have internal storage that are labeled registers, counters and ags. registers are regions used to hold pointers, counters, ags, and other data needed for internal calculations counters are registers that are updated because of program execution ags are status holders about the processor and the program
29
Virtual Memory Even with large sizes in main memory, some programs must go to other large storage devices such as hard drives to solve complex problems. The way this generally works is by virtual memory.
30
31
Fragmentation and Segmentation Multiple programs execute within main memory Each program is held in a segment - a memory block of an arbitrary size As some programs end, space is free in the center of the memory space Fragmentation occurs because these blocks are non-contiguous
32
33
Managing Virtual Memory Most of the management of virtual memory, segments, and paging is done with the Translation Lookaside Buer or TLB. The TLB is usually on the CPU, and keeps track of the virtual and physical memory.
34
Networking Networking is a relatively modern idea in computing. According to http://www.pbs.org/nerds/timeline/network.html ARPANET went on-line in 1969 UUCP and USENET was established in 1974 TCP/IP created in 1982 10,000 network nodes in 1987 Berners-Lee creates a web server in 1992 and one million nodes are on the network WWW has a 341,000% growth in 1993
35
TCP/IP IP -responsible for routing of a packet from node to node TCP - responsible for ensuring a packets content is correct
36
Layers application - http, telnet, ssh, whois presentation - data models / notation session - SSH, RPC, BSD sockets transport - TCP, UDP network - IP data link - Ethernet, frame relay, ISDN physical - radio, wire, optical
37
Communications Overhead Packets in TCP/IP have three parts Header - sender/receiver IP, protocol, packet number Payload/body - data Trailer - end of packet marker/error correction In an email packet, the breakdown is header - 11 bytes data - 112 bytes trailer - 4 bytes
38
Compilers Modern compilers are essential to the high performance CPUs. Compilers have a number of stages they pass through translating programs into machine code. They are: preprocessing - adding denitions and include les lexical analysis - nding keywords, variables, constants, and operators parsing - moving the code into an intermediate representation optimization - simple code changes to improve eciency code generation - creation of machine code
39
Compiler Optimizations Compilers are getting much better, but the optimization changes they normally make are pretty simple. removal of inaccessible code removal of code that produces unused results simplication of constants constant folding (UN-redened variables) common subexpression elimination mathematical simplications removal of loop invariant code simplication of inductive loops
40
Cluster Based Computing The new direction for parallel computing is in cluster based computing. The essential characteristics are: multiple commercial o the shelf (COTS) machines (usually Intel-based) no special operating system or compilers (usually Linux) high speed but COTS networking cards and routers connecting separate boxes standard message passing library handing communications through RSH or SSH The common name for these machines is Beowulf clusters.
41
Performance Characteristics of Beowulf Clusters These are general characteristics of Beowulf clusters. Consider them rules of thumb rather than certainties. individual nodes are moderate end PC boxes communications are usually routed through a single router most communications are fairly slow with high to moderate latency
42
Message Passing Libraries With the creation of MIMD machines, message passing libraries had to be created to allow general communication between computational nodes. The original libraries used were proprietary, and sold only with parallel machines. Every company wanted to have new and better features than every other company. This competition led to completely machine dependent programming. Every new parallel machine required a complete rewrite of the message passing sections. It is not very surprising that parallel computing has been a commercial failure... at least until recently.
43
SCSs Beowulf Cluster - Maps maps.scs.gmu.edu is a fairly standard Beowulf cluster. maps has 64 nodes and a rewall root node each node has dual Pentium III 600 Mhz processor with 512 Mb of disk and 20 Gb hard drive communications is done with 100 base/T cards the central router can handle 2 Gbits/s
44

Aca

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aca

Uploaded by

Copyright:

Available Formats

Basic Computer Architectures How does a CPU work?

http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_ http://www.intel.com/products/processor/pentium_D/ http://www-03.ibm.com/chips/power/powerpc/

The NAND Gate

All computers can be built from NAND gates.

Load Decode Fetch

Process Load Decode Fetch

At any intermediate time slice, eectively ve instructions are simultaneously executing.

You might also like