You are on page 1of 8

GPU Model

Cedric Nugteren February 2, 2010

Contents
1 Introduction 2 Input data 2.1 GPU characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 CUDA constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Model 3.1 Core model (cM) . . . . 3.2 Rooine model (rM) . . 3.3 Extensions model (eM) . 3.4 Memory model (mM) . 3.5 Null model (nM) . . . . 4 Evaluation 2 2 2 2 3 3 3 4 6 6 7 8

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Introduction

GPUs have been of large interest lately, as they can be used as a general purpose computation platform. Because of the competition between dierent vendors, details of the hardware architecture are unknown. The last years have shown dierent models and simulators trying to predict performance. Their usage is threefold: to predict performance of yet to be developped applications, to identify bottlenecks in existing applications or to predict performance of ctional GPU architectures. The last is of particular interest in the eld of computer architecture design, in which how do we improve performance of GPUs is an important question. No hardware description (on paper or in HDL) is available, so only models and simulators can provide an answer to this question. This document introduces a GPU model. Other models exist, but are inaccurate, incomplete or badly explained. The aim of this document is to provide clear insight in a simple model, while extending as the document progresses.

Input data

The input data for the GPU model consists of three parts. First, characteristics of the GPU are needed. They consist of specications and measured data and are dierent for each GPU. Secondly, the source code is analysed for various data. Finally, some CUDA constants are needed. They have not changed, but could change in another GPU architecture.

2.1

GPU characteristics

Table 1 shows all the characteristics required to run the model1 . Some are obtained from the GPU, while some are benchmarked and can thus be less precise. The table also shows some typical values. Name 1 2 3 4 5 6 7 8 9 10 11 12 clock multiprocessors smem size reg size warp size bandwidth overhead base overhead arg overhead block mem latency mem rate uc mem rate c Description Core clock frequency of the SM Number of SMs Size of the shared memory Size of the register le Number of threads per warp Bandwidth between the DRAM and the SMs Kernel launch overhead base Kernel launch overhead per argument Kernel launch overhead per thread block DRAM access latency DRAM re rate latency for uncoalesced accesses DRAM re rate latency for coalesced accesses Table 1: GPU characteristics Typical value 0.3GHz-2.0GHz 1-32 16KB 8192 32 10GB/s-100GB/s 6000 cycles 400 cycles 10 cycles 400-600 cycles 300 cycles 40 cycles

2.2

Source code

Table 2 shows all the required input from the source code. After compilation, the number of instructions and such can be obtained from the PTX source. Other details are obtained from the CUBIN and the host code. Shown are some example values for a case-study program.
1 Some

characteristics can be calculated by other characteristics, this should be changed later

Name 1 2 3 4 5 6 7 8 9 threads per block computation instr mem load instr mem store instr barrier instr arguments smem usage reg usage blocks

Description Number of threads per block Number of instructions per thread Number of (un)coalesced loads per thread Number of (un)coalesced stores per thread Number of synchronization instructions per thread Number of arguments passed to the kernel Shared memory usage per thread block Register usage per thread Number of thread blocks Table 2: Source code

Example kernel 128 23 2 1 0 2 3KB 7 4048

2.3

CUDA constants

Table 3 displays a number of values, dened as constants in the model. However, they could change in future architecture and might be interesting to vary for research purposes. The table shows the assigned values. Name 1 2 3 4 5 6 7 processors c mem per warp uc mem per warp data size max threadblocks max threads pipeline latency Description Number of processing elements (PEs) in one SM Number of coalesced memory accesses per warp2 Number of uncoalesced memory accesses per warp3 Number of bytes for one data element4 Maximum amount of thread blocks on one SM Maximum amount of threads on one SM Latency of the GPU pipeline Table 3: CUDA constants Value 8 1 32 4 8 768 24

Model

This section discusses the GPU model, divided into multiple parts. The rst part - the core model (cM) - contains the basic model of the GPU. Then, a rooine model (rM) is added to provide some constraints to the core model. Additionally, several extensions for special cases are presented in the extensions model (eM). Then, dierent types of memory accesses are taken into account (among others memory coalescing) in the memory model (mM). The last part - the null model (nM) - covers all the variables used that are not directly taken from sourcecode, GPU specications or the constant values.

3.1

Core model (cM)

To begin with, the number of clock cycles the GPU needs to perform any instruction is calculated. This is done for both computation instructions (any instruction, including memory access) and memory instructions (only loads from the o-chip DRAM). The amount of computation cycles per warp is obtained by multiplying the number of instruction with the issue cycles for one instruction5 . The amount of memory cycles is computed by multiplying the number of accesses by the total latency for one warp (memory latency plus the memory re rate latency, see the rooine model for explanation). See table 4 and equations 1 and 2.
2 Currently 3 Currently

not used not used 4 Should be a source code value 5 Some instructions take more issue cycles, this needs to be added in a future version

comp cycles = src.computation instr

gpu.warp size cst.processors

(1) (2)

mem cycles = memory instr (gpu.mem latency + mem rate) Cycle Instruction Threads 1 1 1-8 2 2 1-8 3 1 9-16 4 2 9-16 5 1 17-24 6 2 17-24 7 1 25-32 8 2 25-32

Table 4: Computation cycles for 2 instructions, using a warp size of 32 and 8 processors per SM. Then, the performance is calculated. First, the cycles one multiprocessor (SM) takes for all its warps to nish is calculated by multiplying both the computation and memory cycles with a certain factor (equation 3). These factors are determined by the rooine model (see next subsection). Also, the total number of SM executions is calculated. The total number of thread blocks is divided by the amount of SMs (they compute in parallel) multiplied by the amount of blocks that t onto one SM (equation 4). Multiplying the amount of cycles needed per SM with the amount of SM executions gives the total cycle count (equation 5). cycles per sm = mem f actor mem cycles + comp f actor comp cycles sm executions = src.blocks gpu.multiprocessors blocks per sm (3) (4) (5)

cycles = cycles per sm sm executions

Finally, with the cycle count known, the total execution time can be computed using the GPU core clock frequency. The cycle count (taking parallel computing into account) is divided by the frequency, as can be seen in equation 6. time = cycles gpu.clock (6)

3.2

Rooine model (rM)

First, the bandwidth limit is considered. The GPU has a limited bandwidth, so the bandwidth per warp is calculated. The number of bytes one warp reads is either zero (no memory accesses) or the warp size times the data size (in bytes), as seen in equation 7. When this number is multiplied with the clock rate and divided by the total memory access time, the bandwidth used per SM is calculated (equation 8). The fraction of the used bandwidth is then obtained by dividing the theoretical6 bandwidth with the used bandwidth (the bandwidth per SM multiplied with the number of SMs). The fraction of used bandwidth for one warp is equal to the total number of warps needed to use the complete bandwidth (equation 9). bytes per warp = cst.data size gpu.warp size min bw per sm = bw limit = 1 memory instr (7) (8) (9)

bytes per warp gpu.clock gpu.mem latency + mem rate

gpu.bandwidth bw per sm gpu.multiprocessors

A second considered limit is the re rate limit of memory requests. When ring a memory request, several clock cycles of delay are introduced before a second memory request is red. While the delay per
6 Overhead

needs to be taken into account, the theoretical value is not the maximum obtainable value in benchmarks

memory request can be low, the delay for a complete warp to re all its memory requests can be signicant7 . The total memory request delay time per warp is thus modeled as the actual delay plus the re rate delay. However, one SM can wait for multiple actual memory delays, but can only wait for one re rate delay. In other words, the fraction of total memory request delay spend on the re rate delay imposes a limit for the number of simultaneously schedulable memory requests (equation 10). This is illustrated in table 5, in which only 4 memory requests can be active in one SM. Warp Warp Warp Warp Warp Warp 1 2 3 4 5 6 c f c f c c c c m f m f m m f m m f m m m f m m m f

m m m f

m m m f

m m m f

m m m f

m m m

m m m

m m

m m

Table 5: Fire rate limit due to a memory access delay (m) 3 times higher than the re rate delay (f) gpu.mem latency + mem rate (10) mem rate Then, the rooine model is applied. The two above limits combined with the amount of warps per SM are evaluated against each other. Each of the three imposes a limit to the number of memory accesses at one instance in time. The strongest limit is taken as the limit of the number of memory access that can be performed simultaneously, as seen in equation 11. f ire rate limit bw limit sim mem limit = min (11) warps per sm f ire rate limit = Then, two cases can be distinguished. The amount of cycles needed for memory operations in one warp is evaluated against the number of computation cycles needed multiplied by the limit to the number of simultaneously performed memory accesses: If the rst number is smaller, the program is not memory bound. Therefore, the total cost of execution is one memory cycle (all memory instructions for one warp) and the number of warps per SM multiplied by the computation cycle (all computation instructions for all warps on one SM). See table 6 and equations 12, 13 and 14. mem cycles sim mem limit comp cycles mem f actor = 1 comp f actor = warps per sm (12) (13) (14)

If the rst number is larger, the program is memory bound. The total cost for memory is now equal to the memory cycles times the warps per SM divided by the memory access limit. For computation, it is the number of computation cycles times the warps per SM minus the memory access limit. See table 7 and equations 15, 16 and 17. mem cycles sim mem limit comp cycles warps per sm sim mem limit comp f actor = warps per sm sim mem limit mem f actor = (15) (16) (17)

7 Memory accesses within one warp can be coalesced, reducing the re rate delay considerably. This is explained in a later section

Warp Warp Warp Warp Warp Warp

1 2 3 4 5 6

fm c

fm fm c

fm fm fm c

fm fm fm fm c

fm fm fm fm c

fm fm fm fm

fm fm fm

fm fm

fm

Table 6: Scheduling 6 warps, each with 10 computation cycles (c) and 40 memory cycles (fm), assuming a limit of 4 warps accessing memory simultaneously. Warp Warp Warp Warp Warp Warp 1 2 3 4 5 6 c fm c fm fm c fm fm fm c fm fm fm fm c fm fm fm fm c fm fm fm fm

fm fm fm fm

fm fm fm fm

fm fm fm

fm fm

fm fm

fm fm

fm

Table 7: Scheduling 6 warps, each with 10 computation cycles (c) and 60 memory cycles (fm), assuming a limit of 4 warps accessing memory simultaneously.

3.3

Extensions model (eM)

The core model and the rooine model combined provide a complete model. However, some special cases are not considered. The rst issue the extensions model addresses is the kernel launchtime overhead. In most cases, this time is neglectible, but for small kernels it becomes noticeable. It is divided into three parts. The rst is a base overhead (equation 18), which is the same for any kernel. Then, for any argument passed to the kernel above 5, a second overhead is added (equation 19). Finally, for any launched thread block a small overhead is added (equation 20). This additional cost is added to the cycle count, as seen in equation 21. overhead per kernel = gpu.overhead base overhead per kernel = overhead per kernel + gpu.overhead block src.blocks overhead per kernel = overhead per kernel + gpu.overhead arg max cycles = cycles + overhead per kernel 0 src.arguments 5 (18) (19) (20) (21)

A second extension is due to the pipeline latency of the processing elements. Until now, the throughput (1 instruction per cycle per processor) is used to calculate performance. However, with an insucient amount of threads, the throughput may not be reached due to the pipeline latency (24 cycles in the current architecture). If the SM does not have enough instructions from dierent threads to process within these 24 cycles, the full throughput may not be reached. In this case, equation 1 changes into equation 22. gpu.warp size cst.processors (22) comp cycles = src.computation instr max src.pipeline latency warps per sm

3.4

Memory model (mM)

Both memory loads and memory writes are considered equal8 . The number of memory accesses is therefore the sum of both, as seen in equation 23.
8 This

is not yet proven to be correct

memory instr = src.mem load instr + src.mem store instr

(23)

As shown in the rooine model, the number of simultaneous memory requests can reduce dramatically by the re rate delay. One warp can have a signciant re rate delay, since all threads need to perform their own memory access request. However, threads within a warp can be coalesced into a larger group, reducing the re rate delay, enabling for more simultaneous memory requests. Without coalescing, each thread res its memory request one re rate delay after the previous thread. With coalescing, each warp res its memory requests one re rate delay after the previous warp. Due to a larger memory access when coalescing, the re rate delay per warp is slightly larger than the re rate per thread, but signicantly smaller than the sum of all re rate delays when not coalescing. This is seen in equation 24 and tables 8 (uncoalesced) and 9 (coalesced). mem rate = gpu.mem rate uc gpu.mem rate c 1 1 1 2 2 2 f (f) (f) m f (f) m m f m m m f (f) (f) if memory accesses are uncoalesced if memory accesses are coalesced m m m m f (f) m m m m m f m m m m m m (f) m m m m m (f) (f) m m m m (24)

Thread Thread Thread Thread Thread Thread

1, 2, 3, 1, 2, 3,

Warp Warp Warp Warp Warp Warp

m m m

(f) m m

(f) (f) m

Table 8: Scheduling uncoalesced memory accesses, assuming 3 threads per warp and a memory access latency of twice the re rate delay per warp

Threads 1-3, Warp 1 Threads 1-3, Warp 2

m f

m m

m m

m m

m m

m m

Table 9: Scheduling coalesced memory accesses, assuming 3 threads per warp and a memory access latency of four times the re rate delay per warp Another item discussed in the memory model is partition camping. The available memory bandwidth can only be fully utilized when memory accesses are evenly distributed over the dierent memory partitions. Each partition has its share of bandwidth. Certain applications might be limited to utilizing one or several partitions, but most application will use all partitions. However, these partitions might not be used all at a time. To detect this, the source code needs to be analyzed, combined with a complete schedule of threads on the GPU. This is to extensive to add to the model.

3.5

Null model (nM)

Starting with the input variables as seen in sections 2.1, 2.2 and 2.3, all variables used in the previous sections can be calculated. First, the amount of blocks running on one SM is calculated. This is bound by several factor, including a limit to the number of blocks, the number of threads, the shared memory usage and the register usage. All these limits are imposed by the hardware design. Memories, registers and look-up-tables are set to xed size, allowing a maximum number of blocks and/or threads to be scheduled. This is shown in equation 25.

blocks gpu.multiprocessors cst.max threadblocks cst.max threads blocks per sm = min src.threads per block gpu.smem size src.smem usage + 1 gpu.reg size max[src.reg usage, 1] src.threads per block

(25)

With the number of blocks running on one SM known, the number of warps per block and warps per SM are calculated, as shown in equations 26 and 27. warps per block = src.threads per block gpu.warp size (26) (27)

warps per sm = blocks per sm warps per block

Evaluation

The presented model can give an indication of performance on ctional and existing GPU architectures. However, many special cases are not considered. Also, although the performance estimation is fairly adequate, the model gives no insight in the actual instruction scheduling mechanism and therefore abstracts from the real GPU architecture. A thread/instruction scheduler and a memory emulator can be added to the model, going more in the direction of a simulator. This simulator could be semantically incorrect but cycle accurate, giving performance metrics. A simulator will improve the model at the following points: It gives more accuracy to instruction and thread scheduling, making room for analysis of the actual warp scheduler. It will include a full memory model, with coalescing and partition camping, showing bottlenecks in the architecture. It gives better insight in hardware design choices.

You might also like