2.0

Page | 1 All rights reserved 2008.
Company for Advanced Supercomputing Solutions

CUDA.NET
Manual

Reference for programmers

Written by: Mordechai Butrashvily
Date: 17/08/2008
E-mail: moti@gass-ltd.co.il
Website: http://www.gass-ltd.co.il/products/cuda.net

Revision Writers Date Changes
1.1 Mordechai Butrashvily 17/08/2008 2
nd
revision, final version for
CUDA.NET 1.1
1.0 Mordechai Butrashvily 10/08/2008 First revision

Page | 2 All rights reserved 2008. Company for Advanced Supercomputing Solutions

Notice
ALL COMPANY'S DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS,
LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS) ARE BEING PROVIDED
AS IS. THE COMPANY MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE
WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However, the company assumes no
responsibility for the consequences of use of such information or for any infringement of patents or
other rights of third parties that may result from its use. No license is granted by implication or
otherwise under any patent or patent rights of the company. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and replaces all
information previously supplied. Company's products are not authorized for use as critical
components in life support devices or systems without express written approval of the company.

Trademarks
NVIDIA is a trademark or registered trademarks of NVIDIA Corporation. Other company and product
names may be trademarks of the respective companies with which they are associated.

Copyright
2008 - Company for Advanced Supercomputing Solutions Ltd, All rights reserved.
Bosmat 2a Street
Shoham, 73142,
Israel

http://www.gass-ltd.co.il


Contents
Introduction ................................................................................................................. 4
CUDA.NET basic objects ................................................................................................. 5
Driver Objects ........................................................................................................... 5
Data Types ............................................................................................................... 5
Working with devices ................................................................................................ 6
Working with device memory ..................................................................................... 6
Launching CUDA code ................................................................................................... 7
Working with modules ............................................................................................... 7
Working with functions .............................................................................................. 8
Setting function parameters ....................................................................................... 8
Setting execution configuration .................................................................................. 9
Working with CUFFTDriver ............................................................................................. 9
Higher Level Objects ................................................................................................... 11
New Object Model ...................................................................................................... 11
CUDA Object........................................................................................................... 12
Working with CUFFT ................................................................................................ 13
Working with CUBLAS .............................................................................................. 13


Introduction
CUDA.NET is a library that provides the same functionality by CUDA driver (exposed through
C interface) for .NET based applications.

The version of CUDA.NET this document relates to is CUDA.NET 1.1. That implies, by the
version of CUDA.NET, that the API of CUDA 1.1 is supported.
The library has been tested and can run without a problem on CUDA 2.0, but new features
that are available as of CUDA 2.0 are not yet supported by CUDA.NET.

As such, it wraps all the functionality of CUDA for .NET, practically speaking:
Device enumeration
Context management
Memory allocation and transfer (including arrays management)
Texture management
Asynchronous data transfer and execution - through streams

In addition, it provides access to all other routines provided by CUDA:
FFT 1D 3D
BLAS routines

To simplify development of .NET based applications, the library includes data types that
correspond to CUDA specifications, especially vector types:
CUDA.NET CUDA
Char1, Char2, Char3, Char4 char1, char2, char3, char4
UChar1, UChar2, UChar3, UChar4 uchar1, uchar2 ,uchar3, uchar4
Short1, Short2, Short3, Short4 short1, short2, short3, short4
UShort1, UShort2, UShort3, UShort4 ushort1, ushort2, ushort3, ushort4
Int1, Int2, Int3, Int4 int1, int2, int3, int4
UInt1, UInt2, UInt3, UInt4 uint1, uint2, uint3, uint4
Long1, Long2, Long3, Long4 long1, long2, long3, long4
ULong1, ULong2, ULong3, ULong4 ulong1, ulong2, ulong3, ulong4
Float1, Float2, Float3, Float4 float1, float2, float3, float4

That is, while supporting the basic primitive types (CUDA.NET syntax conforms to C#):
CUDA.NET CUDA
sbyte, byte char, unsigned char
short, ushort short, unsigned short
int, uint int, unsigned int
long, ulong long, unsigned long
float float


CUDA.NET basic objects
As stated in the previous section, CUDA.NET is a wrapper over CUDA driver.
To ease development and migration from existing CUDA application written in C to .NET, the
same API was reserved.

Accessing the driver API of CUDA from .NET can be done by using CUDADriver object of
CUDA.NET.
All methods are static, thus allowing direct access to the same functions.
For example, let's consider the following CUDA application written in C:
#include <cuda.h>
int main()
{
// Initialize the driver.
cuInit(0);
}

The same code with CUDA.NET looks like this:
using GASS.CUDA;

namespace CUDATest
{
class Test
{
static void Main(string[] args)
{
// Initialize the driver.
CUDADriver.cuInit(0);
}
}
}

The same approach can be applied to all other functions of the CUDA driver API.

Driver Objects
The set of basic wrapper objects provided by CUDA.NET are:
CUDADriver provides access to CUDA API
CUFFTDriver provides access to CUFFT API
CUBLASDriver provides access to CUBLAS API and routines

Data Types
Looking into GASS.CUDA.Types namespace reveals some types that were created to support
all features of CUDA from a .NET application:
CUdevice Represents a pointer to a device object
CUdeviceptr Represents a pointer to device memory


CUcontext Represents a pointer to a context object
CUmodule Represents a pointer to a loaded module object
CUfunction - Represents a pointer to a function in a module
CUarray - Represents a pointer to an allocated array in device memory
CUtexref - Represents a pointer to a texture in device memory
CUevent - Represents a pointer to an event
CUstream - Represents a pointer to a stream that can be used for asynchronous
operations
All these objects conform to the declarations in CUDA.

Working with devices
Before starting to perform CUDA operations, we must initialize the driver and select a device
to work with. Selecting a device happens of behalf of creating a context.
An example for that might be:
{
// Initialize the driver this call must be the first before any CUDA operation!
CUDADriver.cuInit(0);

// Get the first device from the driver.
CUdevice dev = new CUdevice();
CUDADriver.cuDeviceGet(ref dev, 0);

// Create a new context with default flags.
CUcontext ctx = new CUcontext();
CUDADriver.cuCtxCreate(ref ctx, 0, dev);
}

By creating a context we tell the driver that this is the one to be used throughout all CUDA
operations (we can use attach and detach functions later to manage the context we work
with).
It should be noted that a context is always related to a single device.

Working with device memory
Using pointers from .NET code (with unsafe semantics) is discouraged, that is why all
functions that accept pointers to device memory receive an object of type "CUdeviceptr"
instead.
This way we keep .NET code clean, and maintain compatibility with the C API of CUDA, since
all this objects are declared in the C environment as well.

An example for allocating device memory from .NET:


{
// Assuming the driver was initialized and a context was created.
CUdeviceptr p1 = new CUdeviceptr();

// Allocate 1K of data in device memory.
CUDADriver.cuMemAlloc(ref p1, 1<<10);
}

Now, to copy data to this pointer (in device memory):
byte[] b = new byte[1<<10];
CUDADriver.cuMemcpyHtoD(p1, arr, arr.Length);
* In this case, a byte array with 1024 elements is really 1024 bytes length.

The same codelet can be used with any data type supported with CUDA.NET. In addition to
the data types listed under the introduction section, .NET primitives and vector types, it is
possible to use CUFFT types for copying data into device memory: cufftReal, cufftComplex.

Launching CUDA code
Using CUDA it is possible to generate a binary file, called a module, which contains code that
runs in the GPU. The binary file has a postfix of "cubin" identifying it as a binary version of
the *.cu file.

In the following example we will create a *.cu file and then call its main function from a
CUDA.NET application.

Let's consider the following file compute.cu:
extern "C" __global__ void compute(float4* data)
{
// Some code.
}

Now, we need to compile this file to create a binary version. For that purpose issue we the
following command:
nvcc compute.cu -- cubin

The result of this execution will be with a file named "compute.cubin" in the current folder.

*.cubin files act like modules in CUDA and are identical to *.SO or *.DLL files under Linux or
Windows respectively.

Working with modules
The driver API of CUDA allows us to load modules dynamically in run-time to be consumed
and then released when no longer being used.

Using CUDA.NET to load a module can done like this:


CUmodule mod = new CUmodule();
CUDADriver.cuModuleLoad(ref mod, "compute.cubin");
* It is highly encouraged to use full path to denote a module file name.

After executing the code above we end up with a module that is loaded by the driver.
The next step will be to get a function to execute from that module.

Working with functions
In the previous section we said that the CUDA driver can load modules in run-time, the same
holds for functions too, although functions are hosted by modules.

Once we have a loaded module, and a reference to its object, we can get a reference to one
of its global functions in the following way, using CUDA.NET:

CUfunction func = new CUfunction();
CUDADriver.cuModuleGetFunction(ref func, mod, "compute");

We used the module we loaded previously to get a function name compute.
At this point you can understand why the declaration of the function in the compute.cu file
involved the use of extern "C" keyword. The reason for that is because nvcc is a C++
compiler, so it emit symbol with name mangling of C++. But, to simplify our process when
we wish to load a function, we want to specify its direct name.

At this point we have a function in hand that is almost ready for execution.
The next step will be to set the function's parameters dynamically.

Setting function parameters
So after we have a function, before it is being executed in the GPU we need to specify some
parameters and configuration information.

Investigating the function signature we used ("compute"), we find that it accepts one
parameter, which is a pointer to device memory:

extern "C" __global__ void compute(float4* data);

Before we set parameter information, it is necessary to allocate the memory in the device:

Float4[] data = new Float4[100];
CUdeviceptr ptr = new CUdeviceptr();
CUDADriver.cuMemAlloc(ref ptr, (uint)Marshal.SizeOf(data));
// Copy the data to the device

Setting parameters using CUDA.NET for this function looks like this:

CUDADriver.cuParameterSeti(func, 0, (uint)ptr.Pointer);


But that is not enough. We still need to tell the driver a hint that indicates how much
memory to reserve for function parameters:

CUDADriver.cuParameterSetSize(func, 4);

* NOTE: When working under 32 bit systems and compiling CUDA code for 32 bit function
pointer will always be in the length of 4 bytes. Under 64 bit systems, specifically when
compiling CUDA code to 64 bit, function pointers have a length of 8 bytes, so the last
parameter of cuParameterSetSize varies with the platform it is possible to get the pointer
size in run-time using the size of IntPtr object in .NET.

Setting execution configuration
One last step before executing the code in the GPU we need to set execution configuration
for our context (meaning the function to be executed).
As already known, with CUDA execution is divided into grids that in turn are divided into
blocks, which are divided to threads (the basic execution element).

It is not the goal of this document to describe this approach as it is widely covered in the
documentation provided by NVIDIA for CUDA.

The driver API provides functions to set each of these parameters:
Grid size by means of blocks
Block size by means of threads

To set threads count for every block of execution:
CUDADriver.cuFunctionSetBlockShape(func, 64, 8, 0);

This way, we set the block size to be 64 threads in the X axis, 8 threads in the Y axis and 0 in
the Z, for a total of 512 threads in each block.
It is possible to set only one of the axes.

To launch the function in a grid:
CUDADriver.cuLaunchGrid(func, 512, 512);

The code above really executes the function in the GPU with a configuration of 512 blocks in
the X and Y axes respectively, for a total amount of 262,144 blocks and 134,217,728 threads
to be executed.

Working with CUFFTDriver
CUFFT routines provided by CUDA allow a programmer to perform FFT calculation in the
GPU.
The same API exposed by including cufft.h is used in CUDA.NET.

For example, let's consider the following code given in the official documentation of CUDA
(written by NVIDIA):
1D Complex-to-Complex Transform

Page |
10
All rights reserved 2008. Company for Advanced Supercomputing Solutions

#define NX 256
#define BATCH 10
cufftHandle plan;
cufftComplex *data;
cudaMalloc((void **)&data, sizeof(cufftComplex) * NX * BATCH);

/* Create a 1D FFT plan. */
cufftPlan1D(&plan, NX, CUFFT_C2C, BATCH);

/* Use the CUFFT plan to transform the signal in place. */
cufftExecC2C(plan, data, data, CUFFT_FORWARD);

/* Inverse transform the signal in place. */
cufftExecC2C(plan, data, data, CUFFT_INVERSE);

/* Destroy the CUFFT plan. */
cufftDestroy(plan);
cudaFree(data);

Performing the same operations with CUDA.NET, looks like this:
using GASS.CUDA;
using GASS.CUDA.FFT;
using GASS.CUDA.FFT.Types;
using System.Runtime.InteropServices;

namespace CUFFTTest
{
class Test
{
const int NX = 256;
const int BATCH = 10;

{
// Assume driver is initialized and a context was created.

// Allocate data for the array.
CUdeviceptr data = new CUdeviceptr();
CUDADriver.cuMemAlloc(ref data,
Marshal.SizeOf(typeof(cufftComplex)) * NX * BATCH);

/* Create a 1D plan. */
cufftHandle plan = new cufftHandle();
CUFFTDriver.cufftPlan1D(ref plan, NX,
CUFFTType.ComplexToComplex, BATCH);

/* Perform a forward FFT. */
CUFFTDriver.cufftExecC2C(plan, data, data,
CUFFTDirection.Forward);

Page |
11

/* Perform an inverse FFT. */
CUFFTDriver.cufftExecC2C(plan, data, data,
CUFFTDirection.Inverse);

/* Clean resources and free memory. */
CUFFTDriver.cufftDestroy(plan);
CUDADriver.cuMemFree(data);
}
}
}

The approach can be used with other types of FFT.

Higher Level Objects
As with the final release of CUDA.NET, three object were added to simplify development
with CUDA.NET:
CUDA to provide all CUDA functionality
CUFFT provides CUFFT functionality with simplified functions
CUBLAS simplifies working with CUBLAS routines

All new objects use the respective driver, so backward compatibility is maintained with
previous versions.

To provide better feedback of what happened in the driver, all objects will through a
runtime exception that is specific to the class itself:
CUDA CUDAException
CUFFT CUFFTException
CUBLAS CUBLASException

When an error occurs and the return value from the relevant driver function is different
from CUResult.Success.
This behavior can be controlled through the UseRuntimeExceptions property, which is by
default true.
To turnoff runtime exceptions, simply set the value of this property to false, and it can be
turned on again later.
New Object Model
The major change was in CUDA to allow programmers work easily with CUDA and devices.

A new object oriented approach was suggested for this purpose. For example, it is possible
to enumerate devices that are recognized by CUDA simply by accessing the following
property of CUDA:

CUDA cuda = new CUDA(true);

Page |
12

for each (Device dev in cuda.Devices)
{
Console.WriteLine("{0} -> {1}", dev.Ordinal, dev.Name);
}

The rationale behind the object model was to provide the same API with better syntax and
function names, and to add some useful functions that will improve programming agility.

CUDA Object
This object was created in mind to provide simpler access to CUDA functions, without using
ref keywords or using too low-level API.

Most of the functionality that is supported by CUDADriver is available through this object
although some functions didn't find their way into. In future releases they will be added if
there will be necessity for that.

Let's consider the case with memory allocation -

We can simply allocate memory using ordinary usage through:
CUdeviceptr ptr = cuda.Allocate(128);

This fragment of code simply allocates 128 bytes of device memory and returns the
appropriate pointer.

It should be noted at this point, that all functions can still operate with low-level driver
objects, to allow interoperability with CUDADriver object.

Allocating memory for a .NET array can be done like this:
UInt3[] data = new UInt3[128];
CUdeviceptr ptr = cuda.Allocate<UInt3>(data);

The fragment above allocates enough memory for 128 elements of UInt3 vector type, for a
total of 1536 bytes.

Using generic code and some explicit reflection code the amount of memory to allocate is
computed by the functions so that there is no necessary to provide such details only the
array to allocate memory for.

To ease programming, some further functions were provided to allow allocating memory
and copying data to device memory in a single call:
UInt3[] data = new UInt3[256];
CUdeviceptr ptr = cuda.CopyHostToDevice<UInt3>(data);

This code fragment allocates device memory for 256 elements of UInt3 (total of 3072 bytes)
and copies the array to device memory. Of course that this mechanism can be used with
other types of arrays and CUDA.NET supported primitives.

Page |
13

Working with CUFFT
As with CUFFT object the API now supports the older function with nicer usage, but allows
performing most of FFT operations in a single call.

Creating a 1D plan can be done by:
CUFFT cufft = new CUFFT(new CUDA(true));
cufftHandle plan = cufft.Plan1D(nx, type, batch);

But it is possible to run any of the 1D FFT routines through calling:
cufftReal[] realData = new cufftReal[256];
cufftComplex[] cmlxData = new cufftComplex[256];

cufft.Execute1D(realData, cmlxData, nx, batch);

The function handles memory management by itself and executes the appropriate FFT based
on the provided parameters.

The same holds for all other types of FFT.

Working with CUBLAS
CUBLAS object was created in mind to provide better usage for working with vector and
matrix memory, while all other operations are still accessible from CUBLASDriver object.

It is possible that in future versions all supported functions will enter CUBLAS as well with
simpler signature.

An example for initializing a vector:
CUDA cuda = new CUDA(true);
CUBLAS blas = new CUBLAS(cuda);

blas.Init();

float[] data = new float[] { 0.0f, 1.5f, 2.5f, 5.224f };
CUdeviceptr vector = blas.Allocate<float>(data);
blas.SetVector<float>(data, vector);

blas.Free(vector);
blas.Shutdown();

The example above demonstrates how to create a vector in device memory and copy data to
be used by one of CUBLAS routines.

2.0

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.0

Uploaded by

Copyright:

Available Formats

Page | 1 All rights reserved 2008.

Company for Advanced Supercomputing Solutions

You might also like