Written by: Mordechai Butrashvily Date: 17/08/2008 E-mail: moti@gass-ltd.co.il Website: http://www.gass-ltd.co.il/products/cuda.net
Revision Writers Date Changes 1.1 Mordechai Butrashvily 17/08/2008 2 nd revision, final version for CUDA.NET 1.1 1.0 Mordechai Butrashvily 10/08/2008 First revision
Page | 2 All rights reserved 2008. Company for Advanced Supercomputing Solutions
Notice ALL COMPANY'S DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS) ARE BEING PROVIDED AS IS. THE COMPANY MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However, the company assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of the company. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. Company's products are not authorized for use as critical components in life support devices or systems without express written approval of the company.
Trademarks NVIDIA is a trademark or registered trademarks of NVIDIA Corporation. Other company and product names may be trademarks of the respective companies with which they are associated.
Copyright 2008 - Company for Advanced Supercomputing Solutions Ltd, All rights reserved. Bosmat 2a Street Shoham, 73142, Israel
http://www.gass-ltd.co.il
Page | 3 All rights reserved 2008. Company for Advanced Supercomputing Solutions
Contents Introduction ................................................................................................................. 4 CUDA.NET basic objects ................................................................................................. 5 Driver Objects ........................................................................................................... 5 Data Types ............................................................................................................... 5 Working with devices ................................................................................................ 6 Working with device memory ..................................................................................... 6 Launching CUDA code ................................................................................................... 7 Working with modules ............................................................................................... 7 Working with functions .............................................................................................. 8 Setting function parameters ....................................................................................... 8 Setting execution configuration .................................................................................. 9 Working with CUFFTDriver ............................................................................................. 9 Higher Level Objects ................................................................................................... 11 New Object Model ...................................................................................................... 11 CUDA Object........................................................................................................... 12 Working with CUFFT ................................................................................................ 13 Working with CUBLAS .............................................................................................. 13
Page | 4 All rights reserved 2008. Company for Advanced Supercomputing Solutions
Introduction CUDA.NET is a library that provides the same functionality by CUDA driver (exposed through C interface) for .NET based applications.
The version of CUDA.NET this document relates to is CUDA.NET 1.1. That implies, by the version of CUDA.NET, that the API of CUDA 1.1 is supported. The library has been tested and can run without a problem on CUDA 2.0, but new features that are available as of CUDA 2.0 are not yet supported by CUDA.NET.
As such, it wraps all the functionality of CUDA for .NET, practically speaking: Device enumeration Context management Memory allocation and transfer (including arrays management) Texture management Asynchronous data transfer and execution - through streams
In addition, it provides access to all other routines provided by CUDA: FFT 1D 3D BLAS routines
To simplify development of .NET based applications, the library includes data types that correspond to CUDA specifications, especially vector types: CUDA.NET CUDA Char1, Char2, Char3, Char4 char1, char2, char3, char4 UChar1, UChar2, UChar3, UChar4 uchar1, uchar2 ,uchar3, uchar4 Short1, Short2, Short3, Short4 short1, short2, short3, short4 UShort1, UShort2, UShort3, UShort4 ushort1, ushort2, ushort3, ushort4 Int1, Int2, Int3, Int4 int1, int2, int3, int4 UInt1, UInt2, UInt3, UInt4 uint1, uint2, uint3, uint4 Long1, Long2, Long3, Long4 long1, long2, long3, long4 ULong1, ULong2, ULong3, ULong4 ulong1, ulong2, ulong3, ulong4 Float1, Float2, Float3, Float4 float1, float2, float3, float4
That is, while supporting the basic primitive types (CUDA.NET syntax conforms to C#): CUDA.NET CUDA sbyte, byte char, unsigned char short, ushort short, unsigned short int, uint int, unsigned int long, ulong long, unsigned long float float
Page | 5 All rights reserved 2008. Company for Advanced Supercomputing Solutions
CUDA.NET basic objects As stated in the previous section, CUDA.NET is a wrapper over CUDA driver. To ease development and migration from existing CUDA application written in C to .NET, the same API was reserved.
Accessing the driver API of CUDA from .NET can be done by using CUDADriver object of CUDA.NET. All methods are static, thus allowing direct access to the same functions. For example, let's consider the following CUDA application written in C: #include <cuda.h> int main() { // Initialize the driver. cuInit(0); }
The same code with CUDA.NET looks like this: using GASS.CUDA;
namespace CUDATest { class Test { static void Main(string[] args) { // Initialize the driver. CUDADriver.cuInit(0); } } }
The same approach can be applied to all other functions of the CUDA driver API.
Driver Objects The set of basic wrapper objects provided by CUDA.NET are: CUDADriver provides access to CUDA API CUFFTDriver provides access to CUFFT API CUBLASDriver provides access to CUBLAS API and routines
Data Types Looking into GASS.CUDA.Types namespace reveals some types that were created to support all features of CUDA from a .NET application: CUdevice Represents a pointer to a device object CUdeviceptr Represents a pointer to device memory
Page | 6 All rights reserved 2008. Company for Advanced Supercomputing Solutions
CUcontext Represents a pointer to a context object CUmodule Represents a pointer to a loaded module object CUfunction - Represents a pointer to a function in a module CUarray - Represents a pointer to an allocated array in device memory CUtexref - Represents a pointer to a texture in device memory CUevent - Represents a pointer to an event CUstream - Represents a pointer to a stream that can be used for asynchronous operations All these objects conform to the declarations in CUDA.
Working with devices Before starting to perform CUDA operations, we must initialize the driver and select a device to work with. Selecting a device happens of behalf of creating a context. An example for that might be: static void Main(string[] args) { // Initialize the driver this call must be the first before any CUDA operation! CUDADriver.cuInit(0);
// Get the first device from the driver. CUdevice dev = new CUdevice(); CUDADriver.cuDeviceGet(ref dev, 0);
// Create a new context with default flags. CUcontext ctx = new CUcontext(); CUDADriver.cuCtxCreate(ref ctx, 0, dev); }
By creating a context we tell the driver that this is the one to be used throughout all CUDA operations (we can use attach and detach functions later to manage the context we work with). It should be noted that a context is always related to a single device.
Working with device memory Using pointers from .NET code (with unsafe semantics) is discouraged, that is why all functions that accept pointers to device memory receive an object of type "CUdeviceptr" instead. This way we keep .NET code clean, and maintain compatibility with the C API of CUDA, since all this objects are declared in the C environment as well.
An example for allocating device memory from .NET:
Page | 7 All rights reserved 2008. Company for Advanced Supercomputing Solutions
static void Main(string[] args) { // Assuming the driver was initialized and a context was created. CUdeviceptr p1 = new CUdeviceptr();
// Allocate 1K of data in device memory. CUDADriver.cuMemAlloc(ref p1, 1<<10); }
Now, to copy data to this pointer (in device memory): byte[] b = new byte[1<<10]; CUDADriver.cuMemcpyHtoD(p1, arr, arr.Length); * In this case, a byte array with 1024 elements is really 1024 bytes length.
The same codelet can be used with any data type supported with CUDA.NET. In addition to the data types listed under the introduction section, .NET primitives and vector types, it is possible to use CUFFT types for copying data into device memory: cufftReal, cufftComplex.
Launching CUDA code Using CUDA it is possible to generate a binary file, called a module, which contains code that runs in the GPU. The binary file has a postfix of "cubin" identifying it as a binary version of the *.cu file.
In the following example we will create a *.cu file and then call its main function from a CUDA.NET application.
Let's consider the following file compute.cu: extern "C" __global__ void compute(float4* data) { // Some code. }
Now, we need to compile this file to create a binary version. For that purpose issue we the following command: nvcc compute.cu -- cubin
The result of this execution will be with a file named "compute.cubin" in the current folder.
*.cubin files act like modules in CUDA and are identical to *.SO or *.DLL files under Linux or Windows respectively.
Working with modules The driver API of CUDA allows us to load modules dynamically in run-time to be consumed and then released when no longer being used.
Using CUDA.NET to load a module can done like this:
Page | 8 All rights reserved 2008. Company for Advanced Supercomputing Solutions
CUmodule mod = new CUmodule(); CUDADriver.cuModuleLoad(ref mod, "compute.cubin"); * It is highly encouraged to use full path to denote a module file name.
After executing the code above we end up with a module that is loaded by the driver. The next step will be to get a function to execute from that module.
Working with functions In the previous section we said that the CUDA driver can load modules in run-time, the same holds for functions too, although functions are hosted by modules.
Once we have a loaded module, and a reference to its object, we can get a reference to one of its global functions in the following way, using CUDA.NET:
CUfunction func = new CUfunction(); CUDADriver.cuModuleGetFunction(ref func, mod, "compute");
We used the module we loaded previously to get a function name compute. At this point you can understand why the declaration of the function in the compute.cu file involved the use of extern "C" keyword. The reason for that is because nvcc is a C++ compiler, so it emit symbol with name mangling of C++. But, to simplify our process when we wish to load a function, we want to specify its direct name.
At this point we have a function in hand that is almost ready for execution. The next step will be to set the function's parameters dynamically.
Setting function parameters So after we have a function, before it is being executed in the GPU we need to specify some parameters and configuration information.
Investigating the function signature we used ("compute"), we find that it accepts one parameter, which is a pointer to device memory:
extern "C" __global__ void compute(float4* data);
Before we set parameter information, it is necessary to allocate the memory in the device:
Float4[] data = new Float4[100]; CUdeviceptr ptr = new CUdeviceptr(); CUDADriver.cuMemAlloc(ref ptr, (uint)Marshal.SizeOf(data)); // Copy the data to the device
Setting parameters using CUDA.NET for this function looks like this:
Page | 9 All rights reserved 2008. Company for Advanced Supercomputing Solutions
But that is not enough. We still need to tell the driver a hint that indicates how much memory to reserve for function parameters:
CUDADriver.cuParameterSetSize(func, 4);
* NOTE: When working under 32 bit systems and compiling CUDA code for 32 bit function pointer will always be in the length of 4 bytes. Under 64 bit systems, specifically when compiling CUDA code to 64 bit, function pointers have a length of 8 bytes, so the last parameter of cuParameterSetSize varies with the platform it is possible to get the pointer size in run-time using the size of IntPtr object in .NET.
Setting execution configuration One last step before executing the code in the GPU we need to set execution configuration for our context (meaning the function to be executed). As already known, with CUDA execution is divided into grids that in turn are divided into blocks, which are divided to threads (the basic execution element).
It is not the goal of this document to describe this approach as it is widely covered in the documentation provided by NVIDIA for CUDA.
The driver API provides functions to set each of these parameters: Grid size by means of blocks Block size by means of threads
To set threads count for every block of execution: CUDADriver.cuFunctionSetBlockShape(func, 64, 8, 0);
This way, we set the block size to be 64 threads in the X axis, 8 threads in the Y axis and 0 in the Z, for a total of 512 threads in each block. It is possible to set only one of the axes.
To launch the function in a grid: CUDADriver.cuLaunchGrid(func, 512, 512);
The code above really executes the function in the GPU with a configuration of 512 blocks in the X and Y axes respectively, for a total amount of 262,144 blocks and 134,217,728 threads to be executed.
Working with CUFFTDriver CUFFT routines provided by CUDA allow a programmer to perform FFT calculation in the GPU. The same API exposed by including cufft.h is used in CUDA.NET.
For example, let's consider the following code given in the official documentation of CUDA (written by NVIDIA): 1D Complex-to-Complex Transform
Page | 10 All rights reserved 2008. Company for Advanced Supercomputing Solutions
/* Use the CUFFT plan to transform the signal in place. */ cufftExecC2C(plan, data, data, CUFFT_FORWARD);
/* Inverse transform the signal in place. */ cufftExecC2C(plan, data, data, CUFFT_INVERSE);
/* Destroy the CUFFT plan. */ cufftDestroy(plan); cudaFree(data);
Performing the same operations with CUDA.NET, looks like this: using GASS.CUDA; using GASS.CUDA.FFT; using GASS.CUDA.FFT.Types; using System.Runtime.InteropServices;
namespace CUFFTTest { class Test { const int NX = 256; const int BATCH = 10;
static void Main(string[] args) { // Assume driver is initialized and a context was created.
// Allocate data for the array. CUdeviceptr data = new CUdeviceptr(); CUDADriver.cuMemAlloc(ref data, Marshal.SizeOf(typeof(cufftComplex)) * NX * BATCH);
/* Create a 1D plan. */ cufftHandle plan = new cufftHandle(); CUFFTDriver.cufftPlan1D(ref plan, NX, CUFFTType.ComplexToComplex, BATCH);
/* Perform a forward FFT. */ CUFFTDriver.cufftExecC2C(plan, data, data, CUFFTDirection.Forward);
Page | 11 All rights reserved 2008. Company for Advanced Supercomputing Solutions
/* Perform an inverse FFT. */ CUFFTDriver.cufftExecC2C(plan, data, data, CUFFTDirection.Inverse);
Higher Level Objects As with the final release of CUDA.NET, three object were added to simplify development with CUDA.NET: CUDA to provide all CUDA functionality CUFFT provides CUFFT functionality with simplified functions CUBLAS simplifies working with CUBLAS routines
All new objects use the respective driver, so backward compatibility is maintained with previous versions.
To provide better feedback of what happened in the driver, all objects will through a runtime exception that is specific to the class itself: CUDA CUDAException CUFFT CUFFTException CUBLAS CUBLASException
When an error occurs and the return value from the relevant driver function is different from CUResult.Success. This behavior can be controlled through the UseRuntimeExceptions property, which is by default true. To turnoff runtime exceptions, simply set the value of this property to false, and it can be turned on again later. New Object Model The major change was in CUDA to allow programmers work easily with CUDA and devices.
A new object oriented approach was suggested for this purpose. For example, it is possible to enumerate devices that are recognized by CUDA simply by accessing the following property of CUDA:
CUDA cuda = new CUDA(true);
Page | 12 All rights reserved 2008. Company for Advanced Supercomputing Solutions
for each (Device dev in cuda.Devices) { Console.WriteLine("{0} -> {1}", dev.Ordinal, dev.Name); }
The rationale behind the object model was to provide the same API with better syntax and function names, and to add some useful functions that will improve programming agility.
CUDA Object This object was created in mind to provide simpler access to CUDA functions, without using ref keywords or using too low-level API.
Most of the functionality that is supported by CUDADriver is available through this object although some functions didn't find their way into. In future releases they will be added if there will be necessity for that.
Let's consider the case with memory allocation -
We can simply allocate memory using ordinary usage through: CUdeviceptr ptr = cuda.Allocate(128);
This fragment of code simply allocates 128 bytes of device memory and returns the appropriate pointer.
It should be noted at this point, that all functions can still operate with low-level driver objects, to allow interoperability with CUDADriver object.
Allocating memory for a .NET array can be done like this: UInt3[] data = new UInt3[128]; CUdeviceptr ptr = cuda.Allocate<UInt3>(data);
The fragment above allocates enough memory for 128 elements of UInt3 vector type, for a total of 1536 bytes.
Using generic code and some explicit reflection code the amount of memory to allocate is computed by the functions so that there is no necessary to provide such details only the array to allocate memory for.
To ease programming, some further functions were provided to allow allocating memory and copying data to device memory in a single call: UInt3[] data = new UInt3[256]; CUdeviceptr ptr = cuda.CopyHostToDevice<UInt3>(data);
This code fragment allocates device memory for 256 elements of UInt3 (total of 3072 bytes) and copies the array to device memory. Of course that this mechanism can be used with other types of arrays and CUDA.NET supported primitives.
Page | 13 All rights reserved 2008. Company for Advanced Supercomputing Solutions
Working with CUFFT As with CUFFT object the API now supports the older function with nicer usage, but allows performing most of FFT operations in a single call.
Creating a 1D plan can be done by: CUFFT cufft = new CUFFT(new CUDA(true)); cufftHandle plan = cufft.Plan1D(nx, type, batch);
But it is possible to run any of the 1D FFT routines through calling: cufftReal[] realData = new cufftReal[256]; cufftComplex[] cmlxData = new cufftComplex[256];
cufft.Execute1D(realData, cmlxData, nx, batch);
The function handles memory management by itself and executes the appropriate FFT based on the provided parameters.
The same holds for all other types of FFT.
Working with CUBLAS CUBLAS object was created in mind to provide better usage for working with vector and matrix memory, while all other operations are still accessible from CUBLASDriver object.
It is possible that in future versions all supported functions will enter CUBLAS as well with simpler signature.
An example for initializing a vector: CUDA cuda = new CUDA(true); CUBLAS blas = new CUBLAS(cuda);
blas.Init();
float[] data = new float[] { 0.0f, 1.5f, 2.5f, 5.224f }; CUdeviceptr vector = blas.Allocate<float>(data); blas.SetVector<float>(data, vector);
blas.Free(vector); blas.Shutdown();
The example above demonstrates how to create a vector in device memory and copy data to be used by one of CUBLAS routines.