Professional Documents
Culture Documents
Pi19404
January 28, 2013
Contents
Contents
OpenCL Parallel Programming for Image Convolution
0.1 0.2 0.3 0.4 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2D Convolution . . . . . . . . . . . . . . . . . . . . . . . Naive 2D convolution . . . . . . . . . . . . . . . . . . . Optimization method 1 2D convolution . . . . . . . . 0.4.1 Using Local Memory . . . . . . . . . . . . . . 0.4.2 Using Ternary Conditional Operator . . . . 0.4.3 Unrolling For Loops . . . . . . . . . . . . . . . 0.4.4 Read Only Memory and Constant Variables 0.4.5 Performance Comparison . . . . . . . . . . . . . 0.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3 4 4 5 5 6 6 6 7 7 8
2 | 8
3 | 8
0.2 2D Convolution
A 2D Convolution operations is a neighborhood operations.Value of pixel in the output matrix depends on weighted linear sum of pixel is input matrix.The weigth map to be used during the summation are defined by convolution kernel. 2D convolution is viewed as the output of a discrete time LTI system whose impulse response is defined by the convolution kernel.The value the pixel at system output is the linear sum of the pixels in neighborhood corresponding pixel in system input weighted by the convolution kernel. The convolution kernel decides the neighborhood size and weight map. Different weight map correspond to different types of filtering operation. A box kernel,gaussian filter defined a Low Pass filter LTI system while the sobel filter kernel defines a High Pass Filter LTI system.
XX
R R
k=R l=R
P [i
+ k; j + l]K [k; l]
(1)
The pixels at the image borders use pixel index outside of the image,we need to extrapolate the values of such image pixels lying outside the image.Different methods can be used to extrapolate the value of pixel.One simple method is to set pixel value to zero or constant.Another method is replicate the border pixels to pixels outside image border.In present approach we will set the pixel values
4 | 8
OpenCL Parallel Programming for Image Convolution 0. The data for input image,output image and kernel are stored in device global memory. Thus Each thread will access the data from the global device memory. The same pixels in the global memory will be accessed different local threads multiple times. Each thread will implement the above code to compute the value output pixel O[i; j ] in terms of local block/work group indexs,local thread ids and global thread ids. The naive parallel version is compared with host CPU version of the code
Once the data is loaded into local memory ,all the thread in the thread block perform convolution operations on the sub-image loaded in the local memory.
5 | 8
Thus performance increases is obtained by reducing the number of loads from global memory as well fact that convolution computation involves local memory which is faster to access than the global memory.
0.4.2
Performance is also increased by replacing the if-else block by ternary conditional operator. The If-Else block takes more than two instruction while ternary conditional operation is executed in single instruction cycle in some devices.
0.4.3
The for loops are expensive operation .If the size of the for loops are known at compile time they can be unrolled .In some compilers the for loops are unrolled automatically by providing compiler with a hint. If the size of loop is not known still the loop can be partially unrolled. To take full advantage of unrolling the parameters used in for loops can be passed as defined directives at compile time rather as kernel arguments . However not all devices may support unrolling and in which case we need to manually substitue for loop with equivalent commands,this provides slight improvement than for loop in some cases
0.4.4
Read Only memory are faster to access on some devices than readwrite memory Thus memories that are not required to be written to are labelled as read only memory. Also variable that are not going to be changed during the ex-
6 | 8
OpenCL Parallel Programming for Image Convolution ecution of the code are declared as const.These changes may provide improvement on some devices.
0.5 Code
The code consits of two parts the host code and the device code. Host side code uses OpenCv APIs to read the image from video file and demonstrates the calling of the kernel code for Box filter,Gaussian Filter and Sobel with naive and optimized parallel version and host CPU version . Code is available in repository https://code.google.com/p/m19404/ source/browse/OpenCL-Image-Processing/Convolution/
7 | 8
Bibliography
Bibliography
[1] [2] [3] [4] [5]
uic.edu/kreda/gpu/image-convolution/. html.
http://www.evl.
8 | 8