Cuda Image average filter

Average filter is windowed filter of linear class, that smooths signal (image). The filter works as low-pass one. The basic idea behind filter is for any element of the signal (image) take an average across its neighborhood.


If we've an mxn matrix and we want to apply average filter with size k on it,then for each point in the matrix p:(i,j) the value of the point would be the average of all points in the square

This figure is for Square kernel of filtering with size 2 , that the yellow box is the pixel to be averaged, and all the grid is the square of neighbor pixels, that the pixel's new value will be the average of them.

The problem is that this algorithm is very slow, specially on large images, so I thought about using GPGPU .

The question now is , How Can this be implemented in cuda, if it's possible ?


This is a classic case of embarrassingly parallel image processing problem that can be very easily mapped to CUDA framework. The averaging filter is knows as Box Filter in image processing domains.

The easiest approach would be to use CUDA textures for the filtering process as the boundary conditions can be handled very easily by textures.

Assuming you have source and destination pointers allocated on the host. The procedure would be something like this.

  • Allocate large enough memory to hold the source and destination images on device.
  • Copy source image from host to device.
  • Bind the source image device pointer to texture.
  • Specify an appropriate block size and a grid large enough to cover every pixel of the image.
  • Launch the filtering kernel using the specified grid and block size.
  • Copy the results back to host.
  • Unbind the texture
  • Free device pointers.
  • Sample Implementation Of Box Filter

    Kernel

    texture<unsigned char, cudaTextureType2D> tex8u;
    
    //Box Filter Kernel For Gray scale image with 8bit depth
    __global__ void box_filter_kernel_8u_c1(unsigned char* output,const int width, const int height, const size_t pitch, const int fWidth, const int fHeight)
    {
        int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
        int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
    
        const int filter_offset_x = fWidth/2;
        const int filter_offset_y = fHeight/2;
    
        float output_value = 0.0f;
    
        //Make sure the current thread is inside the image bounds
        if(xIndex<width && yIndex<height)
        {
            //Sum the window pixels
            for(int i= -filter_offset_x; i<=filter_offset_x; i++)
            {
                for(int j=-filter_offset_y; j<=filter_offset_y; j++)
                {
                    //No need to worry about Out-Of-Range access. tex2D automatically handles it.
                    output_value += tex2D(tex8u,xIndex + i,yIndex + j);
                }
            }
    
            //Average the output value
            output_value /= (fWidth * fHeight);
    
            //Write the averaged value to the output.
            //Transform 2D index to 1D index, because image is actually in linear memory
            int index = yIndex * pitch + xIndex;
    
            output[index] = static_cast<unsigned char>(output_value);
        }
    }
    

    Wrapper Function:

    void box_filter_8u_c1(unsigned char* CPUinput, unsigned char* CPUoutput, const int width, const int height, const int widthStep, const int filterWidth, const int filterHeight)
    {
    
        /*
         * 2D memory is allocated as strided linear memory on GPU.
         * The terminologies "Pitch", "WidthStep", and "Stride" are exactly the same thing.
         * It is the size of a row in bytes.
         * It is not necessary that width = widthStep.
         * Total bytes occupied by the image = widthStep x height.
         */
    
        //Declare GPU pointer
        unsigned char *GPU_input, *GPU_output;
    
        //Allocate 2D memory on GPU. Also known as Pitch Linear Memory
        size_t gpu_image_pitch = 0;
        cudaMallocPitch<unsigned char>(&GPU_input,&gpu_image_pitch,width,height);
        cudaMallocPitch<unsigned char>(&GPU_output,&gpu_image_pitch,width,height);
    
        //Copy data from host to device.
        cudaMemcpy2D(GPU_input,gpu_image_pitch,CPUinput,widthStep,width,height,cudaMemcpyHostToDevice);
    
        //Bind the image to the texture. Now the kernel will read the input image through the texture cache.
        //Use tex2D function to read the image
        cudaBindTexture2D(NULL,tex8u,GPU_input,width,height,gpu_image_pitch);
    
        /*
         * Set the behavior of tex2D for out-of-range image reads.
         * cudaAddressModeBorder = Read Zero
         * cudaAddressModeClamp  = Read the nearest border pixel
         * We can skip this step. The default mode is Clamp.
         */
        tex8u.addressMode[0] = tex8u.addressMode[1] = cudaAddressModeBorder;
    
        /*
         * Specify a block size. 256 threads per block are sufficient.
         * It can be increased, but keep in mind the limitations of the GPU.
         * Older GPUs allow maximum 512 threads per block.
         * Current GPUs allow maximum 1024 threads per block
         */
    
        dim3 block_size(16,16);
    
        /*
         * Specify the grid size for the GPU.
         * Make it generalized, so that the size of grid changes according to the input image size
         */
    
        dim3 grid_size;
        grid_size.x = (width + block_size.x - 1)/block_size.x;  /*< Greater than or equal to image width */
        grid_size.y = (height + block_size.y - 1)/block_size.y; /*< Greater than or equal to image height */
    
        //Launch the kernel
        box_filter_kernel_8u_c1<<<grid_size,block_size>>>(GPU_output,width,height,gpu_image_pitch,filterWidth,filterHeight);
    
        //Copy the results back to CPU
        cudaMemcpy2D(CPUoutput,widthStep,GPU_output,gpu_image_pitch,width,height,cudaMemcpyDeviceToHost);
    
        //Release the texture
        cudaUnbindTexture(tex8u);
    
        //Free GPU memory
        cudaFree(GPU_input);
        cudaFree(GPU_output);
    }
    

    The good news is that you don't have to implement the filter yourself. The CUDA Toolkit comes with free signal and image processing library named NVIDIA Performance Primitives aka NPP, made by NVIDIA. NPP utilizes CUDA enabled GPUs to accelerate processing. The averaging filter is already implemented in NPP. The current version of NPP (5.0) has support for 8 bit, 1 channel and 4 channel images. The functions are:

  • nppiFilterBox_8u_C1R for 1 channel image.
  • nppiFilterBox_8u_C4R for 4 channel image.

  • Some Basic thoughts/steps:

  • Copy the Image data from the CPU to the GPU
  • Call a kernel to build the average for each line (horizontal) and store it in shared memory.
  • Call a kernel to build the average for each column (vertical) and store it in global memory.
  • Copy the data back to the CPU-Memory.
  • You should be able to scale this pretty easily with 2D-memory and multidimensional kernel-calls.


    If the filter's size is normal and not humongous, the average filter is a very good case for implementing with CUDA. You can set this up using square blocks and every thread of the block is responsible for the calculation of the value of one pixel, by summing and averaging its neighbors.

    If you store the image in Global Memory then it can be programmed easily but you will have a lot of bank conflicts. One possible optimization is that you load blocks of the image into the block's Shared Memory. Using phantom elements (so that you won't exceed the shared block's dimensions when looking for neighboring pixels) you can calculate the average of the pixels within a block.

    The only think that you have to be careful of is how the "stitching" will be done in the end, because the shared memory blocks will overlap (because of the extra "padding" pixels) and you don't want to calculate their values twice.

    链接地址: http://www.djcxy.com/p/25394.html

    上一篇: 带有cuda的盒式过滤器c

    下一篇: Cuda图像平均过滤器