在内核OpenCL中实现FIFO的最佳方法

2018-06-23 04:25:21

目标：在OpenCL中实现下面的图表。 OpenCl内核所需要做的主要事情是将系数数组和temp数组相乘，然后将所有这些值累加到最后。（这可能是最花时间的操作，并行性在这里真的很有用）。

我正在使用内核的辅助函数来进行乘法和加法运算（我希望这个函数也是平行的）。

图片描述：

一次一个 ，将值传递到与系数数组大小相同的数组（temp数组）。现在每次将单个值传递到此数组中时，temp数组将与系数数组并行相乘，然后将每个索引的值连接成一个单一元素。这将继续，直到输入数组到达它的最后一个元素。

在这里输入图像描述

我的代码会发生什么？

从输入的60个元素，它需要超过8000毫秒！而且我总共有120万份投入仍然需要通过。我知道有一个事实，有一种更好的解决方案来做我想做的事情。以下是我的代码如下。

以下是我知道的一些事情，他肯定是错误的。当我尝试将系数值与temp数组相乘时，它会崩溃。这是因为global_id。我希望这条线所做的只是并行地乘以两个数组。

我试图弄清楚为什么要花费这么长的时间来完成FIFO功能，所以我开始评论线路。我首先评论除FIFO函数的第一个循环以外的所有内容。结果，这花了50毫秒。然后当我取消注释下一个循环时，它跳到了8000ms。所以延迟将与数据传输有关。

是否有可以在OpenCl中使用的寄存器移位？也许使用整数数组的一些逻辑移位方法？（我知道有一个'>>'操作符）。

float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];

float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {

//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end

int globalId = get_global_id(0); 

float output = 0.0f;

//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
    tempArrayForShift[i] = temp[i];
}

//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
    temp[i] = tempArrayForShift[i];
}

//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];

//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i <  58; i ++) {
//  output = multipliedResult[i] + output;
}

//Returned summed value of temp array
return output;
}


__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) { 

//Initialize the temporary array values to 0
for (int i = 0; i <  58; i ++) {
    temp[i] = 0;
    tempArrayForShift[i] = 0;
    multipliedResult[i] = 0;
}

//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i <  60; i ++) {
    Output[i] = fifo(Array[i], coefficients, 58);
}

}

很长一段时间，我在OpenCl上遇到过这个问题。我不确定如何一起执行并行和顺序指令。

我正在考虑的另一种选择

在主要的cpp文件中，我正在考虑在那里实现fifo缓冲区，让内核执行乘法和加法操作。但这意味着我必须在循环中调用内核1000次以上。这会是更好的解决方案吗？或者它会完全没有效率。

为了从GPU中获得良好的性能，您需要将工作并行化到多个线程。在您的代码中，您只是使用单个线程，并且每个线程的GPU非常慢，但是如果许多线程同时运行，速度可能非常快。在这种情况下，您可以为每个输出值使用单个线程。实际上并不需要通过数组来移位值：对于每个输出值，都会考虑一个包含58个值的窗口，您可以从内存中获取这些值，将它们与系数相乘并将结果写回。

一个简单的实现是（使用与输出值一样多的线程来启动）：

__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) 
{ 
    int globalId = get_global_id(0); 
    float sum=0.0f;
    for (int i=0; i< 58; i++)
    {
        float tmp=0;
        if (globalId+i > 56)
        {
            tmp=Array[i+globalId-57]*coefficient[57-i];
        }
        sum += tmp;
    }
    output[globalId]=sum;
}

这并不完美，因为它所产生的内存访问模式并不是GPU的最佳选择。缓存可能会有所帮助，但显然有很多优化的空间，因为这些值会重复使用多次。您尝试执行的操作称为卷积（1D）。 NVidia在其GPU计算SDK中有一个名为oclConvolutionSeparable的2D示例，它显示了一个优化版本。您可以使用他们的convolutionRows内核进行一维卷积。

这是另一个可以尝试的内核。有很多同步点（障碍），但这应该表现得相当好。 65个项目的工作组并不是非常优化的。

步骤：

初始化本地值为0

将系数复制到局部变量

遍历输出元素来计算：

移动现有元素（仅限工作项目> 0）

复制新元素（仅限工作项目0）

计算点积
5A。乘法 - 每个工作项目一个
5B。还原循环来计算总和

复制点产品输出（仅限WI 0）

最后的障碍

代码：

__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){

    int globalId = get_global_id(0);
    int localId = get_local_id(0);  
    int localSize = get_local_size(0);  

    //1  init local values to 0
    localArray[localId] = 0.0f

    //2  copy coefficients to local
    //don't bother with this id __constant is working for you
    //requires another local to be passed in: localCoeff
    //localCoeff[localId] = coefficients[localId];

    //barrier for both steps 1 and 2
    barrier(CLK_LOCAL_MEM_FENCE);

    float tmp;
    for(int i = 0; i< outputSize; i++)
    {
        //3  shift elements (+barrier)
        if(localId > 0){
            tmp = localArray[localId -1]
        }
        barrier(CLK_LOCAL_MEM_FENCE);
        localArray[localId] = tmp

        //4  copy new element (work item 0 only, + barrier)
        if(localId == 0){
            localArray[0] = Array[i];
        }
        barrier(CLK_LOCAL_MEM_FENCE);

        //5  compute dot product
        //5a multiply + barrier
        localSums[localId] = localArray[localId] * coefficients[localId];
        barrier(CLK_LOCAL_MEM_FENCE);
        //5b reduction loop + barrier
        for(int j = 1; j < localSize; j <<= 1) {
            int mask = (j << 1) - 1;
            if ((localId & mask) == 0) {
                localSums[local_index] += localSums[localId +j]
            }
            barrier(CLK_LOCAL_MEM_FENCE);
        }

        //6 copy dot product (WI 0 only)
        if(localId == 0){
            Output[i] = localSums[0];
        }

        //7 barrier
        //only needed if there is more code after the loop.
        //the barrier in #3 covers this in the case where the loop continues
        //barrier(CLK_LOCAL_MEM_FENCE);
    }

}

那么更多的工作组呢？
这稍微简化为允许单个1x65工作组计算机的整个1.2M输出。要允许多个工作组，您可以使用/ get_num_groups（0）来计算每个组应该执行的工作量（workAmount），并调整i for循环：

for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)

步骤＃1也必须更改以初始化localArray的正确启动状态，而不是全部为0。

    //1  init local values
    if(groupId == 0){
        localArray[localId] = 0.0f
    }else{
        localArray[localSize - localId] = Array[workAmount - localId];
    }

这两项更改应允许您使用更多的工作组; 我建议在设备上计算单元数量的一些倍数。尽管如此，尽量保持每个组的工作量。充分利用这一点，有时在高级别上看起来最佳的内核在运行时会对内核造成不利影响。

优点
几乎在这个内核的每一点上，工作项目都有一定的作用。在步骤5b的减少循环期间，只有少于100％的项目正在工作。在这里阅读更多关于为什么这是一件好事。

缺点
这些障碍只会由于障碍的性质而放慢核心：暂停工作项目，直到其他人达到该点。也许有一种方法可以用较少的障碍来实现这一点，但我仍然觉得这是最佳的，因为你正试图解决这个问题。
每个组没有更多工作项目的空间，而65并不是一个非常理想的尺寸。理想情况下，您应该尝试使用2或2的倍数。但这并不是一个大问题，因为内核中存在很多障碍，使得它们都等待相当有规律。

链接地址: http://www.djcxy.com/p/65009.html

上一篇: Best approach to FIFO implementation in a kernel OpenCL

下一篇: Calculator C: inputting both operator signs and integers to perform calculations