Performance AVX/SSE assembly vs. intrinsics

2018-06-26 00:00:40

I'm just trying to check the optimum approach to optimizing some basic routines. In this case I tried very simply example of multiplying 2 float vectors together:

void Mul(float *src1, float *src2, float *dst)
{
    for (int i=0; i<cnt; i++) dst[i] = src1[i] * src2[i];
};

Plain C implementation is very slow. I did some external ASM using AVX and also tried using intrinsics. These are the test results (time, smaller is better):

ASM: 0.110
IPP: 0.125
Intrinsics: 0.18
Plain C++: 4.0

(compiled using MSVC 2013, SSE2, tried Intel Compiler, results were pretty much the same)

As you can see my ASM code beaten even Intel Performance Primitives (probably because I did lots of branches to ensure I can use the AVX aligned instructions). But I'd personally like to utilize the intrinsic approach, it's simply easier to manage and I was thinking the compiler should do the best job optimizing all the branches and stuff (my ASM code sucks in that matter imho, yet it is faster). So here's the code using intrinsics:

    int i;
    for (i=0; (MINTEGER)(dst + i) % 32 != 0 && i < cnt; i++) dst[i] = src1[i] * src2[i];

    if ((MINTEGER)(src1 + i) % 32 == 0)
    {
        if ((MINTEGER)(src2 + i) % 32 == 0)
        {
            for (; i<cnt-8; i+=8)
            {
                __m256 x = _mm256_load_ps( src1 + i); 
                __m256 y = _mm256_load_ps( src2 + i); 
                __m256 z = _mm256_mul_ps(x, y); 
                _mm256_store_ps(dst + i, z);
            };
        }
        else
        {
            for (; i<cnt-8; i+=8)
            {
                __m256 x = _mm256_load_ps( src1 + i); 
                __m256 y = _mm256_loadu_ps( src2 + i); 
                __m256 z = _mm256_mul_ps(x, y); 
                _mm256_store_ps(dst + i, z);
            };
        };
    }
    else
    {
        for (; i<cnt-8; i+=8)
        {
            __m256 x = _mm256_loadu_ps( src1 + i); 
            __m256 y = _mm256_loadu_ps( src2 + i); 
            __m256 z = _mm256_mul_ps(x, y); 
            _mm256_store_ps(dst + i, z);
        };
    };

    for (; i<cnt; i++) dst[i] = src1[i] * src2[i];

Simple: First get to an address where dst is aligned to 32 bytes, then branch to check which sources are aligned.

One problem is that the C++ implementations in the beginning and at the end are not using AVX unless I enable AVX in the compiler, which I do NOT want, because this should be just AVX specialization, but the software should work even on a platform, where AVX is not available. And sadly there seems to be no intrinsics for instructions such as vmovss, so there's probably a penalty for mixing AVX code with SSE, which the compiler uses. However even if I enabled AVX in the compiler, it still didn't get below 0.14.

Any ideas how to optimize this to make the instrisics reach the speed of the ASM code?

Your implementation with intrinsics is not the same function as your implementation in straight C: eg what if your function was called with arguments Mul(p, p, p+1) ? You'll get different results. The pure C version is slow because the compiler is ensuring that the code does exactly what you said.

If you want the compiler to make optimizations based on the assumption that the three arrays do not overlap, you need to make that explicit:

void Mul(float *src1, float *src2, float *__restrict__ dst)

or even better

void Mul(const float *src1, const float *src2, float *__restrict__ dst)

(I think it's enough to have __restrict__ just on the output pointer, although it wouldn't hurt to add it to the input pointers too)

On CPUs with AVX there is very little penalty for using misaligned loads - I would suggest trading this small penalty off against all the extra logic you're using to check for alignment etc and just have a single loop + scalar code to handle any residual elements:

   for (i = 0; i <= cnt - 8; i += 8)
   {
        __m256 x = _mm256_loadu_ps(src1 + i); 
        __m256 y = _mm256_loadu_ps(src2 + i); 
        __m256 z = _mm256_mul_ps(x, y); 
        _mm256_storeu_ps(dst + i, z);
   }
   for ( ; i < cnt; i++)
   {
       dst[i] = src1[i] * src2[i];
   }

Better still, make sure that your buffers are all 32 byte aligned in the first place and then just use aligned loads/stores.

Note that performing a single arithmetic operation in a loop like this is generally a bad approach with SIMD - execution time will be largely dominated by loads and stores - you should try to combine this multiplication with other SIMD operations to mitigate the load/store cost.

链接地址: http://www.djcxy.com/p/72736.html

上一篇: 有没有办法让DIV无法选择？

下一篇: 性能AVX / SSE组件与内部函数