Assembly generated from NEON intrinsics [LLVM

2018-06-21 06:37:31

I'm trying to learn more about ARM assembly and understand what exactly is happening behind the scenes with NEON intrinsics. I'm using the latest Xcode LLVM compiler. I find that often, the assembly produced from intrinsics is actually slower than even plain naive C code.

For example this code:

void ArmTest::runTest()
{

    const float vector[4] = {1,2,3,4};
    float result[4];
    float32x4_t vA = vld1q_f32(vector);

    asm("#Begin Test");

    vA = vmulq_f32(vA, vA);

    asm("#End Test");

    vst1q_f32(result, vA);
}

Produces this output:

#Begin Test

ldr q0, [sp, #16]
stp q0, q0, [fp, #-48]
ldur    q1, [fp, #-32]
fmul.4s v0, v1, v0
str q0, [sp, #16]

#End Test

What I fail to understand is why all the loads/stores hitting the memory here? I must be missing something obvious, right? Also, how would one write this in inline assembly so that it is optimal? I would expect just a single instruction, but the output is way different.

Please help me understand.

Thanks!

You need a better test. Your test doesn't use the results of your calculation for anything so the compiler is just going through the motions to make you happy. It looks like you're compiling with -O0 which will produce a bunch of unnecessary loads and stores for debugging purposes. If you compiled with -O3, all of your code would be stripped out. I rewrote your test to preserve the results and compiled with -O3 and here are the results:

$ cat neon.c 
#include <arm_neon.h>

void runTest(const float vector[], float result[])
{
    float32x4_t vA = vld1q_f32(vector);
     vA = vmulq_f32(vA, vA);
     vst1q_f32(result, vA);
}

$ xcrun -sdk iphoneos clang -arch arm64 -S neon.c -O3
$ cat neon.s 
    .section    __TEXT,__text,regular,pure_instructions
    .globl  _runTest
    .align  2
_runTest:                               ; @runTest
; BB#0:
    ldr q0, [x0]
    fmul.4s v0, v0, v0
    str q0, [x1]
    ret lr

This code looks optimal

链接地址: http://www.djcxy.com/p/59752.html

上一篇: 似乎无法将％ES添加到clobberlist（内联汇编，GCC）

下一篇: 由NEON内在函数生成的程序集[LLVM