Assembly generated from NEON intrinsics [LLVM
I'm trying to learn more about ARM assembly and understand what exactly is happening behind the scenes with NEON intrinsics. I'm using the latest Xcode LLVM compiler. I find that often, the assembly produced from intrinsics is actually slower than even plain naive C code.
For example this code:
void ArmTest::runTest()
{
const float vector[4] = {1,2,3,4};
float result[4];
float32x4_t vA = vld1q_f32(vector);
asm("#Begin Test");
vA = vmulq_f32(vA, vA);
asm("#End Test");
vst1q_f32(result, vA);
}
Produces this output:
#Begin Test
ldr q0, [sp, #16]
stp q0, q0, [fp, #-48]
ldur q1, [fp, #-32]
fmul.4s v0, v1, v0
str q0, [sp, #16]
#End Test
What I fail to understand is why all the loads/stores hitting the memory here? I must be missing something obvious, right? Also, how would one write this in inline assembly so that it is optimal? I would expect just a single instruction, but the output is way different.
Please help me understand.
Thanks!
You need a better test. Your test doesn't use the results of your calculation for anything so the compiler is just going through the motions to make you happy. It looks like you're compiling with -O0 which will produce a bunch of unnecessary loads and stores for debugging purposes. If you compiled with -O3, all of your code would be stripped out. I rewrote your test to preserve the results and compiled with -O3 and here are the results:
$ cat neon.c
#include <arm_neon.h>
void runTest(const float vector[], float result[])
{
float32x4_t vA = vld1q_f32(vector);
vA = vmulq_f32(vA, vA);
vst1q_f32(result, vA);
}
$ xcrun -sdk iphoneos clang -arch arm64 -S neon.c -O3
$ cat neon.s
.section __TEXT,__text,regular,pure_instructions
.globl _runTest
.align 2
_runTest: ; @runTest
; BB#0:
ldr q0, [x0]
fmul.4s v0, v0, v0
str q0, [x1]
ret lr
This code looks optimal
链接地址: http://www.djcxy.com/p/59752.html上一篇: 似乎无法将%ES添加到clobberlist(内联汇编,GCC)
下一篇: 由NEON内在函数生成的程序集[LLVM