multiplication using SSE (x*x*x)
I'm trying to optimize a cube function using SSE
long cube(long n)
{
return n*n*n;
}
I have tried this :
return (long) _mm_mul_su32(_mm_mul_su32((__m64)n,(__m64)n),(__m64)n);
And the performance was even worse (and yes I have never done anything with sse).
Is there a SSE function which could increase the performance? Or something else?
output from cat /proc/cpuinfo
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 3070 @ 2.66GHz stepping : 6 cpu MHz : 2660.074 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 5320.14 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Xeon(R) CPU 3070 @ 2.66GHz stepping : 6 cpu MHz : 2660.074 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow bogomips : 5320.35 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
I think you have misunderstood when it is useful to use SSE. But I have only used SSE with floating-point types so my experience may not be applicable to this case. I hope you can still learn some bits from what I have written.
SSE provides SIMD, Single Instruction Multiple Data. It is useful when you have many values on which you want to perform the same calculation. It is a kind of small scale parallelization. So instead of doing one multiplication, you can do four at the same time. But it is only useful if you have all dependencies available.
So in your case, there is no room for parallelization. You could write a function that calculated the cube of four float
s that would be faster than calling a function that calculated the cube of one number four times.
Your code compiles to:
cube:
movl 4(%esp), %edx
movl %edx, %eax
imull %edx, %eax
imull %edx, %eax
ret
If inlined the ret and moves will get optimized out, so you have two imul instructions. I doubt mmx or SSE could make this any faster (transfering the data into the mmx / sse registers alone would probably be slower than the two imuls)
You have to align your variables on 16 bytes, for one. Also, in my own experience tinkerin with SSE, you will get significant gains if you compute your function on a whole batch of values... say
cube(long* inArray, long* outArray, size_t size) {
...
}
链接地址: http://www.djcxy.com/p/72564.html
下一篇: 乘法使用SSE(x * x * x)