multiplication using SSE (x*x*x)

I'm trying to optimize a cube function using SSE

long cube(long n)
{
    return n*n*n;
}

I have tried this :

return (long) _mm_mul_su32(_mm_mul_su32((__m64)n,(__m64)n),(__m64)n);

And the performance was even worse (and yes I have never done anything with sse).

Is there a SSE function which could increase the performance? Or something else?

output from cat /proc/cpuinfo


processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Xeon(R) CPU            3070  @ 2.66GHz
stepping    : 6
cpu MHz     : 2660.074
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips    : 5320.14
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 15
model name  : Intel(R) Xeon(R) CPU            3070  @ 2.66GHz
stepping    : 6
cpu MHz     : 2660.074
cache size  : 4096 KB
physical id : 0
siblings    : 2
core id     : 1
cpu cores   : 2
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 10
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips    : 5320.35
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:


I think you have misunderstood when it is useful to use SSE. But I have only used SSE with floating-point types so my experience may not be applicable to this case. I hope you can still learn some bits from what I have written.

SSE provides SIMD, Single Instruction Multiple Data. It is useful when you have many values on which you want to perform the same calculation. It is a kind of small scale parallelization. So instead of doing one multiplication, you can do four at the same time. But it is only useful if you have all dependencies available.

So in your case, there is no room for parallelization. You could write a function that calculated the cube of four float s that would be faster than calling a function that calculated the cube of one number four times.


Your code compiles to:

cube:
        movl    4(%esp), %edx
        movl    %edx, %eax
        imull   %edx, %eax
        imull   %edx, %eax
        ret

If inlined the ret and moves will get optimized out, so you have two imul instructions. I doubt mmx or SSE could make this any faster (transfering the data into the mmx / sse registers alone would probably be slower than the two imuls)


You have to align your variables on 16 bytes, for one. Also, in my own experience tinkerin with SSE, you will get significant gains if you compute your function on a whole batch of values... say

cube(long* inArray, long* outArray, size_t size) {
  ...
}
链接地址: http://www.djcxy.com/p/72564.html

上一篇: 英特尔Sandybridge的管道优化项目

下一篇: 乘法使用SSE(x * x * x)