NEON?

2018-06-24 21:58:21

I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more than 20% of my total time taken. So I feel like if I could remove the 64 bit division, I could optimize the code well. In NEON we can have some 64 bit instructions. Can any one suggest some routine to get the bottleneck resolved by using some faster implementation.

Or if I could make the 64 bit/32 bit division in terms of 32bit/32 bit division in C, that also is fine?

If any one has some idea, could you please help me out?

I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.

The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:

http://www.peter-teichmann.de/adiv2e.html

This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)

Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.

This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.

Edit:

I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:

Divide.S

.section ".text"

.global udiv64

udiv64:
    adds      r0,r0,r0
    adc       r1,r1,r1

    .rept 31
        cmp     r1,r2   
        subcs   r1,r1,r2  
        adcs    r0,r0,r0
        adc     r1,r1,r1
    .endr

    cmp     r1,r2
    subcs   r1,r1,r2
    adcs    r0,r0,r0

    bx      lr

C-Code:

extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);

int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
  int q;
  int sign = (a^b) < 0; /* different signs */
  uint32_t l,h;
  a = a<0 ? -a:a;
  b = b<0 ? -b:b;
  l = (a << 24);
  h = (a >> 8);
  q = udiv64 (l,h,b);
  if (sign) q = -q;
  return q;
}

链接地址: http://www.djcxy.com/p/69768.html

上一篇: 计算不同行中不同列之间的差异

下一篇: ARM / NEON的64bit / 32bit分割算法更快？