bit multiplication through 16

I am writing a soft-multiplication function call using shifting and addition. The existing function call goes like this:

unsigned long __mulsi3 (unsigned long a, unsigned long b) {

    unsigned long answer = 0;

    while(b)
    {
        if(b & 1) {
            answer += a;
        };

        a <<= 1;
        b >>= 1;
    }
    return answer;
}

Although my hardware does not have a multiplier, I have a hard shifter. The shifter is able to shift up to 16 bits at one time.

If I want to make full use of my 16-bit shifter. Any suggestions on how can I adapt the code above to reflect my hardware's capabilities? The given code shifts only 1-bit per iteration.

The 16-bit shifter can shift 32-bit unsigned long values up to 16 places at a time. The sizeof(unsigned long) == 32 bits


The basic approach is (assuming shifting by 1) :-

  • Shift the top 16 bits
  • Set the bottom bit of the top 16 bits to the top bit of the bottom 16 bits
  • Shift the bottom 16 bits
  • Depends a bit on your hardware...

    but you could try :-

  • assuming unsigned long is 32 bits
  • assuming Big Endian
  • then :-

     union Data32
            {
               unsigned long l;
               unsigned short s[2];
            }; 
    
    unsigned long shiftleft32(unsigned long valueToShift, unsigned short bitsToShift)
    {
        union Data32 u;
        u.l  = valueToShift
        u.s[0] <<= bitsToShift;
        u.s[0] |= (u.s[1] >> (16 - bitsToShift);
        u.s[1] <<= bitsToShift
    
        return u.l;
    }
    

    then do the same in reverse for shifting right


    Having 16-bit shifts can help you in making minor speed enhancement using the following approach:

    (U1 * P + U0) * (V1 * P + V0) =
    = U1 * V1 * P * P + U1 * V0 * P + U0 * V1 * P + U0 * V0 =
    = U1 * V1 * (P*P+P) + (U1-U0) * (V0-V1) * P + U0 * V0 * (1-P)
    

    provided P is a convenient power of 2 (for example, 2**16, 2**32), so multiplying to it is a fast shift. This reduces from 4 to 3 multiplications of smaller numbers, and, recursively, O(N**(3/2)) instead of O(N**2) for very long numbers.

    This method is described at least in Knuth's TAoCP. There are more advanced versions described there.

    For small numbers (eg 8 by 8 bits), the following method is fast, if you have enough fast ROM:

    a * b = square(a+b)/4 - square(a-b)/4
    

    if to tabulate int(square(x)/4) , you'll need 1022 bytes for unsigned multiplication and 510 bytes for signed one.


    the code above is multiplying on the traditional way, the way we learnt in primary school :

    EX:

        0101
      * 0111
      -------
        0101
       0101.
      0101..
     --------
      100011
    

    of course you can not approach it like that if you don't have either a multiplier operator or 1-bit shifter! though, you can do it in other ways, for example a loop :

    unsigned long _mult(unsigned long a, unsigned long b)
    {
        unsigned long res =0;
    
        while (a > 0)
        {
            res += b;
            a--;
        }
    
        return res;
    } 
    

    It is costy but it serves your needings, anyways you can think about other approaches if you have more constraints (like computation time ...)

    链接地址: http://www.djcxy.com/p/72614.html

    上一篇: SIMD用无符号乘法签名为64

    下一篇: 位乘以16