Fastest way to count bits

2018-06-25 22:46:31

Possible Duplicate:
How to count the number of set bits in a 32-bit integer?

Give a unsigned char type value,count the total bits in it.What's the fastest way? I wrote three function as below,what's the best way,and can someone come up with a faster one?(I just want the extremely fast one)

const int tbl[] =
{
#define B2(n)   n, n+1, n+1, n+2
#define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
#define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)
    B6(0), B6(1), B6(1), B6(2)
};

char naivecount (unsigned char val)
{
    char cnt = 0;
    while (val)
    {
        cnt += (val & 1);
        val = val >> 1;
    }
    return cnt;
}

inline tableLookUp(int val)
{
    assert(val >= 0 && val <= 255);
    return tbl[val];
}

int asmCount(int val)
{
    int res = 0;
    asm volatile("xor %0, %0nt"
            "begin:nt"
            "cmp $0x0, %1nt"
            "jle endnt"
            "movl %1, %%ecxnt"
            "and $0x1, %%ecxnt"
            "addl %%ecx, %0nt"
            "shrl %1nt"
            "jmp beginnt"
            "end:"
            : "=r"(res)
            : "r" (val));
    return res;
}

EDIT:

I have test all the method,the fastest one is to use the popcntl instruction.In platform without the instruction,I will use table look-up.

If you want to code it by hand, try this:

#include <stdint.h>

int popcnt8(uint8_t x) {

    x = (x & 0x55) + (x >> 1 & 0x55);
    x = (x & 0x33) + (x >> 2 & 0x33);
    x = (x & 0x0f) + (x >> 4 & 0x0f);

    return x;
}

on x86, this compiles to (AT&T-syntax):

popcnt8:
    movl    %edi, %eax
    shrb    %dil
    andl    $85, %eax
    andl    $85, %edi
    addl    %eax, %edi
    movl    %edi, %eax
    shrb    $2, %dil
    andl    $51, %eax
    andl    $51, %edi
    addl    %eax, %edi
    movl    %edi, %eax
    shrb    $4, %dil
    andl    $15, %eax
    addl    %edi, %eax
    movzbl  %al, %eax
    ret

Compare this to what gcc generates with the intrinsic:

#include <stdint.h>

int popcnt8_intrin(uint8_t x) { return __builtin_popcount(x); }

On x86 with SSE 4.2:

popcnt8_intrin:
movzbl  %dil, %eax
popcntl %eax, %eax
ret

which is not optimal; clang generates:

popcnt8_intrin:
    popcntl %edi,%eax
    ret

reducing the calculation to one (!) instruction.

On x86 without SSE 4.2:

popcnt8_intrin:
subq    $8, %rsp
movzbl  %dil, %edi
call    __popcountdi2
addq    $8, %rsp
ret

gcc essentially calls its library here. Not quite optimal. clang does a little better:

popcnt8_intrin:                         # @popcnt8_intrin
movl    %edi, %eax
shrl    %eax
andl    $85, %eax
subl    %eax, %edi
movl    %edi, %eax
andl    $858993459, %eax        # imm = 0x33333333
shrl    $2, %edi
andl    $858993459, %edi        # imm = 0x33333333
addl    %eax, %edi
movl    %edi, %eax
shrl    $4, %eax
addl    %edi, %eax
andl    $252645135, %eax        # imm = 0xF0F0F0F
imull   $16843009, %eax, %eax   # imm = 0x1010101
shrl    $24, %eax
ret

clang calculates popcnt for a whole 32 bit number. This is not optimal imho.

Your assembler code would be faster if you didn't do so many compares and branches that vary from taken and not taken.

But clearly, the fastest method is to do a byte lookup, particularly as you are only dealing with 256 values (you can use the naive method to write a list of the values, then just have a static const table[256] = { ... }; return table[value]; in your function.

Benchmark the different solutions.

I wouldn't be surprised if your assembler code is slower than the compiler generated code!

Edit: Your assembler code would be slight faster by doing:

int asmCount(int val)
{
    int res = 0;
    asm volatile("begin:nt"
            "movl %1, %%ecxnt"
            "and $0x1, %%ecxnt"
            "addl %%ecx, %0nt"
            "shrl %1nt"
            "jnz beginnt"
            "end:"
            : "=r"(res)
            : "r" (val)
            : "ecx");      // Important: clobbers ecx!
    return res;
}

I removed the xor (res = 0 should do that anyways), and compare (sure, if val is zero, we execute a few extra instructions, but for anything with high bits set, it is much worse, since it's two extra instructions for every iteration, meaning potentially 16 extra instructions - and one of them a branch!), and changed the jump to a jnz at the end of the loop. It probably is roughly what the compiler generates in your first case. Trying to beat the compiler on simple code isn't that easy!

链接地址: http://www.djcxy.com/p/72596.html

上一篇: C ++如何获得一个变量的位长度？

下一篇: 计数位数最快的方法