"Isolate" specific Row/Column/Diagonal from a 64

OK, let's consider a 64-bit number, with its bits forming a 8x8 table.

Eg

0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0

written as

a b c d e f g h
----------------
0 1 1 0 1 0 1 0
0 1 1 0 1 0 1 1 
0 1 1 1 1 0 1 0 
0 1 1 0 1 0 1 0 
1 1 1 0 1 0 1 0 
0 1 1 0 1 0 1 0 
0 1 1 0 1 1 1 0 
0 1 1 0 1 0 1 0

Now, what if we want to isolate JUST eg column d ( 00100000 ) (or any row/diagonal for that matter) ?

Can this be done? And if so, how?


HINTS :

  • (a) My main objective here - though not initially mentioned - is raw speed. I'm searching for the fastest algorithm around, since the "retrieval" function is being performed some millions of times per second.

  • (b) This is what comes closer to what I mean : http://chessprogramming.wikispaces.com/Kindergarten+Bitboards


  • Here's a solution with only 4 main steps:

    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic = 0x2040810204081ull;
    
    int get_col(uint64_t board, int col) {
        uint64_t column = (board << col) & column_mask;
        column *= magic;
        return (column >> 56) & 0xff;
    }
    

    It works like this:

  • the board is shifted to align the column with the left side
  • it's masked to only contain the required column (0..8)
  • it's multiplied by a magic number which results in all the original bits pushed to the left side
  • the left-most byte is shifted to the right
  • The magic number is chosen to copy only the needed bits and let the rest fall into unused places / overflow over the number. The process looks like this (digits are bit "IDs", rather than the number itself):

    original column: ...1.......2.......3.......4.......5.......6.......7.......8....
    aligned column:  1.......2.......3.......4.......5.......6.......7.......8.......
    multiplied:      123456782345678.345678..45678...5678....678.....78......8.......
    shifted to right:........................................................12345678
    

    If you add the const keywords, assembly becomes quite nice actually:

    get_col:
    .LFB7:
            .cfi_startproc
            movl    %esi, %ecx
            movabsq $-9187201950435737472, %rax
            salq    %cl, %rdi
            andq    %rax, %rdi
            movabsq $567382630219905, %rax
            imulq   %rax, %rdi
            shrq    $56, %rdi
            movl    %edi, %eax
            ret
    

    No branching, no external data, around 0.4ns per calculation.

    Edit: takes around 6th of the time using NPE's solution as baseline (next fastest one)


    Right, so to "settle" the debate as to which is faster/slower/etc, I've put all the code into one program [and I hope I've credited the right person for the right code-snippet].

    The code can be found below, for inspection that I've intepreded the code correctly when I've made it into functions. I did run it wout proper output and check that each function gives the same result [bearing in mind that the order is slightly different in some cases - so I made a variation to run the other way of my code, just to see that it gives the "right" result]. So without further ado, here's the results:

    mats1 time in clocks per iteration 10.3457
    mats2 time in clocks per iteration 10.4785
    mats3 time in clocks per iteration 10.5538
    viraptor time in clocks per iteration 6.24603
    lemees time in clocks per iteration 14.4818
    npe time in clocks per iteration 13.1455
    alex time in clocks per iteration 24.8272
    

    (viraptor's results from core i5, g++ 4.7)

    mats1 time in clocks per iteration 7.62338
    mats2 time in clocks per iteration 7.36226
    mats3 time in clocks per iteration 7.45361
    viraptor time in clocks per iteration 2.09582
    lemees time in clocks per iteration 9.43744
    npe time in clocks per iteration 7.51016
    alex time in clocks per iteration 19.3554
    

    (viraptor's results from core i5, clang++ 3.2)

    mats1 time in clocks per iteration 12.956
    mats2 time in clocks per iteration 13.4395
    mats3 time in clocks per iteration 13.3178
    viraptor time in clocks per iteration 2.12914
    lemees time in clocks per iteration 13.9267
    npe time in clocks per iteration 16.2102
    alex time in clocks per iteration 13.8705
    

    That's clock-cycles on a 3.4GHz AMD Athlon2 - I don't have a modern Intel machine - if someone wishes to run the code on that, I'd be interested to see what it looks like. I'm fairly sure all of it runs well within the cache - perhaps aside from fetching some of the values in to check.

    So, the winner is clearly viraptor, by about 40% - "my" code is second. Alex's code doesn't have any jumps/branches, but it appears to run slower than the other alternatives still. Not sure why npe's results are that much slower than mine - it does almost the same thing (and the code looks very similar when looking at the assembler output from g++).

    #include <iostream>
    #include <fstream>
    #include <cstdint>
    
    using namespace std;
    
    const int SIZE = 1000000;
    
    uint64_t g_val[SIZE];
    
    ofstream nulloutput;
    
    static __inline__ unsigned long long rdtsc(void)
    {
        unsigned hi, lo;
        __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
        return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
    }
    
    #define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b))
    
    unsigned char get_col_mats1(uint64_t val, int col)
    {
        return BITA_TO_B(val, 56+col, 7) |
        BITA_TO_B(val, 48+col, 6) |
        BITA_TO_B(val, 40+col, 5) |
        BITA_TO_B(val, 32+col, 4) |
        BITA_TO_B(val, 24+col, 3) |
        BITA_TO_B(val, 16+col, 2) |
        BITA_TO_B(val, 8+col, 1) |
        BITA_TO_B(val, 0+col, 0);
    }
    
    unsigned char get_col_mats2(uint64_t val, int col)
    {
        return BITA_TO_B(val, 63-col, 7) |
        BITA_TO_B(val, 55-col, 6) |
        BITA_TO_B(val, 47-col, 5) |
        BITA_TO_B(val, 39-col, 4) |
        BITA_TO_B(val, 31-col, 3) |
        BITA_TO_B(val, 23-col, 2) |
        BITA_TO_B(val, 15-col, 1) |
        BITA_TO_B(val, 7-col, 0);
    }
    
    
    unsigned char get_col_viraptor(uint64_t board, int col) {
        const uint64_t column_mask = 0x8080808080808080ull;
        const uint64_t magic = 0x2040810204081ull ;
        uint64_t column = board & (column_mask >> col);
        column <<= col;
        column *= magic;
        return (column >> 56) & 0xff;
    }
    
    
    unsigned char get_col_alex(uint64_t bitboard, int col)
    {
        unsigned char result;
        result |= (bitboard & (1ULL << 63-col)) ? 0x80 : 0;
        result |= (bitboard & (1ULL << 55-col)) ? 0x40 : 0;
        result |= (bitboard & (1ULL << 47-col)) ? 0x20 : 0;
        result |= (bitboard & (1ULL << 39-col)) ? 0x10 : 0;
        result |= (bitboard & (1ULL << 31-col)) ? 0x08 : 0;
        result |= (bitboard & (1ULL << 23-col)) ? 0x04 : 0;
        result |= (bitboard & (1ULL << 15-col)) ? 0x02 : 0;
        result |= (bitboard & (1ULL << 7-col))  ? 0x01 : 0;
    
        return result;
    }
    
    unsigned char get_col_lemees(uint64_t val, int column)
    {
        int result = 0;
        int source_bitpos = 7 - column; // "point" to last entry in this column
        for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
        {
        bool bit = (val >> source_bitpos) & 1;  // "extract" bit
        result |= bit << target_bitpos;            // add bit if it was set
        source_bitpos += 8;                        // move one up in table
        }
        return result;
    }
    
    int get(uint64_t board, int row, int col) {
      return (board >> (row * 8 + col)) & 1;
    }
    
    uint8_t get_col_npe(uint64_t board, int col) {
      uint8_t ret = 0;
      for (int i = 0; i < 8; ++i) {
        ret = (ret << 1) + get(board, i, col);
      }
      return ret;
    }
    
    
    
    #define BITA_TO_B2(x, a, b) (((x) >> (a-b)) & (1 << b))
    
    unsigned char get_col_mats3(uint64_t val, int col)
    {
        return BITA_TO_B2(val, 63-col, 7) |
        BITA_TO_B2(val, 55-col, 6) |
        BITA_TO_B2(val, 47-col, 5) |
        BITA_TO_B2(val, 39-col, 4) |
        BITA_TO_B2(val, 31-col, 3) |
        BITA_TO_B2(val, 23-col, 2) |
        BITA_TO_B2(val, 15-col, 1) |
        BITA_TO_B2(val, 7-col, 0);
    }
    
    template<unsigned char (*f)(uint64_t val, int col)>
    void runbench(const char *name)
    {
        unsigned char col[8]  = {0};
        uint64_t long t = rdtsc();
        for(int j = 0; j < SIZE; j++)
        {
        uint64_t val = g_val[j];
        for(int i = 0; i < 8; i++)
        {
            col[i] += f(val, i);
        }
    //  __asm__ __volatile__("":::"memory");
        }
        t = rdtsc() - t;
        for(int i = 0; i < 8; i++)
        {
        nulloutput<< "col " << i << " has bits " << hex << (int)col[i] << endl;
        }
        cout << name << " time in clocks per iteration " << dec << t / (8.0 * SIZE) << endl;
    }
    
    #define BM(name) void bench_##name() { runbench<get_col_##name>(#name); }
    
    BM(mats1);
    BM(mats2);
    BM(mats3);
    BM(viraptor);
    BM(lemees);
    BM(npe);
    BM(alex);
    
    struct function
    {
        void (*func)(void);
        const char *name;
    };
    
    
    #define FUNC(f) { bench_##f, #f }
    
    function funcs[] = 
    {
        FUNC(mats1),
        FUNC(mats2),
        FUNC(mats3),
        FUNC(viraptor),
        FUNC(lemees),
        FUNC(npe),
        FUNC(alex),
    }; 
    
    
    int main()
    {
        unsigned long long a, b;
        int i;
        int sum = 0;
    
        nulloutput.open("/dev/nul");
        for(i = 0; i < SIZE; i++)
        {
        g_val[i] = rand() + ((long)rand() << 32L);
        }
        unsigned char col[8];
    
        for(i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
        {
        funcs[i].func();
        }
    }
    

    Code it up with straightforward loops and let the optimizer do the inlining and loop unrolling for you.

    Compiled using 4.7.2 with -O3 , on my box the following can perform about 300 million get_col() calls per second.

    bitboard.cpp:

    #include <cinttypes>
    #include <iostream>
    
    int get(uint64_t board, int row, int col) {
      return (board >> (row * 8 + col)) & 1;
    }
    
    uint8_t get_col(uint64_t board, int col) {
      uint8_t ret = 0;
      for (int i = 0; i < 8; ++i) {
        ret = (ret << 1) + get(board, i, col);
      }
      return ret;
    }
    
    extern uint64_t board;
    extern int sum;
    
    extern void f();
    
    int main() {
      for (int i = 0; i < 40000000; ++i) {
        for (int j = 0; j < 8; ++j) {
          sum += get_col(board, j);
        }
        f();
      }
      std::cout << sum << std::endl;
    }
    

    bitboard_b.cpp:

    #include <cinttypes>
    
    uint64_t board = 0x1234567890ABCDEFull;
    int sum = 0;
    
    void f() {}
    

    If you look at the assembly code for get_col() , you'll see that it contains zero loops and is probably as efficient as anything you're likely to handcraft:

    __Z7get_colyi:
    LFB1248:
            movl    %esi, %ecx
            movq    %rdi, %rax
            movq    %rdi, %rdx
            shrq    %cl, %rax
            leal    8(%rsi), %ecx
            andl    $1, %eax
            shrq    %cl, %rdx
            leal    16(%rsi), %ecx
            andl    $1, %edx
            leal    (%rdx,%rax,2), %eax
            movq    %rdi, %rdx
            shrq    %cl, %rdx
            leal    24(%rsi), %ecx
            andl    $1, %edx
            leal    (%rdx,%rax,2), %eax
            movq    %rdi, %rdx
            shrq    %cl, %rdx
            leal    32(%rsi), %ecx
            andl    $1, %edx
            leal    (%rdx,%rax,2), %eax
            movq    %rdi, %rdx
            shrq    %cl, %rdx
            leal    40(%rsi), %ecx
            andl    $1, %edx
            leal    (%rdx,%rax,2), %edx
            movq    %rdi, %rax
            shrq    %cl, %rax
            leal    48(%rsi), %ecx
            andl    $1, %eax
            leal    (%rax,%rdx,2), %edx
            movq    %rdi, %rax
            shrq    %cl, %rax
            leal    56(%rsi), %ecx
            andl    $1, %eax
            leal    (%rax,%rdx,2), %eax
            shrq    %cl, %rdi
            andl    $1, %edi
            leal    (%rdi,%rax,2), %eax
            ret
    

    This is not meant a complete implementation, just an rough illustration of the idea. In particular, the ordering of bits may be the opposite of what you expect, etc.

    链接地址: http://www.djcxy.com/p/72606.html

    上一篇: 屏蔽c中不需要的位

    下一篇: 从64位“隔离”特定的行/列/对角线