如何计算32位中的设置位数

2018-05-29 15:28:55

代表数字7的8位看起来像这样：

00000111

三位被设置。

什么算法来确定一个32位整数的设置位数？

这就是所谓的“汉明重量”，“弹出”或“横向加入”。

'最好'的算法真的取决于你在哪个CPU上以及你的使用模式是什么。

一些CPU有一个内置的指令来完成它，而另外一些则有一些作用于位向量的并行指令。并行指令（像86的popcnt用在CPU在那里的支持），几乎可以肯定是最快的。其他一些体系结构可能使用微码循环执行慢指令，每循环测试一次（需要引用）。

如果您的CPU具有较大的缓存并且/或者您正在紧密循环中执行大量这些指令，则预填充的表查找方法可能非常快。然而，它可能会因为'高速缓存未命中'的花费而受到影响，在这种情况下，CPU必须从主内存中获取一些表。

如果你知道你的字节大部分是0或大多数是1，那么这些场景就有非常有效的算法。

我相信一个非常好的通用算法如下，称为“并行”或“可变精度SWAR算法”。我已经用类似C的伪语言表达了这一点，您可能需要调整它以适用于特定语言（例如，在C ++和Java中使用uint32_t）：

int numberOfSetBits(int i)
{
     // Java: use >>> instead of >>
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
     return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

这是所讨论算法中最好的最坏情况行为，因此将有效地处理任何使用模式或值。

这种按位SWAR算法可以并行执行多个向量元素，而不是在单个整数寄存器中，以便在SIMD但没有可用的弹出计数指令的CPU上进行加速。（例如必须在任何CPU上运行的x86-64代码，而不仅仅是Nehalem或更高版本。）

但是，对于popcount使用向量指令的最佳方法通常是通过使用可变混排来并行地在每个字节的每个时间对4位进行表查找。（4位索引一个保存在向量寄存器中的16条目表）。

在Intel CPU上，硬件64位popcnt指令可以比SSSE3 PSHUFB位并行实现性能提高2倍左右，但PSHUFB是编译器能够正确实现。否则上证所可能会大幅走高。较新的编译器版本知道英特尔的弹出错误依赖关系问题。

参考文献：

https://graphics.stanford.edu/~seander/bithacks.html

https://en.wikipedia.org/wiki/Hamming_weight

http://gurmeet.net/puzzles/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count）

还要考虑编译器的内置功能。

例如，在GNU编译器中，您可以使用：

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

在最坏的情况下，编译器会生成一个函数调用。在最好的情况下，编译器会发出cpu指令来更快地完成相同的工作。

GCC内在函数甚至可以在多个平台上工作。 Popcount将成为x86架构的主流，所以现在开始使用内在的东西是有道理的。其他架构拥有多年的帐户。

在x86上，您可以告诉编译器它可以使用-mpopcnt或-msse4.2支持popcnt指令，以启用在同一代中添加的向量指令。请参阅GCC x86选项。 -march=nehalem （或者-march=你希望你的代码假设并调整的任何CPU）可能是一个不错的选择。在较旧的CPU上运行生成的二进制文件将导致非法指令错误。

要为您构建的计算机优化二进制文件，请使用-march=native （使用gcc，clang或ICC）。

MSVC为x86 popcnt指令提供了一种内在特性，但与gcc不同的是，它实际上是硬件指令的内在特征，并且需要硬件支持。

使用std::bitset<>::count()而不是内置的

理论上，任何知道如何高效计算目标CPU的编译器都应该通过ISO C ++ std::bitset<>公开该功能。在实践中，对于某些目标CPU，在某些情况下，您可能更适合使用bit-hack AND / shift / ADD。

对于硬件弹出窗口是可选扩展（如x86）的目标体系结构，并非所有编译器都有一个std::bitset ，在可用时利用它。例如，MSVC无法在编译时启用popcnt支持，并且始终使用表查找，即使使用/Ox /arch:AVX （这意味着SSE4.2，虽然技术上对于popcnt存在单独的特征位）。

但至少你得到了一些随处可用的可移植设备，并且通过gcc / clang具有正确的目标选项，您可以获得支持它的体系结构的硬件弹出窗口。

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

在Godbolt编译器浏览器中查看gcc，clang，icc和MSVC的asm。

x86-64 gcc -O3 -std=gnu++11 -mpopcnt发出这个：

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret
unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax
    ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11发出（对于int arg版本）：

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

这个源代码根本不是x86特定的或GNU特有的，但是只能编译适用于x86的gcc / clang / icc。

还要注意，没有单指令popcount的架构的gcc的回退是一次一个字节的表查找。例如，这对于ARM来说并不美妙。

在我看来，“最好”的解决方案是另一位程序员（或两年后的原程序员）可以阅读而没有大量的评论。你可能想要一些已经提供的最快速或最聪明的解决方案，但我更喜欢随时随地阅读可读性。

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1)     // check lower bit
            count++;
        value >>= 1;              // shift bits, removing lower bit
    }
    return count;
}

如果你想要更快的速度（并且假设你很好地记录下来以帮助你的后继者），你可以使用表格查找：

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
//  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F (<- n)
//  =====================================================
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
    : : :
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
    return oneBitsInUChar [x >>    8]
         + oneBitsInUChar [x &  0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
    return oneBitsInUShort (x >>     16)
         + oneBitsInUShort (x &  0xffff);
}

虽然这些依赖于特定的数据类型大小，所以它们不是那种可移植的。但是，由于许多性能优化无论如何都不是可移植的，所以这可能不是问题。如果你想要可移植性，我会坚持可读的解决方案。

链接地址: http://www.djcxy.com/p/1995.html

上一篇: How to count the number of set bits in a 32

下一篇: What's the purpose of the LEA instruction?