Seeking maximum bitmap (aka bit array) performance with C/Intel assembly

2018-05-31 16:21:44

Following on from my two previous questions, How to improve memory performance/data locality of 64-bit C/intel assembly program and Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?, I have further reduced the running time of the test program mentioned in these questions from 150 seconds down to 62 seconds, as I will describe below.

The 64-bit program has five 4 GB lookup tables (bytevecM, bytevecD, bytevecC, bytevecL, bytevecX). To reduce the (huge) number of cache misses, analysed in my last question, I added five 4 MB bitmaps, one per lookup table.

Here is the original inner loop:

psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0]  == 0 && psz[1]  == 0
&&  psz[2]  == 0 && psz[3]  == 0
&&  psz[4]  == 0 && psz[5]  == 0
&&  psz[6]  == 0 && psz[7]  == 0
&&  psz[8]  == 0 && psz[9]  == 0
&&  psz[10] == 0 && psz[11] == 0
&&  psz[12] == 0 && psz[13] == 0
&&  psz[14] == 0 && psz[15] == 0) continue;
// ... rinse and repeat for bytevecD, bytevecC, bytevecL, bytevecX

// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...

The idea behind this simple "pre-check" was to avoid the expensive inner loop if all 128 bytes were zero. However, profiling showed that this pre-check was the primary bottleneck due to huge numbers of cache misses, as discussed last time. So I created a 4 MB bitmap to do the pre-check. (BTW, around 36% of 128-byte blocks are zero, not 98% as I mistakenly reported last time).

Here is the code I used to create a 4 MB bitmap from a 4 GB lookup table:

// Last chunk index (bitmap size=((LAST_CHUNK_IDX+1)>>3)=4,194,304 bytes)
#define LAST_CHUNK_IDX      33554431
void make_bitmap(
   const unsigned char* bytevec,  // in:  byte vector
   unsigned char* bitvec          // out: bitmap
)
{
   unsigned int uu;
   unsigned int ucnt = 0;
   unsigned int byte;
   unsigned int bit;
   const size_t* psz;
   for (uu = 0; uu <= LAST_CHUNK_IDX; ++uu)
   {
      psz = (size_t*)&bytevec[uu << 7];
      if (psz[0]  == 0 && psz[1]  == 0
      &&  psz[2]  == 0 && psz[3]  == 0
      &&  psz[4]  == 0 && psz[5]  == 0
      &&  psz[6]  == 0 && psz[7]  == 0
      &&  psz[8]  == 0 && psz[9]  == 0
      &&  psz[10] == 0 && psz[11] == 0
      &&  psz[12] == 0 && psz[13] == 0
      &&  psz[14] == 0 && psz[15] == 0) continue;
      ++ucnt;
      byte = uu >> 3;
      bit  = (uu & 7);
      bitvec[byte] |= (1 << bit);
   }
   printf("ucnt=%u hits from %un", ucnt, LAST_CHUNK_IDX+1);
}

Suggestions for a better way to do this are welcome.

With the bitmaps created via the function above, I then changed the "pre-check" to use the 4 MB bitmaps, instead of the 4 GB lookup tables, like so:

if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
// ... rinse and repeat for bitvecD, bitvecC, bitvecL, bitvecX

// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...

This was "successful" in that the running time was reduced from 150 seconds down to 62 seconds in the simple single-threaded case. However, VTune still reports some pretty big numbers, as shown below.

I profiled a more realistic test with eight simultaneous threads running across different ranges. The VTune output of the inner loop check for zero blocks is shown below:

> m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
> if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;

0x1400025c7  Block 15:
mov   eax, r15d                    1.058s
mov   edx, ebx                     0.109s
xor   eax, ecx                     0.777s
imul  eax, eax, 0xf4243            1.088s
mov   r9d, eax                     3.369s
shr   eax, 0x7                     0.123s
and   eax, 0x7                     1.306s
movzx ecx, al                      1.319s
mov   eax, r9d                     0.156s
shr   rax, 0xa                     0.248s
shl   edx, cl                      1.321s
test  byte ptr [rax+r10*1], dl     1.832s
jz    0x140007670                  2.037s

> d7 = (unsigned int)( (s6.m128i_i32[0] ^ q7) * H_PRIME );
> if ( (bitvecD[d7 >> 10] & (1 << ((d7 >> 7) & 7))) == 0 ) continue;

0x1400025f3  Block 16:
mov   eax, dword ptr [rsp+0x30]  104.983s
mov   edx, ebx                     1.663s
xor   eax, r15d                    0.062s
imul  eax, eax, 0xf4243            0.513s
mov   edi, eax                     1.172s
shr   eax, 0x7                     0.140s
and   eax, 0x7                     0.062s
movzx ecx, al                      0.575s
mov   eax, edi                     0.689s
shr   rax, 0xa                     0.016s
shl   edx, cl                      0.108s
test  byte ptr [rax+r11*1], dl     1.591s
jz    0x140007670                  1.087s

> c7 = (unsigned int)( (s6.m128i_i32[1] ^ q7) * H_PRIME );
> if ( (bitvecC[c7 >> 10] & (1 << ((c7 >> 7) & 7))) == 0 ) continue;

0x14000261f  Block 17:
mov   eax, dword ptr [rsp+0x34]   75.863s
mov   edx, 0x1                     1.097s
xor   eax, r15d                    0.031s
imul  eax, eax, 0xf4243            0.265s
mov   ebx, eax                     0.512s
shr   eax, 0x7                     0.016s
and   eax, 0x7                     0.233s
movzx ecx, al                      0.233s
mov   eax, ebx                     0.279s
shl   edx, cl                      0.109s
mov   rcx, qword ptr [rsp+0x58]    0.652s
shr   rax, 0xa                     0.171s
movzx ecx, byte ptr [rax+rcx*1]    0.126s
test  cl, dl                      77.918s
jz    0x140007667

> l7 = (unsigned int)( (s6.m128i_i32[2] ^ q7) * H_PRIME );
> if ( (bitvecL[l7 >> 10] & (1 << ((l7 >> 7) & 7))) == 0 ) continue;

0x140002655  Block 18:
mov   eax, dword ptr [rsp+0x38]    0.980s
mov   edx, 0x1                     0.794s
xor   eax, r15d                    0.062s
imul  eax, eax, 0xf4243            0.187s
mov   r11d, eax                    0.278s
shr   eax, 0x7                     0.062s
and   eax, 0x7                     0.218s
movzx ecx, al                      0.218s
mov   eax, r11d                    0.186s
shl   edx, cl                      0.031s
mov   rcx, qword ptr [rsp+0x50]    0.373s
shr   rax, 0xa                     0.233s
movzx ecx, byte ptr [rax+rcx*1]    0.047s
test  cl, dl                      55.060s
jz    0x14000765e

In addition to that, large amounts of time were (confusingly to me) attributed to this line:

> for (q6 = 1; q6 < 128; ++q6) {

0x1400075a1  Block 779:
inc   edx                          0.124s
mov   dword ptr [rsp+0x10], edx
cmp   edx, 0x80                    0.031s
jl    0x140002574
mov   ecx, dword ptr [rsp+0x4]
mov   ebx, dword ptr [rsp+0x48]
...
0x140007575 Block 772:
mov   edx, dword ptr [rsp+0x10]    0.699s
...
0x14000765e  Block 789  (note: jz in l7 section above jumps here if zero):
mov   edx, dword ptr [rsp+0x10]    1.169s
jmp   0x14000757e                  0.791s
0x140007667  Block 790 (note: jz in c7 section above jumps here if zero):
mov   edx, dword ptr [rsp+0x10]    2.261s
jmp   0x140007583                  1.461s
0x140007670  Block 791 (note: jz in m7/d7 section above jumps here if zero):
mov   edx, dword ptr [rsp+0x10]  108.355s
jmp   0x140007588                  6.922s

I don't fully understand the big numbers in the VTune output above. If anyone can shed more light on these numbers, I'm all ears.

It seems to me that my five 4 MB bitmaps are bigger than my Core i7 3770 processor can fit into its 8 MB L3 cache, leading to many cache misses (though far fewer than before). If my CPU had a 30 MB L3 cache (as the upcoming Ivy Bridge-E has), I speculate that this program would run a lot faster because all five bitmaps would comfortably fit into the L3 cache. Is that right?

Further to that, since the code to test the bitmaps, namely:

m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0

now appears five times in the inner loop, any suggestions for speeding up this code are very welcome.

Within the core bits of the loop, using the _bittest() MSVC intrinsic for the bitmap check combines the shl / test combo the compiler creates into a single instruction with (on SandyBridge) no latency/throughput penalty, ie it should shave a few cycles off.

Beyond that, can only think of calculating the bitmaps by map-reducing bit sets via recursive POR , as a variation on your zero testing that might be worth benchmarking:

for (int i = 0; i < MAX_IDX; i++) {
   __m128i v[8];
   __m128i* ptr = ...[i << ...];

   v[0] = _mm_load_si128(ptr[0]);
   v[1] = _mm_load_si128(ptr[1]);
   v[2] = _mm_load_si128(ptr[2]);
   v[3] = _mm_load_si128(ptr[3]);
   v[4] = _mm_load_si128(ptr[4]);
   v[5] = _mm_load_si128(ptr[5]);
   v[6] = _mm_load_si128(ptr[6]);
   v[7] = _mm_load_si128(ptr[7]);
   v[0] = _mm_or_si128(v[0], v[1]);
   v[2] = _mm_or_si128(v[2], v[3]);
   v[4] = _mm_or_si128(v[4], v[5]);
   v[6] = _mm_or_si128(v[6], v[7]);

   v[0] = _mm_or_si128(v[0], v[2]);
   v[2] = _mm_or_si128(v[4], v[6]);

   v[0] = _mm_or_si128(v[0], v[2]);

   if (_mm_movemask_epi8(_mm_cmpeq_epi8(_mm_setzero_si128(), v[0]))) {
       // the contents aren't all zero
   }
   ...
}

At this point, the pure load / accumulate- OR / extract mask might be better than a tight loop of SSE4.2 PTEST because there's no flags dependency and no branches.

对于128字节的缓冲区，用较大的整数进行比较。

unsigned char cbuf[128];
unsigned long long *lbuf = cbuf;
int i;
for (i=0; i < 128/sizeof(long long); i++) {
    if (lbuf[i]) return false; // something not a zero
}
return true; // all zero

链接地址: http://www.djcxy.com/p/7626.html

上一篇: 使用gprof奇怪的分析输出

下一篇: 用C / Intel程序集寻求最大位图（又名位阵列）性能