Seeking maximum bitmap (aka bit array) performance with C/Intel assembly
Following on from my two previous questions, How to improve memory performance/data locality of 64-bit C/intel assembly program and Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?, I have further reduced the running time of the test program mentioned in these questions from 150 seconds down to 62 seconds, as I will describe below.
The 64-bit program has five 4 GB lookup tables (bytevecM, bytevecD, bytevecC, bytevecL, bytevecX). To reduce the (huge) number of cache misses, analysed in my last question, I added five 4 MB bitmaps, one per lookup table.
Here is the original inner loop:
psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
// ... rinse and repeat for bytevecD, bytevecC, bytevecL, bytevecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
The idea behind this simple "pre-check" was to avoid the expensive inner loop if all 128 bytes were zero. However, profiling showed that this pre-check was the primary bottleneck due to huge numbers of cache misses, as discussed last time. So I created a 4 MB bitmap to do the pre-check. (BTW, around 36% of 128-byte blocks are zero, not 98% as I mistakenly reported last time).
Here is the code I used to create a 4 MB bitmap from a 4 GB lookup table:
// Last chunk index (bitmap size=((LAST_CHUNK_IDX+1)>>3)=4,194,304 bytes)
#define LAST_CHUNK_IDX 33554431
void make_bitmap(
const unsigned char* bytevec, // in: byte vector
unsigned char* bitvec // out: bitmap
)
{
unsigned int uu;
unsigned int ucnt = 0;
unsigned int byte;
unsigned int bit;
const size_t* psz;
for (uu = 0; uu <= LAST_CHUNK_IDX; ++uu)
{
psz = (size_t*)&bytevec[uu << 7];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
++ucnt;
byte = uu >> 3;
bit = (uu & 7);
bitvec[byte] |= (1 << bit);
}
printf("ucnt=%u hits from %un", ucnt, LAST_CHUNK_IDX+1);
}
Suggestions for a better way to do this are welcome.
With the bitmaps created via the function above, I then changed the "pre-check" to use the 4 MB bitmaps, instead of the 4 GB lookup tables, like so:
if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
// ... rinse and repeat for bitvecD, bitvecC, bitvecL, bitvecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
This was "successful" in that the running time was reduced from 150 seconds down to 62 seconds in the simple single-threaded case. However, VTune still reports some pretty big numbers, as shown below.
I profiled a more realistic test with eight simultaneous threads running across different ranges. The VTune output of the inner loop check for zero blocks is shown below:
> m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
> if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
0x1400025c7 Block 15:
mov eax, r15d 1.058s
mov edx, ebx 0.109s
xor eax, ecx 0.777s
imul eax, eax, 0xf4243 1.088s
mov r9d, eax 3.369s
shr eax, 0x7 0.123s
and eax, 0x7 1.306s
movzx ecx, al 1.319s
mov eax, r9d 0.156s
shr rax, 0xa 0.248s
shl edx, cl 1.321s
test byte ptr [rax+r10*1], dl 1.832s
jz 0x140007670 2.037s
> d7 = (unsigned int)( (s6.m128i_i32[0] ^ q7) * H_PRIME );
> if ( (bitvecD[d7 >> 10] & (1 << ((d7 >> 7) & 7))) == 0 ) continue;
0x1400025f3 Block 16:
mov eax, dword ptr [rsp+0x30] 104.983s
mov edx, ebx 1.663s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.513s
mov edi, eax 1.172s
shr eax, 0x7 0.140s
and eax, 0x7 0.062s
movzx ecx, al 0.575s
mov eax, edi 0.689s
shr rax, 0xa 0.016s
shl edx, cl 0.108s
test byte ptr [rax+r11*1], dl 1.591s
jz 0x140007670 1.087s
> c7 = (unsigned int)( (s6.m128i_i32[1] ^ q7) * H_PRIME );
> if ( (bitvecC[c7 >> 10] & (1 << ((c7 >> 7) & 7))) == 0 ) continue;
0x14000261f Block 17:
mov eax, dword ptr [rsp+0x34] 75.863s
mov edx, 0x1 1.097s
xor eax, r15d 0.031s
imul eax, eax, 0xf4243 0.265s
mov ebx, eax 0.512s
shr eax, 0x7 0.016s
and eax, 0x7 0.233s
movzx ecx, al 0.233s
mov eax, ebx 0.279s
shl edx, cl 0.109s
mov rcx, qword ptr [rsp+0x58] 0.652s
shr rax, 0xa 0.171s
movzx ecx, byte ptr [rax+rcx*1] 0.126s
test cl, dl 77.918s
jz 0x140007667
> l7 = (unsigned int)( (s6.m128i_i32[2] ^ q7) * H_PRIME );
> if ( (bitvecL[l7 >> 10] & (1 << ((l7 >> 7) & 7))) == 0 ) continue;
0x140002655 Block 18:
mov eax, dword ptr [rsp+0x38] 0.980s
mov edx, 0x1 0.794s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.187s
mov r11d, eax 0.278s
shr eax, 0x7 0.062s
and eax, 0x7 0.218s
movzx ecx, al 0.218s
mov eax, r11d 0.186s
shl edx, cl 0.031s
mov rcx, qword ptr [rsp+0x50] 0.373s
shr rax, 0xa 0.233s
movzx ecx, byte ptr [rax+rcx*1] 0.047s
test cl, dl 55.060s
jz 0x14000765e
In addition to that, large amounts of time were (confusingly to me) attributed to this line:
> for (q6 = 1; q6 < 128; ++q6) {
0x1400075a1 Block 779:
inc edx 0.124s
mov dword ptr [rsp+0x10], edx
cmp edx, 0x80 0.031s
jl 0x140002574
mov ecx, dword ptr [rsp+0x4]
mov ebx, dword ptr [rsp+0x48]
...
0x140007575 Block 772:
mov edx, dword ptr [rsp+0x10] 0.699s
...
0x14000765e Block 789 (note: jz in l7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 1.169s
jmp 0x14000757e 0.791s
0x140007667 Block 790 (note: jz in c7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 2.261s
jmp 0x140007583 1.461s
0x140007670 Block 791 (note: jz in m7/d7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 108.355s
jmp 0x140007588 6.922s
I don't fully understand the big numbers in the VTune output above. If anyone can shed more light on these numbers, I'm all ears.
It seems to me that my five 4 MB bitmaps are bigger than my Core i7 3770 processor can fit into its 8 MB L3 cache, leading to many cache misses (though far fewer than before). If my CPU had a 30 MB L3 cache (as the upcoming Ivy Bridge-E has), I speculate that this program would run a lot faster because all five bitmaps would comfortably fit into the L3 cache. Is that right?
Further to that, since the code to test the bitmaps, namely:
m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0
now appears five times in the inner loop, any suggestions for speeding up this code are very welcome.
Within the core bits of the loop, using the _bittest()
MSVC intrinsic for the bitmap check combines the shl
/ test
combo the compiler creates into a single instruction with (on SandyBridge) no latency/throughput penalty, ie it should shave a few cycles off.
Beyond that, can only think of calculating the bitmaps by map-reducing bit sets via recursive POR
, as a variation on your zero testing that might be worth benchmarking:
for (int i = 0; i < MAX_IDX; i++) {
__m128i v[8];
__m128i* ptr = ...[i << ...];
v[0] = _mm_load_si128(ptr[0]);
v[1] = _mm_load_si128(ptr[1]);
v[2] = _mm_load_si128(ptr[2]);
v[3] = _mm_load_si128(ptr[3]);
v[4] = _mm_load_si128(ptr[4]);
v[5] = _mm_load_si128(ptr[5]);
v[6] = _mm_load_si128(ptr[6]);
v[7] = _mm_load_si128(ptr[7]);
v[0] = _mm_or_si128(v[0], v[1]);
v[2] = _mm_or_si128(v[2], v[3]);
v[4] = _mm_or_si128(v[4], v[5]);
v[6] = _mm_or_si128(v[6], v[7]);
v[0] = _mm_or_si128(v[0], v[2]);
v[2] = _mm_or_si128(v[4], v[6]);
v[0] = _mm_or_si128(v[0], v[2]);
if (_mm_movemask_epi8(_mm_cmpeq_epi8(_mm_setzero_si128(), v[0]))) {
// the contents aren't all zero
}
...
}
At this point, the pure load / accumulate- OR
/ extract mask might be better than a tight loop of SSE4.2 PTEST
because there's no flags
dependency and no branches.
对于128字节的缓冲区,用较大的整数进行比较。
unsigned char cbuf[128];
unsigned long long *lbuf = cbuf;
int i;
for (i=0; i < 128/sizeof(long long); i++) {
if (lbuf[i]) return false; // something not a zero
}
return true; // all zero
链接地址: http://www.djcxy.com/p/7626.html
上一篇: 使用gprof奇怪的分析输出