Why does my 8M L3 cache not provide any benefit for arrays larger than 1M?

I was inspired by this question to write a simple program to test my machine's memory bandwidth in each cache level:

Why vectorizing the loop does not have performance improvement

My code uses memset to write to a buffer (or buffers) over and over and measures the speed. It also saves the address of every buffer to print at the end. Here's the listing:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE_KB {8, 16, 24, 28, 32, 36, 40, 48, 64, 128, 256, 384, 512, 768, 1024, 1025, 2048, 4096, 8192, 16384, 200000}
#define TESTMEM 10000000000 // Approximate, in bytes
#define BUFFERS 1

double timer(void)
{
    struct timeval ts;
    double ans;

    gettimeofday(&ts, NULL);
    ans = ts.tv_sec + ts.tv_usec*1.0e-6;

    return ans;
}

int main(int argc, char **argv)
{
    double *x[BUFFERS];
    double t1, t2;
    int kbsizes[] = SIZE_KB;
    double bandwidth[sizeof(kbsizes)/sizeof(int)];
    int iterations[sizeof(kbsizes)/sizeof(int)];
    double *address[sizeof(kbsizes)/sizeof(int)][BUFFERS];
    int i, j, k;

    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
        iterations[k] = TESTMEM/(kbsizes[k]*1024);

    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
    {
        // Allocate
        for (j = 0; j < BUFFERS; j++)
        {
            x[j] = (double *) malloc(kbsizes[k]*1024);
            address[k][j] = x[j];
            memset(x[j], 0, kbsizes[k]*1024);
        }

        // Measure
        t1 = timer();
        for (i = 0; i < iterations[k]; i++)
        {
            for (j = 0; j < BUFFERS; j++)
                memset(x[j], 0xff, kbsizes[k]*1024);
        }
        t2 = timer();
        bandwidth[k] = (BUFFERS*kbsizes[k]*iterations[k])/1024.0/1024.0/(t2-t1);

        // Free
        for (j = 0; j < BUFFERS; j++)
            free(x[j]);
    }

    printf("TESTMEM = %ldn", TESTMEM);
    printf("BUFFERS = %dn", BUFFERS);
    printf("Size (kB)tBandwidth (GB/s)tIterationstAddressesn");
    for (k = 0; k < sizeof(kbsizes)/sizeof(int); k++)
    {
        printf("%7dtt%.2fttt%dtt%x", kbsizes[k], bandwidth[k], iterations[k], address[k][0]);
        for (j = 1; j < BUFFERS; j++)
            printf(", %x", address[k][j]);
        printf("n");
    }

    return 0;
}

And the results (with BUFFERS = 1):

TESTMEM = 10000000000
BUFFERS = 1
Size (kB)   Bandwidth (GB/s)    Iterations  Addresses
      8     52.79               1220703     90b010
     16     56.48               610351      90b010
     24     57.01               406901      90b010
     28     57.13               348772      90b010
     32     45.40               305175      90b010
     36     38.11               271267      90b010
     40     38.02               244140      90b010
     48     38.12               203450      90b010
     64     37.51               152587      90b010
    128     36.89               76293       90b010
    256     35.58               38146       d760f010
    384     31.01               25431       d75ef010
    512     26.79               19073       d75cf010
    768     26.20               12715       d758f010
   1024     26.20               9536        d754f010
   1025     18.30               9527        90b010
   2048     18.29               4768        d744f010
   4096     18.29               2384        d724f010
   8192     18.31               1192        d6e4f010
  16384     18.31               596         d664f010
 200000     18.32               48          cb2ff010

I can easily see the effect of the 32K L1 cache and 256K L2 cache. What I don't understand is why performance drops suddenly after the size of the memset buffer exceeds 1M. My L3 cache is supposed to be 8M. It happens so suddenly too, not tapered at all like when the L1 and L2 cache size was exceeded.

My processor is the Intel i7 3700. The details of the L3 cache from /sys/devices/system/cpu/cpu0/cache are:

level = 3
coherency_line_size = 64
number_of_sets = 8192
physical_line_partition = 1
shared_cpu_list = 0-7
shared_cpu_map = ff
size = 8192K
type = Unified
ways_of_associativity = 16

I thought I would try using multiple buffers - call memset on 2 buffers of 1M each and see if performance would drop. With BUFFERS = 2, I get:

TESTMEM = 10000000000
BUFFERS = 2
Size (kB)   Bandwidth (GB/s)    Iterations  Addresses
      8     54.15               1220703     e59010, e5b020
     16     51.52               610351      e59010, e5d020
     24     38.94               406901      e59010, e5f020
     28     38.53               348772      e59010, e60020
     32     38.31               305175      e59010, e61020
     36     38.29               271267      e59010, e62020
     40     38.29               244140      e59010, e63020
     48     37.46               203450      e59010, e65020
     64     36.93               152587      e59010, e69020
    128     35.67               76293       e59010, 63769010
    256     27.21               38146       63724010, 636e3010
    384     26.26               25431       63704010, 636a3010
    512     26.19               19073       636e4010, 63663010
    768     26.20               12715       636a4010, 635e3010
   1024     26.16               9536        63664010, 63563010
   1025     18.29               9527        e59010, f59420
   2048     18.23               4768        63564010, 63363010
   4096     18.27               2384        63364010, 62f63010
   8192     18.29               1192        62f64010, 62763010
  16384     18.31               596         62764010, 61763010
 200000     18.31               48          57414010, 4b0c3010

It appears that both 1M buffers stay in the L3 cache. But try to increase the size of either buffer ever so slightly and the performance drops.

I've been compiling with -O3. It doesn't make much difference (except possibly unrolling the loops over BUFFERS). I tried with -O0 and it's the same except for the L1 speeds. gcc version is 4.9.1.

To summarize, I have a 2-part question:

  • Why does my 8 MB L3 cache not provide any benefit on blocks of memory larger than 1M?
  • Why is the drop in performance so sudden?

  • EDIT:

    As suggested by Gabriel Southern, I ran my code with perf using BUFFERS=1 with only one buffer size at a time. This was the full command:

    perf stat -e dTLB-loads,dTLB-load-misses,dTLB-stores,dTLB-store-misses -r 100 ./a.out 2> perfout.txt
    

    The -r means that perf will run a.out 100 times and return the average statistics.

    The output of perf , with #define SIZE_KB {1024} :

     Performance counter stats for './a.out' (100 runs):
    
             1,508,798 dTLB-loads                                                    ( +-  0.02% )
                     0 dTLB-load-misses          #    0.00% of all dTLB cache hits 
           625,967,550 dTLB-stores                                                   ( +-  0.00% )
                 1,503 dTLB-store-misses                                             ( +-  0.79% )
    
           0.360471583 seconds time elapsed                                          ( +-  0.79% )
    

    and with #define SIZE_KB {1025} :

     Performance counter stats for './a.out' (100 runs):
    
             1,670,402 dTLB-loads                                                    ( +-  0.09% )
                     0 dTLB-load-misses          #    0.00% of all dTLB cache hits 
           626,099,850 dTLB-stores                                                   ( +-  0.00% )
                 2,115 dTLB-store-misses                                             ( +-  2.19% )
    
           0.503913416 seconds time elapsed                                          ( +-  0.06% )
    

    So there does seem to be more TLB misses with the 1025K buffer. However, with this size buffer, the program does about 9500 calls of memset , so it is still less than 1 miss per memset call.


    Short Answer:

    Your version of memset starts using non-temporal stores when initializing a region of memory larger than 1 MB. As a result the CPU does not store these lines in its cache, even though your L3 cache is larger than 1 MB. Consequently the performance is limited by the available memory bandwidth in the system for buffer values larger than 1 MB.

    Details:

    Background:

    I tested the code you provided on several different systems and initially focused on investigating the TLB because I thought that there might be thrashing in the 2nd level TLB. However, none of the data I collected confirmed that hypothesis.

    Some of the systems that I tested used Arch Linux which has the latest version of glibc, while others used Ubuntu 10.04 which uses an older version of eglibc. I was able to reproduce the behavior described in the question when using a statically linked binary when testing with multiple different CPU architectures. The behavior that I focused on was a significant difference in runtime between when SIZE_KB was 1024 and when it was 1025 . The performance difference is explained by a change in the code executed for the slow and fast versions.

    Assembly Code

    I used perf record and perf annotate to collect a trace of the executing assembly code to see what the hot code path was. The code is displayed below using the following format:

    percentage time executing instruction | address | instruction percentage time executing instruction | address | instruction .

    I've copied the hot loop from the shorter version that omits most of the address and has a line connecting the loop back edge and loop header.

    For the version compiled on Arch Linux the hot loop was (for both 1024 and 1025 sizes):

      2.35 │a0:┌─+movdqa %xmm8,(%rcx)
     54.90 │   │  movdqa %xmm8,0x10(%rcx)
     32.85 │   │  movdqa %xmm8,0x20(%rcx)
      1.73 │   │  movdqa %xmm8,0x30(%rcx)
      8.11 │   │  add    $0x40,%rcx      
      0.03 │   │  cmp    %rcx,%rdx       
           │   └──jne    a0
    

    For the Ubuntu 10.04 binary the hot loop when running with a size of 1024 was:

           │a00:┌─+lea    -0x80(%r8),%r8
      0.01 │    │  cmp    $0x80,%r8     
      5.33 │    │  movdqa %xmm0,(%rdi)  
      4.67 │    │  movdqa %xmm0,0x10(%rdi)
      6.69 │    │  movdqa %xmm0,0x20(%rdi)
     31.23 │    │  movdqa %xmm0,0x30(%rdi)
     18.35 │    │  movdqa %xmm0,0x40(%rdi)
      0.27 │    │  movdqa %xmm0,0x50(%rdi)
      3.24 │    │  movdqa %xmm0,0x60(%rdi)
     16.36 │    │  movdqa %xmm0,0x70(%rdi)
     13.76 │    │  lea    0x80(%rdi),%rdi 
           │    └──jge    a00    
    

    For the Ubuntu 10.04 version running with a buffer size of 1025 the hot loop was:

           │a60:┌─+lea    -0x80(%r8),%r8  
      0.15 │    │  cmp    $0x80,%r8       
      1.36 │    │  movntd %xmm0,(%rdi)    
      0.24 │    │  movntd %xmm0,0x10(%rdi)
      1.49 │    │  movntd %xmm0,0x20(%rdi)
     44.89 │    │  movntd %xmm0,0x30(%rdi)
      5.46 │    │  movntd %xmm0,0x40(%rdi)
      0.02 │    │  movntd %xmm0,0x50(%rdi)
      0.74 │    │  movntd %xmm0,0x60(%rdi)
     40.14 │    │  movntd %xmm0,0x70(%rdi)
      5.50 │    │  lea    0x80(%rdi),%rdi 
           │    └──jge    a60
    

    The key difference here is that the slower version was using movntd instructions while the faster versions used movdqa instructions. The Intel Software Developers manual says the following about non-temporal stores:

    For WC memory type in particular, the processor never appears to read the data into the cache hierarchy. Instead, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache.

    So this seems to explain the behavior where using memset with values larger than 1 MB don't fit in the cache. The next question is why there is a difference between the Ubuntu 10.04 system and the Arch Linux system, and why 1 MB is selected as a cutoff point. To investigate that question I looked at the glibc source code:

    Source code for memset

    Looking at the glibc git repo at sysdeps/x86_64/memset.S the first commit I found interesting was b2b671b677d92429a3d41bf451668f476aa267ed

    The commit description is:

    Faster memset on x64

    This implementation speed up memset in several ways. First is avoiding expensive computed jump. Second is using fact that arguments of memset are most of time aligned to 8 bytes.

    Benchmark results on: kam.mff.cuni.cz/~ondra/benchmark_string/memset_profile_result27_04_13.tar.bz2

    And the website referenced has some interesting profiling data.

    The diff of the commit shows that the code for memset is simplified a lot and the non-temporal stores are removed. This matches what the profiled code from Arch Linux shows.

    Looking at the older code I saw that the choice of whether to use non-temporal stores appeared to make use of a value described as The largest cache size

    L(byte32sse2_pre):
    
        mov    __x86_shared_cache_size(%rip),%r9d  # The largest cache size
        cmp    %r9,%r8
        ja     L(sse2_nt_move_pre)
    

    The code for calculating this is in: sysdeps/x86_64/cacheinfo.c

    Although it looks like there is code for calculating the actual shared cache size, the default value is also 1 MB:

    long int __x86_64_shared_cache_size attribute_hidden = 1024 * 1024;
    

    So I suspect that either the default value is being used, but there may be some other reason that the code is selecting 1MB as the cutoff point.

    In either case the overall answer to your question appears to be that the version of memset on your system is using non-temporal stores when setting a region of memory larger than 1 MB.


    Given Gabriel's disassembly of the generated assembly code, I think this is indeed the problem [Edit: his answer was edited, it now appears as the root cause so we're in agreement]:

    Note that movnt is a streaming store, which may have (depending on the exact micro-architectural implementation) several impacts:

  • Has weak ordering semantics (which allows it to be faster).
  • Has improved latency if it overwrites a full line (no need to fetch previous data and merge).
  • Has a non-temporal hint, making it uncacheable.
  • #1 and #2 may improve the latency and bandwidth of these operations if they're memory bound, but #3 basically forces them to be memory bound even if they could fit in some cache level. This probably surpasses the benefits, since the memory latency/BW are significantly worse to begin with.

    So, your memset library implementation is probably using a wrong threshold for switching into streaming-stores version (I guess it doesn't bother checking your LLC size, but assuming 1M is memory resident is quite strange). I suggest trying alternative libraries, or disabling the compiler ability to generate them (if it's supported).


    Your benchmark is only writing to memory, never reading, using memset which is probably cleverly designed to not read anything from cache into memory. It may very well be that with this code, where you only use half of the capability of cache memory, there is just no performance gain compared to raw memory. The fact that writing to raw memory is quite close to L2 speed may be a hint. If L2 runs at 26 GB/sec, main memory at 18 GB/sec, what can you really expect for L3 cache?

    You are measuring throughput, not latency. I'd try a benchmark where you actually use the strength of L3 cache, supplying data with lower latency than main memory.

    链接地址: http://www.djcxy.com/p/85962.html

    上一篇: 调试CPU高速缓存

    下一篇: 为什么我的8M L3缓存不能为大于1M的阵列提供任何好处?