C cache optimization for direct mapped cache

2018-06-12 17:20:58

Having some trouble figuring out the hit and miss rates of the following two snippets of code.

Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:

struct pos {
    int x;
    int y;
};

struct pos grid[16][16];
int total_x = 0; int total_y = 0;

void function1() {
    int i, j;
    for (i = 0; i < 16; i++) {
         for (j = 0; j < 16; j++) {
             total_x += grid[j][i].x;
             total_y += grid[j][i].y;
         }
    }
}

void function2() {
    int i, j;
    for (i = 0; i < 16; i++) {
         for (j = 0; j < 16; j++) {
             total_x += grid[i][j].x;
             total_y += grid[i][j].y;
         }
    }
}

I can tell from some basic rules (ie C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.

Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?

Thanks.

How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15] ....... pos[15][0] ...... pos[15][15]

The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63

How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0 , the second iteration loads pos[1][0] and pos[1][1] to cache block 8 . Caches are cold , so the first column x is always miss , while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0 !).So on, the miss rate is 50%.

How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.

K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1 , and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.

Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).

Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries ( grid[i][j] and grid[i][j+1] ). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.

Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.

Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.

Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at

total_x += grid[j][i].x;
total_y += grid[j][i].y;

are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to

pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;

which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."

链接地址: http://www.djcxy.com/p/36356.html

上一篇: 访问数组越界有多危险？

下一篇: 直接映射缓存的C高速缓存优化