Why does this recursive C++ function have such a bad cache behavior?

Let T be a rooted binary tree such that every internal node has exactly two children. The nodes of the tree will be stored in an array, let us call it TreeArray by following the preorder layout.

So for example if this is the tree that we have: 在这里输入图像描述

Then TreeArray will contain the following node objects:

7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13

A node in this tree is a struct of this kind:

struct tree_node{

    int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0

    int pos; //position in TreeArray where the node is stored
    int lpos; //position of the left child
    int rpos; //position of the right child

    tree_node(){
            id = -1;
            pos = lpos = rpos = -1;
            numChildren = 0;
       }

};

Now suppose that we want a function that returns the sum of all the ids in the tree. Sounds very trivial, all you have to do is use a for loop that iterates over TreeArray and accumulates all the found ids. However, I am interested in understanding the cache behavior of the following implementation:

void testCache1(int cur){

     //find the positions of the left and right children
     int lpos = TreeArray[cur].lpos;
     int rpos = TreeArray[cur].rpos;

     //if there are no children we are at a leaf so update r and return

     if(TreeArray[cur].numChildren == 0){
        r += TreeArray[cur].id;
        return;
     }

     //otherwise we are in an internal node, so update r and recurse
     //first to the left subtree and then to the right subtree

     r += TreeArray[cur].id;

     testCache1(lpos);
     testCache1(rpos);

}

To test the cache behavior I have the following experiment:

r = 0; //r is a global variable
int main(int argc, char* argv[]){

    for(int i=0;i<100;i++) {
        r = 0;
        testCache1(0);
    }

    cout<<r<<endl;
    return 0;
}

For a random tree with 5 million leafs, perf stat -B -e cache-misses,cache-references,instructions ./run_tests 111.txt prints the following:

 Performance counter stats for './run_tests 111.txt':

   469,511,047      cache-misses              #   89.379 % of all cache refs    
   525,301,814      cache-references                                            
20,715,360,185      instructions             

  11.214075268 seconds time elapsed

In the beginning I thought maybe it is because of the way I generate the tree, which I exclude including it in my question, but when I run sudo perf record -e cache-misses ./run_tests 111.txt I received the following output:

在这里输入图像描述

As we can see most of the cache misses come from this function. However I can not understand why this is the case. The values of cur will be sequential, I will first access position 0 of TreeArray , then position 1 , 2 , 3 etc.

To add more doubt to my understanding of what is happening, I have the following function that finds the same summation:

void testCache4(int index){

     if(index == TreeArray.size) return;

     r += TreeArray[index].id;

     testCache4(index+1);

}

testCache4 accesses the elements of TreeArray in the same way, but the cache behavior is so much better.

output from perf stat -B -e cache-misses,cache-references,instructions ./run_tests 11.txt :

 Performance counter stats for './run_tests 111.txt':

   396,941,872      cache-misses              #   54.293 % of all cache refs    
   731,109,661      cache-references                                            
11,547,097,924      instructions             

   4.306576556 seconds time elapsed

in the output from sudo perf record -e cache-misses ./run_tests 111.txt the function is not even there:

在这里输入图像描述

I apologize for the long post but I feel completely lost. Thank you in advance.

EDIT:

Here is the entire test file, together with the parsers and everything that is required. It is assumed that the tree is available inside a text file that is given as an argument. Compile by typing g++ -O3 -std=c++11 file.cpp and run by typing ./executable tree.txt . The tree I am using can be found here (don't open, click save us).

#include <iostream>
#include <fstream>
#define BILLION  1000000000LL

using namespace std;

/*
 *
 * Timing functions
 *
 */

timespec startT, endT;

void startTimer(){
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &startT);
}

double endTimer(){
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &endT);
    return endT.tv_sec * BILLION + endT.tv_nsec - (startT.tv_sec * BILLION + startT.tv_nsec);
}

/*
 *
 * tree node
 *
 */

//this struct is used for creating the first tree after reading it from the external file, for this we need left and child pointers

struct tree_node_temp{

    int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node
    tree_node_temp *leftChild;
    tree_node_temp *rightChild;

    tree_node_temp(){
        id = -1;
        size = 1;
        leftChild = nullptr;
        rightChild = nullptr;
        numChildren = 0;
    }

};

struct tree_node{

    int id; //id of the node, randomly generated
    int numChildren; //number of children, it is 2 but for the leafs it's 0
    int size; //size of the subtree rooted at the current node

    int pos; //position in TreeArray where the node is stored
    int lpos; //position of the left child
    int rpos; //position of the right child

    tree_node(){
        id = -1;
        pos = lpos = rpos = -1;
        numChildren = 0;
    }

};

/*
 *
 * Tree parser. The input is a file containing the tree in the newick format.
 *
 */

string treeNewickStr; //string storing the newick format of a tree that we read from a file
int treeCurSTRindex; //index to the current position we are in while reading the newick string
int treeNumLeafs; //number of leafs in current tree
tree_node ** treeArrayReferences; //stack of references to free node objects
tree_node *treeArray; //array of node objects
int treeStackReferencesTop; //the top index to the references stack
int curpos; //used to find pos,lpos and rpos when creating the pre order layout tree


//helper function for readNewick
tree_node_temp* readNewickHelper() {

    int i;
    if(treeCurSTRindex == treeNewickStr.size())
        return nullptr;

    tree_node_temp * leftChild;
    tree_node_temp * rightChild;

    if(treeNewickStr[treeCurSTRindex] == '('){
        //create a left child
        treeCurSTRindex++;
        leftChild = readNewickHelper();
    }

    if(treeNewickStr[treeCurSTRindex] == ','){
        //create a right child
        treeCurSTRindex++;
        rightChild = readNewickHelper();
    }

    if(treeNewickStr[treeCurSTRindex] == ')' || treeNewickStr[treeCurSTRindex] == ';'){
        treeCurSTRindex++;
        tree_node_temp * cur = new tree_node_temp();
        cur->numChildren = 2;
        cur->leftChild = leftChild;
        cur->rightChild = rightChild;
        cur->size = 1 + leftChild->size + rightChild->size;
        return cur;
    }

    //we are about to read a label, keep reading until we read a "," ")" or "(" (we assume that the newick string has the right format)
    i = 0;
    char treeLabel[20]; //buffer used for the label
    while(treeNewickStr[treeCurSTRindex]!=',' && treeNewickStr[treeCurSTRindex]!='(' && treeNewickStr[treeCurSTRindex]!=')'){
        treeLabel[i] = treeNewickStr[treeCurSTRindex];
        treeCurSTRindex++;
        i++;
    }

    treeLabel[i] = '';
    tree_node_temp * cur = new tree_node_temp();
    cur->numChildren = 0;
    cur->id = atoi(treeLabel)-1;
    treeNumLeafs++;

    return cur;
}

//create the pre order tree, curRoot in the first call points to the root of the first tree that was given to us by the parser
void treeInit(tree_node_temp * curRoot){

    tree_node * curFinalRoot = treeArrayReferences[curpos];

    curFinalRoot->pos = curpos;

    if(curRoot->numChildren == 0) {
        curFinalRoot->id = curRoot->id;
        return;
    }

    //add left child
    tree_node * cnode = treeArrayReferences[treeStackReferencesTop];
    curFinalRoot->lpos = curpos + 1;
    curpos = curpos + 1;
    treeStackReferencesTop++;
    cnode->id = curRoot->leftChild->id;
    treeInit(curRoot->leftChild);

    //add right child
    curFinalRoot->rpos = curpos + 1;
    curpos = curpos + 1;
    cnode = treeArrayReferences[treeStackReferencesTop];
    treeStackReferencesTop++;
    cnode->id = curRoot->rightChild->id;
    treeInit(curRoot->rightChild);

    curFinalRoot->id = curRoot->id;
    curFinalRoot->numChildren = 2;
    curFinalRoot->size = curRoot->size;

}

//the ids of the leafs are deteremined by the newick file, for the internal nodes we just incrementally give the id determined by the dfs traversal
void updateInternalNodeIDs(int cur){

    tree_node* curNode = treeArrayReferences[cur];

    if(curNode->numChildren == 0){
        return;
    }
    curNode->id = treeNumLeafs++;
    updateInternalNodeIDs(curNode->lpos);
    updateInternalNodeIDs(curNode->rpos);

}

//frees the memory of the first tree generated by the parser
void treeFreeMemory(tree_node_temp* cur){

    if(cur->numChildren == 0){
        delete cur;
        return;
    }
    treeFreeMemory(cur->leftChild);
    treeFreeMemory(cur->rightChild);

    delete cur;

}

//reads the tree stored in "file" under the newick format and creates it in the main memory. The output (what the function returns) is a pointer to the root of the tree.
//this tree is scattered anywhere in the memory.

tree_node* readNewick(string& file){

    treeCurSTRindex = -1;
    treeNewickStr = "";
    treeNumLeafs = 0;

    ifstream treeFin;

    treeFin.open(file, ios_base::in);
    //read the newick format of the tree and store it in a string
    treeFin>>treeNewickStr;
    //initialize index for reading the string
    treeCurSTRindex = 0;
    //create the tree in main memory
    tree_node_temp* root = readNewickHelper();

    //store the tree in an array following the pre order layout
    treeArray = new tree_node[root->size];
    treeArrayReferences = new tree_node*[root->size];
    int i;
    for(i=0;i<root->size;i++)
        treeArrayReferences[i] = &treeArray[i];
    treeStackReferencesTop = 0;

    tree_node* finalRoot = treeArrayReferences[treeStackReferencesTop];
    curpos = treeStackReferencesTop;
    treeStackReferencesTop++;
    finalRoot->id = root->id;
    treeInit(root);

    //update the internal node ids (the leaf ids are defined by the ids stored in the newick string)
    updateInternalNodeIDs(0);
    //close the file
    treeFin.close();

    //free the memory of initial tree
    treeFreeMemory(root);
    //return the pre order tree
    return finalRoot;

}

/*
 * experiments
 *
 */

int r;
tree_node* T;

void testCache1(int cur){

    int lpos = treeArray[cur].lpos;
    int rpos = treeArray[cur].rpos;

    if(treeArray[cur].numChildren == 0){
        r += treeArray[cur].id;
        return;
    }

    r += treeArray[cur].id;

    testCache1(lpos);
    testCache1(rpos);

}


void testCache4(int index){

    if(index == T->size) return;

    r += treeArray[index].id;

    testCache4(index+1);

}


int main(int argc, char* argv[]){

    string Tnewick = argv[1];
    T = readNewick(Tnewick);
    double tt;

    startTimer();
    for(int i=0;i<100;i++) {
        r = 0;
        testCache4(0);
    }
    tt = endTimer();
    cout<<r<<endl;
    cout<<tt/BILLION<<endl;

    startTimer();
    for(int i=0;i<100;i++) {
        r = 0;
        testCache1(0);
    }
    tt = endTimer();
    cout<<r<<endl;
    cout<<tt/BILLION<<endl;

    delete[] treeArray;
    delete[] treeArrayReferences;

    return 0;
}

EDIT2:

I run some profiling tests with valgrind. The instructions might actually be the overhead here, but I don't understand why. For example even in the experiments above with perf, one version gives around 20 billion instructions and the other 11 billion. That's a 9 billion difference.

With -O3 enabled I get the following:

在这里输入图像描述

so the function calls are expensive in testCache1 and cost nothing in testCache4 ? The amount of function calls in both cases should be the same...


I guess the problem is a misunderstanding of what cache-references actually count.

As explained in this answer cache-references on Intel CPUs actually are the number of references to the last-level cache. Therefore memory references which were served by the L1 cache are not counted. The Intel 64 and IA-32 Architectures Developer's Manual states that loads from the L1 prefetcher however are counted.

If you actually compare the absolute number of cache-misses, you will see that they are roughly equal for both functions. I used a fully balanced tree for the test, removed pos to get a size of 16 fitting nicely into cache lines and got the following numbers:

testCache4 :

843.628.131      L1-dcache-loads                                               (56,83%)
193.006.858      L1-dcache-load-misses     #   22,73% of all L1-dcache hits    (57,31%)
326.698.621      cache-references                                              (57,07%)
188.435.203      cache-misses              #   57,679 % of all cache refs      (56,76%)

testCache1 :

3.519.968.253    L1-dcache-loads                                               (57,17%)
193.664.806      L1-dcache-load-misses     #    5,50% of all L1-dcache hits    (57,24%)
256.638.490      cache-references                                              (57,12%)
188.007.927      cache-misses              #   73,258 % of all cache refs      (57,23%)

And if I manually disable all hardware prefetchers:

testCache4 :

846.124.474      L1-dcache-loads                                               (57,22%)
192.495.450      L1-dcache-load-misses     #   22,75% of all L1-dcache hits    (57,31%)
193.699.811      cache-references                                              (57,03%)
185.445.753      cache-misses              #   95,739 % of all cache refs      (57,17%)

testCache1 :

3.534.308.118    L1-dcache-loads                                               (57,16%)
193.595.962      L1-dcache-load-misses     #    5,48% of all L1-dcache hits    (57,18%)
193.639.498      cache-references                                              (57,12%)
185.120.733      cache-misses              #   95,601 % of all cache refs      (57,15%)

As you can see the differences are gone now. There simply were additional cache-references events due to the prefetcher and the actual reference being counted twice.

Actually if you count all memory references testCache1 has a lower total cache miss ratio, because each tree_node is referenced 4 times instead of ones, but each data member of a tree_node lies on the same cache line and so there is only one out of 4 misses.

For testCache4 you can see that the L1d load miss ratio is actually close to 25% which you would expect if sizeof(tree_node) == 16 and cache lines are 64 bytes.

Also the compiler (at least gcc with -O2) applies tail recursion optimization to both functions eliminating the recursion of testCache4 , while making testCache1 one-way recursive. Therefore testCache1 has many additional cache references to the stack frames which testCache4 does not have.

You can get the result without prefetcher also by using valgrind, which probably is also more reliable in its output. It does not simulate all properties of CPU caches though.

Regarding your edits: As I metioned gcc aplies tail recursion optimization, so there are no calls left in testCache4 and of course recursion and additional memory loads in testCache1 have significant instruction overhead compared to the simple load/add loop left in testCache4 .

链接地址: http://www.djcxy.com/p/36226.html

上一篇: 基于请求标头的路由(使用AWS Application Load Balancer)

下一篇: 为什么这个递归C ++函数有这么糟糕的缓存行为?