Which one is more optimized for accessing array?

Solving the following exercise:

Write three different versions of a program to print the elements of ia. One version should use a range for to manage the iteration, the other two should use an ordinary for loop in one case using subscripts and in the other using pointers. In all three programs write all the types directly. That is, do not use a type alias, auto, or decltype to simplify the code.[C++ Primer]

a question came up: Which of these methods for accessing array is optimized in terms of speed and why?


My Solutions:

  • Foreach Loop:

    int ia[3][4]={{1,2,3,4},{5,6,7,8},{9,10,11,12}};    
    for (int (&i)[4]:ia)        //1st method using for each loop
        for(int j:i)
            cout<<j<<" ";
    
  • Nested for loops:

    for (int i=0;i<3;i++)       //2nd method normal for loop
        for(int j=0;j<4;j++)
            cout<<ia[i][j]<<" ";
    
  • Using pointers:

    int (*i)[4]=ia;
    for(int t=0;t<3;i++,t++){  //3rd method.  using pointers.
        for(int x=0;x<4;x++)
            cout<<(*i)[x]<<" ";
    
  • Using auto :

    for(auto &i:ia)             //4th one using auto but I think it is similar to 1st.  
        for(auto j:i)
             cout<<j<<" ";
    

  • Benchmark result using clock()

    1st: 3.6  (6,4,4,3,2,3) 
    2nd: 3.3  (6,3,4,2,3,2)
    3rd: 3.1  (4,2,4,2,3,4)
    4th: 3.6  (4,2,4,5,3,4)
    

    Simulating each method 1000 times:

    1st: 2.29375  2nd: 2.17592  3rd: 2.14383  4th: 2.33333
    Process returned 0 (0x0)   execution time : 13.568 s
    

    Compiler used:MingW 3.2 c++11 flag enabled. IDE:CodeBlocks


    I have some observations and points to make and I hope you get your answer from this.

  • The fourth version, as you mention yourself, is basically the same as the first version. auto can be thought of as only a coding shortcut (this is of course not strictly true, as using auto can result in getting different types than you'd expected and therefore result in different runtime behavior. But most of the time this is true.)

  • Your solution using pointers is probably not what people mean when they say that they are using pointers! One solution might be something like this:

    for (int i = 0, *p = &(ia[0][0]); i < 3 * 4; ++i, ++p)
        cout << *p << " ";
    

    or to use two nested loops (which is probably pointless):

    for (int i = 0, *p = &(ia[0][0]); i < 3; ++i)
        for (int j = 0; j < 4; ++j, ++p)
            cout << *p << " ";
    

    from now on, I'm assuming this is the pointer solution you've written.

  • In such a trivial case as this, the part that will absolutely dominate your running time is the cout . The time spent in bookkeeping and checks for the loop(s) will be completely negligible comparing to doing I/O. Therefore, it won't matter which loop technique you use.

  • Modern compilers are great at optimizing such ubiquitous tasks and access patterns (iterating over an array.) Therefore, chances are that all these methods will generate exactly the same code (with the possible exception of the pointer version, which I will talk about later.)

  • The performance of most codes like this will depend more on the memory access pattern rather than how exactly the compiler generates the assembly branch instructions (and the rest of the operations.) This is because if a required memory block is not in the CPU cache, it's going to take a time roughly equivalent of several hundred CPU cycles (this is just a ballpark number) to fetch those bytes from RAM. Since all the examples access memory in exactly the same order, their behavior in respect to memory and cache will be the same and will have roughly the same running time.

    As a side note, the way these examples access memory is the best way for it to be accessed! Linear, consecutive and from start to finish. Again, there are problems with the cout in there, which can be a very complicated operation and even call into the OS on every invocation, which might result, among other things, an almost complete deletion (eviction) of everything useful from the CPU cache.

  • On 32-bit systems and programs, the size of an int and a pointer are usually equal (both are 32 bits!) Which means that it doesn't matter much whether you pass around and use index values or pointers into arrays. On 64-bit systems however, a pointer is 64 bits but an int will still usually be 32 bits. This suggests that it is usually better to use indexes into arrays instead of pointers (or even iterators) on 64-bit systems and programs.

    In this particular example, this is not significant at all though.

  • Your code is very specific and simple, but the general case, it is almost always better to give as much information to the compiler about your code as possible. This means that you must use the narrowest, most specific device available to you to do a job. This in turn means that a generic for loop (ie for (int i = 0; i < n; ++i) ) is worse than a range-based for loop (ie for (auto i : v) ) for the compiler, because in the latter case the compiler simply knows that you are going to iterate over the whole range and not go outside of it or break out of the loop or something, while in the generic for loop case, specially if your code is more complex, the compiler cannot be sure of this and has to insert extra checks and tests to make sure the code executes as the C++ standard says it should.

  • In many (most?) cases, although you might think performance matters, it does not . And most of the time you rewrite something to gain performance, you don't gain much. And most of the time the performance gain you get is not worth the loss in readability and maintainability that you sustain. So, design your code and data structures right (and keep performance in mind) but avoid this kind of "micro-optimization" because it's almost always not worth it and even harms the quality of the code too.

  • Generally, performance in terms of speed is very hard to reason about. Ideally you have to measure the time with real data on real hardware in real working conditions using sound scientific measuring and statistical methods. Even measuring the time it takes a piece of code to run is not at all trivial. Measuring performance is hard, and reasoning about it is harder, but these days it is the only way of recognizing bottlenecks and optimizing the code.

  • I hope I have answered your question.

    EDIT: I wrote a very simple benchmark for what you are trying to do. The code is here. It's written for Windows and should be compilable on Visual Studio 2012 (because of the range-based for loops.) And here are the timing results:

    Simple iteration (nested loops): min:0.002140, avg:0.002160, max:0.002739
        Simple iteration (one loop): min:0.002140, avg:0.002160, max:0.002625
       Pointer iteration (one loop): min:0.002140, avg:0.002160, max:0.003149
     Range-based for (nested loops): min:0.002140, avg:0.002159, max:0.002862
     Range(const ref)(nested loops): min:0.002140, avg:0.002155, max:0.002906
    

    The relevant numbers are the "min" times (over 2000 runs of each test, for 1000x1000 arrays.) As you see, there is absolutely no difference between the tests. Note that you should turn on compiler optimizations or test 2 will be a disaster and cases 4 and 5 will be a little worse than 1 and 3.

    And here are the code for the tests:

    // 1. Simple iteration (nested loops)
    unsigned sum = 0;
    for (unsigned i = 0; i < gc_Rows; ++i)
        for (unsigned j = 0; j < gc_Cols; ++j)
            sum += g_Data[i][j];
    
    // 2. Simple iteration (one loop)
    unsigned sum = 0;
    for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
        sum += g_Data[i / gc_Cols][i % gc_Cols];
    
    // 3. Pointer iteration (one loop)
    unsigned sum = 0;
    unsigned * p = &(g_Data[0][0]);
    for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
        sum += *p++;
    
    // 4. Range-based for (nested loops)
    unsigned sum = 0;
    for (auto & i : g_Data)
        for (auto j : i)
            sum += j;
    
    // 5. Range(const ref)(nested loops)
    unsigned sum = 0;
    for (auto const & i : g_Data)
        for (auto const & j : i)
            sum += j;
    

    It has many factors affecting it:

  • It depends on the compiler
  • It depends on the compiler flags used
  • It depends on the computer used
  • There is only one way to know the exact answer: measuring the time used when dealing with huge arrays (maybe from a random number generator) which is the same method you have already done except that the array size should be at least 1000x1000.

    链接地址: http://www.djcxy.com/p/6188.html

    上一篇: 为什么malloc 7x比英特尔icc的新版慢?

    下一篇: 哪一个更适合访问数组?