Coding Practices which enable the compiler/optimizer to make a faster program

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.

As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.

FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.

What coding practices are available that may enable the compiler/optimizer to generate faster code?

  • Identifying the platform and compiler you use, would be appreciated.
  • Why does the technique seem to work?
  • Sample code is encouraged.
  • Here is a related question

    [Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?

    [Edit] Offset related link


    Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like

    void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
    {
        for (int i=0; i<numFoo, i++)
        {
             barOut.munge(foo1, foo2[i]);
        }
    }
    

    the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:

    void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
    {
        Foo barTemp = barOut;
        for (int i=0; i<numFoo, i++)
        {
             barTemp.munge(foo1, foo2[i]);
        }
        barOut = barTemp;
    }
    

    It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).


    Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:

    Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.

    Next, profile your code.

    Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.

    Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.


    The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:

    #define N 1000000;
    int matrix[N][N] = { ... };
    
    //awesomely fast
    long sum = 0;
    for(int i = 0; i < N; i++){
      for(int j = 0; j < N; j++){
        sum += matrix[i][j];
      }
    }
    
    //painfully slow
    long sum = 0;
    for(int i = 0; i < N; i++){
      for(int j = 0; j < N; j++){
        sum += matrix[j][i];
      }
    }
    
    链接地址: http://www.djcxy.com/p/86748.html

    上一篇: 防止结束的方法

    下一篇: 编码实践使编译器/优化器能够制作更快的程序