Is GNU gprof buggy?

I have a C program that calls a function pi_calcPiItem() 600000000 times through the function pi_calcPiBlock . So to analyze the time spent in the functions I used GNU gprof. The result seems to be erroneous since all calls are attributed to main() instead. Furthermore the call graph does not make any sense:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 61.29      9.28     9.28                             pi_calcPiItem
 15.85     11.68     2.40                             pi_calcPiBlock
 11.96     13.49     1.81                             _mcount_private
  9.45     14.92     1.43                             __fentry__
  1.45     15.14     0.22                             pow
  0.00     15.14     0.00 600000000     0.00     0.00  main

                        Call graph


granularity: each sample hit covers 4 byte(s) for 0.07% of 15.14 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     61.3    9.28    0.00                 pi_calcPiItem [1]
-----------------------------------------------
                                                 <spontaneous>
[2]     15.9    2.40    0.00                 pi_calcPiBlock [2]
                0.00    0.00 600000000/600000000     main [6]
-----------------------------------------------
                                                 <spontaneous>
[3]     12.0    1.81    0.00                 _mcount_private [3]
-----------------------------------------------
                                                 <spontaneous>
[4]      9.4    1.43    0.00                 __fentry__ [4]
-----------------------------------------------
                                                 <spontaneous>
[5]      1.5    0.22    0.00                 pow [5]
-----------------------------------------------
                                   6             main [6]
                0.00    0.00 600000000/600000000     pi_calcPiBlock [2]
[6]      0.0    0.00    0.00 600000000+6       main [6]
                                   6             main [6]
-----------------------------------------------

Is this a bug or do I have to configure the program somehow?

And what does <spontaneous> mean?

EDIT (more insight for you)

The code is all about the calculation of pi:

#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)

int32_t main(int32_t argc, char* argv[]) {
  double result;

  for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
    pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
  }

  printf("pi = %fn",result);
  return 0;
}

static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
  double piItem;

  for ( int32_t i = start; i < end; ++i ) {
    pi_calcPiItem(&piItem, i);
    *result += piItem;
  }  
}    

static void pi_calcPiItem(double* piItem, int32_t index) {
  *piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}

And this is how I got the results (executed on Windows with the help of Cygwin):

> gcc -std=c99 -o pi *.c -pg -fno-inline-small-functions
> ./pi.exe
> gprof.exe pi.exe

Try:

  • Using the noinline , noclone function attributes instead of -fno-inline-small-functions
  • By disassembling main I could see that -fno-inline-small-functions doesn't stop inlining
  • Linking your program statically ( -static )
  • You should also initialize result to 0.0 in main
  • This worked for me on Linux, x86-64:

    #include <stdio.h>
    #include <stdint.h>
    #include <math.h>
    
    #define PI_BLOCKSIZE (100000000)
    #define PI_BLOCKCOUNT (6)
    #define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)
    
    static void pi_calcPiItem(double* piItem, int32_t index);
    static void pi_calcPiBlock(double* result, int32_t start, int32_t end);
    
    int32_t main(int32_t argc, char* argv[]) {
      double result;
    
      result = 0.0;
      for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
        pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
      }
    
      printf("pi = %fn",result);
      return 0;
    }
    
    __attribute__((noinline, noclone))
    static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
      double piItem;
    
      for ( int32_t i = start; i < end; ++i ) {
        pi_calcPiItem(&piItem, i);
        *result += piItem;
      }  
    }    
    
    __attribute__((noinline, noclone))
    static void pi_calcPiItem(double* piItem, int32_t index) {
      *piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
    }
    

    Building the Code

    $ cc pi.c -o pi -Os -Wall -g3 -I. -std=c99 -pg -static -lm
    

    Output

    $ ./pi && gprof ./pi
    pi = 3.141593
    Flat profile:
    
    Each sample counts as 0.01 seconds.
      %   cumulative   self              self     total           
     time   seconds   seconds    calls  ns/call  ns/call  name    
     85.61     22.55    22.55                             __ieee754_pow_sse2
      4.75     23.80     1.25                             pow
      4.14     24.89     1.09 600000000     1.82     1.82  pi_calcPiItem
      2.54     25.56     0.67                             __exp1
      0.91     25.80     0.24                             pi_calcPiBlock
      0.53     25.94     0.14                             matherr
      0.47     26.07     0.13                             __lseek_nocancel
      0.38     26.17     0.10                             frame_dummy
      0.34     26.26     0.09                             __ieee754_exp_sse2
      0.32     26.34     0.09                             __profile_frequency
      0.00     26.34     0.00        1     0.00     0.00  main
    
    
                 Call graph (explanation follows)
    
    
    granularity: each sample hit covers 2 byte(s) for 0.04% of 26.34 seconds
    
    index % time    self  children    called     name
                                                     <spontaneous>
    [1]     85.6   22.55    0.00                 __ieee754_pow_sse2 [1]
    -----------------------------------------------
                                                     <spontaneous>
    [2]      5.0    0.24    1.09                 pi_calcPiBlock [2]
                    1.09    0.00 600000000/600000000     pi_calcPiItem [4]
    -----------------------------------------------
                                                     <spontaneous>
    [3]      4.7    1.25    0.00                 pow [3]
    -----------------------------------------------
                    1.09    0.00 600000000/600000000     pi_calcPiBlock [2]
    [4]      4.1    1.09    0.00 600000000         pi_calcPiItem [4]
    -----------------------------------------------
                                                     <spontaneous>
    [5]      2.5    0.67    0.00                 __exp1 [5]
    -----------------------------------------------
                                                     <spontaneous>
    [6]      0.5    0.14    0.00                 matherr [6]
    -----------------------------------------------
                                                     <spontaneous>
    [7]      0.5    0.13    0.00                 __lseek_nocancel [7]
    -----------------------------------------------
                                                     <spontaneous>
    [8]      0.4    0.10    0.00                 frame_dummy [8]
    -----------------------------------------------
                                                     <spontaneous>
    [9]      0.3    0.09    0.00                 __ieee754_exp_sse2 [9]
    -----------------------------------------------
                                                     <spontaneous>
    [10]     0.3    0.09    0.00                 __profile_frequency [10]
    -----------------------------------------------
                    0.00    0.00       1/1           __libc_start_main [827]
    [11]     0.0    0.00    0.00       1         main [11]
    -----------------------------------------------
    

    Comments

    As expected, pow() is the bottleneck. While pi is running, perf top (sampling based system profiler) also shows __ieee754_pow_sse2 taking 60%+ of CPU. Changing pow(-1.0,index) to ((i & 1) ? -1.0 : 1.0) as @Mike Dunlavey suggested makes the code roughly 4 times faster.


    In 'man gprof' page, here is explanation for "spontaneous":

    Parents that are not themselves profiled will have the time of their profiled children propagated to them, but they will appear to be spontaneously invoked in the call graph listing, and will not have their time propagated further. Similarly, signal catchers, even though profiled, will appear to be spontaneous (although for more obscure reasons). Any profiled children of signal catchers should have their times propagated properly, unless the signal catcher was invoked during the execution of the profiling routine, in which case all is lost.

    链接地址: http://www.djcxy.com/p/40352.html

    上一篇: gprof和(unix)时间之间的差异; gprof报告运行时间较短

    下一篇: GNU gprof越野车是?