Is GNU gprof buggy?
I have a C program that calls a function pi_calcPiItem()
600000000 times through the function pi_calcPiBlock
. So to analyze the time spent in the functions I used GNU gprof. The result seems to be erroneous since all calls are attributed to main()
instead. Furthermore the call graph does not make any sense:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
61.29 9.28 9.28 pi_calcPiItem
15.85 11.68 2.40 pi_calcPiBlock
11.96 13.49 1.81 _mcount_private
9.45 14.92 1.43 __fentry__
1.45 15.14 0.22 pow
0.00 15.14 0.00 600000000 0.00 0.00 main
Call graph
granularity: each sample hit covers 4 byte(s) for 0.07% of 15.14 seconds
index % time self children called name
<spontaneous>
[1] 61.3 9.28 0.00 pi_calcPiItem [1]
-----------------------------------------------
<spontaneous>
[2] 15.9 2.40 0.00 pi_calcPiBlock [2]
0.00 0.00 600000000/600000000 main [6]
-----------------------------------------------
<spontaneous>
[3] 12.0 1.81 0.00 _mcount_private [3]
-----------------------------------------------
<spontaneous>
[4] 9.4 1.43 0.00 __fentry__ [4]
-----------------------------------------------
<spontaneous>
[5] 1.5 0.22 0.00 pow [5]
-----------------------------------------------
6 main [6]
0.00 0.00 600000000/600000000 pi_calcPiBlock [2]
[6] 0.0 0.00 0.00 600000000+6 main [6]
6 main [6]
-----------------------------------------------
Is this a bug or do I have to configure the program somehow?
And what does <spontaneous>
mean?
EDIT (more insight for you)
The code is all about the calculation of pi:
#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)
int32_t main(int32_t argc, char* argv[]) {
double result;
for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
}
printf("pi = %fn",result);
return 0;
}
static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
double piItem;
for ( int32_t i = start; i < end; ++i ) {
pi_calcPiItem(&piItem, i);
*result += piItem;
}
}
static void pi_calcPiItem(double* piItem, int32_t index) {
*piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}
And this is how I got the results (executed on Windows with the help of Cygwin):
> gcc -std=c99 -o pi *.c -pg -fno-inline-small-functions
> ./pi.exe
> gprof.exe pi.exe
Try:
noinline
, noclone
function attributes instead of -fno-inline-small-functions
main
I could see that -fno-inline-small-functions
doesn't stop inlining -static
) result
to 0.0
in main
This worked for me on Linux, x86-64:
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)
static void pi_calcPiItem(double* piItem, int32_t index);
static void pi_calcPiBlock(double* result, int32_t start, int32_t end);
int32_t main(int32_t argc, char* argv[]) {
double result;
result = 0.0;
for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
}
printf("pi = %fn",result);
return 0;
}
__attribute__((noinline, noclone))
static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
double piItem;
for ( int32_t i = start; i < end; ++i ) {
pi_calcPiItem(&piItem, i);
*result += piItem;
}
}
__attribute__((noinline, noclone))
static void pi_calcPiItem(double* piItem, int32_t index) {
*piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}
Building the Code
$ cc pi.c -o pi -Os -Wall -g3 -I. -std=c99 -pg -static -lm
Output
$ ./pi && gprof ./pi
pi = 3.141593
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ns/call ns/call name
85.61 22.55 22.55 __ieee754_pow_sse2
4.75 23.80 1.25 pow
4.14 24.89 1.09 600000000 1.82 1.82 pi_calcPiItem
2.54 25.56 0.67 __exp1
0.91 25.80 0.24 pi_calcPiBlock
0.53 25.94 0.14 matherr
0.47 26.07 0.13 __lseek_nocancel
0.38 26.17 0.10 frame_dummy
0.34 26.26 0.09 __ieee754_exp_sse2
0.32 26.34 0.09 __profile_frequency
0.00 26.34 0.00 1 0.00 0.00 main
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.04% of 26.34 seconds
index % time self children called name
<spontaneous>
[1] 85.6 22.55 0.00 __ieee754_pow_sse2 [1]
-----------------------------------------------
<spontaneous>
[2] 5.0 0.24 1.09 pi_calcPiBlock [2]
1.09 0.00 600000000/600000000 pi_calcPiItem [4]
-----------------------------------------------
<spontaneous>
[3] 4.7 1.25 0.00 pow [3]
-----------------------------------------------
1.09 0.00 600000000/600000000 pi_calcPiBlock [2]
[4] 4.1 1.09 0.00 600000000 pi_calcPiItem [4]
-----------------------------------------------
<spontaneous>
[5] 2.5 0.67 0.00 __exp1 [5]
-----------------------------------------------
<spontaneous>
[6] 0.5 0.14 0.00 matherr [6]
-----------------------------------------------
<spontaneous>
[7] 0.5 0.13 0.00 __lseek_nocancel [7]
-----------------------------------------------
<spontaneous>
[8] 0.4 0.10 0.00 frame_dummy [8]
-----------------------------------------------
<spontaneous>
[9] 0.3 0.09 0.00 __ieee754_exp_sse2 [9]
-----------------------------------------------
<spontaneous>
[10] 0.3 0.09 0.00 __profile_frequency [10]
-----------------------------------------------
0.00 0.00 1/1 __libc_start_main [827]
[11] 0.0 0.00 0.00 1 main [11]
-----------------------------------------------
Comments
As expected, pow()
is the bottleneck. While pi
is running, perf top
(sampling based system profiler) also shows __ieee754_pow_sse2
taking 60%+ of CPU. Changing pow(-1.0,index)
to ((i & 1) ? -1.0 : 1.0)
as @Mike Dunlavey suggested makes the code roughly 4 times faster.
In 'man gprof' page, here is explanation for "spontaneous":
Parents that are not themselves profiled will have the time of their profiled children propagated to them, but they will appear to be spontaneously invoked in the call graph listing, and will not have their time propagated further. Similarly, signal catchers, even though profiled, will appear to be spontaneous (although for more obscure reasons). Any profiled children of signal catchers should have their times propagated properly, unless the signal catcher was invoked during the execution of the profiling routine, in which case all is lost.
链接地址: http://www.djcxy.com/p/40352.html上一篇: gprof和(unix)时间之间的差异; gprof报告运行时间较短
下一篇: GNU gprof越野车是?