Profiling C++ in the presence of aggressive inlining?
I am trying to figure out where my C++ program is spending its time, using gprof. Here's my dilemma: if I compile with the same optimization settings I use for my release build, pretty much everything gets inlined, and gprof tells me, unhelpfully, that 90% of my time is spent in a core routine, where everything was inlined. On the other hand, if I compile with inlining disabled, the program runs an order of magnitude slower.
I want to find out how much time procedures called from my core routine are taking, when my program is compiled with inlining enabled.
I am running 64-bit Ubuntu 9.04 on a quad-core Intel machine. I looked into google-perftools, but that doesn't seem to work well on x86_64. Running on a 32-bit machine is not an option.
Does anyone have suggestions as to how I can more effectively profile my application, when inlining is enabled?
Edit: Here is some clarification of my problem. I apologize if it was not clear initially.
I want to find where the time was being spent in my application. Profiling my optimized build resulted in gprof telling me that ~90% of the time is spent in main, where everything was inlined. I already knew that before profiling!
What I want to find out is how much time the inlined functions are taking, preferably, without disabling optimization or inlining in my build options. The application is something like an order of magnitude slower when profiling with inlining disabled. This difference in execution time is a convenience issue, but also, I am not confident that the performance profile of the program built with inlining disabled will strongly correspond to the performance profile of the program built with inlining enabled.
In short: is there a way to get useful profiling information on a C++ program without disabling optimization or inlining?
I assume what you want to do is find out which lines of code are costing you enough to be worth optimizing. That is very different from timing functions. You can do better than gprof .
Here's a fairly complete explanation of how to do it.
You can do it by hand, or use one of the profilers that can provide the same information, such as oprofile, and RotateRight/Zoom.
BTW, inlining is of significant value only if the routines being inlined are small and don't call functions themselves, and if the lines where they are being called are active enough of the time to be significant.
As for the order of magnitude performance ratio between debug and release build, it may be due to a number of things, maybe or maybe not the inlining. You can use the stackshot method mentioned above to find out for certain just what's going on in either case. I've found that debug builds can be slow for other reasons, like recursive data structure validation, for example.
You can use a more powerful profiler, such as Intel's VTune, which can give you assembly line level of performance detail.
http://software.intel.com/en-us/intel-vtune/
It's for Windows and Linux, but does cost money...
Develop a few macros using the high performance timing mechanism of your CPU (eg, x86) -- the routines that don't rely on system calls, and bind a single thread running your core loop to a specific CPU (set the affinity). You would need to implement the following macro's.
PROF_INIT //allocate any variables -- probably a const char
PROF_START("name") // start a timer
PROF_STOP() // end a timer and calculate the difference --
// which you write out using a async fd
I had something like this that I placed in every function I was interested in, I made sure the macro's placed the timing calls into the context of the call tree -- this is possibly the most accurate way to profile.
Note:
This method is driven by your code -- and does not rely on an external tool to snoop your code in any way. Snooping, Sampling and interrupt driven profiling is inaccurate when it comes to small sections of code. Besides, you want to control where and when the timing data is collected -- like at specific constructs in your code, like loops, the beginning of a recursive call-chain or mass memory allocations.
-- edit --
You might be interested in the link from this answer to one of my questions.
链接地址: http://www.djcxy.com/p/43846.html上一篇: 剖析C ++编译过程
下一篇: 在积极内联的情况下分析C ++?