Linaro compilation speed
I am working on software running on an embedded ARM platform. In the course of updating our platform, we are switching from an OpenEmbedded based system to Linaro.
On my machine, it currently takes about 9 minutes to cross-compile our software for ARM, using the 32 bit gcc 4.6.4 that OpenEmbedded built for us. For the new system we are now of course trying Linaro's gcc 4.7 binary - with the surprising result that compilation suddenly takes about twice as long (18 minutes). The Linaro gcc 4.6 binary has the same issue, so it is not gcc version specific.
Using Linaro's crosstool-ng to create an adjusted version of their compiler (eg trying to get the configure options as close as possible) did not speed it up.
The main differences between our old gcc compiler and the Linaro one:
with-target=armv7-a
and with-tune=cortex-a9
explicitely set Changing configure options in gcc like enabling of ssp, thumb/arm mode, using multilib, target CPU (cortex-a8 vs a9) does not yield an improvement.
Performance speed already differs for a simple test.cpp that just has a main function with a vector<int>
, so it's not related to the linking and I doubt that the STL header files are causing that much difference.
I am running out of ideas what else to tweak. Does anybody have an idea?
EDIT4: I also tried the arm cross compiler from Ubuntu 12.04 (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)) and it has comparable compilation times to my 4.6.4 version. So there seems to be something particular different in Linaro's version which I either can't manage to turn off or is some special patch they applied?
EDIT3: -ftime-report
from Linaro gcc 4.7 for an actual source file from the project:
Execution times (seconds)
phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 2076 kB ( 2%) ggc
phase parsing : 1.72 (74%) usr 0.34 (85%) sys 2.04 (75%) wall 66732 kB (79%) ggc
phase lang. deferred : 0.28 (12%) usr 0.04 (10%) sys 0.33 (12%) wall 10215 kB (12%) ggc
phase cgraph : 0.32 (14%) usr 0.02 ( 5%) sys 0.33 (12%) wall 5481 kB ( 6%) ggc
phase generate : 0.60 (26%) usr 0.06 (15%) sys 0.66 (24%) wall 15700 kB (19%) ggc
|name lookup : 0.28 (12%) usr 0.02 ( 5%) sys 0.24 ( 9%) wall 8058 kB (10%) ggc
|overload resolution : 0.32 (14%) usr 0.06 (15%) sys 0.36 (13%) wall 10042 kB (12%) ggc
callgraph construction : 0.06 ( 3%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 551 kB ( 1%) ggc
callgraph optimization : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 224 kB ( 0%) ggc
varpool construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 94 kB ( 0%) ggc
df scan insns : 0.06 ( 3%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 5 kB ( 0%) ggc
df reg dead/unused notes: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 22 kB ( 0%) ggc
alias analysis : 0.00 ( 0%) usr 0.02 ( 5%) sys 0.00 ( 0%) wall 11 kB ( 0%) ggc
preprocessing : 0.08 ( 3%) usr 0.10 (25%) sys 0.29 (11%) wall 1069 kB ( 1%) ggc
parser (global) : 0.58 (25%) usr 0.08 (20%) sys 0.43 (16%) wall 25145 kB (30%) ggc
parser struct body : 0.28 (12%) usr 0.02 ( 5%) sys 0.34 (12%) wall 12400 kB (15%) ggc
parser enumerator list : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 121 kB ( 0%) ggc
parser function body : 0.14 ( 6%) usr 0.00 ( 0%) sys 0.15 ( 6%) wall 2435 kB ( 3%) ggc
parser inl. func. body : 0.10 ( 4%) usr 0.02 ( 5%) sys 0.18 ( 7%) wall 3682 kB ( 4%) ggc
parser inl. meth. body : 0.24 (10%) usr 0.02 ( 5%) sys 0.20 ( 7%) wall 5298 kB ( 6%) ggc
template instantiation : 0.58 (25%) usr 0.14 (35%) sys 0.75 (28%) wall 26588 kB (31%) ggc
tree gimplify : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.03 ( 1%) wall 785 kB ( 1%) ggc
tree CFG construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 543 kB ( 1%) ggc
tree SSA other : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 32 kB ( 0%) ggc
out of ssa : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
expand : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 438 kB ( 1%) ggc
varconst : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 6 kB ( 0%) ggc
integrated RA : 0.04 ( 2%) usr 0.00 ( 0%) sys 0.09 ( 3%) wall 1313 kB ( 2%) ggc
reload : 0.06 ( 3%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 60 kB ( 0%) ggc
thread pro- & epilogue : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 92 kB ( 0%) ggc
final : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 4 kB ( 0%) ggc
rest of compilation : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 133 kB ( 0%) ggc
unaccounted todo : 0.02 ( 1%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc
TOTAL : 2.32 0.40 2.72 84519 kB
and the same for my gcc-4.6:
Execution times (seconds)
callgraph construction: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 527 kB ( 1%) ggc
trivially dead code : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 0 kB ( 0%) ggc
df scan insns : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 5 kB ( 0%) ggc
df reg dead/unused notes: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 22 kB ( 0%) ggc
preprocessing : 0.08 ( 7%) usr 0.10 (26%) sys 0.14 ( 9%) wall 1016 kB ( 1%) ggc
parser : 0.68 (58%) usr 0.24 (63%) sys 0.83 (52%) wall 52215 kB (76%) ggc
name lookup : 0.28 (24%) usr 0.02 ( 5%) sys 0.41 (26%) wall 10211 kB (15%) ggc
inline heuristics : 0.00 ( 0%) usr 0.02 ( 5%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc
tree gimplify : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 637 kB ( 1%) ggc
tree CFG construction : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 463 kB ( 1%) ggc
expand : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 426 kB ( 1%) ggc
varconst : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 132 kB ( 0%) ggc
integrated RA : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.03 ( 2%) wall 304 kB ( 0%) ggc
reload : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 58 kB ( 0%) ggc
machine dep reorg : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 3 kB ( 0%) ggc
final : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 1%) wall 4 kB ( 0%) ggc
rest of compilation : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.01 ( 1%) wall 133 kB ( 0%) ggc
unaccounted todo : 0.02 ( 2%) usr 0.00 ( 0%) sys 0.03 ( 2%) wall 0 kB ( 0%) ggc
TOTAL : 1.18 0.38 1.59 68315 kB
EDIT2: Linaro gcc 4.6.3's -ftime-report
output for a VERY simple test.cpp (including options -fno-graphite-identity -fno-graphite
):
Execution times (seconds)
preprocessing : 0.00 ( 0%) usr 0.02 (50%) sys 0.02 (10%) wall 121 kB ( 2%) ggc
parser : 0.10 (62%) usr 0.02 (50%) sys 0.11 (55%) wall 4022 kB (65%) ggc
name lookup : 0.02 (12%) usr 0.00 ( 0%) sys 0.04 (20%) wall 879 kB (14%) ggc
tree gimplify : 0.02 (13%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 20 kB ( 0%) ggc
expand : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 5%) wall 34 kB ( 1%) ggc
integrated RA : 0.02 (12%) usr 0.00 ( 0%) sys 0.01 ( 5%) wall 59 kB ( 1%) ggc
TOTAL : 0.16 0.04 0.20 6207 kB
and for the same file with my old gcc 4.6.4:
Execution times (seconds)
preprocessing : 0.02 (25%) usr 0.00 ( 0%) sys 0.02 (14%) wall 119 kB ( 2%) ggc
parser : 0.00 ( 0%) usr 0.04 (100%) sys 0.06 (43%) wall 4021 kB (65%) ggc
name lookup : 0.04 (50%) usr 0.00 ( 0%) sys 0.03 (21%) wall 879 kB (14%) ggc
expand : 0.02 (25%) usr 0.00 ( 0%) sys 0.01 ( 7%) wall 34 kB ( 1%) ggc
unaccounted todo : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 7%) wall 0 kB ( 0%) ggc
TOTAL : 0.08 0.04 0.14 6204 kB
Generating the preprocessed file with both compilers yielded no significant difference (output of Linaro's gcc was but 3 lines longer).
EDIT1: gcc -v for the old one (path shortened or removed (eg --sbindir)
# arm-none-linux-gnueabi-g++ -v
Using built-in specs.
COLLECT_GCC=..sysroots/i686-linux/usr/bin/arm-none-linux-gnueabi-g++
COLLECT_LTO_WRAPPER=../libexec/gcc/arm-none-linux-gnueabi/4.6.4/lto-wrapper
Target: arm-none-linux-gnueabi
Configured with: ..tmp/work/armv7a-none-linux-gnueabi/gcc-cross-4.6.3+svnr184847-r27/gcc-4_6-branch/configure --build=i686-linux --host=i686-linux --target=arm-none-linux-gnueabi --with-gnu-ld --enable-shared --enable-languages=c,c++ --enable-threads=posix --disable-multilib --enable-c99 --enable-long-long --enable-symvers=gnu --enable-libstdcxx-pch --program-prefix=arm-none-linux-gnueabi- --without-local-prefix --enable-lto --enable-libssp --disable-bootstrap --disable-libgomp --disable-libmudflap --with-system-zlib --with-linker-hash-style=gnu --with-ppl=no --with-cloog=no --enable-cheaders=c_global --enable-languages=c,c++,fortran --disable-libunwind-exceptions --with-mpfr=..sysroots/i686-linux/usr --with-system-zlib --enable-__cxa_atexit
Thread model: posix
gcc version 4.6.4 20120303 (prerelease) (GCC)
and Linaro gcc -v
# arm-linux-gnueabihf-g++ -v
Using built-in specs.
COLLECT_GCC=..compiler/bin/arm-linux-gnueabihf-g++
COLLECT_LTO_WRAPPER=../libexec/gcc/arm-linux-gnueabihf/4.7.2/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: .build/src/gcc-linaro-4.7-2012.08/configure --build=i686-build_pc-linux-gnu --host=i686-build_pc-linux-gnu --target=arm-linux-gnueabihf --enable-languages=c,c++,fortran --enable-multilib --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard --with-pkgversion='crosstool-NG linaro-1.13.1-2012.08-20120827 - Linaro GCC 2012.08' --with-bugurl=https://bugs.launchpad.net/gcc-linaro --enable-__cxa_atexit --enable-libmudflap --enable-libgomp --enable-libssp --with-gmp=.. --with-mpfr=.. --with-mpc=.. --with-ppl=.. --with-cloog=.. --with-libelf=.. --with-host-libstdcxx='-L.. -lpwl' --enable-threads=posix --disable-libstdcxx-pch --enable-linker-build-id --enable-gold --with-local-prefix=.. --enable-c99 --enable-long-long --with-mode=thumb
Thread model: posix
gcc version 4.7.2 20120731 (prerelease) (crosstool-NG linaro-1.13.1-2012.08-20120827 - Linaro GCC 2012.08)
For the latter I also made adjustments to have --disable-multilib --disable-libmudflap --disable-libgomp --disable-multilib.
And here's Ubuntu 12.04's arm compiler:
> arm-linux-gnueabihf-g++-4.6 -v
Using built-in specs.
COLLECT_GCC=arm-linux-gnueabihf-g++-4.6
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/4.6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.6.3-1ubuntu5' --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.6 --enable-shared --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/arm-linux-gnueabihf/include/c++/4.6.3 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --enable-objc-gc --enable-multilib --disable-sjlj-exceptions --with-arch=armv7-a --with-float=hard --with-fpu=vfpv3-d16 --with-mode=thumb --disable-werror --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabihf
You could pass -time
or -ftime-report
to the gcc
compiler to find out why and where gcc
is taking compilation time.
But why does the compilation time matters so much to you?
You should take care of the execution time of the produced executable binary.
Also, show us the output of the -v
option passed to your gcc
And you might pass the -j
option to your make
command to have it work in parallel (eg running several gcc
in parallel). You could also lower the optimization level, eg from -O3
to -O2
or -O1
OK - you -ftime-report tests clearly show "parser" is the culprit; I'm guessing templates (in general) and STL (in particular) are the root cause.
SUGGESTION:
See if there's any way you can use "precompiled headers" in your tool chain. If you can, that might eliminate the entire problem.
LINKS (unfortunately, I'm not sure which may or may not be applicable to you):
Why does C++ compilation take so long?
http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Precompiled-Headers.html
http://clang.llvm.org/docs/UsersManual.html#precompiledheaders
In GCC, can precompiled headers be included from other headers?
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0472c/CIHJFBHC.html
下一篇: Linaro编译速度