Next: LTO file sections, Up: Link Time Optimization [Contents][Index]
25.1 Design Overview ¶
Link time optimization is implemented as a GCC front end for abytecode representation of GIMPLE that is emitted in special sectionsof .o
files. Currently, LTO support is enabled in mostELF-based systems, as well as darwin, cygwin and mingw systems.
By default, object files generated with LTO support contain only GIMPLEbytecode. Such objects are called “slim”, and they require thattools like ar
and nm
understand symbol tables of LTOsections. For most targets these tools have been extended to use theplugin infrastructure, so GCC can support “slim” objects consistingof the intermediate code alone.
GIMPLE bytecode could also be saved alongside final object code ifthe -ffat-lto-objects option is passed, or if no plugin supportis detected for ar
and nm
when GCC is configured. It makesthe object files generated with LTO support larger than regular objectfiles. This “fat” object format allows to ship one set of fatobjects which could be used both for development and the production ofoptimized builds. A, perhaps surprising, side effect of this featureis that any mistake in the toolchain leads to LTO information notbeing used (e.g. an older libtool
calling ld
directly).This is both an advantage, as the system is more robust, and adisadvantage, as the user is not informed that the optimization hasbeen disabled.
At the highest level, LTO splits the compiler in two. The first half(the “writer”) produces a streaming representation of all theinternal data structures needed to optimize and generate code. Thisincludes declarations, types, the callgraph and the GIMPLE representationof function bodies.
When -flto is given during compilation of a source file, thepass manager executes all the passes in all_lto_gen_passes
.Currently, this phase is composed of two IPA passes:
pass_ipa_lto_gimple_out
This pass executes the functionlto_output
inlto-streamer-out.cc, which traverses the call graph encodingevery reachable declaration, type and function. This generates amemory representation of all the file sections described below.pass_ipa_lto_finish_out
This pass executes the functionproduce_asm_for_decls
inlto-streamer-out.cc, which takes the memory image built in theprevious pass and encodes it in the corresponding ELF file sections.
The second half of LTO support is the “reader”. This is implementedas the GCC front end lto1 in lto/lto.cc. Whencollect2 detects a link set of .o
/.a
files withLTO information and the -flto is enabled, it invokeslto1 which reads the set of files and aggregates them into asingle translation unit for optimization. The main entry point forthe reader is lto/lto.cc:lto_main
.
- LTO modes of operation
25.1.1 LTO modes of operation ¶
One of the main goals of the GCC link-time infrastructure was to alloweffective compilation of large programs. For this reason GCC implements twolink-time compilation modes.
- LTO mode, in which the whole program is read into thecompiler at link-time and optimized in a similar way as if itwere a single source-level compilation unit.
- WHOPR or partitioned mode, designed to utilize multipleCPUs and/or a distributed compilation environment to quickly linklarge applications. WHOPR stands for WHOle Program optimizeR (not tobe confused with the semantics of -fwhole-program). Itpartitions the aggregated callgraph from many different
.o
files and distributes the compilation of the sub-graphs to differentCPUs.Note that distributed compilation is not implemented yet, but sincethe parallelism is facilitated via generating a
Makefile
, itwould be easy to implement.
WHOPR splits LTO into three main stages:
- Local generation (LGEN)This stage executes in parallel. Every file in the program is compiledinto the intermediate language and packaged together with the localcall-graph and summary information. This stage is the same for boththe LTO and WHOPR compilation mode.
- Whole Program Analysis (WPA)WPA is performed sequentially. The global call-graph is generated, anda global analysis procedure makes transformation decisions. The globalcall-graph is partitioned to facilitate parallel optimization duringphase 3. The results of the WPA stage are stored into new object fileswhich contain the partitions of program expressed in the intermediatelanguage and the optimization decisions.
- Local transformations (LTRANS)This stage executes in parallel. All the decisions made during phase 2are implemented locally in each partitioned object file, and the finalobject code is generated. Optimizations which cannot be decidedefficiently during the phase 2 may be performed on the localcall-graph partitions.
WHOPR can be seen as an extension of the usual LTO mode ofcompilation. In LTO, WPA and LTRANS are executed within a singleexecution of the compiler, after the whole program has been read intomemory.
When compiling in WHOPR mode, the callgraph is partitioned duringthe WPA stage. The whole program is split into a given number ofpartitions of roughly the same size. The compiler tries tominimize the number of references which cross partition boundaries.The main advantage of WHOPR is to allow the parallel execution ofLTRANS stages, which are the most time-consuming part of thecompilation process. Additionally, it avoids the need to load thewhole program into memory.