Procedure Positioning is a well-known compiler optimization aiming at the improvement of the instruction cache behavior. The basic idea behind procedure positioning is to improve I-cache behavior by reducing the number of cache conflict misses. Caches reduce the average memory access time by exploiting spatial and temporal locality. The former refers to the reference of contiguous memory locations. Temporal locality means that particular memory locations will be accessed within a short period of time. Due to an inappropriate memory layout, the temporal locality may, however, degrade cache performance. This situation arises when memory locations being accessed temporally close to each other are mapped to the same cache location. This overlapping results in an eviction of cache contents and results in repetitive cache refills.
Procedure positioning approaches are based on a call graph. This undirected graph consists of nodes which represent program procedures. Edges between the nodes denote calling relationships between procedures and are weighted with call frequencies which, for average-case execution time (ACET) optimization, are gained using profiling.
In contrast to standard ACET-focused optimizations, the approaches presented on this page do not rely on execution profile data but extract their input data for the call graph from a WCET analyzer. This fundamental difference makes our approach more reliable than the standard optimizations. Execution profile data is critical since it reflects the program execution for a particular set of input data, i.e., profiling the program under test with varying inputs may yield different results. For more complex programs consisting of numerous input-dependent execution paths, it is almost infeasible to find representative input values. This may result in a call graph which is annotated with profiling data that does not represent some particular program executions. The optimized code will possibly not improve cache behavior and may even suffer performance degradation.
Our approach does not rely on representative input data. The edge weights of the call graph are computed by a WCET analyzer and are invariant for all program executions. The reason is the inherent nature of a static WCET analysis. It computes the worst-case behavior for a given program that is valid for all inputs without running the program but by performing a static program analysis. This computation implies the determination of call frequencies that yield the longest program execution. Our WCET-centric call graph is based on this information and an edge with the heaviest weight potentially combines the most promising functions for optimization.
It is well-known that the influence of a memory layout modification on caches is hardly predictable. A promising optimization strategy is based on a greedy approach which evaluates the influence of a particular reallocation of procedures on the WCET. In case a WCET minimization was achieved, this altered memory layout is considered as a new starting point for the next optimization cycle, and the next most promising function for positioning is derived from the WCET-centric call graph. Hence, the approach successively reduces the WCET and guarantees that the modified code's WCET is never worse compared to the original code.
The greedy approach is an iterative algorithm processing a single edge of the WCET-centric call graph during each iteration. During each iteration, a WCET analysis takes place in order to update the WCET timing data to be used during the next iteration.
Due to the potentially many and time consuming WCET analyses performed by the greedy approach, a fast heuristic approach was developed just using the information of the WCET-centric call graph for the initial input program. In contrast to the greedy approach, the heuristic algorithm performs exactly one WCET analysis to construct the initial call graph.
The above figure shows the results for the greedy and heuristic procedure positioning approaches, with 100% corresponding to the WCET estimation of the original code without procedure positioning applied. First of all, it can be seen that a WCET reduction was achieved for most benchmarks. The greedy algorithm achieved on average a WCET minimization by 10%, while the heuristic approach reduced the WCET on average by 4%.
Note the difference in the achieved results between the greedy and the heuristic approach. For all benchmarks, the greedy positioning achieved better results since it does not allow a degradation of the WCET. This might result in a local optimum missing the global minimum as could be potentially achieved by the heuristic approach. However, for the considered benchmarks, this case did not arise. For the heuristic approach, it might also happen that the WCET becomes worse after the optimization as experienced for the GSM encoder. Hence, it can be concluded that for best results, it is worth to invest time for the optimization as done by our greedy approach.
In addition to procedure positioning, we described how we exploited the standard compiler optimization procedure cloning to improve WCET analysis. WCET-aware procedure cloning generates specialized versions of functions, making their calling context explicit and thus enabling the highly precise annotation of loop bound flow facts. Procedure cloning was performed for a system with disabled caches. Any newly created function clone was placed behind the last function in the code with no regard to cache effects. The following figure presents results for a combination of procedure cloning and WCET-aware procedure positioning particularly for cache-based systems.
In the above figure, 100% correspond to the WCETs of the benchmarks in their unoptimized versions. Combined procedure cloning and procedure positioning achieves WCET reductions of up to 64%. These results allow two conclusions.
First, it can be seen that procedure cloning and positioning is best suited in a cache based system. Although procedure cloning increases the code size by additional functions (the cloned functions), the resulting WCET estimates are still more precise than for the original code. The benefits from the improved WCET analysis exceed the disadvantages that may emerge from more cache conflict misses due to the increased working set.
Second, for most benchmarks, procedure cloning combined with positioning achieved better results. The goal of both optimizations in combination is to compensate the potential conflict misses caused by additional function clones. Obviously, the achieved benefits are smaller than for the positioning approaches shown Figure, because function reordering was exclusively restricted to the function clones. However, the additional positioning of the clones is negligible for the complexity of the algorithm and should be exploited for better results.