# Loop Nest Splitting for WCET Optimization

Especially loops and *if*-statements are an inherent source of unpredictability and loss of precision for WCET analysis. This is caused by the difficulty to obtain safe and tight worst-case estimates of an application's high-level control flow. In addition, assembly-level control flow redirections reduce predictability even more due to complex processor pipelines and branch prediction units.

*Loop Nest Splitting* on the one hand achieves a significantly more homogeneous control flow structure. On the other hand, the highly precise analyses of loop nest splitting enable to generate very accurate high-level flow facts for loops and *if*-statements, having a very beneficial impact on WCET analysis.

### Flow Fact Generation during Loop Nest Splitting

Many embedded real-time multimedia applications are data-dominated using large amounts of data memory. Typically, they consist of deeply nested for-loops. The main algorithm is usually located in the innermost loop. Often, such an algorithm treats particular parts of its data specifically, e.g., an image border requires other manipulations than its center. This boundary checking is implemented using if-statements in the innermost loop. A typical code fragment (an MPEG4 full search motion estimation routine) is depicted below:

for ( x = 0; x < 36; x++ ) { x1 = 4 * x;

for ( y = 0; y < 49; y++ ) { y1 = 4 * y; */* y-loop */*

for ( k = 0; k < 9; k++ ) { x2 = x1 + k - 4;

for ( l = 0; l < 9; l++ ) { y2 = y1 + l - 4;

for ( i = 0; i < 4; i++ ) { x3 = x1 + i; x4 = x2 + i;

for ( j = 0; j < 4; j++ ) { y3 = y1 + j; y4 = y2 + j;

if ( x3 < 0 || 35 < x3 || y3 < 0 || 48 < y3 )

*then_block_1*; else *else_block_1*;

if ( x4 < 0 || 35 < x4 || y4 < 0 || 48 < y4 )

*then_block_2*; else *else_block_2*; }}}}}}

This code has several properties making it sub-optimal w.r.t. worst- and average-case execution time (ACET). First, the *if*-statements lead to a very irregular control flow. Any jump instruction in a machine program causes a control hazard for pipelined processors. This means that the pipeline needs to be stalled for some instruction cycles, so as to prevent the execution of incorrectly prefetched instructions. WCET analysis needs to estimate whether a jump is taken or not. The worst-case influence of this decision on pipeline and branch prediction needs to be taken into account. Since it is difficult to predict these control flow modifications accurately, resulting WCETs tend to become imprecise the more irregular the control flow is.

In addition, the way how conditions of an *if*-statement are expressed also has a negative impact on WCET. If conditions are connected using the logical and / or operators of ANSI-C, they are evaluated lazily. For example, expression ` e2` of the condition

`is not evaluated if`

**e1 || e2**`already evaluates to true. Hence, each occurrence of`

**e1**`and`

**||**`implies hidden control flow modifications with a negative influence on WCET. This unpredictability due to`

**&&***if*-statements becomes even more severe if the

*if*-statements are located in deeply nested loops as depicted above. Here, WCET analysis multiplies the overestimated data of the

*if*-statements with the possibly also overestimated number of loop iterations, leading to even more imprecise WCET estimates.

Considering the code shown above, loop nest splitting is able to detect that

- the conditions
**x3 < 0**and**y3 < 0**are never true, and - both
*if*-statements are true for**x ≥ 10**or**y ≥ 14**.

Using this information, the original loop nest above can be rewritten to minimize the total number of executed *if*-statements as depicted below. Here, a new *if*-statement (the *splitting-if*) is inserted in the ` y`-loop testing the condition

`. The`

**x >= 10 || y >= 14***else*-part of this new

*if*-statement is an exact copy of the body of the original

`-loop. Since all`

**y***if*-statements are fulfilled when the splitting-if is true, the

*then*-part consists of the body of the

`-loop without any`

**y***if*-statements and associated

*else*-blocks.

for ( x = 0; x < 36; x++ ) { x1 = 4 * x;

for ( y = 0; y < 49; y++ )

if ( x >= 10 || y >= 14 ) */* Splitting-If */*

for ( ; y < 49; y++ )

for ( k = 0; k < 9; k++ )

... */* l- & i-loop omitted */*

for ( j = 0; j < 4; j++ ) {

*then_block_1*; *then_block_2*; }

else { y1 = 4 * y;

for ( k = 0; k < 9; k++ ) { x2 = x1 + k - 4;

... */* l- & i-loop omitted */*

for ( j = 0; j < 4; j++ ) { y3 = y1 + j; y4 = y2 + j;

if ( 0 || 35 < x3 || 0 || 48 < y3 )

*then_block_1*; else *else_block_1*;

if ( x4 < 0 || 35 < x4 || y4 < 0 || 48 < y4 )

*then_block_2*; else *else_block_2*; }}}}}}

This example shows that loop nest splitting is able to generate a very homogeneous control flow in the hot-spots on an application. During loop nest splitting, polytopes model the loop currently under analysis by defining affine constraints for the loops' lower and upper bounds and for all surrounding loops. In such a polytope, each integral point represents one execution of the loop body for one actual assignment of values to the loops' index variables. Counting the number of integral points of these polytopes leads to the exact number of executions of the loop body. For this purpose, so-called *Ehrhart polynomials* are applied to the polytopes.

The number of points returned by the Ehrhart polynomials is used to annotate the source code after loop nest splitting with flow facts for WCET analysis, which exactly specify the executions of a loop compared to code lying outside the outermost loop of a loop nest.

In addition, precise annotations for the splitting-if created by loop nest splitting are provided to the WCET analyzer. These flow facts also rely on the application of Ehrhart polynomials to the polytope modeling the conditions of the splitting-if.

### Results

The above figure shows the effect of loop nest splitting on the benchmark's WCETs and ACETs for both the ARM and THUMB ISAs. It shows the values for the optimized benchmarks as a percentage of the unoptimized versions denoted as 100%.

Loop nest splitting reduces both ACET and WCET significantly. Concerning ACET, improvements of 6.4% (QSDPCM) - 54.8% (ME) were measured for the ARM ISA. Similarly, ACET is reduced by 11.5% (QSDPCM) - 59.4% (ME) for the THUMB ISA. On average for all benchmarks, ACET is reduced by 25.0% (ARM) - 30.1% (THUMB). This clearly shows that generating homogeneous control flow in loop nests increases average-case performance since a large amount of code found in innermost loops before our optimization is eliminated.

The figure also shows that loop nest splitting reduces WCETs by a similar order of magnitude. Here, gains reach from 4.4% (QSDPCM) - 86.5% (ME) for the ARM ISA. For the THUMB ISA, WCET reductions by 9.6% (QSDPCM) - 89.0% (ME) were reported by aiT. On average over all benchmarks, the WCET reductions after loop nest splitting are significantly larger than the corresponding ACET reductions. In terms of WCET, average improvements by 34.0% (ARM) - 36.3% (THUMB) were returned by the WCET analyzer.