Compiler support for preventing stall cycles

Compiler support for preventing stall cycles

See also: Register Allocation (in compilers)

Assumption:

instructions stall cycles
FP-ALU -> Store 2 cycles
FP-ALU -> FP-ALU 3 cycles
Load -> FP-ALU 1 cycle
Load -> Store 0 cycles
Jumps 1 cycle

Rescheduling instructions #

for (int i = 1000; i > 0; --i) {
  x[i] = x[i]+s;
}
Code Snippet 1: sample C code
loop:
    load  f0, 0(r1)    ;; f0 = x[i]
    add   f4, f0, f2   ;; x[i] + s
    store f4, 0(r1)    ;; x[I] = _
    addi  r1, r1, -8   ;;
    bne   r1, r2, loop ;; branch r1!=r2
Code Snippet 2: trivial compilation
loop:                  ;; cycle
    load  f0, 0(r1)    ;; 1
    stall              ;; 2
    add   f4, f0, f2   ;; 3
    stall              ;; 4
    stall              ;; 5
    store f4, 0(r1)    ;; 6
    addi  r1, r1, -8   ;; 7
    stall              ;; 8
    stall              ;; 8
    bne   r1, r2, loop ;; 9
    stall              ;; 10
Code Snippet 3: sample execution

A better compilation using better instruction scheduling #

loop:
    load  f0, 0(r1)    ;; 1
    addi  r1, r1, -8   ;; 2
    add   f4, f0, f2   ;; 3
    stall              ;; 4
    bne   r1, r2, loop ;; 5
    store f4, 8(r1)    ;; 6

Loop unrolling #

for (int i = 1000; i > 0; --i) {
  x[i] = x[i]+s;
}
for (int i = 1000; i > 0; i-=4) {
  x[i-0] = x[i-0]+s;
  x[i-1] = x[i-1]+s;
  x[i-2] = x[i-2]+s;
  x[i-3] = x[i-3]+s;
}
Code Snippet 4: The same code but unrolled by a factor of 4

Colors are given by Register Allocation :

Figure 1: before reordering

Figure 1: before reordering

Figure 2: after reordering: no stall after loads: the first add can start after the load block

Figure 2: after reordering: no stall after loads: the first add can start after the load block

Software pipelining #

Figure 3: (a) loop unrolling (b) software pipelining

Figure 3: (a) loop unrolling (b) software pipelining

Loops are restuctured that in each loop iteration the different stages of different iterations are handled (less dependencies between instructions) -> only works if no loop dependencies

Figure 4: Horizontal yellow line: an interation of the software pipelined loop

Figure 4: Horizontal yellow line: an interation of the software pipelined loop

Figure 5: Software pipelined code transformation, startup and winddown clode have been omitted

Figure 5: Software pipelined code transformation, startup and winddown clode have been omitted

Software pipelining vs Loop unrolling #

Advantage of Loop unrolling:

  • Reduce loop overhead

Advantage of Software pipelining:

  • shorter code
  • Reduces area of low overlap

Advantage of both:

  • Use independent operations from different loop iterations

Loop fusion #

Combine loops to only have to iterate once. Not always possible. Only possible if data dependencies are the same before and after the program transformation.

Calendar October 22, 2023