[flang] 1.5x performance regression on gas_dyn2 #121599

vzakhari · 2025-01-03T20:18:26Z

This reproduces on zen4 after #121544

After InlineHLFIRAssign we now have this in chozdt_ routine:

  %23 = hlfir.minloc %19#0 {fastmath = #arith.fastmath<fast>} : (!fir.box<!fir.array<?xf32>>) -> !hlfir.expr<1xi32>
  fir.do_loop %arg6 = %c1 to %c1 step %c1 unordered {
    %55 = hlfir.apply %23, %arg6 : (!hlfir.expr<1xi32>, index) -> i32
    %56 = hlfir.designate %7#0 (%arg6)  : (!fir.ref<!fir.array<1xi32>>, index) -> !fir.ref<i32>
    hlfir.assign %55 to %56 : i32, !fir.ref<i32>
  }

The do-loop is the result of inlining of:

  hlfir.assign %23 to %7#0 : !hlfir.expr<1xi32>, !fir.ref<!fir.array<1xi32>>

After LLVM inlining, and other optimizations we have the following minloc loop:

.lr.ph.i:                                         ; preds = %.lr.ph.i.preheader, %.lr.ph.i
  %42 = phi i32 [ %51, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
  %43 = phi float [ %52, %.lr.ph.i ], [ %41, %.lr.ph.i.preheader ]
  %44 = phi i64 [ %53, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
  %45 = shl nsw i64 %44, 2
  %46 = getelementptr i8, ptr %10, i64 %45
  %47 = load float, ptr %46, align 4, !tbaa !60
  %48 = fcmp fast uge float %47, %43
  %49 = trunc i64 %44 to i32
  %50 = add nuw i32 %49, 1
  %51 = select i1 %48, i32 %42, i32 %50
  %52 = select i1 %48, float %43, float %47
  %53 = add nuw nsw i64 %44, 1
  %exitcond.not.i = icmp eq i64 %53, %9
  br i1 %exitcond.not.i, label %_FortranAMinlocReal4x1_i32_fast_simplified.exit, label %.lr.ph.i, !llvm.loop !64

The select operations form the cyclic dependency, and they are later transformed to cmov/minss instructions. Before my change, the SimplifyCFG pass could not produce selects, presumably, because the result of the integer select was stored into the temporary <1 x i32> array.

The cyclic dependency introduces by the integer cmov seems to limit performance of the loop. The loop would be better off with compare and jump.

Performance restores with -disable-select-optimize=false, unfortunately, it is disabled by default on X86.

Possible solutions:

Try to enable the select optimization for X86, but I guess it is disabled for a reason.
We can try to trick SimplifyCFG to not optimize the compare-jump into selects by setting the following probability to the jump instruction:

  %35 = load float, ptr %34, align 4, !tbaa !1609
  %36 = fcmp fast olt float %35, %23
  %37 = fcmp fast une float %23, %23
  %38 = fcmp fast oeq float %35, %35
  %39 = and i1 %37, %38
  %40 = or i1 %36, %39
  %41 = trunc i32 %27 to i1
  %42 = xor i1 %41, true
  %43 = or i1 %40, %42
  br i1 %43, label %44, label %47, !prof !2000

44:                                               ; preds = %26
  store i32 1, ptr %5, align 4, !tbaa !1609
  %45 = trunc i64 %22 to i32
  %46 = add i32 %45, 1
  store i32 %46, ptr %14, align 4, !tbaa !1609
  br label %47

47:                                               ; preds = %44, %26
...

!2000 = !{!"branch_weights", !"expected",  i32 1, i32 99}

This is just a trick though, and the right solution should be allowing the select optimization to use its heuristics.

The text was updated successfully, but these errors were encountered:

vzakhari self-assigned this Jan 3, 2025

github-actions bot added the flang Flang issues not falling into any other category label Jan 3, 2025

vzakhari added the performance label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flang] 1.5x performance regression on gas_dyn2 #121599

[flang] 1.5x performance regression on gas_dyn2 #121599

vzakhari commented Jan 3, 2025

[flang] 1.5x performance regression on gas_dyn2 #121599

[flang] 1.5x performance regression on gas_dyn2 #121599

Comments

vzakhari commented Jan 3, 2025