Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flang] 1.5x performance regression on gas_dyn2 #121599

Open
vzakhari opened this issue Jan 3, 2025 · 0 comments
Open

[flang] 1.5x performance regression on gas_dyn2 #121599

vzakhari opened this issue Jan 3, 2025 · 0 comments
Assignees
Labels
flang Flang issues not falling into any other category performance

Comments

@vzakhari
Copy link
Contributor

vzakhari commented Jan 3, 2025

This reproduces on zen4 after #121544

After InlineHLFIRAssign we now have this in chozdt_ routine:

  %23 = hlfir.minloc %19#0 {fastmath = #arith.fastmath<fast>} : (!fir.box<!fir.array<?xf32>>) -> !hlfir.expr<1xi32>
  fir.do_loop %arg6 = %c1 to %c1 step %c1 unordered {
    %55 = hlfir.apply %23, %arg6 : (!hlfir.expr<1xi32>, index) -> i32
    %56 = hlfir.designate %7#0 (%arg6)  : (!fir.ref<!fir.array<1xi32>>, index) -> !fir.ref<i32>
    hlfir.assign %55 to %56 : i32, !fir.ref<i32>
  }

The do-loop is the result of inlining of:

  hlfir.assign %23 to %7#0 : !hlfir.expr<1xi32>, !fir.ref<!fir.array<1xi32>>

After LLVM inlining, and other optimizations we have the following minloc loop:

.lr.ph.i:                                         ; preds = %.lr.ph.i.preheader, %.lr.ph.i
  %42 = phi i32 [ %51, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
  %43 = phi float [ %52, %.lr.ph.i ], [ %41, %.lr.ph.i.preheader ]
  %44 = phi i64 [ %53, %.lr.ph.i ], [ 1, %.lr.ph.i.preheader ]
  %45 = shl nsw i64 %44, 2
  %46 = getelementptr i8, ptr %10, i64 %45
  %47 = load float, ptr %46, align 4, !tbaa !60
  %48 = fcmp fast uge float %47, %43
  %49 = trunc i64 %44 to i32
  %50 = add nuw i32 %49, 1
  %51 = select i1 %48, i32 %42, i32 %50
  %52 = select i1 %48, float %43, float %47
  %53 = add nuw nsw i64 %44, 1
  %exitcond.not.i = icmp eq i64 %53, %9
  br i1 %exitcond.not.i, label %_FortranAMinlocReal4x1_i32_fast_simplified.exit, label %.lr.ph.i, !llvm.loop !64

The select operations form the cyclic dependency, and they are later transformed to cmov/minss instructions. Before my change, the SimplifyCFG pass could not produce selects, presumably, because the result of the integer select was stored into the temporary <1 x i32> array.

The cyclic dependency introduces by the integer cmov seems to limit performance of the loop. The loop would be better off with compare and jump.

Performance restores with -disable-select-optimize=false, unfortunately, it is disabled by default on X86.

Possible solutions:

  • Try to enable the select optimization for X86, but I guess it is disabled for a reason.
  • We can try to trick SimplifyCFG to not optimize the compare-jump into selects by setting the following probability to the jump instruction:
  %35 = load float, ptr %34, align 4, !tbaa !1609
  %36 = fcmp fast olt float %35, %23
  %37 = fcmp fast une float %23, %23
  %38 = fcmp fast oeq float %35, %35
  %39 = and i1 %37, %38
  %40 = or i1 %36, %39
  %41 = trunc i32 %27 to i1
  %42 = xor i1 %41, true
  %43 = or i1 %40, %42
  br i1 %43, label %44, label %47, !prof !2000

44:                                               ; preds = %26
  store i32 1, ptr %5, align 4, !tbaa !1609
  %45 = trunc i64 %22 to i32
  %46 = add i32 %45, 1
  store i32 %46, ptr %14, align 4, !tbaa !1609
  br label %47

47:                                               ; preds = %44, %26
...

!2000 = !{!"branch_weights", !"expected",  i32 1, i32 99}

This is just a trick though, and the right solution should be allowing the select optimization to use its heuristics.

@vzakhari vzakhari self-assigned this Jan 3, 2025
@github-actions github-actions bot added the flang Flang issues not falling into any other category label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang Flang issues not falling into any other category performance
Projects
None yet
Development

No branches or pull requests

1 participant