[x86] Un-optimal use of upper registers while generating vpmaddwd for reductions #121548

venkataramananhashkumar · 2025-01-03T07:16:49Z

For the given code below

#include <stdint.h>
int32_t foo(const int16_t *arr16, const uint8_t *arr8, int N) {
    int32_t sum = 1;
    for ( int k = 0 ; k < N ; k++ ) {
        sum += arr16[k] * arr8[k];
    }
    return sum;
}

LLVM with -O3 -march=znver5 -S -fno-unroll-loops generated the below code.

.LBB0_8:
        vpmovzxbw       (%rsi,%rax), %ymm1
        vpmaddwd        (%rdi,%rax,2), %ymm1, %ymm1  
        addq    $16, %rax
        vpaddd  %zmm0, %zmm1, %zmm0  <== upper half of  zmm1 and zmm0 are zero.
        cmpq    %rax, %rdx
        jne     .LBB0_8

        vextracti64x4   $1, %zmm0, %ymm1 <== Un-optimal extract 
        vpaddd  %zmm1, %zmm0, %zmm0 
        vextracti128    $1, %ymm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpshufd $238, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpshufd $85, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax

"vpmaddwd" in the above code only writes to lower half of zmm1 (ie. ymm1). But we are using full zmm add for the sum. Also we have un-optimal extract and add in the reduction code.

venkataramananhashkumar · 2025-01-03T07:26:29Z

for ( int k = 0 ; k < N ; k++ ) {

        sum += arr16[k] * arr8[k];
}

There are 3 data types in this loop. "sum" is of int32, arr16 is of int16 and arr8 is of uint8 types. AOCC/LLVM decides to vectorize using maxtype (i32 for the sum). so it choses VF = 16 x i32 (ZMM).

Later in the precodegen pass called "X86 Partial reduction" it understands the presence of dot product pattern and inserts extra shuffles and adds to prepare the selection dag to generate vpmaddwd.

--Snip--
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <16 x i32> [ zeroinitializer, %vector.ph ], [ %9, %vector.body ]
  %0 = shl nuw nsw i64 %index, 1
  %scevgep1 = getelementptr i8, ptr %arr16, i64 %0
  %wide.load = load <16 x i16>, ptr %scevgep1, align 2, !tbaa !5
  %1 = sext <16 x i16> %wide.load to <16 x i32>
  %2 = getelementptr inbounds i8, ptr %arr8, i64 %index
  %wide.load12 = load <16 x i8>, ptr %2, align 1, !tbaa !9
  %3 = zext <16 x i8> %wide.load12 to <16 x i32>
  %4 = mul <16 x i32> %3, %1
  %5 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %6 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %7 = add <8 x i32> %5, %6
  %8 = shufflevector <8 x i32> %7, <8 x i32> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %9 = add <16 x i32> %8, %vec.phi
  %index.next = add nuw i64 %index, 16
  %10 = icmp eq i64 %n.vec, %index.next
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !10

middle.block:                                     ; preds = %vector.body
  %rdx.shuf = shufflevector <16 x i32> %9, <16 x i32> poison, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %bin.rdx = add <16 x i32> %9, %rdx.shuf
--Snip--

vpaddwd only writes to 8 lanes.

Can this pass be modified to updated the phi, add and reduction instructions to use only 8 lanes instead of 16??
Like shown below??

--Snip--
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %x, %vector.body ]
  %0 = shl nuw nsw i64 %index, 1
  %scevgep1 = getelementptr i8, ptr %arr16, i64 %0
  %wide.load = load <16 x i16>, ptr %scevgep1, align 2, !tbaa !5
  %1 = sext <16 x i16> %wide.load to <16 x i32>
  %2 = getelementptr inbounds i8, ptr %arr8, i64 %index
  %wide.load12 = load <16 x i8>, ptr %2, align 1, !tbaa !9
  %3 = zext <16 x i8> %wide.load12 to <16 x i32>
  %4 = mul <16 x i32> %3, %1
  %5 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %6 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %7 = add <8 x i32> %5, %6
  %x = add <8 x i32> %7, %vec.phi
  %index.next = add nuw i64 %index, 16
  %8 = icmp eq i64 %n.vec, %index.next
  br i1 %8, label %middle.block, label %vector.body, !llvm.loop !10

middle.block:                                     ; preds = %vector.body
  %rdx.shuf2 = shufflevector <8 x i32> %x, <8 x i32> poison, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
  %bin.rdx3 = add <8 x i32> %x, %rdx.shuf2
--Snip--

With the above IR change I noticed we are generating vpdpwssd instruction. But it needs to split into vpmaddwd and vpaddd later when critical dependency is dependent on the vpaddd since it is cheaper for Znver5.

venkataramananhashkumar · 2025-01-03T14:22:53Z

@RKSimon Can I change the X86 Partial reduction pass to use lower vector lanes and post a patch ??

llvmbot · 2025-01-03T14:27:36Z

@llvm/issue-subscribers-backend-x86

Author: None (venkataramananhashkumar)

For the given code below ```c #include <stdint.h> int32_t foo(const int16_t *arr16, const uint8_t *arr8, int N) { int32_t sum = 1; for ( int k = 0 ; k < N ; k++ ) { sum += arr16[k] * arr8[k]; } return sum; } ``` LLVM with -O3 -march=znver5 -S -fno-unroll-loops generated the below code. ```s .LBB0_8: vpmovzxbw (%rsi,%rax), %ymm1 vpmaddwd (%rdi,%rax,2), %ymm1, %ymm1 addq $16, %rax vpaddd %zmm0, %zmm1, %zmm0 <== upper half of zmm1 and zmm0 are zero. cmpq %rax, %rdx jne .LBB0_8

    vextracti64x4   $1, %zmm0, %ymm1 &lt;== Un-optimal extract 
    vpaddd  %zmm1, %zmm0, %zmm0 
    vextracti128    $1, %ymm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vpshufd $238, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vpshufd $85, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vmovd   %xmm0, %eax

"vpmaddwd" in the above code only writes to lower half of zmm1 (ie. ymm1).   But we are using full zmm add  for the sum.  Also we have un-optimal extract and add in the reduction code. 


 
</details>

RKSimon · 2025-01-03T14:38:35Z

I suspect this will be addressed with a fix for #121529

venkataramananhashkumar · 2025-01-03T15:01:53Z

I suspect this will be addressed with a fix for #121529

vextracti64x4 $1, %zmm0, %ymm1
vpaddd %zmm1, %zmm0, %zmm0

I thought #12159 still needs the extract and only change the vpadd to ymm form. Here we don't need the first extract as well as the first add.

github-actions bot added the new issue label Jan 3, 2025

RKSimon self-assigned this Jan 3, 2025

EugeneZelenko added backend:X86 missed-optimization and removed new issue labels Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[x86] Un-optimal use of upper registers while generating vpmaddwd for reductions #121548

[x86] Un-optimal use of upper registers while generating vpmaddwd for reductions #121548

venkataramananhashkumar commented Jan 3, 2025 •

edited by RKSimon

Loading

venkataramananhashkumar commented Jan 3, 2025 •

edited by RKSimon

Loading

venkataramananhashkumar commented Jan 3, 2025

llvmbot commented Jan 3, 2025

RKSimon commented Jan 3, 2025

venkataramananhashkumar commented Jan 3, 2025

[x86] Un-optimal use of upper registers while generating vpmaddwd for reductions #121548

[x86] Un-optimal use of upper registers while generating vpmaddwd for reductions #121548

Comments

venkataramananhashkumar commented Jan 3, 2025 • edited by RKSimon Loading

venkataramananhashkumar commented Jan 3, 2025 • edited by RKSimon Loading

venkataramananhashkumar commented Jan 3, 2025

llvmbot commented Jan 3, 2025

RKSimon commented Jan 3, 2025

venkataramananhashkumar commented Jan 3, 2025

venkataramananhashkumar commented Jan 3, 2025 •

edited by RKSimon

Loading

venkataramananhashkumar commented Jan 3, 2025 •

edited by RKSimon

Loading