Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86] Un-optimal use of upper registers while generating vpmaddwd for reductions #121548

Open
venkataramananhashkumar opened this issue Jan 3, 2025 · 5 comments

Comments

@venkataramananhashkumar
Copy link

venkataramananhashkumar commented Jan 3, 2025

For the given code below

#include <stdint.h>
int32_t foo(const int16_t *arr16, const uint8_t *arr8, int N) {
    int32_t sum = 1;
    for ( int k = 0 ; k < N ; k++ ) {
        sum += arr16[k] * arr8[k];
    }
    return sum;
}

LLVM with -O3 -march=znver5 -S -fno-unroll-loops generated the below code.

.LBB0_8:
        vpmovzxbw       (%rsi,%rax), %ymm1
        vpmaddwd        (%rdi,%rax,2), %ymm1, %ymm1  
        addq    $16, %rax
        vpaddd  %zmm0, %zmm1, %zmm0  <== upper half of  zmm1 and zmm0 are zero.
        cmpq    %rax, %rdx
        jne     .LBB0_8

        vextracti64x4   $1, %zmm0, %ymm1 <== Un-optimal extract 
        vpaddd  %zmm1, %zmm0, %zmm0 
        vextracti128    $1, %ymm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpshufd $238, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vpshufd $85, %xmm0, %xmm1
        vpaddd  %xmm1, %xmm0, %xmm0
        vmovd   %xmm0, %eax

"vpmaddwd" in the above code only writes to lower half of zmm1 (ie. ymm1). But we are using full zmm add for the sum. Also we have un-optimal extract and add in the reduction code.

@venkataramananhashkumar
Copy link
Author

venkataramananhashkumar commented Jan 3, 2025

for ( int k = 0 ; k < N ; k++ ) {

        sum += arr16[k] * arr8[k];
}

There are 3 data types in this loop. "sum" is of int32, arr16 is of int16 and arr8 is of uint8 types. AOCC/LLVM decides to vectorize using maxtype (i32 for the sum). so it choses VF = 16 x i32 (ZMM).

Later in the precodegen pass called "X86 Partial reduction" it understands the presence of dot product pattern and inserts extra shuffles and adds to prepare the selection dag to generate vpmaddwd.

--Snip--
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <16 x i32> [ zeroinitializer, %vector.ph ], [ %9, %vector.body ]
  %0 = shl nuw nsw i64 %index, 1
  %scevgep1 = getelementptr i8, ptr %arr16, i64 %0
  %wide.load = load <16 x i16>, ptr %scevgep1, align 2, !tbaa !5
  %1 = sext <16 x i16> %wide.load to <16 x i32>
  %2 = getelementptr inbounds i8, ptr %arr8, i64 %index
  %wide.load12 = load <16 x i8>, ptr %2, align 1, !tbaa !9
  %3 = zext <16 x i8> %wide.load12 to <16 x i32>
  %4 = mul <16 x i32> %3, %1
  %5 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %6 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %7 = add <8 x i32> %5, %6
  %8 = shufflevector <8 x i32> %7, <8 x i32> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %9 = add <16 x i32> %8, %vec.phi
  %index.next = add nuw i64 %index, 16
  %10 = icmp eq i64 %n.vec, %index.next
  br i1 %10, label %middle.block, label %vector.body, !llvm.loop !10

middle.block:                                     ; preds = %vector.body
  %rdx.shuf = shufflevector <16 x i32> %9, <16 x i32> poison, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
  %bin.rdx = add <16 x i32> %9, %rdx.shuf
--Snip--

vpaddwd only writes to 8 lanes.

Can this pass be modified to updated the phi, add and reduction instructions to use only 8 lanes instead of 16??
Like shown below??

--Snip--
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %vec.phi = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %x, %vector.body ]
  %0 = shl nuw nsw i64 %index, 1
  %scevgep1 = getelementptr i8, ptr %arr16, i64 %0
  %wide.load = load <16 x i16>, ptr %scevgep1, align 2, !tbaa !5
  %1 = sext <16 x i16> %wide.load to <16 x i32>
  %2 = getelementptr inbounds i8, ptr %arr8, i64 %index
  %wide.load12 = load <16 x i8>, ptr %2, align 1, !tbaa !9
  %3 = zext <16 x i8> %wide.load12 to <16 x i32>
  %4 = mul <16 x i32> %3, %1
  %5 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
  %6 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
  %7 = add <8 x i32> %5, %6
  %x = add <8 x i32> %7, %vec.phi
  %index.next = add nuw i64 %index, 16
  %8 = icmp eq i64 %n.vec, %index.next
  br i1 %8, label %middle.block, label %vector.body, !llvm.loop !10

middle.block:                                     ; preds = %vector.body
  %rdx.shuf2 = shufflevector <8 x i32> %x, <8 x i32> poison, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
  %bin.rdx3 = add <8 x i32> %x, %rdx.shuf2
--Snip--

With the above IR change I noticed we are generating vpdpwssd instruction. But it needs to split into vpmaddwd and vpaddd later when critical dependency is dependent on the vpaddd since it is cheaper for Znver5.

@RKSimon RKSimon self-assigned this Jan 3, 2025
@venkataramananhashkumar
Copy link
Author

@RKSimon Can I change the X86 Partial reduction pass to use lower vector lanes and post a patch ??

@llvmbot
Copy link
Member

llvmbot commented Jan 3, 2025

@llvm/issue-subscribers-backend-x86

Author: None (venkataramananhashkumar)

For the given code below ```c #include <stdint.h> int32_t foo(const int16_t *arr16, const uint8_t *arr8, int N) { int32_t sum = 1; for ( int k = 0 ; k < N ; k++ ) { sum += arr16[k] * arr8[k]; } return sum; } ``` LLVM with -O3 -march=znver5 -S -fno-unroll-loops generated the below code. ```s .LBB0_8: vpmovzxbw (%rsi,%rax), %ymm1 vpmaddwd (%rdi,%rax,2), %ymm1, %ymm1 addq $16, %rax vpaddd %zmm0, %zmm1, %zmm0 <== upper half of zmm1 and zmm0 are zero. cmpq %rax, %rdx jne .LBB0_8
    vextracti64x4   $1, %zmm0, %ymm1 &lt;== Un-optimal extract 
    vpaddd  %zmm1, %zmm0, %zmm0 
    vextracti128    $1, %ymm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vpshufd $238, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vpshufd $85, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vmovd   %xmm0, %eax
"vpmaddwd" in the above code only writes to lower half of zmm1 (ie. ymm1).   But we are using full zmm add  for the sum.  Also we have un-optimal extract and add in the reduction code. 


 
</details>

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 3, 2025

I suspect this will be addressed with a fix for #121529

@venkataramananhashkumar
Copy link
Author

I suspect this will be addressed with a fix for #121529

vextracti64x4 $1, %zmm0, %ymm1
vpaddd %zmm1, %zmm0, %zmm0

I thought #12159 still needs the extract and only change the vpadd to ymm form. Here we don't need the first extract as well as the first add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants