-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[x86] Un-optimal use of upper registers while generating vpmaddwd for reductions #121548
Comments
for ( int k = 0 ; k < N ; k++ ) {
sum += arr16[k] * arr8[k];
} There are 3 data types in this loop. "sum" is of int32, arr16 is of int16 and arr8 is of uint8 types. AOCC/LLVM decides to vectorize using maxtype (i32 for the sum). so it choses VF = 16 x i32 (ZMM). Later in the precodegen pass called "X86 Partial reduction" it understands the presence of dot product pattern and inserts extra shuffles and adds to prepare the selection dag to generate vpmaddwd. --Snip--
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <16 x i32> [ zeroinitializer, %vector.ph ], [ %9, %vector.body ]
%0 = shl nuw nsw i64 %index, 1
%scevgep1 = getelementptr i8, ptr %arr16, i64 %0
%wide.load = load <16 x i16>, ptr %scevgep1, align 2, !tbaa !5
%1 = sext <16 x i16> %wide.load to <16 x i32>
%2 = getelementptr inbounds i8, ptr %arr8, i64 %index
%wide.load12 = load <16 x i8>, ptr %2, align 1, !tbaa !9
%3 = zext <16 x i8> %wide.load12 to <16 x i32>
%4 = mul <16 x i32> %3, %1
%5 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
%6 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
%7 = add <8 x i32> %5, %6
%8 = shufflevector <8 x i32> %7, <8 x i32> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
%9 = add <16 x i32> %8, %vec.phi
%index.next = add nuw i64 %index, 16
%10 = icmp eq i64 %n.vec, %index.next
br i1 %10, label %middle.block, label %vector.body, !llvm.loop !10
middle.block: ; preds = %vector.body
%rdx.shuf = shufflevector <16 x i32> %9, <16 x i32> poison, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx = add <16 x i32> %9, %rdx.shuf
--Snip-- vpaddwd only writes to 8 lanes. Can this pass be modified to updated the phi, add and reduction instructions to use only 8 lanes instead of 16?? --Snip--
vector.body: ; preds = %vector.body, %vector.ph
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.phi = phi <8 x i32> [ zeroinitializer, %vector.ph ], [ %x, %vector.body ]
%0 = shl nuw nsw i64 %index, 1
%scevgep1 = getelementptr i8, ptr %arr16, i64 %0
%wide.load = load <16 x i16>, ptr %scevgep1, align 2, !tbaa !5
%1 = sext <16 x i16> %wide.load to <16 x i32>
%2 = getelementptr inbounds i8, ptr %arr8, i64 %index
%wide.load12 = load <16 x i8>, ptr %2, align 1, !tbaa !9
%3 = zext <16 x i8> %wide.load12 to <16 x i32>
%4 = mul <16 x i32> %3, %1
%5 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
%6 = shufflevector <16 x i32> %4, <16 x i32> %4, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
%7 = add <8 x i32> %5, %6
%x = add <8 x i32> %7, %vec.phi
%index.next = add nuw i64 %index, 16
%8 = icmp eq i64 %n.vec, %index.next
br i1 %8, label %middle.block, label %vector.body, !llvm.loop !10
middle.block: ; preds = %vector.body
%rdx.shuf2 = shufflevector <8 x i32> %x, <8 x i32> poison, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 poison, i32 poison, i32 poison, i32 poison>
%bin.rdx3 = add <8 x i32> %x, %rdx.shuf2
--Snip-- With the above IR change I noticed we are generating vpdpwssd instruction. But it needs to split into vpmaddwd and vpaddd later when critical dependency is dependent on the vpaddd since it is cheaper for Znver5. |
@RKSimon Can I change the X86 Partial reduction pass to use lower vector lanes and post a patch ?? |
@llvm/issue-subscribers-backend-x86 Author: None (venkataramananhashkumar)
For the given code below
```c
#include <stdint.h>
int32_t foo(const int16_t *arr16, const uint8_t *arr8, int N) {
int32_t sum = 1;
for ( int k = 0 ; k < N ; k++ ) {
sum += arr16[k] * arr8[k];
}
return sum;
}
```
LLVM with -O3 -march=znver5 -S -fno-unroll-loops generated the below code.
```s
.LBB0_8:
vpmovzxbw (%rsi,%rax), %ymm1
vpmaddwd (%rdi,%rax,2), %ymm1, %ymm1
addq $16, %rax
vpaddd %zmm0, %zmm1, %zmm0 <== upper half of zmm1 and zmm0 are zero.
cmpq %rax, %rdx
jne .LBB0_8
|
I suspect this will be addressed with a fix for #121529 |
For the given code below
LLVM with -O3 -march=znver5 -S -fno-unroll-loops generated the below code.
"vpmaddwd" in the above code only writes to lower half of zmm1 (ie. ymm1). But we are using full zmm add for the sum. Also we have un-optimal extract and add in the reduction code.
The text was updated successfully, but these errors were encountered: