Possible optimization on Windows with AVX-512 vs AVX2 #91478
-
Per information in https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=msvc-170#callercallee-saved-registers, I came up with an idea that it would be faster if JIT would use volatile XMM0-5 and XMM16-31 registers first instead of using registers sequentially (from 0 to 31) and thus using non-volatile XMM6-15 after using up first 6 -> need to save these registers on the stack. Also there may be some other profitability functionality changes to accommodate this change to do this only when appropriate (if there are no calls to other functions & using only XMM6-15 parts (without upper YMM and ZMM usage), for example) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The register allocator already takes this into account. We independently track Register preferencing, ABI conventions for passing/returning values, and other factors can also influence the ultimate register selection. The baseline list/order can be seen here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/targetamd64.h#L277-L287 (Unix is in the ifdef just above that). --Noting that this was adjusted as part of adding AVX-512 support in .NET 8 |
Beta Was this translation helpful? Give feedback.
The register allocator already takes this into account. We independently track
callee save
vscallee trash
sets and within those order them by encoding cost. Thusxmm0-5
are done first andxmm16-xmm31
are taken after those are used up.Register preferencing, ABI conventions for passing/returning values, and other factors can also influence the ultimate register selection.
The baseline list/order can be seen here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/targetamd64.h#L277-L287 (Unix is in the ifdef just above that).
--Noting that this was adjusted as part of adding AVX-512 support in .NET 8