-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Winch: implement fpu to int conversions for aarch64 #9889
base: main
Are you sure you want to change the base?
Conversation
2d48a96
to
9392766
Compare
one more in the #8321 series |
Subscribe to Label Action
This issue or pull request has been labeled: "winch"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
I can help reviewing this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Left some comments inline.
As a general comment, the emit_*
naming convention is not standard across the assembler implementations (as noted and fixed in #9918), so I´d suggest dropping the emit_
prefix in the assembler methods introduced in this PR.
winch/codegen/src/isa/aarch64/asm.rs
Outdated
} | ||
|
||
/// Emit the `fcmp` instruction with `rn` and `rm` as source registers. | ||
fn emit_fpu_cmp(&mut self, rn: Reg, rm: Reg, size: OperandSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we don´t need to duplicate this method, fcmp
already exists.
winch/codegen/src/isa/aarch64/asm.rs
Outdated
/// Load the max value for an integer of size out_size, as a floating-point | ||
/// of size `in-size`, into register `rd`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Load the max value for an integer of size out_size, as a floating-point | |
/// of size `in-size`, into register `rd`. | |
/// Load the max value for an integer of size out_size, as a floating-point | |
/// of size `in_size`, into register `rd`. |
winch/codegen/src/isa/aarch64/asm.rs
Outdated
use OperandSize::*; | ||
|
||
match in_size { | ||
OperandSize::S32 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I believe that here and in the next match arm below you can omit the OperandSize::
prefix given that you're already importing it above (e.g., use OperandSize::*
)
winch/codegen/src/isa/aarch64/asm.rs
Outdated
use OperandSize::*; | ||
|
||
match in_size { | ||
OperandSize::S32 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment here, I think it should be possible to omit the OperandSize
prefix in both match arms.
winch/codegen/src/isa/aarch64/asm.rs
Outdated
// - check bounds | ||
self.emit_check_nan(src, src_size); | ||
|
||
self.emit_min_fp_value(signed, src_size, dst_size, Writable::from_reg(tmp_reg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're writing to tmp_reg
, I think this should be defined in the signature, similar to dst
: tmp_reg: Writable<Reg>
src_size, | ||
dst_size, | ||
kind, | ||
// TODO: strangely, this register is provided in unsigned_truncate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I´d suggest avoiding passing a placeholder register here. The temp register that is passed to unsigned_truncate
is an allocatable register, therefore there's very little chance of it getting unintentionally clobbered. Instead having a single implementation for both cases, I´d suggest breaking down the implementation in the assembler into two methods, one for each case (signed/unsigned), then in each implementation you can decide to use the scratch register, immediately before emission, without the risk of unintentionally clobbering it. This is similar to how the x64 assembler implementation works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I understand, similarly to how x86 implementation, you would like to have two methods, fpu_to_sint
and fpu_to_uint
in asm, and also, those methods should be passed a temporary register in masm
, rather than using the provided one in the case of unsigned_truncate
?
I have a few questions/remarks:
- what is the purpose of the temp reg in the
unsigned
variant? Why must this registry be allocated rather than used as a temp register? Is the caller expecting to find something in there? It is not explicitly passed as a writable register, but the x64 implementation implicitly converts it to one. - While splitting the
convert_
methods for x64 makes sense to me since they correspond to a single instruction, we must emit a bunch of checks that are aware of the sign in the case of aarch64. Do you suggest duplicating all the checks in signed and unsigned variants? Or do we have two methods in ASM that call the current implementation with thesigned
argument set accordingly? - I'm sorry if that sounds very ignorant, but why do we have to worry about clobbering a tmp register anyway? Can't it, by definition, be used for any purpose, and one should not have too strong an assumption about what it may contain, especially when handing it over to a callee?
- I'm surely missing something, but it seems to me that splitting the
convert
method into a signed and unsigned variant, and the use of tmp regs are orthogonal things, so I want to make sure I understand what you want correctly :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the allocatable temporary register is to provide an extra
register to aid in the emission of the instruction sequence needed for the
unsigned case; in x64, 4 registers are required for the instruction sequence:
1 for the destination and 3 temporary registers. Since in the case of x64 the
instruction sequence is completely handled by Cranelift, there's no risk of
unintentionally clobbering any of the registers passed to it, the unintentional
clobbering risk is particularly important in the case of scratch registers in Winch: even
though they can be used for any purpose, one important detail about them is that
they are not tracked by Winch's regalloc therefore they must be used sparingly
and with extreme caution: a misuse could result in unintentional clobbering and
therefore subtle bugs. Ideally, the live range of the scratch registers should be
as short as possible to avoid potential bugs.
In this particular case the reason why I am suggesting improving the usage of
the scratch register is to minimize the risk of unintentional clobbering of such
registers: once a scratch register is passed as a parameter, its live range is
extended and the risk of misuse is greater. A concrete example in this case of
a potential risk of unintentional clobbering is passing the scratch register to
fpu_to_int
and then using the float scratch register in any of the methods
needed for the instruction sequence needed to fulfill the truncate instructions
(this is not the case today, but by passing the float scratch register around as
if it were any other register, it becomes harder to reason and audit in the
future).
There are some alternatives that I've thought about to improve the
handling/usage of the scratch registers to address some of the challenges
exposed in the previous paragraph. We could discuss those in a separate
issue/PR (e.g., relying on the type system to identify/audit scratch register usage and/or
providing exclusive access to the scratch registers, similar to what we have for
the allocatable registers). Care needs to be taken here to ensure that none of
the solutions explored affect compilation performance.
Taking a deeper look at your changes for the unsigned case, only 3 registers are
needed here compared to the x64 implementation which demands 4. This means that
you can effectively ignore the temporary register defined in the Masm
unsigned_truncate
signature and only rely on the float scratch register.
Taking this into account, my concrete suggestion to move forward here is to:
- Redefine the
unsigned_truncate
signature to be:
fn unsigned_truncate(
&mut self,
context: &mut CodeGenContext<Emission>,
src_size: OperandSize,
dst_size: OperandSize,
kind: TruncKind
) -> Result<()>;
This allows maximum flexibility to encode the ISA differences and more
importantly reduces register pressure in cases where only 3 registers are
needed, reducing unnecessary register pressure where possible is important to
minimize memory traffic.
This is similar to how we handle the div
instruction, given that the
requirements vary between ISAs.
- Redefine
fpu_to_int
to be:
fn fpu_to_int(
&mut self,
dst: Writable<Reg>,
src: Reg,
src_size: OperandSize,
dst_size: OperandSize,
kind: TruncKind,
signed: bool
) { ... };
Avoiding passing the scratch registers as params and shortening their live
range as much as possible by using them explicitly in each of the helper
functions that you have to fulfill the instruction sequence.
I believe that with this approach, my concerns will be addressed and there won't
be a need to split the signed/unsigned implementations (initially I was confused
with the TODO
comment in the code and had assumed that the implementation was
making use of a scratch register that wasn't actually needed, although that's
not the case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detailed writeup! This makes things a lot clearer. I'll make the changes 👍
@saulecabrera I addressed most of the review comments but left a few questions |
9392766
to
9311b6e
Compare
as agreed in review
@saulecabrera I have made the requested changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for all your patience iterating on this. Left two minor comments and then we should be able to land this one.
let dst_ty = match dst_size { | ||
OperandSize::S32 => WasmValType::I32, | ||
OperandSize::S64 => WasmValType::I64, | ||
_ => bail!("invalid conversion size"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you can use CodeGenError::unexpected_operand_size()
instead. https://github.com/bytecodealliance/wasmtime/blob/main/winch/codegen/src/codegen/error.rs#L139
let dst_ty = match dst_size { | ||
OperandSize::S32 => WasmValType::I32, | ||
OperandSize::S64 => WasmValType::I64, | ||
_ => bail!("invalid conversion size"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, you can use the appropriate error constructor.
This PR implements FPU to int conversions for winch, with the
unsigned_truncate
andsigned_truncate
methods.