Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hv_macro.h -> broken -> USE_UNALIGNED_PTR_DEREF & config.h -> unintelligible -> U32_ALIGNMENT_REQUIRED d_u32align #22886

Open
bulk88 opened this issue Jan 5, 2025 · 1 comment

Comments

@bulk88
Copy link
Contributor

bulk88 commented Jan 5, 2025

Description

Problem 1

/* U32_ALIGNMENT_REQUIRED:
 *	This symbol, if defined, indicates that you must access
 *	character data through U32-aligned pointers.  */
#ifndef U32_ALIGNMENT_REQUIRED
#$d_u32align U32_ALIGNMENT_REQUIRED	/**/
#endif

That comment doesn't match the macro's real meaning.

Checking to see whether you can access character data unalignedly...
EOM
$cat >try.c <<EOCP
#include <stdio.h>
#define U32 $u32type
#define BYTEORDER $byteorder
int main() {
#if BYTEORDER == 0x1234 || BYTEORDER == 0x4321
    U8 buf[] = "\0\0\0\1\0\0\0\0";
    U32 *up;
    int i;
    if (sizeof(U32) != 4) {
	printf("sizeof(U32) is not 4, but %d\n", sizeof(U32));
	exit(1);
    }
    fflush(stdout);
    for (i = 0; i < 4; i++) {
	up = (U32*)(buf + i);
	if (! ((*up == 1 << (8*i)) ||   /* big-endian */
	       (*up == 1 << (8*(3-i)))))  {  /* little-endian */
	    printf("read failed (%x)\n", *up);
	    exit(2);
	}
    }
    /* write test */
    for (i = 0; i < 4; i++) {
	up = (U32*)(buf + i);
	*up = 0xBeef;
	if (*up != 0xBeef) {
	    printf("write failed (%x)\n", *up);
	    exit(3);
	}
    }
    exit(0);
#else
    printf("1\n");
    exit(1);
#endif
    return 0;
}
EOCP
set try
if eval $compile_ok; then
	echo "(This test may dump core.)" >&4
	./try >&2 >/dev/null
	case "$?" in
	0)	cat >&4 <<EOM
You can access character data pretty unalignedly.
EOM
		d_u32align="$undef"
		;;
	*)	cat >&4 <<EOM
It seems that you must access character data in an aligned manner.
EOM
		d_u32align="$define"
		;;
	esac
	$rm -f core core.try.* try.core
else
	rp='Can you access character data at unaligned addresses?'
	dflt='n'
	. ./myread
	case "$ans" in
	[yY]*)	d_u32align="$undef"  ;;
	*)	d_u32align="$define" ;;
	esac
fi

"This symbol, if defined, indicates that you must access character data through U32-aligned pointers."

means, that CC, that OS, that CPU, that Perl was compiled for, DOES NOT HAVE 8-bit SIZED CHARS AT ALL. I can not imagine Perl >= year 2000 running on PDP-11s or Boroughs minicomputers from 60s/70s, where sizeof(char) is 4 bytes. If Perl or Unix ever even supported such a CPU arch.

DEC Alpha (yes WinNT 3.1-4.0 ran on it), was the only CPU in Perl 1.0-5.41's lifetime that had U32 hardware alignment required, for "unsigned char" array reads/writes. Since DEC Alpha did not implement 8-bit anything in hardware. All CCs, Unix, Windows, had to emulate U8/I8 and U16/I16 ANSI C types in software/assembly. MASK; READ; ADD; MASK; READ; SHIFT; SHIFT; OR; each time to emulate C's and ASCII's char type.

    /* write test */
    for (i = 0; i < 4; i++) {
	up = (U32*)(buf + i);
	*up = 0xBeef;
	if (*up != 0xBeef) {
	    printf("write failed (%x)\n", *up);
	    exit(3);
	}
    }

part of the code above, proved U32 * unaligned READS!!!! are supported. Nothing to do with proving "this is not an Alpha CPU, by proving alignment of type U8 is 1 byte.

What is the real meaning of U32_ALIGNMENT_REQUIRED ? because "This is an Alpha CPU" comment isn't matching the test code. And I can't image a C compiler not software emulating I8 U8 type on Alpha.

Something needs changing, probably the comment since the macro to match the test b/c both are so old, and searching grep metacpan vs a new #define, a new #define is easier safer and can't break ancient private XS code.

Problem 2

Perl 5.10-5.41 do not take advantage of modern unaligned memory access, which is available on almost all OSes/CPUs that Perl supports in 2024. This list includes X86, X64, and ARM64. In 2024, SEGV-platforms likes Palm Pilot-era and J2ME-era ARM32, SPARC, IA64, PA-RISC, and Power/PowerPC, are a tiny minority, or build support was already removed.

Perl with any GCC -O2, GCC itself will always emit unaligned inlined/intrinsic memory access, for memcpy(&var_u32, ua_ptr, 4); if allowed by the CPU/OS combo. GCC will never actually go through the ELF symbol table and execute memcpy in libc for tiny fixed length mem reads/writes. Calling libc memcpy fundamentally is WRITE_PTR; WRITE_PTR; WRITE_PTR; JUMP_PTR; and can't optimize to anything better.

If the CPU/OS combo, prohibits unaligned, GCC will emit something similar to this not-perfect-psuedo-code

MASK; JUMP_VAL_IN_REG; READ; OR; READ; OR; READ; OR; READ; OR;

switch( ptr_u32 & 0x3)  {
case 3: u32 = READ_CHR++;
case 2: u32 = (u32 << 8)  | READ_CHR++;
case 1: u32 = (u32 << 8)  | READ_CHR++;
default: u32 = (u32 << 8)  | READ_CHR++;   }

which is 10 ops vs memcpy's 4 ops. And didn't Perl request gcc -O2?

Other problem is, a C compiler's inlined intrinsic memcpy for alignment required CPUs, can't be hand written in C, no matter how hard someone tries. You can't assign random integers to current_instruction_register (function ptrs are N/A). And absolutely can't find out the machine code size of | or << in C.

U32 * pu32;
U32 u32_1, u32_2, u32_3, u32_4, u32_targ;
arm32_restore_multiple_zx_u8(pu32, &u32_1 , &u32_2 , &u32_3 , &u32_4);
arm32_pack_vector_horizontal(&u32_targ, u32_1, u32_2, u32_3, u32_4);

Nope. That is not C. That is assembly language. I'm not desperate enough to write or maintain that for job code. Most I'll do manipulate cl.exe -O_ -G_ -Q_ -Z_ -ARCH:_ foo.c flags. I'll let MSVC deal with the rest.

x86-32 X64's SSE's mov_128bits_unaligned opcode https://www.felixcloutier.com/x86/movups

__m128 _mm_loadu_ps ( float * p);
__m128 _mm_maskz_loadu_ps( __mmask8 k, void * s);

is also pure assembly, the CC's inlined intrinsic memcpy() could become that opcode, very very easily on modern Perl on AMD x64, it does for me on MSVC 2022 sometimes vs 2 U64 OPs. So blead perl's current default U8TO32_LE(ptr) _shifted_octet backend implementation in S_perl_hash_siphash_1_3_with_state_64() has problems or atleast risk of missed faster better backends provided by any modern CC if memcpy() was used.

So there is a config.h / Config.pm macro called d_u32align / U32_REQUIRES_ALIGNMENT from

4e0554e - jhi - 4/5/2001 1:47:01 PM
Introduce d_u32align / U32_REQUIRES_ALIGNMENT, needed for

It is outright broken on Win32/Win64 because of canned-config.SH/Configure app not used, I've never bothered making a patch, since there was nothing to optimize in core, or core extensions, if I turned it on for Win32 Perl.

Old ML post about UA memory https://www.nntp.perl.org/group/perl.perl5.porters/2015/11/msg232805.html

#12565 a rejected patch I wrote years ago, which was less than perfect after I read it a few times, any replacement would be so different looking, and different title, its a new ticket, I had to create a NEW public API, and use that API all over core and much less "visual" code vs special casing and branch all over in many places, to fix problem area, and using only pre-existing tools. Because I used a non-default hash algo, it wasn't a priority to write a new patch, that would not change my libperl.dll

More recently, the default HV hash algorithms were replaced/improved/upgraded. And default-build-flags Perl actually has an optimization dependent on OS/CPU UA mem support. The test macro is USE_UNALIGNED_PTR_DEREF but there is a bug in blead Perl and Perl >= 2019, and alot of code motion, has created new problems.

This commit abandons macro U32_ALIGNMENT_REQUIRED in core, the commit refers to https://rt.perl.org/Ticket/Display.html?id=133495

e8864db
Matt Turner - 9/5/2019 12:48:56 AM
Clean up U8TO*_LE macro implementations

Later on, another copy cat macro is added, now named USE_UNALIGNED_PTR_DEREF. Nothing in core/blead will set this macro or knows about it. No core dev, cpan dev, or humble perl dev, knows about it (no POD, no docs).

ed16b18
Yves Orton - 11/5/2019 6:05:17 PM
rework U8TOxx_LE macros to force unsigned access

The default build Perl's zaphod32_hash_with_state() actually uses macro USE_UNALIGNED_PTR_DEREF, and has 2 different implementations (performance), key-ed off that macro. So Perl needs both backend CPP feature flag macro cleanup/Configure level cleanup , and front end CPP function like macro cleanup. Win32/Win64 Perl needs its canned config.sh/config,h to turn on "UNALIGNED_OKAY". All "normal" CPU archs for WinNT 3.1-Win 11, or Win16 1.0-3.11, have fast hardware native unaligned memory access.

The current 5.41.7 code has problems and assumptions.

#define _shifted_octet(type,ptr,idx,shift) (((type)(((const U8*)(ptr))[(idx)]))<<(shift))
#define U8TO32_LE(ptr)   (_shifted_octet(U32,(ptr),0, 0)|\
                                 _shifted_octet(U32,(ptr),1, 8)|\
                                 _shifted_octet(U32,(ptr),2,16)|\
                                 _shifted_octet(U32,(ptr),3,24))

I think this is less portable than falling back to inline intrinsic memcpy(). See below for ASM output samples. I verified GCC will turn the above into a U32 op. MSVC 2022 refused. I see an ISO C spec problem with the above. How does the CC (any) know, its safe to combine 4x U8 into 1x U32 UA?

How does the CC know, that the user IS NOT FORCING U8* derefs, because memcpy() SEGVed in prior code versions? How does the CC know, that the user is deref-ing malloc() or .so/.dll memory, and NOT a memory mapped I/O window, to a ring buffer in a Enterprise 100 Gbps PCIe Ethernet card with full TCP/TLS/HTTP2 offload engine? A Mellanox or a Xilinx or this https://www.nvidia.com/en-us/networking/products/data-processing-unit/ ?

Of course there is UB in the C spec, but its very reasonable in 2024 to find a production use case for "forced U8*" and file a bug ticket against the CC vendor. If the CC vendor rejects the ticket with "Turing Tarpit" rational, or academic [malicious] compliance with abstract C virtual machine, it will be instantly gone from all downstreams. Maybe "forced U8* can be a real bug ticket with something atomic/interlocked. Definitely mmap IO windows in user mode kernel-like driver-like C libraries can justify U8* vs U32*.

Last 2 use cases, can be legit rejected/closed by CC vendor with "Why dont you use volatile type decl tag" if the CC project wants the easiest fix, IE no fix. I don't remember if volatile tag is syntax legal inside pointer cast operator or on C autos/typedefs. Then the ticket is valid.

click to expand hidden examples - hardware examples and OSX example aren't critical to understand this ticket
Or 30 years ago, MSDOS era, a soundcard with a terrible spaghetti API, where reading a mmap IO U8 *, has side effects? Like executing a queued job and the CPU gets blocked for 1000 nano seconds, and on control return, the U8 * popped-out a U8 job status code? Reading all 4 U8*s as a U32 *, would fail and purge all 4 queued job objects.

I can write an unrealistic not production but real test, breaking U32 and shift, vs 3 x U8 UA, by using mprotect() and mmap() and 4096 page boundaries.

memcpy can be argued to only be spec-defined for [DDR RAM stick] malloc() , .bin, .so, .dll memory. But read/write/C's = operator, are much more CPU hardware specific on what they do, vs some random libc's memcpy().

OSX's libc's memcpy() will over-read by 1 byte most memory blocks, using C spec's "suitable aligned" clause, C spec's null terminated clause and concept

  • combined with OSX's malloc which always overalloc's by 1 byte internally or the malloc() will never hand out a 16 byte long mem block, pressed up against the very end, of the very last valid 4096 byte page

  • combined with OSX's C linker, which never will align/push/layout, a C global var symbol, right upto the absolute end, at the end of the very last valid 4096 byte page, at the very end of a Maco-O/ELF file.

ASAN/Valgrind users have been complaining for years about OSX overreading by 1 byte inside memcpy but I think overread is still Apple's policy.

Steps to Reproduce

Add all needed #ifdef #else #endif and eventually a assert()/croak() to zaphod32_hash_with_state() at

the assert croak, must verify or do

if(  (_CPU_ARCH == _I386_  ||  _CPU_ARCH == _AMD64_ )
    && ! (USE_UNALIGNED_PTR_DEREF || U32_REQUIRES_ALIGNMENT)
)
    croak("This Perl will be very slow on this CPU without UA flag being on");

A chunk of Perl 5.41.7's S_perl_hash_siphash_1_3_with_state_64() compiled with MSVC 2022 x64 -O1. MSVC 2022 is incapable of optimizing 4 x U8 to 1 x U32.

movzx   eax, byte ptr cs:PL_hash_state_w
shl     r11, 8
or      r11, rax
movzx   eax, byte ptr cs:qword_180303768+6
or      r9, rax
movzx   eax, byte ptr cs:qword_180303768+5
shl     r9, 8
or      r9, rax
movzx   eax, byte ptr cs:qword_180303768+4
shl     r9, 8
or      r9, rax
movzx   eax, byte ptr cs:qword_180303768+3
shl     r9, 8
or      r9, rax
movzx   eax, byte ptr cs:qword_180303768+2
shl     r9, 8
or      r9, rax
movzx   eax, byte ptr cs:qword_180303768+1
shl     r9, 8
or      r9, rax
movzx   eax, byte ptr cs:qword_180303768
shl     r9, 8
or      r9, rax

Same code, same function, but Perl 5.40 compiled with GCC for Ubuntu for ARM64. GCC for ARM64 Linux, did optimize 4 x U8 to 1x U32 (aligned? unaligned?). Note it takes 4 OPs, to read 2 x U64.

"X2, [X2,PL_hash_state_w_ptr@PAGEOFF]" is ELF's overhead. "ADRP X2, PL_hash_state_w_ptr@PAGE" is the genetic flaw of ARM/all RISC, can't put a 32-bit const int, into a 32 bit long opcode slot, no space for the op type! All the RISC CPUs must read a U32 const into a register using a +/- 4096 bytes away offset memory read. Or the RISC CPU/CC vendor, must glue together 2 OPs each holding a U16 inline const lit.

ADRP            X2, PL_hash_state_w_ptr@PAGE
LDR             X2, [X2,PL_hash_state_w_ptr@PAGEOFF]
AND             X7, X1, #0xFFFFFFFFFFFFFFF8
ADD             X7, X0, X7
AND             W8, W1, #7
LSL             X5, X1, #0x38
LDP             X3, X4, [X2]
LDP             X1, X2, [X2,#0x10]

Expected behavior

Use hardware UA reads and writes for tiny known unaligned U16s/U24s/U32s/U64s on all Intel 32/64 builds, all OSes. The fallback should be memcpy(), which is probably a compiler inline intrinsic for tiny fixed length calls. Do not use the code below, unless someone manually checked each OS/CC's assembly to make sure the asm output is sane/as intended. If a CC is so backwards/basic/ancient, that it can't inline fixed length memcpys, it can't combine U8* derefs either. memcpy has better chances than this << | code, see elsewhere in this ticket for reasoning.

#define _shifted_octet(type,ptr,idx,shift) (((type)(((const U8*)(ptr))[(idx)]))<<(shift))
#define U8TO32_LE(ptr)   (_shifted_octet(U32,(ptr),0, 0)|\
                                 _shifted_octet(U32,(ptr),1, 8)|\
                                 _shifted_octet(U32,(ptr),2,16)|\
                                 _shifted_octet(U32,(ptr),3,24))

Also more expected behavior to "close" this ticket, offer to CPAN XS, a portable, documented public API macro, and or CPP function like macros. To do tiny UA reads/writes. The goal is to centralize/abstract "tiny portable UA read/writes" to Perl core. So CPAN XS authors have minimal thinking and minimal typing to do portable, works everywhere, UA read/writes.

Current situation places burden on CPAN XS authors to #ifdef or if() {} else{} OS specific, CPU specific, CC specific code for UA OKAY and NO UA Perls. CCs and OSes will obviously be missed by CPAN XS authors, since they will only test and write conditional logic regarding UA, for their specific personal OS/CPU (unless community patches come later).

Also fix comment in to match actual executed code behavior

/* U32_ALIGNMENT_REQUIRED:
 *	This symbol, if defined, indicates that you must access
 *	character data through U32-aligned pointers.  */
#ifndef U32_ALIGNMENT_REQUIRED
#$d_u32align U32_ALIGNMENT_REQUIRED	/**/
#endif

Perl configuration

-V skipped since its not easily possible to write a .t in PP or C code, that verifies GCC/Clang/MSVC's final machine code.

** Win32 OS Bedtime story, not important enough to this ticket, read in spare time **

Click to expand
i386/x64 AMD64/ARM32 Win8/ARM64, have native hardware unaligned access. The museum-only WinNT for MIPS/SPARC/ALPHA and Server 2003 for IA64 require "__unaligned" C type token modifier for "fast enough" unaligned access (guessing 3-4 ns UA vs 1 ns regular), which means, the CPU has mandatory dedicated CPU opcodes for unaligned U16/U32/U64 load()/store() or some special recipe to do unaligned using 2-5 assembly opcodes instead of calling libc() memcpy.

Public API contract for WinNT 3.1-4.0-Server 2003, guaranteed MIPS/SPARC/ALPHA/IA64 CPUs supported transparent unaligned memory access just like i386, without __unaligned type modifier. But it was suicidal on those 4 CPUs, for production code to fallback to user mode's kernel32.dll's exception handler try{} catch(e){} finally{} of last resort, vs which executes as the last step, right before the SEGV popup.

MIPS/SPARC/ALPHA/IA64's kernel32.dll had code, to look at the faulting parameters, data address, instruction address, decode faulting instructions, and synthesize them, advance instruction address in the struct _CONTEXT, then call no_return NtKrnlResumeWithThreadContext(context); but this took 0.5-1.5 milliseconds each time, and all C++ call frames have to reject the SEGV exception object, before the user mode last-resort exception handler will swallow the exception object.

IDK what happens on Windows, if someone removes last-resort catch frame, or someone starts a raw NT Native process without the "Windows API", and a raw NT Native process SEGVs. My wild guess is an instant BSOD, or the "critical service terminated" 5 second countdown GUI popup and then power off. Which is very logical, since a process object can't exit or end, without an U32 exit code, and a crashed process doesn't have a U32 exit code, and WinNT will not lie. Public Win32 API, requires passing var U32 exit_code to TerminateProcess() to force kill a hung process from another process.

I have previous debugged and determined exit code 0xc0000005 STATUS_ACCESS_VIOLATION is purely the choice of kernel32.dll. If I broke/removed last-resort handler, ntdll.dll RtlDispatchException() will throw a STATUS_INVALID_DISPOSITION or STATUS_NONCONTINUABLE_EXCEPTION back to Ring 0, Ring 0, does a nested callback again to RtlDispatchException() [this is really for a C debugger]. If a 2nd time, no try/catch frame swallow the exception, RtlDispatchException() will call Ring 0 with a NtRaiseHardError() call, which Public API says will become an instant BSOD, unless there is a Ring 0 kernel debugger attached.

I learned there is nothing to gain by removing the Kernel32.dll last-resort exception handler and leaving the process naked. Either replace the K32 handle with YOUR SEH handler that calls ExitProcess (but why? if you are ontop of K32's, using public API, and catch all objects, its the same behavior and a clean vanilla public API design).

If its GCC/Perl, which can't do SEH on 32b (I object, I made a POC), only intelligent choice on GCC is (SetErrorMode)[https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-seterrormode?redirectedfrom=MSDN] to disable the popup 99% of the time.

The popups in the Visual C IDE from C debugger can't be selectively filtered, from the target process side. K32 last-resort dispatched the event packets to the remote C debugger, but last-resort is always on top, and always on the bottom of the exception handler stack. Damaging/disabling K32 last-resort handler, is detected from Ring 0, and Ring 0 changes exception dispatch behavior for naked or native NT processes. So then even though I inserted a custom last-resort, Win32 Perl's exception handlers, and MS CRT's handlers, started trapping objects because K32's secret first-peek was disabled, that can create/dispatch/block on the VS IDE C Dbgr, before Perl/MS CRT get to see it.

So after that BSOD incident with NtRaiseHardError() syscall, I've never tried again to remove K32 exception handler of last resort. No benefit is gained at runtime, if my new code is bug free stable. Any bug, any mistake, and I instantly will fire off an unstoppable NT kernel panic, and loosing a few mins of work in my IDE.

The original purpose/goal of removing K32's last-resort exception trap, was to emulate on Win2K the API added in Windows XP https://learn.microsoft.com/en-us/windows/win32/api/errhandlingapi/nf-errhandlingapi-addvectoredcontinuehandler

It wasn't worth the work to emulate AVCH() for the rapidly aging Win2K OS when WinXP had what I needed as public API.

The final solution I used was just compiling my own forked perl.exe with a patched perl_construct() for Win2k boxes at work, instead of doing crazy things at runtime hacking active_perl.exe .

@bulk88
Copy link
Contributor Author

bulk88 commented Jan 5, 2025

Update, I found this situation, where I think GCC refused, IDK why, not enough background knowledge, to fuse U8s into U32s/U64. This is from Ubuntu official package, Perl 5.40 compiled with GCC for ARM64, inside S_perl_hash_siphash_1_3_with_state_64(). The U8TO32_LE()/_shifted_octet() backend failed to optimize with GCC here. I don't know where the 8th U8 is being read, if highest U8 is read at all. ARM64 isn't my focus. But p[0] thru p[6] were read as U8s, no optimization to U32 or U64.

; @S_perl_hash_siphash_1_3_with_state_64.constprop.0_1
.text:U8A53A0                 LDRB            W0, [X7,#6]
.text:U8A53A4                 ORR             X5, X5, X0,LSL#48
.text:U8A53A8                 LDRB            W0, [X7,#5]
.text:U8A53AC                 ORR             X5, X5, X0,LSL#40
.text:U8A53B0                 LDRB            W0, [X7,#4]
.text:U8A53B4                 ORR             X5, X5, X0,LSL#32
.text:U8A53B8                 LDRB            W0, [X7,#3]
.text:U8A53BC                 ORR             X5, X5, X0,LSL#24
.text:U8A53C0                 LDRB            W0, [X7,#2]
.text:U8A53C4                 ORR             X5, X5, X0,LSL#16
.text:U8A53C8                 LDRB            W0, [X7,#1]
.text:U8A53CC                 ORR             X5, X5, X0,LSL#8
.text:U8A53D0                 LDRB            W0, [X7]
.text:U8A53D4                 ORR             X5, X5, X0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant