[AArch64] Improve post-inc stores of SIMD/FP values
Add patterns to match post-increment truncating stores from lane 0 of
wide integer vectors (v4i32/v2i64) to narrower types (i8/i16/i32).
This avoids transferring the value through a GPR when storing.
Also remove the pre-legalization early-exit in combineStoreValueFPToInt
as it prevented the optimization from applying in some cases.
[AArch64] Improve post-inc stores of SIMD/FP values
Add patterns to match post-increment truncating stores from lane 0 of
wide integer vectors (v4i32/v2i64) to narrower types (i8/i16/i32).
This avoids transferring the value through a GPR when storing.
Also remove the pre-legalization early-exit in combineStoreValueFPToInt
as it prevented the optimization from applying in some cases.
[LoopPeel] Peel last iteration to enable load widening
In loops that contain multiple consecutive small loads (e.g., 3 bytes
loading i8's), peeling the last iteration makes it safe to read beyond
the accessed region, enabling the use of a wider load (e.g., i32) for
all other N-1 iterations.
Patterns such as:
```
%a = load i8, ptr %p
%b = load i8, ptr %p+1
%c = load i8, ptr %p+2
...
%p.next = getelementptr i8, ptr %p, 3
```
Can be transformed to:
```
%wide = load i32, ptr %p ; Read 4 bytes
[9 lines not shown]
Slightly adjust BUGS section for X509_addr_add_range()
Since x509_addr.c r1.95 X509_addr_add_range() clears the unused bits in
the maximum, so this is is only true in some implementations.