InstCombine: Handle fadd in SimplifyDemandedFPClass (#174853)
Note some of the tests currently fail with alive, but not
due to this patch. Namely, when performing the fadd x, 0 -> x
simplification in functions with non-IEEE denormal handling.
The existing instsimplify ignores the denormals-are-zero hazard by
checking cannotBeNegativeZero instead of isKnownNeverLogicalZero.
Also note the self handling doesn't really do anything yet, other
than propagate consistent known-fpclass information until there is
multiple use support.
This also leaves behind the original ValueTracking support, without
switching to the new KnownFPClass:fadd utility. This will be easier
to clean up after the subsequent fsub support patch.
[LowerMemIntrinsics] Propagate value profile to branch weights (#174490)
If the mem intrinsics have value profile information associated, we can synthesize branch weights when converting them (the intrinsics) to loops.
Issue #147390
[AMDGPU] Rematerialize VGPR candidates when SGPR spills to VGPR over the VGPR limit
Before, when selecting candidates to rematerialize, we would only
consider SGPR candidates when there was an excess of SGPR registers.
Failing to eliminate the excess would result in spills to VGPRs.
This is normally not an issue, unless spilling to VGPRs results in
excess VGPRs.
This patch does 2 things:
* It relaxes the GCNRPTarget success criteria: now we accept regions
where we spill SGPRs to VGPRs, as long as this does not end up in
excess VGPRs.
* It changes isSaveBeneficial to consider the excess VGPRs (which
includes the SGPRs that would be spilled to VGPR).
With these changes, the compiler rematerializes VGPRs when the excess
SGPRs would result in VGPR excess.
[4 lines not shown]
Remove LLVM_ABI from symbolicate declaration in BacktraceTools.h (#175764)
The class is already annotated with LLVM_ABI, so individual members shouldn't be.
[MLIR][Python] Improve Iterator performance. Don't `throw` in `dunderNext` methods. (#175377)
In
https://github.com/llvm/llvm-project/pull/174139#issuecomment-3733259370
I wrote a scuffed benchmark that mostly iterates MLIR Container Types in
Python. My changes from that PR made the performance worse, so I closed
it.
However, when experimetning with that I also saw a large(?) performance
gain by changing the `dunderNext` methods of the various Iterators to
use `PyErr_SetNone(PyExc_StopIteration);` instead of `throw
nb::stop_iteration();`.
<details><summary>Benchmark attempt script</summary>
```python
import timeit
from mlir.ir import Context, Location, Module, InsertionPoint, Block, Region, OpView
[93 lines not shown]
[libc++] Simplify __unwrap_iter a bit (#175153)
`__unwrap_iter` doesn't need to SFINAE away, so we can just check inside
the function body whether an iterator is copy constructible. This
reduces the overload set, improving compile times a bit.
[AArch64][llvm] Improve codegen for svldr_vnum_za/svstr_vnum_za
When compiling `svldr_vnum_za` or `svstr_vnum_za`, the output
assembly has a superfluous `SXTW` instruction (gcc doesn't add
this); this should be excised, see https://godbolt.org/z/sz4s79rf8
In clang we're using int64_t, and `i32` in llvm. The extra `SXTW`
is due to a call to `DAG.getNode(ISD::SIGN_EXTEND...)`. Make them
both 64bit to make the extra `SXTW` go away.
[X86] Add bf16 support to isFMAFasterThanFMulAndFAdd for basic FMA optimizations (#172006)
This PR extends `isFMAFasterThanFMulAndFAdd` in `X86ISelLowering` to
handle
bfloat types. This enables basic FMA optimizations for bf16
operations on AVX10.2 targets.
Includes tests for scalar and vector bf16 cases:
- Scalar bf16 FMA lowering (AVX10.2 do not support scalar bf16
operations)
- Vector bf16 FMA fusion for 128-bit, 256-bit, and 512-bit widths
AMDGPU: Change ABI of 16-bit element vectors on gfx6/7
Fix ABI on old subtargets so match new subtargets, packing
16-bit element subvectors into 32-bit registers. Previously
this would be scalarized and promoted to i32/float.
Note this only changes the vector cases. Scalar i16/half are
still promoted to i32/float for now. I've unsuccessfully tried
to make that switch in the past, so leave that for later.
This will help with removal of softPromoteHalfType.
GlobalISel: Fix mishandling vector-as-scalar in return values
This fixes 2 cases when the AMDGPU ABI is fixed to pass <2 x i16>
values as packed on gfx6/gfx7. The ABI does not pack values
currently; this is a pre-fix for that change.
Insert a bitcast if there is a single part with a different size.
Previously this would miscompile by going through the scalarization
and extend path, dropping the high element.
Also fix assertions in odd cases, like <3 x i16> -> i32. This needs
to unmerge with excess elements from the widened source vector.
All of this code is in need of a cleanup; this should look more
like the DAG version using getVectorTypeBreakdown.
AMDGPU: Directly use v2bf16 as register type for bf16 vectors.
Previously we were casting v2bf16 to i32, unlike the f16 case. Simplify
this by using the natural vector type. This is probably a leftover from
before v2bf16 was treated as legal. This is preparation for fixing a
miscompile in globalisel.
[Hexagon] Fix PIC crash when lowering HVX vector constants (#175413)
Fix a PIC-only crash in Hexagon HVX lowering where we ended up treating
a vector-typed constant-pool reference as an address (e.g. when forming
PC-relative addresses), which triggers a type mismatch during lowering.
Build the constant-pool reference with the target pointer type instead,
then load the HVX vector from that address.