[DAG] Fold INT_TO_FP( FP_TO_INT (x) ) to FTRUNC(X) (#198477)
Extends the `foldFPToIntToFP` DAG Combine so that it can now be applied
when `FTRUNC` has a custom lowering, and given that `INT_TO_FP
(FP_TO_INT (X))` is not already legal.
On AArch64 targets with SVE, this change simplifies the codegen of
`INT_TO_FP (FP_TO_INT (X))` conversions by making use of the `frintz`
instruction.
[IR] Add fast-math support to {u,s}itofp (#198470)
- `{u,s}itofp` are floating point typed values.
- CodeGen part (foldFPToIntToFP in DAGCombiner) needs `nsz` to fold
pattern (uitofp (fptoui x)) -> (trunc x).
- LLVM has intrinsic variants of `{u,s}itofp`, which already support
fast-math flags.
Now optimization flags require 9 bits in bitcode, fast-math flags of
`uitofp` are stored in high 8 bits.
VPlan part may need some extra work, it assumes optimization flags from
different categories are disjoint.
[LoopFusion] Reform LCSSA after peelFusionCandidate's peelLoop (#200442)
peelLoop's internal simplifyLoop call requires LCSSA to be preserved
across it, but the cloned exit edges and cloned defs that peelLoop
introduces are not reflected in the existing LCSSA phis, so the contract
cannot be honoured. Pass PreserveLCSSA=false to peelLoop here and reform
LCSSA on the affected nest immediately afterward. LCSSA is expected
before and after peel+fuse, just not during it.
Caught by yarpgen fuzzing of clang -O3 -fexperimental-loop-fusion -mllvm
-loop-fusion-peel-max-count=8 on AArch64.
Fixes #199418
[libc++] Simplify unique_ptr constructor SFINAE (#201305)
This patch does a couple of things:
- inline aliases to `__enable_if_t`s, making it easier to understand
what's actually going on
- make the `enable_if`s dependent via a `class _Deleter = deleter_type`
instead of a `bool` and `__dependent_type`, reducing the number of
instantiated classes
- remove `__unique_ptr_deleter_sfinae`
[mlir][tosa] Improve folder conformance to TOSA specification (#200223)
This commit fixes some bugs in TOSA folders that cause non-conformant
results. The fixes include:
- tosa.intdiv - Folding when the lhs and rhs are zero. In the TOSA
specification this is undefined behaviour.
- tosa.div_ceil_shape/tosa.div_floor_shape - Folding when the lhs is
negative or the rhs is non-positive. In the TOSA specification this is
undefined behaviour.
In addition, some test cases have been added for non-exercised code
paths, including:
- tosa.intdiv - Rejects overflow cases
- tosa.greater/tosa.greater_equal/tosa.equal - Correctly evaluates NaN
cases to False.
- tosa.cast - Saturating rounding when input is out of range of the
output type.
- tosa.mod_shape - Rejects cases where lhs is negative or rhs is
non-positive.
[NVPTX] Fix illegal combineInsertEltToShuffle pattern (#198259)
Adds a condition to the `isShuffleMaskLegal` method to prevent
`combineInsertEltToShuffle` from creating an illegal `vector_shuffle`
after type legalization which leads to a crash.
Context:
This is triggered when bitcasting a `v2f16` into a vector that type
legalizes to a `v2i32`. This happens on architectures supporting packed
`f32` operations (>= `sm_100`). In certain cases, this leads to
`combineInsertEltToShuffle` creating a vector shuffle with `v4f16` for
simplification. Since this happens after type legalization and `v4f16`
is an illegal type, it leads to a crash.
[GlobalISel] Combine sext(load), zext(load) patterns when the load has multiple uses (#182831)
Extend the existing combiners for sext(load), zext(load) patterns to also work when the load has multiple uses.
[LV] Optimize partial reduction extends before handling inloop subs (#199665)
The crash avoided in #194660 was caused by the extend optimizations
failing to match as due to the extra sub/negation added to the
"ExtendedOp".
A similar crash exists for [us]abs partial reductions (see
https://godbolt.org/z/MerMon5rE), which is fixed with this patch.
This patch solves the underlying issue by running the extend
optimizations before any inloop sub/fsub handling.
Fixes #194000
[clang-tidy] Avoid brace fix-it crash in macro body expansion (#198788)
`readability-braces-around-statements `could assert when diagnosing an
unbraced statement that ends in the middle of a macro body expansion. It
would be hard/unsafe to give fix-its for such cases, so treat them as
diagnostic-only.
Closes https://github.com/llvm/llvm-project/issues/198711
[AMDGPU] In `LowerDYNAMIC_STACKALLOC`, hoist the readfirstlane up one instruction
Instead of:
```
$max_size_vgpr = wave_reduction_umax($vgpr_alloca_size)
$sgpr_newsp = readfirstlane($max_size_vgpr + $sgpr_sp)
```
Hoist the readfirstlane up to perform the addition using scalar
registers:
```
$max_size_sgpr = readfirstlane(wave_reduction_umax($vgpr_alloca_size))
$sgpr_newsp = $max_size_sgpr + $sgpr_sp
```
[SelectionDAG] Pre-commit tests for dagcombine improvements (#201270)
I've got a stack of dagcombine improvements that together make an
infinite cycle relating to freeze insertion in vector-manipulation IR.
Here we have
- Handling freeze(undef) in demanded-elts for shufflevector
- Improvements to noundef checks for bitcast, concat, and select
- Improvements to extract(concat), extract(extract), and extract(insert)
handling
Even though the regression I'm fixing is an AMDGPU one, these tests are
mainly X86 because the AMDGPU calling convention makes it hard to
demonstrate the folds I'm adding.
AI note: I got an LLM to find most of these tests, especially some of
the fiddly ones that needed control flow.
[lldb] Use batched memory reads in ClassDescriptorV2::relative_list_entry_t (#201284)
This reduces the number of memory reads performed when reading Objective
C classes metadata.
Note: these addresses are indeed sequential (with a small offset between
them), but there are so many of them that they would not fit into a
single Process::ReadMemory cache line, so this is still a win, and it
also puts the code into the right shape for vectorizing the next read in
the same loop, which will see the biggest savings.
[AArch64][llvm] Restrict luti6 (4 regs, 8-bit) to 0 <= Zn <= 7
The `luti6` instruction (table, four registers, 8-bit) should only
allow `0 <= Zn <= 7`, since there's only 3 bits. It actually allows:
```
luti6 { z0.b - z3.b }, zt0, { z8 - z10 }
```
which produces a duplicate encoding to the following:
```
luti6 { z0.b - z3.b }, zt0, { z0 - z2 }
```
Fix tablegen to ensure Zn is only allowed in correct range of 0 to 7.
[NFC][GlobalISel] Refactor ownership of InstructionMatchers (#200798)
- Clarify that the array of InstructionMatchers in the RuleMatcher are
for the roots only.
- Let RuleMatcher own all of the InstructionMatcher used for/by
predicates.
They are all kept in an array in which the index of the
InstructionMatcher is equal to its
InsnID, which eliminates some redundant tracking.
- Remove duplicate tracking of InsnID from RuleMatcher;
InstructionMatcher does it on its own already.
Co-authored-by: Pierre-vh <29600849+Pierre-vh at users.noreply.github.com>
[GlobalISel] Do not depend on the RuleMatcher at MatchTable emission (#200799)
Some PredicateMatchers/MatchAction/OperandRenderers relied on accessing
RuleMatcher at emission as a crutch.
Instead, make these classes collect all necessary information in the
constructor so the `emit` methods don't depend on RuleMatcher anymore.
The primary motivation for this is that I've been looking at ways to optimize the MatchTable better,
and the fact that Predicates/Actions/Renderers are not "pure" objects, in the sense that they keep
accessing a bunch of data all over the place even as late as emission, was a consistent pain.
This is NFCI. There are no changes to any of the match table for AMDGPU/AArch64 in this patch.
This patch has a bunch of noise due to function signature changes so I'll highlight the following interesting changes:
- `SameOperandMatcher` needed a bit of an update in its `canHoistOutsideOf` function. I had to rewrite it
but I think the end result is the same.
- `EraseInstAction` has been updated as well, and its users in both Combiner/ISel backends have been updated to.
Instead of ignoring this action if the Inst was already erased, it's now the responsibility of the
builder to never insert it in the first place. `BuildMIAction` had a small update because of that too.
[4 lines not shown]
[NFC][GlobalISel] Refactor ownership of InstructionMatchers (#200798)
- Clarify that the array of InstructionMatchers in the RuleMatcher are for the roots only.
- Let RuleMatcher own all of the InstructionMatcher used for/by predicates.
They are all kept in an array in which the index of the InstructionMatcher is equal to its
InsnID, which eliminates some redundant tracking.
- Remove duplicate tracking of InsnID from RuleMatcher; InstructionMatcher does it on its own already.
[AMDGPU][SIMemoryLegalizer] Consider scratch operations as NV=1 if GAS is disabled
- Clarify that `thread-private` MMO flag is still useful.
- If GAS is not enabled (which is the default as of last patch), consider an op as `NV=1` if it's a `scratch_` opcode, or if the MMO is in the private AS.
- Add tests for the new cases.
- Update AMDGPUUsage GFX12.5 memory model
[AMDGPU] Make globally-addressable-scratch opt-in
This feature is meant to be opt-in for more advanced users, not default-enabled.
It may reduce performance otherwise as we can't assume private AS is thread-local
when it is enabled.
- Add `HasGloballyAddressableScratchSupport` feature to check if a target's scratch
addressing is changed due to support for globally addressable scratch.
- Use `EnableGloballyAddressableScratch` to check whether the user opted into
globally addressable scratch. This affects whether to lower scratch atomics as flat,
and in the future will affect whether NV=1 can be set on scratch accesses.