[AArch64] Fix scheduling info for Armv8.4-a LDAPUR* instructions (#171637)
They were using the wrong scheduler resource. They're also missing from
the optimisation guides, but WriteLD should be closer at least.
[OpenMP][MLIR] Hoist static `alloca`s emitted by private `init` regions to the allocation IP of the construct
Having more than 1 descritpr (allocatable or array) on the same `private` clause triggers a runtime crash on GPUs at the moment.
For SPMD kernels, the issue happens because the initialization logic includes:
* Allocating a number of temporary structs (these are emitted by flang when `fir` is lowered to `mlir.llvm`).
* There is a conditional branch that determines whether we will allocate storage for the descriptor and initialize array bounds from the original descriptor or whether we will initialize the private descriptor to null.
Because of these 2 things, temp allocations needed for descriptors beyond the 1st one are preceded by branching which causes the observed the runtime crash.
This PR solves this issue by hoisting these static `alloca`s instructions to the suitable allca IP of the parent construct.
[X86] LowerATOMIC_STORE - on 32-bit targets see if i64 values were originally legal f64 values that we can store directly. (#171602)
Based off feedback from #171478
(reland) [AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077)
Fixed a crash in Blender due to some weird control flow.
The issue was with the "merge" function which was only looking at the
keys of the "Other" VMem/SGPR maps. It needs to look at the keys of both
maps and merge them.
Original commit message below
----
The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.
There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.
[9 lines not shown]
[AArch64][NFC] Add isTRNMask improvements to isZIPMask (#171532)
Some [ideas for
improvement](https://github.com/llvm/llvm-project/pull/169858#pullrequestreview-3525357470)
came up during review of recent changes to `isTRNMask`.
This PR applies them also to `isZIPMask`, which is implemented almost
identically.
[libc++] Don't instantiate __split_buffer with an allocator reference (#171651)
Allocators should be extremely cheap, if not free, to copy. Furthermore,
we have requirements on allocator types that copies must compare equal,
and that move and copy must be the same.
Hence, taking an allocator by reference should not provide benefits
beyond making a copy of it. However, taking the allocator by reference
leads to complexity in __split_buffer, which can be removed if we stop
using that pattern.
[AMDGPU][SDAG] Add missing cases for SI_INDIRECT_SRC/DST (#170323)
Before this patch, `insertelement/extractelement` with dynamic indices
would
fail to select with `-O0` for vector 32-bit element types with sizes 3,
5, 6 and 7,
which did not map to a `SI_INDIRECT_SRC/DST` pattern.
Other "weird" sizes bigger than 8 (like 13) are properly handled
already.
To solve this issue we add the missing patterns for the problematic
sizes.
Solves SWDEV-568862
[SPIRV] Promote scalar arguments to vector for `OpExtInst` in `generateExtInst` instead of `SPIRVRegularizer` (#170155)
This patch consist of 2 parts:
* A first part that removes the scalar to vector promotion for built-ins
in the `SPIRVRegularizer`;
* and a second part that implements the promotion for built-ins from
scalar to vector in `generateExtInst`.
The implementation in `SPIRVRegularizer` had several issues:
* It rolled its own built-in pattern matching that was extremely
permissive
* the compiler would crash if the built-in had a definition
* the compiler would crash if the built-in had no arguments
* The compiler would crash if there were more than 2 function
definitions in the module.
* It'd be better if this was implemented as a module pass; where we
iterate over the users of the function, instead of scanning the whole
module for callers.
[13 lines not shown]
[NFC][RISCV] Unify all zvfbfa vl patterns and sd node patterns (#171072)
This patch try to move all vl patterns and sd node patterns to
RISCVInstrInfoVVLPatterns.td and RISCVInstrInfoVSDPatterns.td
respectively. It removes redefinition of pattern classes for zvfbfa and
make it easier to maintain and change.
Note: this does not include intrinsic patterns, if we want to also unify
intrinsic patterns we need to also move pseudo instruction definitions
of zvfbfa to RISCVInstrInfoVPseudos.td.
[PowerPC][AIX] Specify correct ABI alignment for double (#144673)
Add `f64:32:64` to the data layout for AIX, to indicate that doubles
have a 32-bit ABI alignment and 64-bit preferred alignment.
Clang was already taking this into account, but it was not reflected in
LLVM's data layout.
A notable effect of this change is that `double` loads/stores with 4
byte alignment are no longer considered "unaligned" and avoid the
corresponding unaligned access legalization. I assume that this is
correct/desired for AIX. (The codegen previously already relied on this
in some places related to the call ABI simply by dint of assuming
certain stack locations were 8 byte aligned, even though they were only
actually 4 byte aligned.)
Fixes https://github.com/llvm/llvm-project/issues/133599.
[InstCombine][CmpInstAnalysis] Use consistent spelling and function names. NFC. (#171645)
Both `decomposeBitTestICmp` and `decomposeBitTest` have a parameter
called `lookThroughTrunc`. This was spelled in full (i.e. `lookThroughTrunc`)
in the header. However, in the implementation, it's written as `lookThruTrunc`.
I opted to convert all instances of `lookThruTrunc` into
`lookThroughTrunc` to reduce surprise while reading the code and for
conformity.
---
The other change in this PR is the renaming of the wrapper around
`decomposeBitTest()`. Even though it was a wrapper around
`CmpInstAnalysis.h`'s `decomposeBitTest`, the function was called
`decomposeBitTestICmp`. This is quite confusing because such a function
_also_ exists in `CmpInstAnalysis.h`, but it is _not_ the one actually
being used in `InstCombineAndOrXor.cpp`.
[Linalg] Add *Conv2D* matchers (#168362)
-- This commit is the fourth in the series of adding matchers
for linalg.*conv*/*pool*. Refer:
https://github.com/llvm/llvm-project/pull/163724
-- In this commit all variants of Conv2D convolution ops have been
added.
-- It also refactors the way these matchers work to make adding more
matchers concise.
Signed-off-by: Abhishek Varma <abhvarma at amd.com>
---------
Signed-off-by: Abhishek Varma <abhvarma at amd.com>
Signed-off-by: hanhanW <hanhan0912 at gmail.com>
Co-authored-by: hanhanW <hanhan0912 at gmail.com>
[ConstantFolding] Support ptrtoaddr in ConstantFoldCompareInstOperands (#162653)
This folds `icmp (ptrtoaddr x, ptrtoaddr y)` to `icmp (x, y)`, matching
the existing ptrtoint fold. Restrict both folds to only the case where
the result type matches the address type.
I think that all folds this can do in practice end up actually being
valid for ptrtoint to a type large than the address size as well, but I
don't really see a way to justify this generically without making
assumptions about what kind of folding the recursive calls may do.
This is based on the icmp semantics specified in
https://github.com/llvm/llvm-project/pull/163936.
[Verifier] Make sure all constexprs in instructions are visited (#171643)
Previously this only happened for constants of some types and missed
incorrect ptrtoaddr.
[ValueTracking] Enhance overflow computation for unsigned mul (#171568)
Changed the range computation in computeOverflowForUnsignedMul to use
computeConstantRange as well.
This expands the patterns that InstCombine manages to narrow a mul that
has values that come from zext, for example if a value comes from a div
operation then the known bits doesn't give the narrowest possible range
for that value.
---------
Co-authored-by: Adar Dagan <adar.dagan at mobileye.com>
[RISCV] Generate Xqcilsm LWMI/SWMI load/store multiple instructions (#171079)
This patch adds support for generating the Xqcilsm load/store multiple
instructions as a part of the RISCVLoadStoreOptimizer pass. For now we
only combine two load/store instructions into a load/store multiple.
Support for converting more loads/stores will be added in follow-up
patches. These instructions are only applicable for 32-bit loads/stores
with an alignment of 4-bytes.
[LoongArch] Add support for the ud macro instruction (#171583)
This patch adds support for the `ud ui5` macro instruction. The `ui5`
operand must be inthe range `0-31`. The macro expands to:
`amswap.w $rd, $r1, $rj`
where `ui5` specifies the register number used for `$rd` in the expanded
instruction, and `$rd` is the same as `$rj`.
Relevant binutils patch:
https://sourceware.org/pipermail/binutils/2025-December/146042.html
[RISCV] Add Xsfmm vlte and vste intrinsics to getTgtMemIntrinsics. (#171747)
Replace dyn_cast with cast. The dyn_cast can never fail now. Previously
it never succeeded.