[InstCombine] Fix crash in `foldReversedIntrinsicOperands` for struct-return intrinsics (#186339)
Fixes #186334
Similar to #176556 , add the missing result type check in
`foldReversedIntrinsicOperands()`. This prevents `CreateVectorReverse()`
from being applied to struct-returning intrinsics.
[AArch64] Fold NEON splats into users by using SVE immediates (#165559)
This patch adds patterns that attempt to fold NEON constant splats into
users by promoting the users to use SVE, when the splat immediate is a
legal SVE immediate operand.
This is done as ISEL patterns to avoid folding to SVE too early, which
can disrupt other patterns/combines.
[AMDGPU][Doc] GFX12.5 Barrier Execution Model
- Document GFX12.5-specific intrinsics.
- Rename signal -> arrive, leave -> drop to match C++ terminology.
- Update execution model to support GFX12.5 semantics (e.g. threads can arrive w/o waiting)
- Various clean-ups & wording updates on the model.
- Added "mutually exclusive" barrier objects.
- Added barrier-phase-with + related constraints.
- Document that barriers can exist at cluster scope too.
- Update GFX12 target semantics/code sequences to include GFX12.5.
The model is no longer marked as incomplete, it is now just experimental.
There are more updates planned in the future to support more features, and
improve some known shortcomings of the model. e.g., currently many relations
encode too much semantic information, which means the model doesn't build
when barriers aren't used correctly. I'd like the model to eventually represent
broken executions as well, just like a memory model can.
[SCEV] Introduce SCEVUse wrapper type (NFC)
Add SCEVUse as a PointerIntPair wrapper around const SCEV * to prepare
for storing additional per-use information.
This commit contains the mechanical changes of adding an intial SCEVUse
wrapper and updating all relevant interfaces to take SCEVUse. Note that
currently the integer part is never set, and all SCEVUses are
considered canonical.
[MLIR][OpenACC] Fix crash in verifyDeviceTypeCountMatch when deviceTypes is null (#186279)
When an acc.parallel op has async operands (via operandSegmentSizes) but
no corresponding asyncOperandsDeviceType attribute, the verifier called
verifyDeviceTypeCountMatch with a null ArrayAttr. The function then
dereferenced the null pointer via deviceTypes.getValue(), causing a
segfault instead of a diagnostic.
Fix by guarding the getValue() call with a null check. When deviceTypes
is absent but operands are present, the mismatch is now reported as a
proper verifier error.
Fixes #107027
Assisted-by: Claude Code
[LowerMemIntrinsics][AMDGPU] Optimize memset.pattern lowering (#185901)
This patch changes the lowering of the [experimental.memset.pattern intrinsic](https://llvm.org/docs/LangRef.html#llvm-experimental-memset-pattern-intrinsic)
to match the optimized memset and memcpy lowering when possible. (The tl;dr of
memset.pattern is that it is like memset, except that you can use it to set
values that are wider than a single byte.)
The memset.pattern lowering now queries `TTI::getMemcpyLoopLoweringType` for a
preferred memory access type. If the size of that type is a multiple of the set
value's type, and if both types have consistent store and alloc sizes (since
memset.pattern behaves in a way that is not well suitable for access widening
if store and alloc size differ), the memset.pattern is lowered into two loops:
a main loop that stores a sufficiently wide vector splat of the SetValue with
the preferred memory access type and a residual loop that covers the remaining
set values individually.
In contrast to the memset lowering, this patch doesn't include a specialized
lowering for residual loops with known constant lengths. Loops that are
statically known to be unreachable will not be emitted.
[6 lines not shown]
[C++20] [Modules] [Reduced BMI] Try not write merged lookup table (#186337)
Update:
Close https://github.com/llvm/llvm-project/issues/184957
The roo cause of the problem is reduced BMI may not emit everything in
the lookup table, if Reduced BMI **partially** emits some decls, then
the generator may not emit the corresponding entry for the corresponding
name is already there. See
MultiOnDiskHashTableGenerator::insert and
MultiOnDiskHashTableGenerator::emit for details. So we won't emit the
lookup
table if we're generating reduced BMI.
[libc] Fix load_aligned big-endian handling. (#185937)
The variadic template helper `load_aligned` performs a specific case of
an unaligned integer load, by loading a sequence of integers from memory
at addresses expected to be aligned, and glues the results back together
with shifts and ORs into an output.
The implementation works by performing the first load, recursing on a
shorter parameter type list for the rest, and recombining via
first | (rest << size_of_first) // if little-endian
(first << size_of_first) | rest // if big-endian
But the big-endian case is wrong: it should shift left by the size of
the _rest_ of the types, not the size of the first. In the case where
you load 8, 16 and 8 bits from an odd address, you want
(first_byte << 24) | (middle_halfword << 8) | (last_byte)
[5 lines not shown]