[MLIR][SCF] Fix scf.index_switch lowering to preserve large case values (#189230)
`IndexSwitchLowering` stored case values as `SmallVector<int32_t>`,
which silently truncated any `int64_t` case value larger than INT32_MAX
(e.g. `4294967296` became `0`). The `cf.switch` flag was also created
via `arith.index_cast index -> i32`, losing the upper 32 bits on 64-bit
platforms.
Fix: store case values as `SmallVector<APInt>` with 64-bit width, cast
the index argument to `i64`, and use the `ArrayRef<APInt>` overload of
`cf::SwitchOp::create` so the resulting switch correctly uses `i64` case
values and flag type.
Fixes #111589
Assisted-by: Claude Code
[mlir][scf] Fix FoldTensorCastOfOutputIntoForallOp write order bug (#189162)
`FoldTensorCastOfOutputIntoForallOp` incorrectly updated the
destinations of `tensor.parallel_insert_slice` ops in the `in_parallel`
block by zipping `getYieldingOps()` with `getRegionIterArgs()`
positionally. This assumed that the i-th yielding op writes to the i-th
shared output, which is not required by the IR semantics. When slices
are written to shared outputs in non-positional order, the
canonicalization would silently reverse the write targets, producing
incorrect output.
Fix by replacing the positional zip with a per-destination check: for
each yielding op's destination operand, if it is a `tensor.cast` result
whose source is one of the new `scf.forall` region iter args (i.e., a
cast we introduced to bridge the type change), replace the destination
with the cast's source directly. This correctly handles all orderings.
Add a regression test that exercises the multi-result case where
`parallel_insert_slice` ops write to shared outputs in non-sequential
[4 lines not shown]
[MLIR][SparseTensor] Fix fingerprint changes in SparseFuncAssembler (#188958)
SparseFuncAssembler::matchAndRewrite was calling funcOp.setName(),
funcOp.setPrivate(), and funcOp->removeAttr() directly without notifying
the rewriter, causing "operation fingerprint changed" errors under
MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS.
Wrap all in-place funcOp mutations with rewriter.modifyOpInPlace.
Assisted-by: Claude Code
Fix a failure present with MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS=ON.
Co-authored-by: Claude Sonnet 4.6 <noreply at anthropic.com>
[MLIR][SparseTensor] Fix domination violation in co-iteration for dense iterators (#188959)
In exitWhileLoop, random-accessible (dense) iterators were being located
using whileOp.getResults().back() while the insertion point was still
inside the while loop's after block. This caused a domination violation:
the ADDI created by locate() was inside the after block, but it was
later used (via derefImpl's SUBI) after the while loop exits.
Move the locate() calls for random-accessible iterators to after
builder.setInsertionPointAfter(whileOp), where the while results are
properly in scope.
Fixes 10 failing tests under MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS.
Assisted-by: Claude Code
Co-authored-by: Claude Sonnet 4.6 <noreply at anthropic.com>
[AArch64][NFC] Move `isZExtLoad/isSExtLoad` from `AArch64FastISel` to `AArch64InstrInfo` (#189486)
Move the static function `isZExtLoad` and `isSExtLoad` helper functions
from `AArch64FastISel` into `AArch64InstrInfo` to be reused by other
passes.
[SandboxVec][VecUtils] Lane Enumerator (#188355)
This patch introduces an iterator that helps us iterate over lane-value
pairs in a range. For example, given a container `(i32 %v0, <2 x i32>
%v1, i32 %v2)` we get:
```
Lane Value
0 %v0
1 %v1
3 %v2
```
We use this iterator to replace the lane counting logic in
BottomUpVec.cpp.
[SPIRV][AMDGPU][clang][CodeGen][opt] Add late-resolved feature identifying predicates (#134016)
This change adds two builtins for AMDGPU:
- `__builtin_amdgcn_processor_is`, which is similar in observable
behaviour with `__builtin_cpu_is`, except that it is never "evaluated"
at run time;
- `__builtin_amdgcn_is_invocable`, which is behaviourally similar with
`__has_builtin`, except that it is not a macro (i.e. not evaluated at
preprocessing time).
Neither of these are `constexpr`, even though when compiling for
concrete (i.e. `gfxXXX` / `gfxXXX-generic`) targets they get evaluated
in Clang, so they shouldn't tear the AST too badly / at all for
multi-pass compilation cases like HIP. They can only be used in specific
contexts (as args to control structures).
The motivation for adding these is two-fold:
[18 lines not shown]
[lldb][macOS] Recognize new layouts for DeviceSupport directories (#188646)
When debugging a remote Darwin device (iOS, macOS, etc), lldb needs to
find a local copy of all the system libraries (the system's shared
cache) so we don't need to read them over gdb-remote serial protocol at
the start of every debug session.
Xcode etc normally creates these expanded shared caches in
~/Library/Developer/Xcode/<OS> DeviceSupport/<OS VER> (<OS
BUILD>)/Symbols
So when lldb sees a file like /usr/lib/libSystem.B.dylib, it may find a
copy at in
~/L/D/Xcode/iOS DeviceSupport/26.2
(23B87)/Symbols/usr/lib/libSystem.B.dylib
There may be multiple expanded shared caches in these DeviceSupport
directories, so we try to parse the "os version" and "os build" out of
the filepath name, and look in a directory that matches the target
[23 lines not shown]
[AArch64][llvm] Gate some `tlbip` insns with +tlbid or +d128
Change the gating of `tlbip` instructions containing `*E1IS*`, `*E1OS*`,
`*E2IS*` or `*E2OS*` to be used with `+tlbid` or `+d128`. This is because
the 2025 Armv9.7-A MemSys specification says:
```
All TLBIP *E1IS*, TLBIP*E1OS*, TLBIP*E2IS* and TLBIP*E2OS* instructions
that are currently dependent on FEAT_D128 are updated to be dependent
on FEAT_D128 or FEAT_TLBID
```