[mlir][CSE] Fix CSE markAnalysesPreserved<DominanceInfo, PostDominanceInfo> comment (#190471)
The original comment claimed that DominanceInfo and PostDominanceInfo
could be preserved because region operations are not removed. However,
the real reason was that the original CSE only deleted redundant
operations without moving any operation to a different block, leaving
the dominance tree structure unchanged. Part of
https://github.com/llvm/llvm-project/pull/180556.
[libc++][docs] Update paper and LWG issue lists after 2026-03 meeting (#189901)
[P3726R2](https://wg21.link/P3726R2) is a Core paper but adds
`std::start_lifetime`, so it needs to be listed in libc++'s
documentation.
For LWG issues, see [P4145R0](https://wg21.link/P4145R0) and
[P4146R0](https://wg21.link/P4146R0).
[CodeGenPrepare] Use Instruction::comesBefore instead of manual ordering (#190485)
After #172329, we noticed that some sources compiled with MSan take
1000x longer to compile. This is caused by quadratic complexity in
tryToSinkFreeOperands, which can be called on a significant number
of instructions within huge basic blocks.
This inefficiency was introduced in 9cfa9b4, which manually iterates
and creates a DenseMap of entire basic blocks for each interesting
instruction.
This patch avoids the manual ordering by using
Instruction::comesBefore(), which provides the exact same
ordering much more efficiently.
[clang-tidy] Fix performance-trivially-destructible with C++20 modules (#178471)
When a class definition is seen through both a header include and a
C++20 module import, destructors may appear multiple times in the AST's
redeclaration chain. The original matcher used `isFirstDecl()` which
fails in this scenario because the same declaration can appear as both
first and non-first depending on the view.
Replace `unless(isFirstDecl())` with `isOutOfLine()` which correctly
identifies out-of-line definitions by checking whether the lexical
context differs from the semantic context.
Also update clang-tools-extra's lit.cfg.py to call `use_clang()` instead
of `clang_setup()` to make the `%clang` substitution available for
tests.
Fixes #178102
Co-authored-by: Chuanqi Xu <yedeng.yd at linux.alibaba.com>
[flang][cuda] Lower unified variables as cuf.alloc in main program scope (#190713)
Remove the unified exception from CanCUDASymbolBeGlobal so unified
variables follow the same cuf.alloc lowering path as other CUDA data
attributes.
[AMDGPU] asyncmark support for ASYNC_CNT (#185813)
The ASYNC_CNT is used to track the progress of asynchronous copies
between global and LDS memories. By including it in asyncmark, the
compiler can now assist the programmer in generating waits for
ASYNC_CNT.
Assisted-By: Claude Sonnet 4.5
This is part of a stack:
- #185813
- #185810
Fixes: LCOMPILER-332
[AMDGPU] Fix setreg handling in the VGPR MSB lowering
There are multiple issues with it:
1. It can skip inserting S_SET_VGPR_MSB if we set the mode via
piggybacking. We are now relying on the HW bug for correct
behavior. If/when the bug is fixed lowering will be incorrect.
2. We should just unconditionally update MSBs if immediate allows it.
We shall set correct bits and keep the rest of the immediate
(that is done). There is no reasonable way for an user to change
MSBs nor does it do anything good to set it with SETREG and then
immediately overwrite with S_SET_VGPR_MSB.
3. We can always update immediate if Offset is zero.
4. Redundant mode changes created as seen in the
hazard-setreg-vgpr-msb-gfx1250.mir.
With unconditional immediate update most of time and not relying on
the SETREG for setting MSBs there is no good reason to complicate
handling by supporting SETREG as a piggybacking target. Moreover,
[10 lines not shown]
Move {load,store}(llvm.protected.field.ptr) lowering to InstCombine.
The previous position of llvm.protected.field.ptr lowering for loads
and stores was problematic as it not only inhibited optimizations such
as DSE (as stores to a llvm.protected.field.ptr were not considered to
must-alias stores to the non-protected.field pointer) but also required
changes to other optimization passes to avoid transformations that would
reduce PFP coverage.
Address this by moving the load/store part of the lowering to
InstCombine, where it will run earlier than the PFP-breaking and
AA-relying transformations. The deactivation symbol, null comparison
and EmuPAC parts of the lowering remain in PreISelLowering.
Now that the transformation inhibitions are no longer needed, remove them
(i.e. partially revert #151649, and revert #182976).
This change resulted in a 2.4% reduction in Fleetbench .text size and
the following improvements to PFP performance overhead for BM_PROTO_Arena
[11 lines not shown]
[RISCV] Use per-SEW immediate inversion for vrol intrinsic patterns (#190113)
The VPatBinaryV_VI_VROL multiclass was using InvRot64Imm for all SEW
widths when converting vrol immediate intrinsics to vror.vi. This
produced unnecessarily large immediates for narrower element types
(e.g., 61 instead of 5 for SEW=8 rotate-left by 3).
Use the appropriate InvRot{SEW}Imm transform to match what the SDNode
patterns already do.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply at anthropic.com>
[Inliner] Put inline history into IR as !inline_history metadata (#190700)
(Reland of #190092 with verifier change to look through GlobalAliases)
So that it's preserved across all inline invocations rather than just
one inliner pass run.
This prevents cases where devirtualization in the simplification
pipeline uncovers inlining opportunities that should be discarded due to
inline history, but we dropped the inline history between inliner pass
runs, causing code size to blow up, sometimes exponentially.
For compile time reasons, we want to limit this to only call sites that
have the potential to inline through SCCs, potentially with the help of
devirtualization. This means that the callee is in a non-trivial
(Ref)SCC, or the call site was previously an indirect call, which can
potentially be devirtualized to call any function.
The CGSCCUpdater::InlinedInternalEdges logic still seems to be relevant
[5 lines not shown]
[CIR] Implement __builtin_flt_rounds and __builtin_set_flt_rounds (#190706)
This adds CIR handling for the __builtin_flt_rounds and
__builtin_set_flt_rounds builtin functions. Because the LLVM dialect
does not have dedicated operations for these, I have chosen not to
implement them as operations in CIR either. Instead, we just call the
LLVM intrinsic.
[CIR][NFC] Use tablegen to create CIRAttrToValue visitor declarations (#187607)
This change introduces TableGen support for indicating CIR attributes
that require a CIRAttrToValue visitor, adds the new flag to all
attributes to which it applies, and replaces the explicit declarations
with the tablegen output.
[AMDGPU] Fixed verifier crash because of multiple live range components.
In Rewrite AGPR-Copy-MFMA pass, after replacing spill instructions, the
replacement register may have multiple live range components when the
spill slot was stored to more than once. The verifier crashes with a
bad machine code error. This patch fixes the problem by splitting a live
range but assigning the same physical register in this scenario. A new
test has been added that verifies the absence of this verifier error.
Assisted-by: Claude Opus
[CodeGen] Fix incorrect rematerializtion order in rematerializer
When rematerializing DAGs of registers wherein multiple paths exist
between some regsters of the DAG, it is possible that the
rematerialization determines an incorrect rematerialization order that
does not ensure that a register's dependencies are rematerialized before
itself; an invariant that is otherwise required.
This fixes that using a simpler recursive logic to determine a correct
rematerialization order that honors this invariant. A minimal unit test
is added that fails on the current implementation.
[CodeGen] Fix multiple connected component issue in rematerializer (#186674)
This fixes a rematerializer issue wherein re-creating the interval of a
non-rematerializable super-register defined over multiple MIs, some of
which defining entirely dead sub-registers, could cause a crash when
changing the order of sub-definitions (for example during scheduling)
because the re-created interval could end up with multiple connected
components, which is illegal. The solution is to split separate
components of the interval in such cases. The added unit test crashes
without that added behavior.
[MLIR][test] Re-disable FileCheck on async.mlir integration test (#190702)
#190563 re-enabled FileCheck on `Integration/GPU/CUDA/async.mlir`, but
the buildbot has shown intermittent wrong-output failures
([example](https://lab.llvm.org/buildbot/#/builders/116/builds/27026)):
the test produces `[42, 42]` instead of the expected `[84, 84]`.
This wrong-output flakiness is distinct from the cleanup-time
`cuModuleUnload` errors that #190563 actually fixes — it's the
underlying issue tracked by #170833. The merged commit message for
#190563 incorrectly says `Fixes #170833`; that issue should be reopened,
since the cleanup-error fix doesn't address the wrong-output behavior.
This PR puts the test back in its previously-disabled state. The runtime
cleanup fix in #190563 is unaffected.
[CIR] Handle static local var decl constants (#190699)
This adds the handling for the case where the address of a static local
variable is used to initialize another static local. In this case, the
address of the first variable is emitted as a constant in the
initializer of the second variable.
[CodeGen] Fix multiple connected component issue in rematerializer
This fixes a rematerializer issue wherein re-creating the interval of a
non-rematerializable super-register defined over multiple MIs, some of
which defining entirely dead subregisters, could cause a crash when
changing the order of sub-definitions (for example during scheduling)
because the re-created interval could end up with multiple connected
components, which is illegal. The solution is to split separate
components of the interval in such cases. The added unit test crashes
without that added behavior.