[MLIR][Python] Support `has_trait` for operations (#188492)
This PR adds a `has_trait(trait_cls)` API to `_OperationBase`, that can
be used for:
- C++-defined operations and C++-defined traits (e.g.
`func_return_op.has_trait(IsTerminatorTrait)`)
- Python-defined operations and C++-defined traits (e.g.
`my_python_op.has_trait(IsTerminatorTrait)`)
- Python-defined operations and Python-defined traits (e.g.
`my_python_op.has_trait(MyPythonTrait)`)
---------
Co-authored-by: Maksim Levental <maksim.levental at gmail.com>
[lit] Explicitly unset timer to free thread stack (#188717)
Currently the virtual address space usage of lit fluctuates wildly, with
peak usage exceeding 4GB, which results in subsequent thread spawning
errors on 32-bit systems.
The cause of this is a circular reference in TimeoutHelper._timer (via the
callback), which causes the 8MB thread stack to not be immediately
reclaimed when the timer is cancelled.
We can avoid this by explicitly unsetting the timer.
[DA] Fix -Wunused-variable
A couple of these variables are only used within LLVM_DEBUG statements
which get removed by the preprocessor in non-assertions builds which
will cause the variable to become unused. Mark them maybe_unused given
the names make the code more readable.
[flang][OpenMP] Support user-defined declare reduction with derived types (#184897)
Fix lowering of `!$omp declare reduction` for intrinsic operators
applied
to user-defined derived types (e.g., `+` on `type(t)`). Previously, this
hit a TODO in `ReductionProcessor::getReductionInitValue` because the
code
tried to compute an init value for a non-predefined type, when it should
instead use the initializer region from the `DeclareReductionOp`.
This fixes the issue #176278: [Flang][OpenMP] Compilation error when
type-list in declare reduction directive is derived type name.
The root cause was a naming mismatch: `genOMP` for
`OpenMPDeclareReductionConstruct` used a raw operator string (e.g.,
"Add")
as the reduction name, while `processReductionArguments` at the use site
computed a canonical name via `getReductionName` (e.g.,
"add_reduction_byref_rec__QFTt"). The `lookupSymbol` in
[83 lines not shown]
Improvements to cost-model
The chosen costs are more precise as it tries to better use the target-features to determine if something can be expanded.
The costs in sdot-i16-i32 are now more accurate and the loops that didn't vectorise before result in equivalent or better codegen.
Various changes to the cost-model.
This has a number of changes to the partial reduction cost-model:
* Implement the fact that *MLALB/T instructions can be used for
16-bit -> 32-bit partial reductions (or *MLAL/MLAL2 for NEON).
* Fixes the cost of reductions that don't have specific lowering,
rather than returning a random number, we now return the cost of
expanding the partial reduction in ISel.
For sub-reductions we scale the cost to make them slightly cheaper,
so that they're still candidates for forming cdot operations.
* Reduce the cost of FP reductions, which are currently prohibitively
expensive.
[libc] Increase the maximum RPC port size for future hardware (#188756)
Summary:
We store the locks in local device memory for performance and
simplicity. The number here needs to correspond to the maximum occupancy
so that we never have a situation where a GPU thread is blocking another
GPU thread.
The number now is sufficient for most hardware, but modern compute chips
like the MI300x are already pushing ~12000 resident waves. This has ABI
impliciations so I'd like to bump it up sooner rather than later. The
ABI change is within what OpenMP expects, LLVM major versions, and it
will be caught statically so there's no risk of silent corruption (size
doesn't match).
[compiler-rt] Rework profile data handling for GPU targets (#187136)
Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.
This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.
Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.
[13 lines not shown]
[AMDGPU] Remove AMDGPUISD::FFBH_I32 and add ISD::CTLS lowering (#187694)
It's the a continuation of previously reverted
https://github.com/llvm/llvm-project/pull/178420
The patch removes custom AMDGPUISD::FFBH_I32 SelectionDAG node. Call
sites that need raw hardware semantics (LowerINT_TO_FP32, legalizeITOFP)
now use amdgcn_sffbh intrinsic directly. ISD::CTLS is added as a Custom
operation for i32.
Previous attempt had an issue:
The hardware v_ffbh_i32 instruction (v_cls_i32 on newer targets) has
different semantics than ISD::CTLS:
-sffbh returns [1, BitWidth-1] for normal values, -1 for
all-same-bits
-CTLS returns [0, BitWidth-2] for normal values, BitWidth-1 for
all-same-bits
Now LowerCTLS handles this by: sffbh -> umin(sffbh, BitWidth) -> sub 1.
[6 lines not shown]
[TargetLowering] In prepareUREMEqFold/prepareSREMEqFold, fix K=-1 for i64 elements. (#188600)
K is an unsigned, it will be zero extended to uint64_t for
the APInt constructor. If the ShSVT has more than 32 bits, we won't
create an all ones ConstantSDNode.
To fix this, explicitly push an all ones constant to KAmts. This
also fixes an APInt ImplicitTrunc.
This allows turnVectorIntoSplatVector to work for this case.
[OpenMP][flang] Fix crash in host offload (#187847)
Guard `getGridValue` in `OMPIRBuilder` to avoid reaching the
`unreachable` in `getGridValue` when offloading to host device without
an explicit num_threads clause.
[CIR] Add support for __atomic_fetch_uinc and __atomic_fetch_udec (#188050)
This patch adds CIRGen and LLVM lowering support for the
`__atomic_fetch_uinc` and the `__atomic_fetch_udec` built-in functions.
Assisted-by: Claude Opus 4.6
[Driver][HIP] Bundle AMDGPU -S output under the new offload driver (#188262)
[Driver][HIP] Bundle AMDGPU -S output under the new offload driver
The old offload driver emits bundled assembly code for -S in textual
clang-offload-bundler format. This allows a single .s file to contain
assembly
code for both host and devices, which can be consumed by clang. This
eases
manual optimization of assembly code for host and device. There are
existing
HIP tests and examples depending on this feature. The new offload driver
does
not support it, causing regressions. This patch adds support for this
feature
with minor changes to the job action creations.
Fixes: LCOMPILER-553
[OpenMP] Fix non-contiguous array omp target update (#156889)
The existing implementation has three issues which this patch addresses.
1. The last dimension which represents the bytes in the type, has the
wrong stride and count. For example, for a 4 byte int, count=1 and
stride=4. The correct representation here is count=4 and stride=1
because there are 4 bytes (count=4) that we need to copy and we do not
skip any bytes (stride=1).
2. The size of the data copy was computed using the last dimension.
However, this is incorrect in cases where some of the final dimensions
get merged into one. In this case we need to take the combined size of
the merged dimensions, which is (Count * Stride) of the first merged
dimension.
3. The Offset into a dimension was computed as a multiple of its Stride.
However, this Stride which is in bytes, already includes the stride
multiplier given by the user. This means that when the user specified
[3 lines not shown]