[clang][bytecode][NFC] Move some opcode impls to the source file (#177543)
They aren't templated, so move them to Interp.cpp to make the header
file a bit shorter.
[mlir][spirv] Add Conv operations for TOSA Extended Instruction Set (001000.1) (#176908)
This patch expands support for the TOSA Extended Instruction Set
(001000.1) to the SPIR-V dialect in MLIR. The TOSA extended instruction
set provides a standardized set of machine learning operations designed
to be used within `spirv.ARM.Graph` operations (corresponding to
OpGraphARM in SPV_ARM_graph) and typed with `!spirv.arm.tensor<...>`
(corresponding to OpTypeTensorARM in SPV_ARM_tensor).
The change introduces:
* Extending dialect plumbing for import, serialization, and
deserialization of the TOSA extended instruction set.
* The `spirv.Tosa.*Conv*` convolution operation from TOSA extended
instruction, each lowering to the corresponding `OpExtInst`.
* Verification enforcing that new convolution operations appears only
within `spirv.ARM.Graph` regions, operates on `!spirv.arm.tensor<...>`
types, and is well-formed according to the TOSA 001000.1 specification.
All convolution operations from TOSA 001000.1 extended instructions are
[11 lines not shown]
[X86] Enable custom lowering of 256/512-bit vXi32 and vXi64 CLMUL nodes (#177554)
Similar to 128-bit v4i32/v2i64 support, these are can now be efficiently
lowered to PCLMUL nodes through unrolling, shuffle combining and
concatenation
If the target only supports PCLMUL then they will remain as 128-bit
nodes, but if VPCLMULQDQ is supported then they should merge into wider
types.
[mlir][shard,mpi] Lowering shard.allgather to MPI (#177202)
- lowering `shard.allgather` to `mpi.allgather`
- fixing lowering of `shard.allreduce`
- minor refactoring
InstCombine: Use SimplifyDemandedFPClass on fmul
Start trying to use SimplifyDemandedFPClass on instructions, starting
with fmul. This subsumes the old transform on multiply of 0. The
main change is the introduction of nnan/ninf. I do not think anywhere
was systematically trying to introduce fast math flags before, though
a few odd transforms would set them.
Previously we only called SimplifyDemandedFPClass on function returns
with nofpclass annotations. Start following the pattern of
SimplifyDemandedBits, where this will be called from relevant root
instructions.
I was wondering if this should go into InstCombineAggressive, but that
apparently does not make use of InstCombineInternal's worklist.
[NFC][ARC] Tidy Up RegState in ARC Backend (#177546)
This was missed in llvm/llvm-project#177090 because Github CI and my
local build don't have experimental targets enabled.
This is the only problematic RegState use in the experimental targets.
[Clang] Fix the normalization of fold constraints (#177531)
Fold constraints can contain packs expanded from different locations.
For `C<Ps...>`, where the ellipsis immediately follows the argument, the
pack should be expanded in place regardless of the fold expression. For
`C<Ps> && ...`, the fold expression itself is responsible for expanding
Ps.
Previously, both kinds of packs were expanded by the fold expression,
which broke assumptions within concept caching. This patch fixes that by
preserving PackExpansionTypes for the first kind of pack while rewriting
them to non-packs for the second kind.
This patch also removes an unused function and performs some cleanup of
the evaluation contexts. Hopefully it is viable for backporting.
No release note, as this issue was a regression.
Fixes https://github.com/llvm/llvm-project/issues/177245
[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the CPol
bits, but that bypasses the memory model and forces the user to learn about
ISA bits encoding.
Making load monitor an atomic operation has a couple of advantages. First,
the memory model foundation for it is stronger. We just lean on the existing
rules for atomic operations. Second, the CPol bits are abstracted away from
the user, which avoids leaking ISA details into the API.
This patch also adds supporting memory model and intrinsics documentation to
AMDGPUUsage.
Solves SWDEV-516398.
[libclang/python] Add CompletionChunkKind enum and deprecate old CompletionChunk.Kind (#176631)
This adresses point 1 from
https://github.com/llvm/llvm-project/issues/156680.
Since step 4 is already completed, `CompletionChunk.Kind` becomes unused
in this PR, so it is removed.
Revert "[AMDGPU] Allow amdgpu-waves-per-eu to lower target occupancy range" (#177544)
Reverts llvm/llvm-project#168358
Buildbot failure as commented in original PR.
[NFCI][bolt][test] Enable AT&T syntax generally (#172355)
Having it in the X86 subdirectory only affects tests in that directory.
That's however not sufficient as for example runtime/X86/pie-exceptions-split.test is affected but
isn't located in the X86 directory.
This essentially fixes the fix for the original commit by guarding it properly for when the X86
target has been built and the flag is recognized.
Fixes: 6c48fbc1dcfbd44a47f126f21e575340b67aac06
[MC][X86/M68k] Emit syntax directive for AT&T (#167234)
This eases interoperability by making it explicit in emitted assembly code which syntax is used.
Refactored to remove X86-specific directives and logic from the generic MC(Asm)Streamer.
Motivated by building LLVM with `-mllvm -x86-asm-syntax=intel` (i.e. a global preference for Intel
syntax). A Bolt test (`runtime/X86/fdata-escape-chars.ll`) was using `llc` to compile to assembly
and then assembling with `clang`. The specific option causes Clang to assume Intel syntax but only
for assembly and not inline assembly.
[NFC][MI] Tidy Up RegState enum use (2/2) (#177090)
This Change makes `RegState` into an enum class, with bitwise operators.
It also:
- Updates declarations of flag variables/arguments/returns from
`unsigned` to `RegState`.
- Updates empty RegState initializers from 0 to `{}`.
If this is causing problems in downstream code:
- Adopt the `RegState getXXXRegState(bool)` functions instead of using a
ternary operator such as `bool ? RegState::XXX : 0`.
- Adopt the `bool hasRegState(RegState, RegState)` function instead of
using a bitwise check of the flags.
[TargetLowering] Avoid unnecessary EVT -> Type -> EVT roundtrip (NFC) (#177328)
For pointers, this gets the pointer EVT, then converts it back into a
type, and then gets the EVT for that type again. We can directly use the
pointer EVT.
[AMDGPU] Fix use-after-erase in OpenCL printf runtime binding (#177356)
When handling OpenCL printf calls, the AMDGPU backend replaces the
actual function call with a runtime binding. However, this replacement
currently assumes that there are no uses of the original call value
result. If there are uses, the erasure of the function call leads to
errors.
This patch replaces all uses of the original printf call with a 0 value
constant, signalling success of the printf operation.
---------
Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen at amd.com>
Co-authored-by: Steffen Holst Larsen <HolstLarsen.Steffen at amd.com>
DAG: Remove softPromoteHalfType
Remove the now unimplemented target hook and associated DAG machinery
for the old half legalization path.
Really fixes #97975
R600: Remove softPromoteHalfType
Also includes a kind of hacky, minimal change to avoid assertions
when softPromoteHalfType is removed to fix kernel arguments
lowered as f16. Half support was never really implemented
for r600, and there just happened to be a few incidental tests
which included a half argument (which were also not even meaningful,
since the function body just folded to nothing due to no callable
function support).
AMDGPU: Move softPromoteHalfType override to R600 only
As expected the code is much worse, but more correct.
We could do a better job with source modifier management around
fp16_to_fp/fp_to_fp16.
[AMDGPU] Fix inline constant encoding for `v_pk_fmac_f16` (#176659)
This PR handles`v_pk_fmac_f16` inline constant encoding/decoding
differences between pre-GFX11 and GFX11+ hardware.
- Pre-GFX11: fp16 inline constants produce `(f16, 0)` - value in low 16
bits, zero in high.
- GFX11+: fp16 inline constants are duplicated to both halves `(f16,
f16)`.
Fixes #94116.
(cherry picked from commit c253b9f9caf0be95bb16e973f216489d894370e1)
[AArch64] Fix partial_reduce v16i8 -> v2i32 (#177119)
The lowering doesn't need to check for `ConvertToScalable`, because it
lowers to another `PARTIAL_REDUCE_*MLA` node, which is subsequently
lowered using either fixed-length or scalable types.
This fixes https://github.com/llvm/llvm-project/issues/176954
Re-generate check lines
The check lines for SME were different because of sub-register liveness,
which is enabled for streaming functions on trunk, but isn't enabled on
the release branch.
(cherry picked from commit de997639876db38d20c7ed9fb0c683a239d56bf5)
[MC] Explicitly use memcpy in emitBytes() (NFC) (#177187)
We've observed a compile-time regression in LLVM 22 when including large
blobs. The root cause was that emitBytes() was copying bytes one-by-one,
which is much slower than using memcpy for large objects.
Optimization of std::copy to memmove is apparently much less reliable
than one might think. In particular, when using a non-bleeding-edge
libstdc++ (anything older than version 15), this does not happen if the
types of the input and output iterators do not match (like here, where
there is a signed/unsigned mismatch).
As this code is performance sensitive, I think it makes sense to
directly use memcpy.
Previously this code used SmallVector::append, which explicitly uses
memcpy.
(cherry picked from commit 15e421dc643ce4d9d79174fec585cf787e56b1a0)