[CIR][AMDGPU] Add lowering for amdgcn_div_scale builtins (#192931)
Upstreaming clangIR PR: https://github.com/llvm/clangir/pull/2050
This PR adds support for lowering of _builtin_amdgcn_div_scale* amdgpu
builtins to clangIR.
Followed similar lowering from reference clang->llvmir in
clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp.
[flang][OpenMP] Support lowering of metadirective (part 1)
This patch implements following feature in metadirective:
- implementation={vendor(...)}
- device={kind(...), isa(...), arch(...)}
- user={condition(<constant-expr>)}
- construct={parallel, target, teams}
- default, nothing, and otherwise clause
Dynamic user conditions and loop-associated variants are deferred
to follow-up patches.
This patch is part of the feature work for #188820.
Assisted with copilot and GPT-5.4
[NVPTX] Improve error diagnostic when handling unknown intrinsics (#191194)
Following up on #146726, it may be desirable to gracefully fail the
compilation in the presence of unknown NVVM intrinsics, which
cannot be lowered by the NVPTX backend, rather than silently
emitting invalid PTX.
[LoongArch] Type legalize v2f32 loads by using an f64 load and a scalar_to_vector (#164943)
On 64-bit targets the generic legalize will use an i64 load and a
scalar_to_vector for us. But on 32-bit targets, i64 isn't legal, and the
generic legalizer will end up emitting two 32-bit loads. This patch uses
f64 to avoid the splitting entirely and the redundant int->fp
conversion.
[MLIR][OpenMP] Post-translate declare-target USM indirection in OpenMPIRBuilder
When lowering OpenMP to LLVM IR for the target device, record pairs of the
`declare target` device global and the OMPIRBuilder "ref" pointer global
(used for unified shared memory) via `OpenMPIRBuilder`. During the
`OpenMPIRBuilder::finalize` pass, run a postpass that rewrites remaining uses of the
original global to load from the ref global and adjust the pointer (shared
path for `ConstantExpr` addrspace/bitcast chains and for direct
instruction uses).
This follows what is done by clang for similar cases:
https://reviews.llvm.org/D63108.
Co-authored-by: Composer
Co-authored-by: Gemini Pro
[Flang][OpenMP] Clear close on descriptor members for box parents in USM
Extend the MapInfoFinalization walk introduced in #185330 so
parent/member close consistency is enforced whenever
unified_shared_memory is in effect, not only when the parent map's
variable is a fir.RecordType. Allocatable (box) roots expand to member
maps the same way as derived-type instances; getDescriptorMapType may
add OMP_MAP_CLOSE to implicit descriptor members while the parent map
does not set close, which led to bad device behavior under
-fopenmp-force-usm with multiple mapped allocatables.
Co-authored-by: Composer (Cursor) <ai at cursor.com>
[LoopFusion][NFC] UTC gen some tests (#193755)
Some variables need rename as UTC normalizes IR value names. Also,
remove dead variable `%M` and `%N` from
`double_loop_nest_inner_guard.ll`
[MLIR][OpenMP] Post-translate declare-target USM indirection in OpenMPIRBuilder
When lowering OpenMP to LLVM IR for the target device, record pairs of the
`declare target` device global and the OMPIRBuilder "ref" pointer global
(used for unified shared memory) via `OpenMPIRBuilder`. During the
`OpenMPIRBuilder::finalize` pass, run a postpass that rewrites remaining uses of the
original global to load from the ref global and adjust the pointer (shared
path for `ConstantExpr` addrspace/bitcast chains and for direct
instruction uses).
This follows what is done by clang for similar cases:
https://reviews.llvm.org/D63108.
Co-authored-by: Composer
Co-authored-by: Gemini Pro
[Flang][OpenMP] Clear close on descriptor members for box parents in USM
Extend the MapInfoFinalization walk introduced in #185330 so
parent/member close consistency is enforced whenever
unified_shared_memory is in effect, not only when the parent map's
variable is a fir.RecordType. Allocatable (box) roots expand to member
maps the same way as derived-type instances; getDescriptorMapType may
add OMP_MAP_CLOSE to implicit descriptor members while the parent map
does not set close, which led to bad device behavior under
-fopenmp-force-usm with multiple mapped allocatables.
Co-authored-by: Composer (Cursor) <ai at cursor.com>
AMDGPU: Back-propagate wqm for sources of side-effect instruction (#193395)
For readfirstlane instruction, as it would get undefined value if exec
is zero. To handle the case that only helper lanes execute the parent
block, we let the readfirstlane to execute under wqm. But this is not
enough. If the parent block was also executed by non-helper lanes, we
also need to make sure its sources were calculated under wqm. Otherwise,
if the instruction that generate the source of readfirstlane was
executed under exact mode, the value would contain garbage data in help
lane. The garbage data in helper lane maybe returned by the
readfirstlane running under wqm.
To fix this issue, we need to enforce the back-propagation of wqm for
instructions like readfirstlane. This was only done if the instruction
was possibly in the middle of wqm region (by checking OutNeeds).
[GVN] Propagate isMemorySSAEnabled() into ValueTable (#193938)
`GVNPass::runImpl()` calls `VN.setMemorySSA(MSSA)` with a single
argument. The second parameter of `ValueTable::setMemorySSA()`,
`MSSAEnabled`, defaults to `false`, so `ValueTable::IsMSSAEnabled`
remains false even when the pass is configured with
`-enable-gvn-memoryssa=1` or `-passes='gvn<memoryssa>'`.
The MemorySSA-backed value-numbering paths in
`ValueTable::lookupOrAddCall()` and `ValueTable::computeLoadStoreVN()`
are gated on `IsMSSAEnabled`, making them unreachable from runImpl() on
main today.
This patch forwards isMemorySSAEnabled() as the second argument to
setMemorySSA(), so selecting the MemorySSA backend actually enables
MemorySSA-aware value numbering.
[X86] Mark machine-block-hash.mir as XFAIL on big-endian hosts (#194279)
Test introduced in #193107 assumes `stable_hash_combine` is stable,
but it turns out it's not true.