[OpenMP][offload] Cross-team reductions with variable number of teams (#195102)
This is a part of a series of patches that rework OpenMP cross-team
reductions.
This patch changes the cross-team reduction runtime to no longer work
through larger number of teams in chunks. Instead, we allocate a
suitable-sized global buffer for the team values and let all teams run
at once. The last team that finishes uses a strided loop to reduce the
team values from the global buffer.
We also use `mapping::getNumberOfThreadsInBlock()` instead of
`omp_get_num_threads()` because the reduction of the team values runs
outside of the parallel region device code, which would make
`omp_get_num_threads()` always return 1. For Generic-SPMD mode, we also
want to use all available threads, which means that we need to copy the
reduction data from LDS (where it lives in that mode by default) to
scratch in codegen before calling the cross-team reduction.
[48 lines not shown]
[DirectX] Handle llvm.dx.resource.getbasepointer intrinsic in DXILResourceAccess pass (#204732)
The `llvm.dx.resource.getbasepointer` intrinsic is emitted for
`Constantbuffer<T>` element access and needs to be translated to
`llvm.dx.resource.load.cbufferrow` calls in the `DXILResourceAccess`
pass. The handling is identical to `llvm.dx.resource.getpointer` with a
0 offset.
Fixes #204234
[LifetimeSafety] Allow configuring lifetimebound fix-it spelling (#204045)
When suggesting `[[clang::lifetimebound]]` fix-its, allow users to
provide a project-specific macro spelling with
`-lifetime-safety-lifetimebound-macro=...`.
If no spelling is configured, use a visible macro whose replacement
tokens spell the attribute, preferring the most recently defined
matching macro, and fall back to `[[clang::lifetimebound]]` or
`__attribute((lifetimebound))` otherwise.
Closes https://github.com/llvm/llvm-project/issues/200232
[BOLT][AArch64] Align tentative layout bases using per-section alignment (#204262)
Move `AssignSections` pass before `AlignerPass` so it can record the max
code alignment per output section, then align the tentative hot/cold
section bases using the recorded alignment, which makes tentative layout
better match actually emitted.
[Clang][UBSan] Use EmitCheckedLValue for C++ trivial operator= operands (#203737)
Further to https://github.com/llvm/llvm-project/pull/190739, use
EmitCheckedLValue for trivial operator= operands
* for the LHS (`lhs->` not handled yet), and
* for the RHS also for function call syntax.
[Support] Add a parser for cl::opt<ElementCount> (#203969)
This adds command-line option parsing support for ElementCount.
This allows the following syntax:
```
--my-option=4 ; Maps to ElementCount::getFixed(4)
--my-option="vscale x 8" ; Maps to ElementCount::getScalable(8)
```
This is intended to unify fixed/scalable option handling in the loop
vectorizer. Currently, we have options like
'`EpilogueVectorizationForceVF`' defined as `cl::opt<unsigned>` which do
not allow specifying scalable VFs.
Assisted-by: Codex
[AMDGPU] Use explicit carry nodes for i64 wide integer lowering (#204694)
This PR switches widened i64 add/sub lowering to use explicit
UADDO/USUBO carry
nodes instead of glue-based carry chains.
[SPIR-V] Lower undef nested in a constant aggregate (#204377)
A constant aggregate whose element is itself an aggregate `undef` was
never lowered to a placeholder. The raw aggregate operand reached
IRTranslator on the llvm.spv.const.composite call and aborted with
"unable to translate instruction".
A similar issue was found and fixed during SPV_KHR_poison_freeze
implementation. So instead of re-inventing a wheel - unify lowering with
poison.
Addresses the following observation:
https://github.com/llvm/llvm-project/pull/198037#discussion_r3304013315
[LV] Unify header phi fixup and remove fixNonInductionPHIs (NFC). (#204886)
Unify the execute logic for VPPhi and VPWidenPHIRecipe into a shared
executePhiRecipe helper that handles both scalar and vector phis. For
header phis, only the preheader incoming value is added during execute;
the backedge is fixed up later by VPlan::execute().
This allows generalizing the VPlan::execute() fixup loop to handle all
loop headers (not just the first), removing the VPWidenPHIRecipe skip,
and eliminating fixNonInductionPHIs entirely.
[Verifier] Verify AMX tile-register index operands are in range
AMX has 8 physical tile registers (TMM0-TMM7), so the tile-index operands
of the AMX intrinsics must be in [0, 8): operand 0 for the tile
load/store/zero intrinsics, operands 0-2 for the tdp* family.
[AMDGPU] Use explicit carry nodes for i64 wide integer lowering
This PR switches widened i64 add/sub lowering to use explicit UADDO/USUBO carry
nodes instead of glue-based carry chains.
clang/AMDGPU: Remove driver restriction on --gpu-max-threads-per-block
Previously this flag was only handled for HIP, and would produce an unused
argument warning. There is a custom warning produced by cc1 that the
argument isn't supported, but practically speaking that was unreachable
due to not forwarding the argument. Also add a test for the untested warning.
Also use a simpler method for forwarding the flag to cc1.
clang/AMDGPU: Remove driver restriction on --gpu-max-threads-per-block
Previously this flag was only handled for HIP, and would produce an unused
argument warning. There is a custom warning produced by cc1 that the
argument isn't supported, but practically speaking that was unreachable
due to not forwarding the argument. Also add a test for the untested warning.
Also use a simpler method for forwarding the flag to cc1.
clang/AMDGPU: Remove driver restriction on --gpu-max-threads-per-block
Previously this flag was only handled for HIP, and would produce an unused
argument warning. There is a custom warning produced by cc1 that the
argument isn't supported, but practically speaking that was unreachable
due to not forwarding the argument. Also add a test for the untested warning.
Also use a simpler method for forwarding the flag to cc1.
clang/AMDGPU: Merge toolchain subclasses
Simplify the toolchain implementations by collapsing
them into one. Previously we had a confusing split. The
AMDGPUToolChain base class implemented much of the base
support. It was subclassed by ROCMToolChain, which would
have been more accurately described as the offloading subclass.
That was further subclassed into HIP and OpenMP specific subclasses.
Deleting those two is the important part of this change. There was
code duplication, and features arbitrarily handled in one but not
the other. The offload kind is passed in almost everywhere if you
really need to know the original language. However, I consider
this an antifeature, and it is really poor QoI to have the HIP
and OpenMP toolchains behave differently in any way. The platform
should be consistent and the driver behaviors should not depend
on the language.
There is additional mess in the handling of spirv, which this
[9 lines not shown]
clang/AMDGPU: Fix double linking opencl libs with --libclc-lib
Noticed by inspection. If using an explicit --libclc-lib flag,
do not attempt to also link the rocm device libs which will contain
different implementations of the same opencl symbols.
Co-Authored-By: Claude <noreply at anthropic.com>
[offload] Fix teams/threads limits in record replay (#200639)
The recording phase now sets the teams and threads limits provided by
the user (in the corresponding OpenMP clauses) or zero if not specified.
Additionally, the PR #199483 already enforces that replay's configuration
of threads and teams are respected.
This commit also changes the way we test record and replay when multiple
kernels are recorded in the same test. We use the record report to know how
to associate a json record descriptor file to the target region in the code. We
do not rely anymore on the modification time of the files to know the order,
which was problematic.