LLVM/project e9acb01clang/test/OpenMP nvptx_teams_reduction_codegen.cpp target_teams_reduction_codegen.cpp, llvm/lib/Frontend/OpenMP OMPIRBuilder.cpp

[OpenMP][offload] Cross-team reductions with variable number of teams (#195102)

This is a part of a series of patches that rework OpenMP cross-team
reductions.

This patch changes the cross-team reduction runtime to no longer work
through larger number of teams in chunks. Instead, we allocate a
suitable-sized global buffer for the team values and let all teams run
at once. The last team that finishes uses a strided loop to reduce the
team values from the global buffer.

We also use `mapping::getNumberOfThreadsInBlock()` instead of
`omp_get_num_threads()` because the reduction of the team values runs
outside of the parallel region device code, which would make
`omp_get_num_threads()` always return 1. For Generic-SPMD mode, we also
want to use all available threads, which means that we need to copy the
reduction data from LDS (where it lives in that mode by default) to
scratch in codegen before calling the cross-team reduction.


    [48 lines not shown]
DeltaFile
+0-3,642clang/test/OpenMP/nvptx_teams_reduction_codegen.cpp
+2,331-0clang/test/OpenMP/target_teams_reduction_codegen.cpp
+155-169openmp/device/src/Reduction.cpp
+144-73llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
+60-60clang/test/OpenMP/teams_distribute_parallel_for_simd_schedule_codegen.cpp
+60-60clang/test/OpenMP/teams_distribute_parallel_for_schedule_codegen.cpp
+2,750-4,004168 files not shown
+4,266-5,534174 files

LLVM/project 2678b8fllvm/lib/Target/DirectX DXILResourceAccess.cpp DXILOpLowering.cpp, llvm/test/CodeGen/DirectX/ResourceAccess load-constant-buffer-t.ll

[DirectX] Handle llvm.dx.resource.getbasepointer intrinsic in DXILResourceAccess pass (#204732)

The `llvm.dx.resource.getbasepointer` intrinsic is emitted for
`Constantbuffer<T>` element access and needs to be translated to
`llvm.dx.resource.load.cbufferrow` calls in the `DXILResourceAccess`
pass. The handling is identical to `llvm.dx.resource.getpointer` with a
0 offset.

Fixes #204234
DeltaFile
+189-0llvm/test/CodeGen/DirectX/ResourceAccess/load-constant-buffer-t.ll
+12-3llvm/lib/Target/DirectX/DXILResourceAccess.cpp
+1-0llvm/lib/Target/DirectX/DXILOpLowering.cpp
+202-33 files

LLVM/project 359bfe6clang/docs LifetimeSafety.rst, clang/include/clang/Basic LangOptions.h

[LifetimeSafety] Allow configuring lifetimebound fix-it spelling (#204045)

When suggesting `[[clang::lifetimebound]]` fix-its, allow users to
provide a project-specific macro spelling with
`-lifetime-safety-lifetimebound-macro=...`.

If no spelling is configured, use a visible macro whose replacement
tokens spell the attribute, preferring the most recently defined
matching macro, and fall back to `[[clang::lifetimebound]]` or
`__attribute((lifetimebound))` otherwise.

Closes https://github.com/llvm/llvm-project/issues/200232
DeltaFile
+76-0clang/test/Sema/LifetimeSafety/annotation-suggestions-fixits.cpp
+49-2clang/test/Sema/LifetimeSafety/misplaced-lifetimebound-intra-tu.cpp
+31-6clang/lib/Sema/SemaLifetimeSafety.h
+9-0clang/include/clang/Options/Options.td
+7-1clang/docs/LifetimeSafety.rst
+3-0clang/include/clang/Basic/LangOptions.h
+175-96 files

LLVM/project 0928584clang/lib/Format FormatTokenLexer.cpp FormatTokenLexer.h

[clang-format][NFC] Clean up FormatTokenLexer (#203825)
DeltaFile
+11-4clang/lib/Format/FormatTokenLexer.cpp
+0-1clang/lib/Format/FormatTokenLexer.h
+11-52 files

LLVM/project e47530bbolt/include/bolt/Core BinaryContext.h, bolt/lib/Passes Aligner.cpp LongJmp.cpp

[BOLT][AArch64] Align tentative layout bases using per-section alignment (#204262)

Move `AssignSections` pass before `AlignerPass` so it can record the max
code alignment per output section, then align the tentative hot/cold
section bases using the recorded alignment, which makes tentative layout
better match actually emitted.
DeltaFile
+24-0bolt/include/bolt/Core/BinaryContext.h
+20-0bolt/lib/Passes/Aligner.cpp
+8-3bolt/lib/Passes/LongJmp.cpp
+5-3bolt/lib/Rewrite/BinaryPassManager.cpp
+57-64 files

LLVM/project b32488fclang/lib/CodeGen CGExprCXX.cpp, clang/test/CodeGen ubsan-aggregate-null-align-bounds.c

[Clang][UBSan] Use EmitCheckedLValue for C++ trivial operator= operands (#203737)

Further to https://github.com/llvm/llvm-project/pull/190739, use
EmitCheckedLValue for trivial operator= operands
* for the LHS (`lhs->` not handled yet), and
* for the RHS also for function call syntax.
DeltaFile
+46-23clang/test/CodeGen/ubsan-aggregate-null-align-bounds.c
+27-16clang/lib/CodeGen/CGExprCXX.cpp
+73-392 files

LLVM/project ba5384allvm/include/llvm/Support CommandLine.h, llvm/lib/Support CommandLine.cpp

[Support] Add a parser for cl::opt<ElementCount> (#203969)

This adds command-line option parsing support for ElementCount.

This allows the following syntax:
```
  --my-option=4 ; Maps to ElementCount::getFixed(4)
  --my-option="vscale x 8" ; Maps to ElementCount::getScalable(8)
```
This is intended to unify fixed/scalable option handling in the loop
vectorizer. Currently, we have options like
'`EpilogueVectorizationForceVF`' defined as `cl::opt<unsigned>` which do
not allow specifying scalable VFs.

Assisted-by: Codex
DeltaFile
+85-0llvm/unittests/Support/CommandLineTest.cpp
+46-0llvm/lib/Support/CommandLine.cpp
+23-0llvm/include/llvm/Support/CommandLine.h
+154-03 files

LLVM/project a8aba70flang/lib/Lower ConvertVariable.cpp MultiImageFortran.cpp, flang/test/Lower/MIF coarray_allocation5.f90 coarray_allocation4.f90

[Flang] Standardize coarray TODO() diagnostic messages (#204708)
DeltaFile
+5-4flang/lib/Lower/ConvertVariable.cpp
+3-3flang/lib/Lower/MultiImageFortran.cpp
+3-1flang/lib/Lower/Bridge.cpp
+1-1flang/test/Lower/MIF/coarray_allocation5.f90
+1-1flang/test/Lower/MIF/coarray_allocation4.f90
+1-1flang/test/Lower/MIF/coarray_allocation3.f90
+14-112 files not shown
+16-138 files

LLVM/project c890f4dutils/bazel/llvm-project-overlay/mlir BUILD.bazel

[Bazel] Fixes 95e3219 (#204873)

This fixes 95e321951ad3041998e49bc0353482bcd27c65db.

Co-authored-by: Google Bazel Bot <google-bazel-bot at google.com>
DeltaFile
+1-0utils/bazel/llvm-project-overlay/mlir/BUILD.bazel
+1-01 files

LLVM/project 5e7727doffload/ci openmp-offload-amdgpu-libc-runtime.py

Revert "Revert "[AMDGPU] Add compiler-rt checks for the GPU runtime" (#204370)"

This reverts commit 24f4fbf89d7e1c6e7b00efde469adb0a8c529cd2.
DeltaFile
+7-0offload/ci/openmp-offload-amdgpu-libc-runtime.py
+7-01 files

LLVM/project 90b2048llvm/lib/Bitcode/Reader BitcodeReader.cpp, llvm/test/Bitcode invalid-summary-version.test

bitcode: Improve invalid summary version error (#204888)
DeltaFile
+3-4llvm/lib/Bitcode/Reader/BitcodeReader.cpp
+5-0llvm/test/Bitcode/invalid-summary-version.test
+0-0llvm/test/Bitcode/Inputs/invalid-summary-version.bc
+8-43 files

LLVM/project f9fa598llvm/test/CodeGen/AMDGPU rem_i128.ll div_v2i128.ll

[AMDGPU] Use explicit carry nodes for i64 wide integer lowering (#204694)

This PR switches widened i64 add/sub lowering to use explicit
UADDO/USUBO carry
nodes instead of glue-based carry chains.
DeltaFile
+1,255-1,278llvm/test/CodeGen/AMDGPU/rem_i128.ll
+950-975llvm/test/CodeGen/AMDGPU/div_v2i128.ll
+758-780llvm/test/CodeGen/AMDGPU/div_i128.ll
+460-514llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll
+226-250llvm/test/CodeGen/AMDGPU/flat_atomics_i64.ll
+192-216llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll
+3,841-4,01317 files not shown
+4,729-4,74523 files

LLVM/project 086f633llvm/lib/Target/AMDGPU AMDGPURegBankLegalizeRules.cpp, llvm/test/CodeGen/AMDGPU llvm.amdgcn.load.async.to.lds.ll

AMDGPU/GlobalISel: RegBankLegalize rules for load_async_to_lds (#204683)
DeltaFile
+2-1llvm/lib/Target/AMDGPU/AMDGPURegBankLegalizeRules.cpp
+1-1llvm/test/CodeGen/AMDGPU/llvm.amdgcn.load.async.to.lds.ll
+3-22 files

LLVM/project 4195b29.github/workflows subscriber.yml

workflows/subscriber: Update to latest github automation container (#204692)

This one is about 33% smaller than the previous version.
DeltaFile
+1-1.github/workflows/subscriber.yml
+1-11 files

LLVM/project 39f8f90llvm/lib/Target/SPIRV SPIRVEmitIntrinsics.cpp, llvm/test/CodeGen/SPIRV/instructions undef-composite.ll

[SPIR-V] Lower undef nested in a constant aggregate (#204377)

A constant aggregate whose element is itself an aggregate `undef` was
never lowered to a placeholder. The raw aggregate operand reached
IRTranslator on the llvm.spv.const.composite call and aborted with
"unable to translate instruction".

A similar issue was found and fixed during SPV_KHR_poison_freeze
implementation. So instead of re-inventing a wheel - unify lowering with
poison.

Addresses the following observation:
https://github.com/llvm/llvm-project/pull/198037#discussion_r3304013315
DeltaFile
+61-71llvm/lib/Target/SPIRV/SPIRVEmitIntrinsics.cpp
+45-0llvm/test/CodeGen/SPIRV/instructions/undef-composite.ll
+106-712 files

LLVM/project 6f05646llvm/include/llvm/Transforms/Vectorize SLPVectorizer.h, llvm/lib/Transforms/Vectorize SLPVectorizer.cpp

[𝘀𝗽𝗿] initial version

Created using spr 1.3.7
DeltaFile
+249-15llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+21-191llvm/test/Transforms/SLPVectorizer/X86/masked-stores.ll
+2-1llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
+272-2073 files

LLVM/project fe9521dllvm/lib/Transforms/Vectorize VPlanRecipes.cpp VPlan.cpp

[LV] Unify header phi fixup and remove fixNonInductionPHIs (NFC). (#204886)

Unify the execute logic for VPPhi and VPWidenPHIRecipe into a shared
executePhiRecipe helper that handles both scalar and vector phis. For
header phis, only the preheader incoming value is added during execute;
the backedge is fixed up later by VPlan::execute().

This allows generalizing the VPlan::execute() fixup loop to handle all
loop headers (not just the first), removing the VPWidenPHIRecipe skip,
and eliminating fixNonInductionPHIs entirely.
DeltaFile
+22-19llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+15-22llvm/lib/Transforms/Vectorize/VPlan.cpp
+0-22llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+37-633 files

LLVM/project 7472a0ellvm/lib/IR Verifier.cpp, llvm/test/Verifier x86-amx-tile-register-index.ll

[Verifier] Verify AMX tile-register index operands are in range

AMX has 8 physical tile registers (TMM0-TMM7), so the tile-index operands
of the AMX intrinsics must be in [0, 8): operand 0 for the tile
load/store/zero intrinsics, operands 0-2 for the tdp* family.
DeltaFile
+30-0llvm/test/Verifier/x86-amx-tile-register-index.ll
+24-0llvm/lib/IR/Verifier.cpp
+54-02 files

LLVM/project bd70fc0llvm/lib/Bitcode/Reader BitcodeReader.cpp, llvm/test/Bitcode invalid-summary-version.test

bitcode: Improve invalid summary version error

Include the filename in the description.
DeltaFile
+3-4llvm/lib/Bitcode/Reader/BitcodeReader.cpp
+5-0llvm/test/Bitcode/invalid-summary-version.test
+0-0llvm/test/Bitcode/Inputs/invalid-summary-version.bc
+8-43 files

LLVM/project 776cea3llvm/test/CodeGen/AMDGPU rem_i128.ll div_v2i128.ll

[AMDGPU] Use explicit carry nodes for i64 wide integer lowering

This PR switches widened i64 add/sub lowering to use explicit UADDO/USUBO carry
nodes instead of glue-based carry chains.
DeltaFile
+1,255-1,278llvm/test/CodeGen/AMDGPU/rem_i128.ll
+950-975llvm/test/CodeGen/AMDGPU/div_v2i128.ll
+758-780llvm/test/CodeGen/AMDGPU/div_i128.ll
+460-514llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll
+226-250llvm/test/CodeGen/AMDGPU/flat_atomics_i64.ll
+192-216llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll
+3,841-4,01317 files not shown
+4,729-4,74523 files

LLVM/project 2f0ae3allvm/lib/Bitcode/Reader BitcodeReader.cpp, llvm/test/Bitcode invalid-summary-version.test

bitcode: Improve invalid summary version error

Include the filename in the description.
DeltaFile
+5-0llvm/test/Bitcode/invalid-summary-version.test
+2-1llvm/lib/Bitcode/Reader/BitcodeReader.cpp
+0-0llvm/test/Bitcode/Inputs/invalid-summary-version.bc
+7-13 files

LLVM/project f193189libcxx/include/__cstddef byte.h, libcxx/test/libcxx/language.support nodiscard.verify.cpp

[libc++][byte] Apply [[nodiscard]] to std::byte (#204674)

https://libcxx.llvm.org/CodingGuidelines.html#apply-nodiscard-where-relevant

Towards: #172124
DeltaFile
+24-0libcxx/test/libcxx/language.support/nodiscard.verify.cpp
+6-6libcxx/include/__cstddef/byte.h
+30-62 files

LLVM/project 6bc3ea3clang/lib/Driver/ToolChains AMDGPU.cpp, clang/test/Driver amdgpu-openmp-gpu-max-threads-per-block.c

clang/AMDGPU: Remove driver restriction on --gpu-max-threads-per-block

Previously this flag was only handled for HIP, and would produce an unused
argument warning. There is a custom warning produced by cc1 that the
argument isn't supported, but practically speaking that was unreachable
due to not forwarding the argument. Also add a test for the untested warning.
Also use a simpler method for forwarding the flag to cc1.
DeltaFile
+14-0clang/test/Frontend/openmp-warn-gpu-max-threads-per-block.c
+2-8clang/lib/Driver/ToolChains/AMDGPU.cpp
+6-0clang/test/Driver/amdgpu-openmp-gpu-max-threads-per-block.c
+22-83 files

LLVM/project 22995a6clang/lib/Driver/ToolChains AMDGPU.cpp, clang/test/Driver amdgpu-openmp-gpu-max-threads-per-block.c

clang/AMDGPU: Remove driver restriction on --gpu-max-threads-per-block

Previously this flag was only handled for HIP, and would produce an unused
argument warning. There is a custom warning produced by cc1 that the
argument isn't supported, but practically speaking that was unreachable
due to not forwarding the argument. Also add a test for the untested warning.
Also use a simpler method for forwarding the flag to cc1.
DeltaFile
+14-0clang/test/Frontend/openmp-warn-gpu-max-threads-per-block.c
+2-8clang/lib/Driver/ToolChains/AMDGPU.cpp
+5-0clang/test/Driver/amdgpu-openmp-gpu-max-threads-per-block.c
+21-83 files

LLVM/project cbf215cclang/lib/Driver/ToolChains AMDGPU.cpp, clang/test/Driver amdgpu-openmp-max-threads.c

clang/AMDGPU: Remove driver restriction on --gpu-max-threads-per-block

Previously this flag was only handled for HIP, and would produce an unused
argument warning. There is a custom warning produced by cc1 that the
argument isn't supported, but practically speaking that was unreachable
due to not forwarding the argument. Also add a test for the untested warning.
Also use a simpler method for forwarding the flag to cc1.
DeltaFile
+14-0clang/test/Frontend/openmp-warn-gpu-max-threads-per-block.c
+2-8clang/lib/Driver/ToolChains/AMDGPU.cpp
+5-0clang/test/Driver/amdgpu-openmp-max-threads.c
+21-83 files

LLVM/project 013dffeclang/lib/Driver/ToolChains AMDGPU.cpp HIPAMD.cpp

clang/AMDGPU: Merge toolchain subclasses

Simplify the toolchain implementations by collapsing
them into one. Previously we had a confusing split. The
AMDGPUToolChain base class implemented much of the base
support. It was subclassed by ROCMToolChain, which would
have been more accurately described as the offloading subclass.

That was further subclassed into HIP and OpenMP specific subclasses.
Deleting those two is the important part of this change. There was
code duplication, and features arbitrarily handled in one but not
the other. The offload kind is passed in almost everywhere if you
really need to know the original language. However, I consider
this an antifeature, and it is really poor QoI to have the HIP
and OpenMP toolchains behave differently in any way. The platform
should be consistent and the driver behaviors should not depend
on the language.

There is additional mess in the handling of spirv, which this

    [9 lines not shown]
DeltaFile
+264-123clang/lib/Driver/ToolChains/AMDGPU.cpp
+2-193clang/lib/Driver/ToolChains/HIPAMD.cpp
+0-94clang/lib/Driver/ToolChains/AMDGPUOpenMP.cpp
+48-23clang/lib/Driver/ToolChains/AMDGPU.h
+0-68clang/lib/Driver/ToolChains/AMDGPUOpenMP.h
+1-50clang/lib/Driver/ToolChains/HIPAMD.h
+315-5514 files not shown
+340-56610 files

LLVM/project b17e6f7clang/lib/Driver Driver.cpp

Fix more windows paths
DeltaFile
+4-4clang/lib/Driver/Driver.cpp
+4-41 files

LLVM/project 18f45d9clang/include/clang/Driver CommonArgs.h, clang/lib/Driver/ToolChains CommonArgs.cpp AMDGPU.cpp

clang/AMDGPU: Fix double linking opencl libs with --libclc-lib

Noticed by inspection. If using an explicit --libclc-lib flag,
do not attempt to also link the rocm device libs which will contain
different implementations of the same opencl symbols.

Co-Authored-By: Claude <noreply at anthropic.com>
DeltaFile
+8-7clang/lib/Driver/ToolChains/CommonArgs.cpp
+9-0clang/test/Driver/opencl-libclc.cl
+5-1clang/include/clang/Driver/CommonArgs.h
+2-1clang/lib/Driver/ToolChains/AMDGPU.cpp
+24-94 files

LLVM/project e6a92e0offload/plugins-nextgen/common/include PluginInterface.h, offload/plugins-nextgen/common/src RecordReplay.cpp

[offload] Fix teams/threads limits in record replay (#200639)

The recording phase now sets the teams and threads limits provided by
the user (in the corresponding OpenMP clauses) or zero if not specified.
Additionally, the PR #199483 already enforces that replay's configuration
of threads and teams are respected.

This commit also changes the way we test record and replay when multiple
kernels are recorded in the same test. We use the record report to know how
to associate a json record descriptor file to the target region in the code. We
do not rely anymore on the modification time of the files to know the order,
which was problematic.
DeltaFile
+31-6offload/test/tools/omp-kernel-replay/record-replay-diff-teams-threads.cpp
+22-4offload/tools/kernelreplay/llvm-omp-kernel-replay.cpp
+12-6offload/plugins-nextgen/common/src/RecordReplay.cpp
+8-5offload/test/tools/omp-kernel-replay/record-replay-diff-threads.cpp
+3-0offload/plugins-nextgen/common/include/PluginInterface.h
+76-215 files

LLVM/project a665432clang/include/clang/Driver Driver.h, clang/lib/Driver Driver.cpp

Fix using unsanitized target id in filename
DeltaFile
+6-6clang/lib/Driver/Driver.cpp
+2-1clang/include/clang/Driver/Driver.h
+8-72 files