[PhaseOrdering][X86] Copied codegen add/fadd reduction pattern tests to ensure middle-end is creating reduction intrinsics (#206101)
AVX512 is missing a llvm.vector.reduce.add.v16i32 call - will investigate
[llvm][GVNSink] Avoid non-determistic iteration order over NeededPHIs (#205952)
The iteration order of DenseSet is not guaranteed, which affects the
output of code generated with GVNSink enabled. This can cause code to be
emitted in differing order, affect section ordering, and in some cases
was reported to result in larger binaries due to increased padding between
sections.
This patch addresses this by using SetVector, which has a deterministic
iteration order.
Revert "[Flang]Add support for inlining hlfir.assign operation where both LHS and RHS are slices of the same array" (#206103)
Reverts llvm/llvm-project#204532 due to regressions in numerous Fujitsu
tests and several important apps
[AMDGPU][NFC] Use compact enum table for PALMetadata (#206085)
Instead of storing pointer+value pair, use the new enum tables to store
the same information more compact and without dynamic relocations.
[SLP] Fix reused-scalar reduction counters for copyable root nodes
The horizontal reduction reuse-counter scale is built in
getRootNodeScalars() order and applied positionally to the emitted
reduction vector. For a root node with copyable elements the scalar
order is reordered while the emitted lanes still follow the reduced
values (candidates) order, so the repeat count was applied to the wrong
lane, producing a wrong reduction result.
Fixes #205614
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/206102
[DTLTO] Do not serialize inputs that hit in the ThinLTO cache (#204104)
To handle bitcode inputs that are not in individual files on disk, such
as members of non-thin archives, DTLTO serializes those inputs to
temporary individual bitcode files.
This patch changes LLVM to serialize only uncached input modules and any
modules they import from.
For a link of Clang 22 (debug build with sanitizers and
instrumentation), I performed measurements with and without this patch
for an optimized toolchain (PGO non-LTO, based on recent main commit
c264e07c2f3d9f25a2526e69926daea3a68be74b). The measurements were run on:
- Windows 11 Pro build 26200, AMD Family 25 at approximately 4.5 GHz,
16 cores / 32 threads, and 64 GB RAM.
- Ubuntu 24.04.3 LTS, Ryzen 9 5950X with 32 threads, and 62 GiB RAM.
There was no difference in serialization time when the cache was
disabled.
[4 lines not shown]
[AArch64] Increase the max interleave factor to 4 for loops with reductions (#205612)
The default max interleave factor is 2. Increasing it to 4 universally
can spend an amount of codeside on something that does not always
increase performance (especially if the loop gets over-unrolled). Small
reduction loops often benefit from extra interleaving due to the
multiple independant streams that can execute in parallel. This patch
increases the max interleave factor to 4 for such loops, limited to
where the VF is <= 4 to limit the impact for already highly vectorized
loops.
[gn] port 0eefb2682bf8c (C++26 for libc++) (#206100)
Other than in 8a7846fe86f95e82c (the C++23 bump), we apparently only
bump the standard for libc++, but not for libc++abi.
[mlir][acc] Rewrite acc routine bind calls inside gpu.func (#204220)
Run `acc-bind-routine` on `FunctionOpInterface` and rewrite calls to
bound symbols in offload regions and `gpu.func`. For string bind names,
declare private functions in the enclosing `gpu.module` symbol table
when the call is inside device code.
Reapply "[Dexter] Add ability to rewrite scripts to fill-in unknown values" (#206034)
Reverts llvm/llvm-project#205657
The original commit was causing pre-merge CI to fail for AArch64, as one
of the tests expects stepping behaviour that is seen on not seen on
AArch64 targets; the test suite containing the failing test is meant to
be configured to not run for AArch64, but the unsupported label was not
being applied, due to an error in the unsupported check. This patch
fixes the unsupported check in scripts/lit.local.cfg, which should
prevent further errors.
AMDGPU: Use -mtriple= instead of with a space for llc run lines (#206067)
-mtriple=amdgcn is by far the dominant form over space separation.
Convert these to simplify future bulk test updates.