[MLIR][XeGPU] Avoid chained-reductions in multi_reduction unrolling (#198307)
The PR adds a new unrolling pattern for `vector.multi_reduction` to the
`xegpu-blocking` pass. In comparison with [the upstream reduction
unrolling](https://github.com/llvm/llvm-project/blob/2da84a8307e4ef729458d990b221650a5da22639/mlir/lib/Dialect/Vector/Transforms/VectorUnroll.cpp#L372),
the new pattern performs partial row-wise reductions via elementwise
ops, instead of generating a chain of several multi-reduction ops:
```mlir
// reduction to unroll:
// tile-shape: [8x16]
vector.multi_reduction <add> %vec, %cst : vector<8x48xf32> to vector<8xf32>
// upstream unrolling:
%3 = vector.multi_reduction <add>, %tile_0, %cst [1] : vector<8x16xf32> to vector<8xf32>
%4 = vector.multi_reduction <add>, %tile_1, %3 [1] : vector<8x16xf32> to vector<8xf32>
%5 = vector.multi_reduction <add>, %tile_2, %4 [1] : vector<8x16xf32> to vector<8xf32>
// new xegpu-unrolling
%3 = arith.addf %tile_0, %tile_1 : vector<8x16xf32>
[6 lines not shown]
[VPlan] Fix convoluted logic in simpl ext-last-lane (#196355)
Checking the users is unnecessary; if it is single-scalar, it means the
same value is splatted across all lanes. Also, the transformation does
not depend on the Plan being unrolled.
[libc] Refactor qsort code
This patch makes the following changes:
- Refactor the internal sorting functions to reduce code duplication.
- Move the testing machinery done for the testing of `qsort_r` to a
shared place.
These changes are done in anticipation to the introduction of Annex K's
`qsort_s`. This function shares most of its semantics with `qsort_r`,
therefore most of the testing logic can be shared between the two.
Besides, `qsort`, `qsort_r` and `qsort_r` are all very similar, hence we
can attempt to reduce duplication a bit more.
[CIR] Lower calling through a variable (#198672)
We managed to miss a condition when lowering emitCallee, where a
DeclRefExpr referenced a function object. This patch adds that
condition, which will result in these being lowered properly as an
indirect call.
[flang][OpenMP] Limit scope creation to constructs with data environment
Identify specific constructs that require data envorinments, and only
create scopes for them. This avoids scopes for loop-transformation
constructs, for example.
This isn't a correctness fix, but a clarification and a simplification
of the name-resolution code for OpenMP.
Reland "[AMDGPU] Account for inline asm size in inst_pref_size calculation" (#197227)
This relands commit 7ddee0b619f658cef905a69427ef9531fd1d229d (PR
#192306) which was reverted in 70a70e0ed664 (#197070) due to a missing
MC assembler parser case for the `instprefsize` MCExpr, breaking text
assembly roundtrip tests.
Fix:
- Add `"instprefsize"` to the `StringSwitch` in
`AMDGPUAsmParser::parsePrimaryExpr` so the MC assembler can parse
`instprefsize(...)` expressions emitted by `llc` in text assembly mode.
- Add roundtrip lit tests (`llc -filetype=asm | llvm-mc -filetype=obj |
llvm-objdump`) for both GFX11 and GFX12 to prevent regressions.
Confirmed by compiling the new lit test using the original commit that
it was failing and passes now.
_Original PR description_
[15 lines not shown]
[Flang][OpenMP] Add combined construct information
This patch adds the `omp.combined` attribute to OpenMP dialect
operations following changes to the `ComposableOpInterface`.
This attribute is added to operations representing non-innermost leaf
constructs of a combined construct and to standalone block-associated
constructs that can be combined with their parent construct.
Changes are made to the OpenMP lowering logic, as well as the
do-concurrent, workshare and workdistribute transformation passes.
[analyzer] Fix false positive in CStringChecker for offset buffer arg… (#198346)
…uments
CStringChecker::checkInit() was checking the wrong array elements when
the buffer argument pointed into the middle of an array (e.g.,
memcpy(dst, &arr[i], size)). It was called with BufEnd instead of
BufStart, making the ElementRegion index off by (size-1), and the
element lookups were relative to array index 0 instead of the actual
buffer start offset.
[clang] implement CWG2064: ignore value dependence for decltype
The 'decltype' for a value-dependent (but non-type-dependent) should be known,
so this patch makes them non-opaque instead.
This patch also implements what's neceessary to allow overloading
on pure differences in instantiation dependence, making `std::void_t`
usable for SFINAE purposes.
This also readds a few test cases from da98651, which was a previous attempt
at resolving CWG2064.
Fixes #8740
Fixes #61818
Fixes #190388
[Offload] fix OffloadAPI unittests discovery (#198750)
Commit 3383f0d repointed LIBOMPTARGET_LIBRARY_DIR to a different
runtimes lib dir, but the unit lit config still derived the unittest
binary path from it. Pass the unittest directory explicitly instead.
[X86] Update PSADBW tests to more closely match middle-end vector.reduce.add codegen (#198760)
The middle-end will detect vector.reduce.add patterns - update the
Codegen tests to use the intrinsics directly and add PhaseOrdering tests
to ensure vector.reduce.add intrinsics are created
[clang] implement CWG2064: ignore value dependence for decltype
The 'decltype' for a value-dependent (but non-type-dependent) should be known,
so this patch makes them non-opaque instead.
This patch also implements what's neceessary to allow overloading
on pure differences in instantiation dependence, making `std::void_t`
usable for SFINAE purposes.
This also readds a few test cases from da98651, which was a previous attempt
at resolving CWG2064.
Fixes #8740
Fixes #61818
Fixes #190388
[Flang][tests] Add a missing REQUIRES. (#198753)
A newly added test uses `x86_64-unknown-linux-gnu` as a triple, without
a `REQUIRES: x86-registered-target` line, so that it will fail in builds
of LLVM specific to other architectures.
[AArch64][TTI][EarlyCSE] Add support for ld1xN and st1xN intrinsics
Handle ld1x2, ld1x3, ld1x4, st1x2, st1x3, st1x4 in:
- AArch64TTIImpl::getTgtMemIntrinsic
- AArch64TTIImpl::getOrCreateResultFromMemIntrinsic
This enables EarlyCSE to optimize these NEON load/store intrinsics.
To test the changes, a new testcase (intrinsics-1xN.ll) derived from
llvm/test/Transforms/EarlyCSE/AArch64/intrinsics.ll is added.