[flang][cuda] Fix const mismatch in CUFRegisterManagedVariable for __cudaRegisterManagedVar (#188142)
Change varName parameter from `const char *` to `char *` in
CUFRegisterManagedVariable to match the CUDA runtime API signature of
__cudaRegisterManagedVar, which declares deviceAddress as `char *`.
[AArch64][clang][llvm] Add support for Armv9.7-A lookup table intrinsics
Add support for the following Armv9.7-A Lookup Table (lut)
instruction intrinsics:
SVE2.3
```c
// Variant is also available for: _u8 _mf8
svint8_t svluti6[_s8](svint8x2_t table, svuint8_t indices);
```
SVE2.3 and SME2.3
``` c
// Variants are also available for _u16_x2 and _f16_x2.
svint16_t svluti6_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);
```
SME2.3
```c
[9 lines not shown]
[LAA] Allow vectorizing `A[NonZeroNonConstantStride*I] += 1`
In this patch only do that when we can statically prove that
non-constant stride is non-zero and the resulting index doesn't
overflow. That can later be extended to introduce run-time check when
not provable in compile-time.
My main motivation for this is to move unit-strideness speculation to a
VPlan-based transformation. However, it cannot be done right now because
sometimes such speculation affects legality and we simply avoid
vectorizing loop if it's not done. As such, we need to extend LAA to
properly support dependence analysis/RT checks for strided access
without speculating for it being one. This PR is expected to be the
first one on that journey.
[AMDGPU][DAGCombiner][GlobalISel] Extend allMulUsesCanBeContracted with FMA/FMAD pattern
Add conservative FMA/FMAD recognition to allMulUsesCanBeContracted:
a multiply used by an existing FMA/FMAD is assumed to be contractable
(it's already being contracted elsewhere). This avoids unnecessary
contraction blocking for multiplies that feed into FMA chains.
Also adds FMA/FMAD to the FPEXT user set (fpext(fmul) --> fma is
recognized as contractable when isFPExtFoldable).
Guards all remaining FMA-chain reassociation fold sites in both
SDAG (visitFADDForFMACombine/visitFSUBForFMACombine, 8 sites) and
GISel (matchCombineFAddFpExtFMulToFMadOrFMAAggressive, 4 sites).
This re-enables contractions that were conservatively blocked in
earlier patches where the multiply had an FMA use that wasn't yet
recognized: dagcombine-fma-crash.ll and dagcombine-fma-fmad.ll
CHECK lines revert to upstream behavior.
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
[AMDGPU][DAGCombiner][GlobalISel] Extend allMulUsesCanBeContracted with FPEXT pattern
Extend the allMulUsesCanBeContracted analysis to recognize FPEXT patterns
where the multiply result flows through fpext before being used in
contractable operations (fadd, fsub). This covers:
- fmul --> fpext --> {fadd, fsub}: FPEXT folds if isFPExtFoldable
- fmul --> fpext --> fneg --> fsub: FPEXT then FNEG to FSUB
- fmul --> fneg --> fpext --> fsub: FNEG then FPEXT folds if foldable
Also adds allMulUsesCanBeContracted guards to all FPEXT fold sites in
both SDAG (visitFADDForFMACombine, visitFSUBForFMACombine) and GISel
(matchCombineFAddFpExtFMulToFMadOrFMA, matchCombineFSubFpExtFMulToFMadOrFMA,
matchCombineFSubFpExtFNegFMulToFMadOrFMA).
Fixes a missing isFPExtFoldable check in GISel's
matchCombineFSubFpExtFMulToFMadOrFMA which could fold without verifying
the extension is actually foldable.
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
[AMDGPU][DAGCombiner][GlobalISel] Extend allMulUsesCanBeContracted with FNEG pattern
Extend allMulUsesCanBeContracted() to recognize fmul -> fneg -> fsub
chains as contractable uses. This allows FMA contraction when a multiply
feeds an fneg that is only used by fsub operations.
Changes:
- DAGCombiner.cpp: Add ISD::FNEG case to allMulUsesCanBeContracted()
checking that all FNEG users are ISD::FSUB. Update 1 fold site guard
in visitFSUBForFMACombine (fsub(fneg(fmul))).
- CombinerHelper.cpp: Add G_FNEG case to allMulUsesCanBeContracted()
checking that all FNEG users are G_FSUB. Update 2 fold site guards
in matchCombineFSubFNegFMulToFMadOrFMA. Fix guard ordering to check
isContractableFMul before allMulUsesCanBeContracted (cheap first).
- Add 7 new test functions to fma-multiple-uses-contraction.ll covering
fneg single-use, multi-use, mixed contractable/non-contractable, and
cross-pattern (P1 direct + P2 fneg) interactions.
- Update mad-combine.ll CHECK lines affected by the guard changes.
[4 lines not shown]
[DAGCombiner][GlobalISel] Prevent FMA contraction when fmul cannot be eliminated (FADD/FSUB pattern)
fmul nodes with multiple uses can currently be contracted into FMA
operations even when the fmul itself cannot be eliminated, resulting in
a redundant multiply (wasted power and compute). The existing guard
`Aggressive || N0->hasOneUse()` allows contraction under Aggressive mode
regardless of whether the multiply can be removed.
This patch tightens the guard to:
`N0->hasOneUse() || (Aggressive && allMulUsesCanBeContracted(N0))`
`allMulUsesCanBeContracted()` iterates all users of the multiply and
returns true only if every use is itself contractable into an FMA.
For this first patch, only direct FADD and FSUB uses are recognized as
contractable (FNEG, FPEXT, and FMA/FMAD patterns follow in subsequent
patches).
The change is applied symmetrically to both DAGCombiner and GlobalISel:
- DAGCombiner: 4 fold sites in visitFADDForFMACombine (2 sites) and
[8 lines not shown]
Reapply "[LV] Simplify and unify resume value handling for epilogue vec." (#187504)
This reverts commit cdaf29f84dd0abbd1f961982799059c92d76625b.
This version skips removeBranchOnConst when vectorizing the epilogue, as
it may trigger folds that remove the resume phi used as resume value
from the epilogue.
This fixes https://github.com/llvm/llvm-project/issues/187323.
Original message:
This patch tries to drastically simplify resume value handling for the
scalar loop when vectorizing the epilogue.
It uses a simpler, uniform approach for updating all resume values in
the scalar loop:
1. Create ResumeForEpilogue recipes for all scalar resume phis in the
main loop (the epilogue plan will have exactly the same scalar resume
[23 lines not shown]
[MLIR] [Mem2Reg] [NFC] Update pass documentation (#188140)
In #185036, we added region control flow support but forgot to update
the pass declaration documentation. This NFC addresses this.
[HLSL] Allow 1x1 matrices to be splatted like scalars (#188119)
Fixes #186859 by allowing 1x1 matrices to be splatted like the scalar
and vec1 cases.
Assisted-by: GitHub Copilot (powered by Claude Opus 4.6)
libclc: Use nextup and nextdown in place of nextafter
Unfortunately it seems the optimizer isn't able to clean this
up, so this is a code quality improvement.