[AMDGPU] Adding FoldMemRefOpsIntoTransposeLoadOp pattern (#183330)
Before the fix we wouldn't fold a trivial expand_shape as index
computation. This will later force expand_shape to materialize into a
extract_stride_metadata and a reinterpret_cast unnecessarily. The
example below showcase the motivation of a source IR that won't be able
to fold today.
```mlir
%expanded = memref.expand_shape %buf [[0, 1], [2, 3]]
: memref<32x128xf16, strided<[128, 1], offset: ?>, #gpu.address_space<workgroup>>
into memref<1x32x8x16xf16, strided<..., offset: ?>, #gpu.address_space<workgroup>>
amdgpu.transpose_load %expanded[%i, %j, %k, %l]
: memref<1x32x8x16xf16, ...> -> vector<4xf16>
```
With this pattern that matches the more generic
`FoldMemRefAliasOpsPass`, the expand_shape can now fold into
transpose_load op like other load/stores.
[4 lines not shown]
[CodeGen] Expand power-of-2 div/rem at IR level in ExpandIRInsts. (#180654)
Previously, power-of-2 div/rem operations wider than
MaxLegalDivRemBitWidth were excluded from IR expansion and left for
backend peephole optimizations. Some backends can fail to process such
instructions in case we switch off DAGCombiner.
Now ExpandIRInsts expands them into shift/mask sequences:
- udiv X, 2^C -> lshr X, C
- urem X, 2^C -> and X, (2^C - 1)
- sdiv X, 2^C -> bias adjustment + ashr X, C
- srem X, 2^C -> X - (((X + Bias) >> C) << C)
Special cases handled:
- Division/remainder by 1 or -1 (identity, negation, or zero)
- Exact division (sdiv exact skips bias, produces ashr exact)
- Negative power-of-2 divisors (result is negated)
- INT_MIN divisor (correct via countr_zero on bit pattern)
[2 lines not shown]
[AArch64] Decompose FADD reductions with known zero elements
FADDV is matched into FADDPv4f32 + FADDPv2f32p but this can be relaxed
when one element (usually the 4th) or more are known to be zero.
Before:
movi d1, #0000000000000000
mov v0.s[3], v1.s[0]
faddp v0.4s, v0.4s, v0.4s
faddp s0, v0.2s
After:
mov s1, v0.s[2]
faddp s0, v0.2s
fadd s0, s0, s1
[CIR] Update cir::ResumeOp to require an EH token (#183192)
This updates the cir::ResumeOp operation to require an EH token operand.
We already had the token available at both locations where the operation
was being created. Adding this operand makes finding the token more
robust during CFG flattening.
This change was entirely AI generated, but I have reviewed it closely.
[flang][openmp] Add support for ordered regions in SIMD directives (#… (#183379)
Add support for ordered regions within SIMD directives (!$omp simd
ordered and !$omp do simd ordered). This initial implementation matches
Clang's behavior.
In SIMD directives, loop induction variables have an implicit linear
clause with deferred store semantics (storing to .linear_result). To
properly support ordered regions, the LinearClauseProcessor rewrites
variable references to use .linear_result in:
- omp.ordered.region: Code inside ordered blocks
- omp_region.finalize: Code after ordered blocks
Note: The vectorizer cannot currently vectorize loops with ordered
regions. Future enhancement would require generating lane loops or
unrolling ordered regions across SIMD lanes while maintaining ordering
semantics.
This PR is a reland for https://github.com/llvm/llvm-project/pull/181012
and fixes the regression caused by syntax change in IR for linear clause
[CodeGen] Add tests for ShadowStackGCLowering IR pass (#183167)
Add llvm/test/CodeGen/Generic/shadow-stack-gc-lowering.ll testing the
opt-level behavior of the shadow-stack-gc-lowering module pass,
covering:
- Single root: frame push/pop at entry and return
- Two roots: multi-slot frame, NumRoots=2/NumMeta=0 in the frame map
- Root with non-null metadata: NumMeta=1, metadata array in gc_map
- Mixed metadata: CollectRoots ordering (metadata roots sorted first)
- No roots: pass must leave the function unchanged
- Invoke: EscapeEnumerator inserts pop on both normal and unwind exits
As requested in https://github.com/llvm/llvm-project/pull/178436, since
the only existing tests seem to be that llc doesn't crash (in
llvm/test/CodeGen/X86/GC)
Co-authored-by: Claude Sonnet 4.6 <noreply at anthropic.com>
[AMDGPU] Implement -amdgpu-spill-cfi-saved-regs
These spills need special CFI anyway, so implementing them directly
where CFI is emitted avoids the need to invent a mechanism to track them
from ISel.
Change-Id: If4f34abb3a8e0e46b859a7c74ade21eff58c4047
Co-authored-by: Scott Linder scott.linder at amd.com
Co-authored-by: Venkata Ramanaiah Nalamothu VenkataRamanaiah.Nalamothu at amd.com
[AMDGPU] Make slow VOPD assert under EXPENSIVE_CHECKS (#183166)
The assert is algorithmically slow. To preserve the usability of release_assert
builds, move it under EXPENSIVE_CHECKS. On some workloads it increased compile
time by 40-50x.
[libc][math] Refactor fdim family to header-only (#182190)
Refactors the fdim math family to be header-only.
Closes https://github.com/llvm/llvm-project/issues/182188
Target Functions:
- fdim
- fdimbf16
- fdimf
- fdimf128
- fdimf16
- fdiml
[win] Control Flow Guard: Add support for the MSVC /d2guardnochecks command (#182967)
This adds support for MSVC's `/d2guardnochecks` undocumented flag. This
flag is similar to `-guard:cf,nochecks` and `-cfguard-no-checks` in that
it instructs the compiler to emit the metadata for Control Flow Guard
WITHOUT emitting checks (aka: "table only" mode), but it differs from
those existing flags because if only takes effect if another flag is
used to enable Control Flow Guard (i.e., `/d2guardnochecks` by itself
does nothing, `/d2guardnochecks /guard:cf` enables table-only mode for
Control Flow Guard).
[clang][modules-driver] Generate jobs from Standard library module manifest entries (#182182)
This patch is part of a series to support driver-managed module builds.
To support imports of the Standard library modules (std and
std.compat), the driver must generate frontend jobs for each module
before performing the dependency scan.
The source paths for these modules, along with additional information
required to precompile them, are provided by the Standard library
modules manifest.
This change implements the parsing and handling of the manifest in the
modules-driver.
This change is part of an effort to split #152770 into smaller, more
manageable pieces.
RFC for driver-managed module builds:
https://discourse.llvm.org/t/rfc-modules-support-simple-c-20-modules-use-from-the-clang-driver-without-a-build-system
[NFC][VPlan] Add initial tests for future VPlan-based stride MV
I tried to include both the features that current
LoopAccessAnalysis-based transformation supports (e.g., trunc/sext of
stride) but also cases where the current implementation behaves poorly,
e.g., https://godbolt.org/z/h31c3zKxK; as well as some other potentially
interesting scenarios I could imagine.
The are two test files with the same content. One is for VPlan dump change of
the future transformation alone (I'll update `-vplan-print-after` in the next
PR), another is for the full vectorizer pipeline. The latter have two `RUN:`
lines:
* No multiversioning, so the next PR diff can show the transformation itself
* Stride multiversionin performed in LAA, so that we can compare future
VPlan-based transformation vs old behavior.
Revert "[lldb] Batch breakpoint step-over for threads stopped at the … (#183378)
…same site (re-land) (#182944)"
This reverts commit 94d9f1b3cbb02700d9cd3339c1dbf44c0d13b550.
[mlir][xegpu] Add vector layout conflict handling in XeGPU layout propagation pass. (#182402)
This PR adds support for layout conflict handling for vector operands. A
conflict for a vector operand occurs when a value consumed at a given
operand is not in the expected layout in the context of the consumer
(for example `vector.multi_reduction` op's source require a specific
layout inferred from its current result layout). To resolve this
conflict, we insert an `xegpu.convert_layout` right after the producer
(essentially duplicating the producer with expected layout) and use the
new value in the consumer.
[mlir][llvmir][OpenMP] Translate affinity clause in task construct to llvmir
Translate affinity entries to LLVMIR by passing affinity information to
createTask (__kmpc_omp_reg_task_with_affinity is created inside PostOutlineCB).