[flang][OpenACC] Don't hoist declare directive out of interface bodies (#202806)
Example:
```fortran
program main
real :: a(10, 60)
interface
subroutine compute(a)
real :: a(10, 60)
!$acc declare present(a)
end subroutine
end interface
call compute(a)
end program
```
In this code, the `!$acc declare` inside the interface body is hoisted
into the
host program unit and lowered there, where its operand (the interface
[12 lines not shown]
[SPIR-V] Lower `select` instructions with aggregate operands (#201417)
Context: `SPIRVEmitIntrinsics` represents aggregate (array/struct) SSA
values as i32 value-ids, keeping the real type on the side for SPIR-V
emission. `preprocessCompositeConstants()` rewrites composite constant
operands into those value-ids.
A `select` takes its result type from its operands, so rewriting one arm
leaves the select with an aggregate result type but an i32 operand,
which is invalid. The exact failure mode depends: a composite-constant
arm tripped the verifier ("Select values must have same type as select
instruction"), while a non-constant arm (say a load) only became a
value-id later, in the visitor pass, at which point
`replaceMemInstrUses()` found a `select` among its users and hit an
unreachable.
I pushed two commits fixing this, one limited to my use case, another
more general:
[20 lines not shown]
[Demangle] Guard DEMANGLE_ABI and add missing annotation (#202920)
This updates the DEMANGLE_ABI annotation to only be defined if it is not
already defined. This is required to parse the Demangle headers with the
ids-check script.
In addition, this adds one missing DEMANGLE_ABI annotation.
This effort is tracked in #109483.
[flang][OpenMP] Model target in_reduction through map entries
Model omp.target in_reduction so the target body uses the mapped
map_entries block argument instead of a separate in_reduction entry
block argument.
The in_reduction operands remain on the op for host-side translation.
For the host-fallback path, the matching map block argument is redirected
to the pointer returned by __kmpc_task_reduction_get_th_data, so the
target body accumulates into the task reduction-private storage.
Flang lowering now relies on the implicit address-preserving map for the
target body binding, while task and taskloop keep their existing
in_reduction block-argument behavior.
Offload/device compilation is still diagnosed as not yet implemented, and
each target in_reduction variable must have a matching map_entries entry.
[flang][OpenMP] Lower target in_reduction for host fallback
Teach Flang lowering and MLIR OpenMP translation to carry
in_reduction through omp.target for the host-fallback path.
The translation looks up task reduction-private storage with
__kmpc_task_reduction_get_th_data and binds the target region's
in_reduction block argument to that private pointer, so uses inside the
region do not keep referring to the original variable.
The patch also preserves in_reduction operands in the TargetOp builder
path and ensures target in_reduction list items are mapped into the
target region when needed.
The device/offload-entry path remains diagnosed as not yet implemented.
[LoopFusion] Drop duplicate write-write dependence check (NFC) (#203173)
`dependencesAllowFusion()` re-tested every FC0-write vs FC1-write pair
in the second loop nest, duplicating the checks already done in the
first. Iterate only the remaining FC0-read vs FC1-write pairs; the set
of checked dependences (W0xW1, W0xR1, R0xW1) is unchanged.
[OpenACC][flang] Emit NYI when unstructured loops are associated with OpenACC directives
When an unstructured loop is associated with a loop or a combined
directive, we emit an unstructured CFG for the loop's logic nested
within the OpenACC op. This effectively serializes the nested loop on
the device which is not desirable. For now, emit NYI's while working on
a longer-term solution.
The NYI is restricted to the cases where the loop will be lowered with
`independent` parallelism semantics for the default device_type -- i.e.,
the user has explicitly promised the loop is parallel. This covers:
- combined `acc parallel loop`,
- standalone `acc loop` inside `acc parallel`,
- orphan `acc loop` inside a non-`seq` acc routine,
- explicit `independent` clause.
For `auto` (`acc kernels loop` and `acc loop` inside `acc kernels`) and
for `seq` (`acc serial loop`, `acc loop` inside `acc serial`, explicit
`seq`, or orphan inside a `seq` routine), the user has not made a
[4 lines not shown]
[libc++] Hoist <compare> outside the threads guard in <thread> (#202535)
The standard mandates [thread.syn] include <compare> as part of
<thread>'s synopsis. This is a standards-mandated dependency, not a
thread-feature dependency, so it should be visible regardless of
_LIBCPP_HAS_THREADS.
This matches how we handle standard-mandated includes elsewhere, see for
example #134877.
[mlir][OpenMP] Translate reductions on taskloop
Add LLVM IR translation for reduction and in_reduction clauses on omp.taskloop.context.
For taskloop reduction, emit the implicit taskgroup reduction setup and map each generated task to runtime-provided private reduction storage through __kmpc_task_reduction_get_th_data. For in_reduction, use the same runtime lookup path with a null descriptor to join an enclosing task reduction context.
Unsupported byref, cleanup, and two-argument initializer forms remain diagnosed.
Add MLIR translation tests for the supported taskloop reduction and in_reduction cases.
[RISCV] Set CostPerUse to 1 only when optimizing for size (#201501)
We saw some regressions because of bad RAs as the cost of registers
beyond x8-x15 are bigger. This is why `DisableCostPerUse` was added
in https://github.com/llvm/llvm-project/issues/83320.
In this PR, we change it to set `CostPerUse=1` only when optimizing
for size.
Code size increases less than 0.1% in llvm-test-suite.
Reland emitc lower multi return functions (#203026)
Reland #200659 reverted by #202911.
Fixed GCC 7 func-to-emitc build: Use the adaptor operand types
when creating the multi-return struct type instead of relying on an
implicit conversion from ValueRange to TypeRange.
Failed buildbot:
https://lab.llvm.org/buildbot/#/builders/116/builds/29302
Assisted-by: Copilot
[LoongArch] Propagate demanded bits for CRC[C].W.{B,H}.W
CRC byte and halfword instructions only use the low 8 or 16 bits of
their data operand. Propagate these demanded-bit requirements through
SimplifyDemandedBitsForTargetNode() so redundant masking operations can
be removed during DAG combining.
[APInt] Provide sqrtFloor (floor of square root) instead of sqrt (rounded) (#197406)
This simplifies both the implementation and the only in-tree user.
I changed the name to avoid silently changing the behavour of an
existing function that might have out-of-tree users.
[LV][NFC] Remove instcombine pass from RUN lines in ARM tests (#202913)
Following on from PR #197448 I've now removed the instcombine pass from
RUN lines in the ARM test directory, which exposes some potential
missing optimisations in vplan:
1. We could be folding IR into saturating math intrinsic calls to better
reflect the cost.
2. Masked load + select -> masked load with different passthru.
3. icmp + select -> smin/smax.
Some of these were already observed in #197448
[GlobalISel] Remove `fp_to_[s/u]int_sat_gi` node (#202908)
Instead of having a separate node reuse `fp_to_[s/u]int_sat`
but drop the saturation width from it.
Assisted-by: Claude Code
[LoopInterchange] Use UTC as much as possible (NFC) (#202096)
Historically, the loop-interchange tests have relied heavily on checks
via pass remarks. This is because pass remarks are more human-readable
than the CHECK directives generated by UTC. However, during recent
development, I found some downsides:
- Updating them manually is a bit tedious.
- We need to carefully keep the remarks and the code consistent with
each other. In other words, we don't have any way to verify whether the
remarks themselves are reasonable.
For these reasons, I now think it makes more sense to rely on UTC as
much as possible, and this patch does that. Some tests are left as-is,
e.g., the test for checking remarks.
Disclosure: This patch is assisted-by Claude Code.
Reapply "[GlobalISel] Add a shared matcher for memcpy-family instructions (NFC)" (#202275) (#202298)
sanitizer-aarch64-linux-bootstrap-ubsan broke after #201766:
lab.llvm.org/buildbot/#/builders/85/builds/22356
failed tests:
LLVM :: CodeGen/AArch64/aarch64-mops.ll
LLVM :: CodeGen/AArch64/memsize-remarks.ll
The culprit is canLowerMemCpyFamily returning true for zero-length ops
before initializing IsVolatile. The memcpy-family lowering helpers don't
use IsVolatile, it's only needed while building the lowering plan with
findGISelOptimalMemOpLowering and shouldn't have been forwarded.
I've also check the other arguments and simplified alignment too.
This reverts commit 2de2edb943fe1b83d79bdffa03606eb8c5452e9b.
[NFC][Support] Implement slash-agnostic path matching in GlobPattern (#202854)
Add a SlashAgnostic option to GlobPattern to allow matching path
separators
(both forward slashes and backslashes) agnostically.
When enabled:
- We conservatively reduce the plain prefix and suffix by treating path
separators as metacharacters. This ensures that path separators are
matched via the slash-agnostic state machine rather than plain string
comparison.
- Brackets containing slashes are adjusted to match both separators.
- Character comparisons in the state machine (matchChar) treat '/' and
'\' as equivalent.
For #149886.
Co-authored-by: Devon Loehr <DKLoehr at users.noreply.github.com>
Assisted-by: Gemini