[mlir][OpenMP] Translate task_reduction on omp.taskgroup (#199565)
This patch adds LLVM IR translation for `task_reduction` on
`omp.taskgroup`.
Flang already parses, checks, and lowers the relevant task-reduction
constructs to OpenMP MLIR, but the LLVM IR translation path was
incomplete. This patch implements the taskgroup reduction setup needed
by the follow-up taskloop and task `in_reduction` work.
For each reducer on `omp.taskgroup`, the translation emits init and
combiner helpers from the corresponding `omp.declare_reduction` regions,
builds the `kmp_taskred_input_t` descriptor array, and calls
`__kmpc_taskred_init` before entering the taskgroup body.
This patch intentionally keeps `reduction` / `in_reduction` on
`omp.taskloop.context` unsupported. Those are handled in follow-up PRs.
### Stack / review order
[20 lines not shown]
[PGO] Implement PGO counter promotion for atomic updates (#202487)
Currently PGO counter updates are promoted/hoisted out of loops where
possible, in order to reduce memory accesses. The promotion is
implemented via the LoadAndStorePromoter and SSAUpdater classes.
When the updates are relaxed atomic, however, hoisting doesn't happen.
Reading the semantics of relaxed atomics, it should be legal to do
similar promotions, but teaching LoadAndStorePromoter and SSAUpdater
seems like alot of work and would touch common code used by alot of
llvm optimizations such as SROA.
An easier approach, implemented here, is to perform the promotions on
non-atomic updates, then transform the promoted updates to (relaxed)
atomic.
I also added a flag-guarded sanity check, that a user can use to make
sure all PGO counter updates have been made atomic (in case we miss
some).
[32 lines not shown]
clang: enable `swiftasynccall` for Wasm (#203330)
Follow-up to https://github.com/llvm/llvm-project/pull/188296, where in
LLVM `swiftasynccall` is lowered to Wasm `return_call` and
`return_call_indirect` instructions when tail calls are enabled. This
still needed to be enabled at the Clang level in
`checkCallingConvention` in `lib/Basic/Targets/WebAssembly.h`.
Revert "[AMDGPU] Capping max number of registers to function's occupancy budget for indirect calls" (#204605)
Reverts llvm/llvm-project#199765
Broke https://lab.llvm.org/buildbot/#/builders/10
[LifetimeSafety] Propagate loans through the GNU binary conditional (#204439)
FactsGenerator only handled the ternary, so a borrow used through the
GNU binary conditional `a ?: b` was silently dropped. Handle both via
VisitAbstractConditionalOperator, flowing from
getTrueExpr()/getFalseExpr(). For `a ?: b` getTrueExpr() is an
OpaqueValueExpr, so make OpaqueValueExpr transparent in the origin
manager and peel it in the arm-reachability check; guard against flowing
a void (e.g. throw) arm.
Assisted-by: Claude Opus 4.8
Co-authored-by: Gabor Horvath <gaborh at apple.com>
[libc++][mdspan][test] Correct `mapping::operator()` constraint tests (#201061)
The previous requires-expression only checked that `std::is_same_v<...>`
was a well-formed expression, so the test would pass even when the
result was false.
[mlir][VectorToLLVM] add opt-in `enable-gep-inbounds-nuw` pass flag for `vector.load/store` (#202118)
> This patch follows up on #201180 and the refactoring #202766 (which
made `affine-super-vectorize` emit `in_bounds = [true]` on
`vector.transfer_read`/`write` when accesses are statically provable to
be within bounds). With that in place, the `VectorToLLVM` lowering was
still emitting `llvm.getelementptr` without `inbounds`/`nuw`, so LLVM
could not exploit the no-wrap guarantee: SCEV could not prove the index
arithmetic monotone (loop vectorizer bailed out) and BasicAliasAnalysis
fell back to conservative aliasing.
Without `inbounds`/`nuw` on the GEP that `vector.load`/`vector.store`
lower to, LLVM cannot exploit no-wrap guarantees: SCEV fails to prove
loop-index monotonicity (loop vectorizer bails), and BasicAliasAnalysis
falls back to conservative aliasing.
### Why opt-in
Unlike `memref.load`, `vector.load`/`vector.store` intentionally allow
[37 lines not shown]
[libc++] Restore release note dropped during rebase of #196495 (#204435)
A release note about transitive include removals was inadvertently
dropped during a rebase of #196495 before merge. This restores it.
[llvm-pdbutil] Add DXContainer support to `llvm-pdbutil dump` (#200485)
This patch adds `--dxcontainer` option that attempts to parse a
`DXContainer` from stream 5 data (generated by DirectX tools) of a PDB
file, and if successful, dumps the basic info about it. If `DXContainer`
wasn't parsed, shows that it is not present in the file.
[mlir][MemRefToLLVM] fix incorrect `nuw` on `GEP/mul` when lowering `memref.load/store` with negative strides (#204309)
`MemRefToLLVM` was unconditionally emitting `getelementptr inbounds|nuw`
(and consequently `mul overflow<nsw,nuw>` on every intermediate index
computation inside `getStridedElementPtr`) for all `memref.load` and
`memref.store` lowerings.
This is _unsound_ when any stride is negative or dynamic.
`getStridedElementPtr` propagates `GEPNoWrapFlags::nuw` to
`IntegerOverflowFlags::nuw` on every intermediate `llvm.mul` and
`llvm.add` it emits. With a negative stride (e.g. `-1`, which is
`2^64-1` unsigned), an access like index=5 produces `mul nuw 5,
(2^64-1)`, which unsigned-overflows and yields poison per LangRef —
regardless of whether the final offset happens to be non-negative.
This issue came up in the discussion in PR #202118. Thanks to
@banach-space for the detailed discussion.
This PR hopefully concludes the path to fix the regression related to
[6 lines not shown]
AMDGPU: Use module flags to control xnack and sramecc
This ensures these ABI details are encoded in the IR module
rather than depending on external state from command-line flags.
Previously, these were encoded as function-level subtarget features.
The code object output was a single target ID directive implied
by the global subtarget. The backend would previously check if a
function's subtarget feature mismatched the global subtarget. This
is avoided by making xnack and sramecc module-level properties from
the start. This also provides proper linker compatibility
enforcement, moving the error point earlier.
The old encoding was also an abuse of the subtarget feature system.
Subtarget features are a bitvector, and later features in the string
can override earlier ones. The old handling added a special case
where explicit settings were preserved: ordinarily +feature,-feature
should result in the feature being disabled, but +xnack,-xnack would
preserve the explicit "-xnack" state, which differs from the absence
of any xnack setting.
[25 lines not shown]
[clang][bytecode][NFC] Mark results as non-empty when taking a value (#204568)
This was missing and all the EvaluationResults always ended up being
empty even though their APValue was set. Since the assert(!empty()) was
missing from `takeAPValue()`, nobody noticed though.
[libc++] Use public os_sync API instead of private __ulock on newer Apple platforms (#202519)
The atomic wait and wake implementation on Apple platforms currently
relies on `__ulock_wait` and `__ulock_wake`, which are private kernel
APIs. This is a problem for anyone shipping apps through the App Store
since Apple flags private symbol usage during review.
Starting with macOS 14.4 and iOS 17.4, Apple ships public replacements
through `os_sync_wait_on_address` and `os_sync_wake_by_address_any/all`
in `<os/os_sync_wait_on_address.h>`. These cover the same functionality
and are documented, stable, and safe for App Store submissions.
This patch makes use of the public APIs instead of the private ones
whenever the underlying OS permits it.
This takes over #182947.
Fixes #182908
Fixes #146142
Co-authored-by: Bbn08 <atrancendentbeing at gmail.com>
[libc] Include linux headers to get ioctl macros (#204555)
Linux has many existing ioctls and keeps adding them, so a
hand-maintained list would always be out of date. Additionally, some
ioctls have architecture specific numbers (some in a very subtle way --
by having the number depend on the size of a structure).
asm/ioctls.h and linux/sockios.h are pretty clean, and are already
included by glibc, so we can just do the same to get the latest
definitions.
[AMDGPU] Mark all instructions in WWM region as convergent
Mark instructions between ENTER_STRICT_WWM and EXIT_STRICT_WWM as
convergent, so they don't get moved out of the whole wave mode region
(see the licm-wwm.mir test). This doesn't automagically fix all our
woes, since things can still be moved out of the region before we even
run si-wqm, but there are rumours about moving WWM formation earlier
anyway.
This is not a substitute for proper WWM support - in particular, this
would inhibit most optimizations inside WWM regions with complex control
flow. Right now most WWM is relatively limited in size and complexity,
so I think this is acceptable until we get a more principled solution.
I haven't thought too much about whether or not we need this for WQM as
well.
Assisted by: Claude Sonnet
commit-id:9204c7e2
[AMDGPU][doc] Refactor Barrier Execution Model
Remove everything that has to do with named barriers and put it in a series of model extensions specific to /sbarrier/named-barriers.
I had to change a few things to make it fit, in summary:
Base Model:
* Stylistic changes that make it easier to refer to specific rules. Each rule is in a rubric instead of a bullet point.
* (-) No longer defines `barrier-mutually-exclusive`
* (-) No longer defines barrier `join` and any associated rule.
New named barrier extensions
* Define "named barrier" as a sub-type of barrier objects. This makes barrier-mutually-exclusive redundant.
* Define barrier join as an op that can exclusively be done on `named barrier objects`.
* Define rules relating to join and its ordering with other barrier operations
Following these changes, the target tables changed a bit as well.
[2 lines not shown]
[libomp] Parse OMP_DEFAULT_DEVICE with new device trait parser
... but do not yet expose the new functionalities to the user. This is a
backward compatible update that is going to be followed by the step to
the OpenMP 6.0 semantics as defined in 4.3.8.