Fix declare simd linear stride rescaling and arg_types verifier
1. Rescale constant linear steps from source-level element counts to byte
strides in Flang's processLinear(). For reference-like parameters
(pointers or non-VALUE dummy arguments) with Linear or LinearRef ABI
kind, the step must be multiplied by the element size in bytes. This
matches Clang's rescaling in CGOpenMPRuntime.cpp. Val and UVal kinds
are not rescaled as they describe value changes, not pointer strides.
Var-strides are also not rescaled as the value is an argument index.
2. Add a verifier check in DeclareSimdOp to ensure 'arg_types' length
matches the number of function arguments, preventing out-of-bounds
access during MLIR-to-LLVM IR translation.
Also restructure processLinear() to compute stepOperand per-variable
instead of appending the same operand for all objects in the clause,
enabling per-variable rescaling.
Assisted with copilot.
[mlir][acc] Sink constants into acc.compute_region when creating (#187777)
When converting OpenACC compute constructs to acc.compute_region, also
sink constants inside so they do not become live-ins.
[flang][OpenMP] Provide reasons for calculated depths
If the depth (either semantic or perfect) was limited by some factor,
include the reason for what caused the reduction.
Issue: https://github.com/llvm/llvm-project/issues/185287
[libc][x86] Add Non-temporal code path for large memcpy (#187108)
Large memcopies are pretty rare, but are more common in ML workloads
(copying large matrixes/tensors, often to/from CPU host).
For large copies NTA stores can provide performance advantages for both
memcpy itself and the rest of the workload (by reducing cache
pollution). Other runtimes already have NTA path for large copies, so
add 1 to the llvm-libc.
Internal whole-program loadtests shows small, but statistically
significant improvement of 0.1%. ML specific bencahmrks showed 10-20%
performance gain, and fleetbench (https://github.com/google/fleetbench,
which has more up-to-date version of libc benchmarks) shows ~3% gain
(ns/byte for distributions taken from various applications).
```
[Memcpy_0]_L1 0.01950n ± 3% 0.01900n ± 5% ~ (p=0.390 n=20)
[Memcpy_0]_L2 0.02300n ± 0% 0.02300n ± 0% ~ (p=0.256 n=20)
[35 lines not shown]
[AMDGPU][SIInsertWaitcnts] Add test functions in waitcnt-wcg-attributes.mir (#186504)
This patch adds two more functions for exercising the target-cpu
attribute.
[Clang] Use stable_sort for UnqualUsingDirectiveSet for determinism in ambiguity notes (#187750)
In SemaLookup.cpp, `UnqualUsingDirectiveSet::done()` uses `llvm::sort`
with a comparator that only checks the ancestor relationships. So, if
there are multiple "neighbor" namespaces, they are considered equal, and
thus `llvm::sort` may return the using directives in a non-deterministic
order.
This was observed as a test failure on clang/test/CXX/drs/cwg0xx.cpp at
line 220 after PR #187219 started verifying the diagnostics ordering.
The two "candidate found by name lookup" notes were emitted in the
opposite order from the test's expectations -- in some builds of Clang,
but not others.
Switching to `llvm::stable_sort` ensures that using-directives are
always traversed in a deterministic order, and thus the notes emitted
deterministically.
[flang][OpenMP] Introduce `WithReason<T>` for nest/sequence properties (#187563)
This helper class contains an optional value and a "reason" message. It
replaces the uses of std::pair<optional<...>, Reason>.
Issue: https://github.com/llvm/llvm-project/issues/185287
[RISCV] Fix the pipe used by `fmv.x.<fp>/<fp>.x` in SiFive7 sched model (#187740)
These FP <-> Integer conversion instructions should use PipeA instead.
[NFC][LV] Fix what seems to be a typo in the test
The test was added in https://github.com/llvm/llvm-project/commit/4e9894498e166ef6b207c25e780db0b6f006cc89.
Alternative fixes would be:
* Remove unused GEP, although not clear why we'd want to overwrite
stored `i64` with `ptr` store.
* Keep this patch, but perform both GEPs with `i64` element type to
reduce the diff. It's not clear if the scalarization caused by that
type mismatch is intentional/relevant for the original change.
Fix declare simd linear stride rescaling and arg_types verifier
1. Rescale constant linear steps from source-level element counts to byte
strides in Flang's processLinear(). For reference-like parameters
(pointers or non-VALUE dummy arguments) with Linear or LinearRef ABI
kind, the step must be multiplied by the element size in bytes. This
matches Clang's rescaling in CGOpenMPRuntime.cpp. Val and UVal kinds
are not rescaled as they describe value changes, not pointer strides.
Var-strides are also not rescaled as the value is an argument index.
2. Add a verifier check in DeclareSimdOp to ensure 'arg_types' length
matches the number of function arguments, preventing out-of-bounds
access during MLIR-to-LLVM IR translation.
Also restructure processLinear() to compute stepOperand per-variable
instead of appending the same operand for all objects in the clause,
enabling per-variable rescaling.
Assisted with copilot.
[CGP][PAC] Flip PHI and blends when all immediate modifiers are the same
GVN PRE, SimplifyCFG and possibly other passes may hoist the call to
`@llvm.ptrauth.blend` intrinsic, introducing multiple duplicate call
instructions hidden behind a PHI node. This prevents the instruction
selector from generating safer code by absorbing the address and
immediate modifiers into separate operands of AUT, PAC, etc. pseudo
instruction.
This patch makes CodeGenPrepare pass detect when discriminator is
computed as a PHI node with all incoming values being blends with the
same immediate modifier. Each such discriminator value is replaced by a
single blend, whose address argument is computed by a PHI node.
[llvm] Silence llvm-debuginfod-find/headers-winhttp.test on Windows bots temporarily (#187753)
Windows bots are still failing after a3db68a97b2c321e and
d7dbba55bff52f342. This test is new, let's take it off while
we investigate.
[OpenMP] Emit aggregate kernel prototypes and remove libffi dependency (#186261)
Summary:
This PR changes the handling of the emitted kernels when targeting a CPU
to be a pointer struct.
The old handling emitted a standard function prototype, this
necessitated a target specific ABI to call it because the signature
differed with the number of arguments. Instead, this PR emits a void
pointer to a naturally aligned struct, this is what APIs like `pthreads`
assert.
This allows us to remove all the complexity around launching host
kernels and just pass the argument list.
[CIR][NFC] Minor cleanups to missing feature markers (#187754)
This fixes a few places where MissingFeatures asserts were incorrect,
extends the text of two errorNYI diagnostics to disambiguate them, and
fixes a typo in an adjacent comment.
[AArch64][PAC] Rework discriminator analysis for calls and tail calls
Make use of fixupBlendComponents for AUTH_TCRETURN[_BTI] and for
BLRA[_RVMARKER] pseudos the same way it is done for AUT/PAC/AUTPAC.
This patch unifies discriminator analysis for DAGISel and GlobalISel
and improves cross-BB analysis in case of DAGISel.