[AArch64][clang][llvm] Add ACLE Armv9.7 MMLA intrinsics
Implement new ACLE matrix multiply-accumulate intrinsics for Armv9.7:
```c
// 16-bit floating-point matrix multiply-accumulate.
// Only if __ARM_FEATURE_SVE_B16MM
// Variant also available for _f16 if (__ARM_FEATURE_SVE2p2 && __ARM_FEATURE_F16MM).
svbfloat16_t svmmla[_bf16](svbfloat16_t zda, svbfloat16_t zn, svbfloat16_t zm);
// Half-precision matrix multiply accumulating to single-precision
// instruction from Armv9.7-A. Requires the +f16f32mm architecture extension.
float32x4_t vmmlaq_f32_f16(float32x4_t r, float16x8_t a, float16x8_t b)
// Non-widening half-precision matrix multiply instruction. Requires the
// +f16mm architecture extension.
float16x8_t vmmlaq_f16_f16(float16x8_t r, float16x8_t a, float16x8_t b)
```
Fix formatting of changes in recent redefine_extname changes. (#189938)
My recent commits fd7388d14083bb5094bce6a75444a37e424689d7 and
37888541a96e4f10bf1b71b869145f0d31a9d580 had minor formatting issues
which happened during editing the PR as I had forgotten to rerun
clang-format. Sorry for that. So here is the update.
Did not fix line lengths in the FileCheck tests as exceeding line length
in these seems more consistent.
[mlir][math] Add constant folding for `math.fpowi` (#193761)
Adds a constant folder for `math.fpowi` when both operands are constant
and the integer exponent is exactly representable in the floating-point
type of the base.
[LangRef][AMDGPU] Specify that syncscope can cause atomic operations to race
Targets should be able to specify that the syncscope of atomic operations
influences whether they participate in data races with each other.
For example, in AMDGPU, we want (and already implement) the load in the
following case to be in a data race (i.e., return `undef` according to the
current definition), because there is an atomic store with workgroup syncscope
executing in a different workgroup:
```
; workgroup 0:
store atomic i32 1, ptr %p syncscope("workgroup") monotonic, align 4
; workgroup 1:
store atomic i32 2, ptr %p syncscope("workgroup") monotonic, align 4
load atomic i32, ptr %p syncscope("workgroup") monotonic, align 4
```
[3 lines not shown]
[LangRef] Specify that syncscopes can affect the monotonic modification order
If a target specifies that atomics with mismatching syncscopes appear
non-atomic to each other, there is no point in requiring them to be ordered in
the monotonic modification order. Notably, the [AMDGPU target user
guide](https://llvm.org/docs/AMDGPUUsage.html#memory-scopes) has specified
syncscopes to relax the modification order for years.
So far, I haven't found an example where this less constrained ordering would
be observable (at least with the AMDGPU inclusive scope rules). Whenever a load
would be able to see two monotonic stores with non-inclusive scope, that's
considered a data race (i.e., the load would return `undef`), so it cannot be
used to observe the order of the stores.
[AMDGPUUsage] Specify what one-as syncscopes do
This matches the currently implemented and (as far as I could determine)
intended semantics of these syncscopes.
The sync scope table is unchanged except for removing its indentation;
otherwise it would be rendered as part of the preceding note.
[LangRef] Allow monotonic & seq_cst accesses to inter-operate with other accesses
Currently, the LangRef says that atomic operations (which includes `unordered`
operations, which don't participate in the monotonic modification order) must
read a value from the modification order of monotonic operations.
In the following example, this means that the load does not have a store it
could read from, because all stores it may see do not participate in the
monotonic modification order:
```
; thread 0:
store atomic i32 1, ptr %p unordered, align 4
; thread 1:
store atomic i32 2, ptr %p unordered, align 4
load atomic i32, ptr %p unordered, align 4
```
[18 lines not shown]
[Flang] Add `INLINEALWAYS` Compiler Directive (#192674)
Adds support for the INLINEALWAYS Compiler Directive to Flang. This was
previously supported in Classic-Flang, and works in the same way as
FORCEINLINE.
It can either be defined at the call site, or within the function the
user wishes to inline.
The missing support was highlighted while building an opensource
benchmark, as build warnings were indicating that this compiler
directive was being ignored.
[mlir][math] Use APFloat::SemanticsToEnum in constant folding (#193914)
Refactor constant folding in the Math dialect to use APFloat::SemanticsToEnum() instead of getSizeInBits() when checking
floating-point semantics. Inferring semantics from bitwidth is fragile: different formats may share the same bit width but have distinct semantics, leading to incorrect dispatch. SemanticsToEnum() matches on the exact semantics descriptor, making the intent explicit and ensuring correct dispatch.
[lldb] Remove full stop from AppendErrorWithFormat format strings (part 1) (#193750)
To fit the style guide:
https://llvm.org/docs/CodingStandards.html#error-and-warning-messages
I found these with:
* Find `(\.AppendErrorWithFormat\(([\s\r\n]+)?"(?:(?:\\.|[^"\\])*))\."`
and replace with `$1"` using Visual Studio Code.
* Putting a call to `validate_diagnostic` in `AppendErrorWithFormat`.
* Manual inspection.
Note that this change *does not* include a call to `validate_diagnostic`
because I do not know what's going to crash on platforms that I haven't
tested on.
[SPIRV] Do not add aliasing decorations to OpAtomicStore/OpAtomicLoad (#193779)
Do not attach them to store atomic or load atomic intrinsics /
instructions since the extension is inconsistent at the moment (we
cannot add the decoration to atomic stores because they do not have an
id).
[SDAG] Minor cleanup to TargetLowering::expandFP_ROUND. NFC (#193793)
Noticed when porting to GISel, constants might as well be added to the
RHS of an add and a bitcast from the same type can be removed.
[DirectX] Replace non-const count of DISubrange with -1 (#192576)
Non-const count is only emitted for C99 VLA, which are not supported.
Co-authored-by: Andrew Savonichev <andrew.savonichev at gmail.com>