[AArch64][SME] Split FP8 FTMOPA intrinsics (#203310)
Introduce separate FP8 FTMOPA intrinsics for ZA16 and ZA32:
llvm.aarch64.sme.fp8.ftmopa.za16
llvm.aarch64.sme.fp8.ftmopa.za32
The FP8 FTMOPA forms need to model their FPMR dependency, so they should
not share the same intrinsic definitions as the non-FP8 FTMOPA forms.
Update the Clang SME builtin definitions and AArch64 instruction
patterns to use the new intrinsics, and add AutoUpgrade support for the
previous FP8-shaped llvm.aarch64.sme.ftmopa.* spellings so existing IR
and bitcode continue to work.
This was split out from #154144 because the intrinsic upgrade needs to
be handled separately to avoid breaking existing bitcode.
[CIR][AArch64] Lower NEON laneq FMA builtins (#202337)
Lower additional AArch64 NEON laneq fused multiply-accumulate builtins
in CIR.
This covers:
- `BI__builtin_neon_vfmaq_laneq_v`
- `vfmaq_laneq_f16`
- `vfmaq_laneq_f32`
- `vfmaq_laneq_f64`
- `BI__builtin_neon_vfmad_laneq_f64`
- `vfmad_laneq_f64`
For `vfmaq_laneq_v`, the lowering bitcasts the operands, splats the
selected lane source, and emits the `llvm.fma` intrinsic with the
operand order matching classic AArch64 CodeGen.
For `vfmad_laneq_f64`, the lowering extracts the selected lane from the
`float64x2_t` source and emits scalar `llvm.fma.f64`.
[7 lines not shown]
[CIR][AArch64] Lower NEON subtraction intrinsics (#202857)
### summary
part of : https://github.com/llvm/llvm-project/issues/185382
- Add CIR lowering for the scalar AArch64 NEON subtraction builtins
`vsubd_s64` and `vsubd_u64`.
- Verify that the remaining signed, unsigned, and floating-point
`vsub/vsubq` intrinsics are correctly expanded through arm_neon.h and
emitted as `cir.sub`.
[Dexter] Add condition check to state nodes
This patch enables the ability for state nodes to check conditions, meaning
they will be active only if the condition is met.
Condition evaluation is somewhat language specific; we directly check
whether the value of the evaluated expression is "true" (case-insensitive),
which works for the languages we actually use Dexter with, but may require
generalizing in future.
We also cache conditions as they are evaluated; each time we step, we clear
all cached conditions for the current frame and any expired frames, but we
keep the cached conditions for any frames rootwards from the current frame;
this prevents us from unexpectedly exiting out of a callee frame because of
debug info not surviving a stack unwind; if the early exit is desired, an
!and{at_frame_idx, condition} under the lower frame may suffice.
[Dexter] Enable after_hit_count for state nodes
The after_hit_count attribute for a state node causes it to become active
only after it would have become active N times. This uses the existing logic
for incrementing hit counts, i.e. after the node becomes "active", we will
not add another hit count until it stops being active for at least one step.
Since state nodes with after_hit_count do not become active before reaching
the required hit count, this requires us to keep track of an "early" set of
state nodes, meaning nodes that would be active if not for their
after_hit_count.
[Dexter] Add support for writing !step values
Following from the previous patch, this patch adds support to Dexter for
generating expected values for !step nodes. This is relatively limited:
the kind of !step which this is most well-suited to this is !step exactly,
as the !step order of ignoring extra lines is redundant (all lines are added
as expected values), and !step never can't know what lines could have been
stepped on but weren't without some extra work (e.g. finding viable
breakpoint locations in the enclosing state node).
[Dexter] Add !step node for testing stepping behaviour
This patch adds a node for generating metrics based on lines stepped on. The
new node has 3 versions: !step exactly, !step order, and !step never, which
check an expected list of line numbers against the actual line numbers seen
while the expect is active.
[tools] Register analyses correctly (#203808)
- Analyses with custom parameter must be registered before
register*Analyses, otherwise it will be skipped.
- Remove redundant LibcallLoweringModuleAnalysis, pass builder will
register it automatically.
[Valuetracking] Use all FPClasses ordering information for min/max (#199651)
Min/Max functions can exclude more FPClasses than
OrderedLessThanZeroMask/OrderedGreaterThanZeroMask. Now it excludes all
analyzable FPClasses, of which +/-Inf are the most useful.
This enhances analysis for transforms which need to exclude Inf.
Here is a simplified example: 0*y -> 0 is only correct if y cannot be
Inf or NaN, otherwise it may be NaN.
[clang][bytecode] Add `Block::invokeCtorNoMemset()` (#203749)
`invokeCtor()` first memsets the memory to zero, then calls the
descriptor ctor function. The memset is unnecessary if we're already
working with zero-ed memory, like the one we get from
`std::make_unique`.
[LoopVectorize] Fix nondeterminism in loop-vectorize (#200833)
The nondeterministic iteration over `AddrDefs` (SmallPtrSet) causes
nondeterministic output for the test case in this patch (reduced from a
C codebase). One of two different outputs is generated arbitrarily,
chosen roughly equally.
Between the two different outputs sometimes the instruction
`%3 = load i64, ptr %2, align 8`
has an associated cost of 4 and othertimes 9. The instruction is visited
twice in `setCostBasedWideningDecision` in the `AddrDefs` loop: once
directly as an element of `AddrDefs`, and the other time indirectly in
the lambda `UpdateMemOpUserCost` as a User of another `AddrDefs`
element. Each of those times `setWideningDecision` is called with a
different cost value; the final of the two calls sets the final value
(previous is overwritten). Because `AddrDefs` iteration is
nondeterministic, the order of those two calls to `setWideningDecision`
is also nondeterministic, hence we see two different costs arbitrarily
between runs.
[13 lines not shown]
[ObjectYAML] Make BBAddrMap encoder diagnostics format-neutral (#202524)
In preparation for sharing the yaml2obj BBAddrMap encoder with COFF.
1. Drop the now-dead `Section.Type == SHT_LLVM_BB_ADDR_MAP` guards (#146186).
2. Reword the two warnings that will move into the shared helper.
3. Fix a "PBOBBEntries" -> "PGOBBEntries" typo.
[LoopInterchange] Reject inner-latch lcssa PHI feeding the exit condition (#202863)
In a multi-level nest, an lcssa PHI in the inner loop latch that feeds
the latch's exit condition can be left with a stale incoming block after
a subsequent interchange rewires the CFG, producing invalid IR. This
happened even when the outer latch had a single predecessor, where the
legality check returned early. Instead, reject the interchange when such
a PHI feeds the exit condition.
Fixes #202027
[DA] Add test for addrec can wrap in GCD MIV (NFC) (#203526)
This patch adds a test that should have been included in #186892. The
test demonstrates a case where the GCD MIV test would miss a dependency
if the presence of nsw flags were not checked.
[clang][bytecode] Add an `ExplicitThisParam` flag to `Function` (#203672)
We unfortunately have to check this for every function call, so don't
consult the decl every time here.
[NFC][MC] Initialize all fields of DebugName::Parameters in default constructor (#202701)
Initialized both variables **Flags** and **NameLength** of
**DebugNameHeader** structure.
[X86] Record the enclosed register in X86DomainReassignment::buildClosure (#202534)
buildClosure recorded the seed register Reg in the function-wide
EnclosedEdges map on every worklist iteration instead of CurReg, the
register actually being added to the closure. EnclosedEdges therefore
only ever contained the seed of each closure.
The driver loop in runOnMachineFunction skips registers already present
in EnclosedEdges before starting a new closure. Because only seeds were
recorded, every non-seed member of an already-built closure looked like
a fresh seed, so a redundant closure was built for it and then
immediately discarded by the EnclosedInstrs cross-closure check. The
emitted code is unchanged; the pass just performed redundant work
proportional to closure size.
Key EnclosedEdges by CurReg so each enclosed register is recorded once.
This was found as part of @jlebar's X86 LLVM bug hunt / FuzzX effort:
[2 lines not shown]
[clang][bytecode] Add an on-by-default `CanFail` flag to opcodes (#203671)
We have several opcodes that can't fail, so add a flag to them
indicating that they always return `true` anyway.
This simplifies the generated code from e.g.
```c++
PRESERVE_NONE
static bool Interp_Activate(InterpState &S, CodePtr &PC) {
if (!Activate(S, PC))
return false;
#if USE_TAILCALLS
MUSTTAIL return InterpNext(S, PC);
#else
return true;
#endif
}
```
[12 lines not shown]
Revert "Remove default setting signaling_nan attribute for strictfp functions"
Restore the previous behavior, where a strictfp function implicitly got
the `singaling_nans` attribute. Difficulty in explaining the behavior to
users is not an acceptable reason for changing the default behavior.
Previously, this behavior was also undocumented. Assuming
`signaling_nans` by default in `strictfp` functions is safer and
maintains compatibility.
This reverts commit 1c9601c52e8f396d024e4c3032047dce87b288b8.
[CIR][AMDGPU] Adds lowering for amdgcn extended image sample/gather4 builtins (#201761)
Support for lowering of` __builtin_amdgcn_image_sample/gather4` for
AMDGPU builtins to clangIR.
Followed similar lowering from clang->llvmir:
`clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp`.
Upstreaming clangIR PR:
[llvm/clangir#2083](https://github.com/llvm/clangir/pull/2083)