[InstCombine] Combine extract from get_active_lane_mask where all lanes inactive (#183329)
When extracting a subvector from the result of a get_active_lane_mask, return
a constant zero vector if it can be proven that all lanes will be inactive.
For example, the result of the extract below will be a subvector where
every lane is inactive if X & Y are const, and `Y * VScale >= X`:
vector.extract(get.active.lane.mask(Start, X), Y)
[AArch64] optimize vselect of bitcast (#180375)
Using code/ideas from the x86 backend to optimize a select on a bitcast
integer. The previous aarch64 approach was to individually extract the
bits from the mask, which is kind of terrible.
https://rust.godbolt.org/z/576sndT66
```llvm
define void @if_then_else8(ptr %out, i8 %mask, ptr %if_true, ptr %if_false) {
start:
%t = load <8 x i32>, ptr %if_true, align 4
%f = load <8 x i32>, ptr %if_false, align 4
%m = bitcast i8 %mask to <8 x i1>
%s = select <8 x i1> %m, <8 x i32> %t, <8 x i32> %f
store <8 x i32> %s, ptr %out, align 4
ret void
}
```
[64 lines not shown]
[AArch64] Add vector expansion support for ISD::FPOW when using ArmPL (#183526)
This patch is split off from PR #183319 and teaches the backend how to
lower the FPOW DAG node to the vector math library function when using
ArmPL. This is similar to what we already do for llvm.sincos/FSINCOS
today.
[NFC][analyzer] Remove NodeBuilders: part I (#183354)
This commit simplifies some parts of the engine by replacing short-lived
`NodeBuilder`s with `CoreEngine::makeNode`.
Additionally, the three-argument overload of `CoreEngine::enqueue` is
renamed to `enqueueStmtNodes` to highlight that it just calls
`enqueueStmtNode` in a loop.
[WebAssembly][FastISel] Emit signed loads for sext of i8/i16/i32 (#182767)
FastISel currently defaults to unsigned loads for i8/i16/i32 types,
leaving any sign-extension to be handled by a separate instruction. This
patch optimizes this by folding the SExtInst into the LoadInst, directly
emitting a signed load (e.g., i32.load8_s).
When a load has a single SExtInst use, selectLoad emits a signed load
and safely removes the redundantly emitted SExtInst.
Fixed: #180783
Revert "[VPlan] Don't drop NUW flag on tail folded canonical IVs (#183301)" (#183698)
This reverts commit b0b3e3e1c7f6387eabc2ef9ff1fea311e63a4299.
After thinking about this for a bit, I don't think this is correct.
vscale being a power-of-2 only guarantees the canonical IV increment
overflows to zero, but not overflows in general.
[AMDGPU] Multi dword spilling for unaligned tuples
While spilling unaligned tuples, rather than breaking the
spill into 32-bit accesses, spill the first register as a
single 32-bit spill, and spill the remainder of the tuple
as an aligned tuple.
Some additional bookkeeping is required in the spilling
loop to manage the state.
[clang][bytecode][NFC] Refactor visitDeclRef() (#183690)
Move the `!VD` case up so we can assume `VD` to be non-null earlier and
use a local variable instead of calling `D->getType()` several times.
[LV] NFCI: Move extend optimization to transformToPartialReduction. (#182860)
The reason for doing this in `transformToPartialReduction` is so that we
can create the VPExpressions directly when transforming reductions into
partial reductions (to be done in a follow-up PR).
I also intent to see if we can merge the in-loop reductions with partial
reductions, so that there will be no need for the separate
`convertToAbstractRecipes` VPlan Transform pass.