[X86][Clang] VectorExprEvaluator::VisitCallExpr / InterpretBuiltin - allow AVX512 mask predicate intrinsics to be used in constexpr (#165054)
Enables constexpr evaluation for the following AVX512 Instrinsics:
```
_mm_movepi8_mask _mm256_movepi8_mask _mm512_movepi8_mask
_mm_movepi16_mask _mm256_movepi16_mask _mm512_movepi16_mask
_mm_movepi32_mask _mm256_movepi32_mask _mm512_movepi32_mask
_mm_movepi64_mask _mm256_movepi64_mask _mm512_movepi64_mask
```
Part of #162072
[AArch64][SVE] Implement demanded bits for @llvm.aarch64.sve.cntp (#168714)
This allows DemandedBits to see that the SVE CNTP intrinsic will only
ever produce small positive integers. The maximum value you could get
here is 256, which is CNTP on a nxv16i1 on a machine with a 2048bit
vector size (the maximum for SVE).
Using this various redundant operations (zexts, sexts, ands, ors, etc)
can be eliminated.
[MLIR][LLVM] Extend DIScopeForLLVMFuncOp to handle cross-file operatio… (#167844)
The current `DIScopeForLLVMFuncOp` pass handles debug information for
inlined code by processing `CallSiteLoc` attributes. However, some
compilation scenarios compose code from multiple source files directly
into a single function without generating `CallSiteLoc`.
**Scenario:**
```python
# a.py
def kernel_a(tensor):
print("a: {}", tensor) # a.py:3
jit_func_b(tensor) # Calls b.py code
# b.py
def func_b(tensor):
print("b: {}", tensor) # b.py:7
```
[18 lines not shown]
[X86] BuiltinsX86.td - attempt to pack the builtins for each SSE level close together. NFC. (#168844)
Avoid some repeated feature blocks - we should have a single place in
each file that we can find most builtins for a particular ISA level.
Also, avoid some of the 80col wrapping that just makes it harder to find
anything at all.
There's a lot more we can do - but I don't want to completely refactor
this while we still have so much work to do for #30794
[LV] Allow partial reductions with an extended bin op (#165536)
A pattern of the form reduce.add(ext(mul)) is valid for a partial
reduction as long as the mul and its operands fulfill the requirements
of a normal partial reduction. The mul's extend operands will be
optimised to the wider extend, and we already have oneUse checks in
place to make sure the mul and operands can be modified safely.
1. -> https://github.com/llvm/llvm-project/pull/165536
2. https://github.com/llvm/llvm-project/pull/165543
[DebugInfo] Force early line-zero calls to have meaningful locations (#156850)
In functions that have been seriously deformed during optimisation,
there can be call instructions with line-zero immediately after frame
setup (see C reproducer in the test added). Our previous algorithms for
prologue_end ignored these, meaning someone entering a function at
prologue_end would break-in after a function call had completed. Prefer
instead to place prologue_end and the function scope-line on the line
zero call: this isn't false (it's the first meaningful instruction of the
function) and is approximately true. Given a less than ideal function,
this is an OK solution.
[mlir][Pass] Fix crash when applying a pass to an optional interface (#168499)
Interfaces can be optional: whether an op implements an interface or not
can depend on the state of the operation.
```
// An optional code block for adding additional "classof" logic. This can
// be used to better enable "optional" interfaces, where an entity only
// implements the interface if some dynamic characteristic holds.
// `$_attr`/`$_op`/`$_type` may be used to refer to an instance of the
// interface instance being checked.
code extraClassOf = "";
```
The current `Pass::canScheduleOn(RegisteredOperationName)` is
insufficient. This commit adds an additional overload to inspect
`Operation *`.
This commit fixes a crash when scheduling an `InterfacePass` for an
optional interface on an operation that does not actually implement the
interface.
[AllocToken] Enable alloc token instrumentation for size-returning functions (#168840)
Consider a newly added "malloc_span" attribute in the allocation token
instrumentation to ensure that allocation functions with the
"malloc_span" attribute are processed similarly to other memory
allocation functions.
Update the tests to demonstrate applicability to __size_returning_new.
[X86] EltsFromConsecutiveLoads - recognise reverse load patterns. (#168706)
See if we can create a vector load from the src elements in reverse and
then shuffle these back into place.
SLP will (usually) catch this in the middle-end, but there are a few
BUILD_VECTOR scalarizations etc. that appear during DAG legalization.
I did start looking at a more general permute fold, but I haven't found
any good test examples for this yet - happy to take another look if
somebody has examples.
[WebAssembly] Lower ANY_EXTEND_VECTOR_INREG (#167529)
Treat it in the same manner of zero_extend_vector_inreg and generate an
extend_low_u if possible. This is to try an prevent expensive shuffles
from being generated instead. computeKnownBitsForTargetNode has also
been updated to specify known zeros on extend_low_u.