[JITLink][CompactUnwind] Express mergeability via +ve predicate. NFCI. (#176313)
Compact unwind record merging is an optimization. Using a can-be-merged
predicate is preferrable to a "cannot-be-merged" predicate as the former
encourages conservatively correct implementations: "what is safe to
merge" is easier to reason about than "what is safe to not not merge".
[LV] Prevent `extract-lane` generate unused IRs with single vector operand. (#172798)
When `extract-lane` only contains single vector operand. We can simplify
it to `extractelement`.
This patch makes `extract-lane` generate simple `extractelement` when it
only contains single vector operand to prevent unused IR generated.
This patch is mostly NFC, the unused IR should be removed in following
IR passes.
[MLIR][NVVM][Tests] Re-enable matmul.py tests (#175728)
This patch re-enables the matmul.py tests:
* Fix gpu.wait usages
* Fix gpu.launchOp usage
* Fix format-string for gpu.printf
* Fix verification failure by removing the block[0] append.
This is now done by the python script's init.
* Fix the runtime error by adding the missing initialize() call during
JIT.
* Add the missing waitGroup(0) for _ws implementation.
This was mistakenly removed in PR #113713. Without this fix,
I see timing issues and the _ws tests with stage>1 randomly show output
mismatch.
With all these fixes, the test compiles and
executes successfully on an sm90a machine.
(locally verified for 1K iterations)
Signed-off-by: Durgadoss R <durgadossr at nvidia.com>
[RISCV] Store original LocVT/LocInfo in PendingLocs instead of XLenVT/Indirect. NFC (#176193)
Convert to XLenVT/Indirect when we use the PendingLocs. This allows the
2*XLen case to use the original LocVT and not the overridden XLenVT.
Hoping this reduces some of the changes from #176093.
[libc][math] Refactor dfmal to Header Only. (#175359)
builds correctly with both Clang and GCC 12.2.
Since `fma` is not `constexpr`, `dfmal` cannot be declared `constexpr`
either.
Closes #175316.
[AMDGPU] Disable generic DAG combines at -O0 to preserve debuggability.
Disable generic DAG combines for AMDGPU at -O0 via disableGenericCombines()
to preserve instructions that users may want to set breakpoints
on during debugging.
Since power-of-2 division/remainder for types > i64 was dependent on
DAG combine optimizations, added shouldExpandPowerOf2DivRem()
to request IR-level expansion for these cases at -O0.
[NFC] Reduce fragility of swdev503538-... test.
The original test was created in PR #120815, but it depends on -O0 and
implicitly uses DAGCombiner (that is switched on by default for -O0).
The patch reduces fragility of the test and removes dependency on
DAGCombiner.
[orc-rt] Make WrapperFunctionResult constructor explicit. (#176298)
The WrapperFunctionBuffer(orc_rt_WrapperFunctionBuffer) constructor
takes ownership of the underlying buffer (if one exists). Making the
constructor explicit makes this clearer at the call site.
This mirrors a similar change to the LLVM-side API in dec5d663745.
[X86][NewPM] Fix X86CodeGenPassBuilder
There were two passes in there that have not actually been ported, and
x86-seses got ported earlier today before this landed, so adding it as
well.
[ValueTypes] Add types for v256bf16 and v512bf16 (#176287)
There are v256f16 and v128f16 types for f16. This PR adds the same
number of element types for bf16.