[flang][OpenMP] Lower target in_reduction for host fallback
Enable host-fallback lowering for target in_reduction in Flang and MLIR OpenMP translation.
Model target in_reduction through the matching map entry, force address-preserving implicit mapping for Flang in_reduction list items, and emit the host-side task-reduction lookup with __kmpc_task_reduction_get_th_data. The runtime entry point takes and returns a generic, default-address-space pointer, so normalize a non-default-address-space captured pointer to the generic address space before the call and cast the returned private pointer back to the map block argument's address space, mirroring the in_reduction handling on omp.taskloop. Unsupported device/offload-entry and richer reduction forms remain diagnosed.
Add Flang lowering, MLIR verifier/translation, and LLVM IR tests for the supported host-fallback path, including a non-default-address-space case, and the remaining unsupported cases.
[AArch64][llvm] Define APAS, BRB and TRCIT as SYS aliases (#203563)
`APAS`, `BRB IALL/INJ` and `TRCIT` use `SYS` encodings, so define them
as aliases of `SYSxt` instead of separate instructions.
Check that the preferred architectural aliases are printed when their
features are enabled and that disassembly falls back to the generic `SYS`
spelling when not enabled.
[lldb][test] Introduce build_and_run test utility (#194386)
We currently have several hundred tests require a running process in a
given state, and therefore perform the same three tasks:
* compile a test executable
* set a breakpoint by finding a source regex
* then launch the test process to hit that breakpoint.
A large chunk of these tests do this exact same setup with various
versions of copied boilerplate code. The different versions we have all
have different conventions of naming the breakpoint comment, the main
file (and whether it should be resolved), and different generated error
messages if things go wrong.
We already have a standardized and much shorter way of doing this in
LLDB (see below), but this still encourages test writers to specify
non-standard file names and non-standard breakpoint comment names.
[15 lines not shown]
[lldb][test] Faster shut down for pexpect tests (#201171)
Our pexpect tests spend most of their time in the shutdown logic
waiting for the test child to shut down. For example, our editline
tests spend about 95% of their 40s runtime just waiting for the
pexpect child to terminate.
One of the reasons is that the ptyprocess terminate approach
uses a timeout to give the child time to shut down and be cleaned
up by the kernel. While this timeout makes sense, our timeout is
extremely long (6s) since 56fb7456950d2564d16500e40c5719c954a6987a .
Because the default ptyprocess implementation is designed for very
short timeouts (0.1s), it just sleeps and then checks the process
status. For our long timeout, the child most likely already terminated
way before the timeout on a fast system. However, because we have
some very slow builders, we cannot reduce this timeout without
making tests flaky again.
[7 lines not shown]
AMDGPU: Reland: Codegen for v_dual_dot2acc_f32_f16/bf16 from VOP3
For V_DOT2_F32_F16 and V_DOT2_F32_BF16 add their VOPDName and mark
them with usesCustomInserter which will be used to add pre-RA register
allocation hints to preferably assign dst and src2 to the same physical
register. When the hint is satisfied, canMapVOP3PToVOPD recognises the
instruction as eligible for VOPD pairing by checking if it is VOP2 like:
dst==src2, no source modifiers, no clamp, and src1 is a register.
Mark both instructions as commutable to allow a literal in src1 to be
moved to src0, since VOPD only permits a literal in src0.
Original patch had a bug where it did not check if physical src
registers match register class of appropriate operand in fullVOPD
instructions, check is now done via isValidVOPDSrc.
[lldb] Avoid calling dyld's versions of libc functions (#201829)
dyld ships with its own version of various libc functions that we are
not supposed to call. This patch prevents the expression evaluator from
calling them by respecting the existing list of forbidden modules.
[flang][mem2reg] promote memory slots through declares (#196975)
Leverage the new mem2reg APIs for views to remove the
"same block" limitation over fir.declare mem2reg, and to allow mem2reg
over fir.convert so that mixed dialect mem2reg with fir + memref is
possible.
Note that fir.declare_value for memory used with different value types
will be dropped (e.g. EQUIVALENCE). A later patch will deal with
improving fir.declare_value to carry the variable type interpedently of
the value (like in LLVM), but there are anyway a bit more work to enable
mem2reg with equivalence given their storage is an array of bytes.
Assisted by: Claude
[MIPS] soft-promote `f16` also when using `+msa` (#203065)
Fixes https://github.com/llvm/llvm-project/issues/202808
Make use of the default soft-promote mechanism for f16, rather than an
ad-hoc approach making f16 storage-only.
In theory you could leave it at that, but I added custom implementations
to make use of the instructions for `FP16_TO_FP` and `FP_TO_FP16`, and
manually apply the "fptoui to fptosi trick" which generates shorter
code.
I don't really have a good way of testing this. The assembly changes
look reasonable but it's easy to miss something subtle of course. I've
tried to break the change up into smaller commits but it's still kind of
a lot.
[SelectionDAG] Fold subvector inserts into concat operands (#200937)
Push insert_subvector into the containing CONCAT_VECTORS operand when
the insertion is wholly contained there.
AI note: an LLM generated the code and the test, I've read them
[flang][OpenMP] Lower target in_reduction for host fallback
Enable host-fallback lowering for target in_reduction in Flang and MLIR OpenMP translation.
Model target in_reduction through the matching map entry, force address-preserving implicit mapping for Flang in_reduction list items, and emit the host-side task-reduction lookup with __kmpc_task_reduction_get_th_data. The runtime entry point takes and returns a generic, default-address-space pointer, so normalize a non-default-address-space captured pointer to the generic address space before the call and cast the returned private pointer back to the map block argument's address space, mirroring the in_reduction handling on omp.taskloop. Unsupported device/offload-entry and richer reduction forms remain diagnosed.
Add Flang lowering, MLIR verifier/translation, and LLVM IR tests for the supported host-fallback path, including a non-default-address-space case, and the remaining unsupported cases.
[AMDGPU] Fix lowerFCOPYSIGN dropping the sign bit when narrowing the sign operand (#203492)
TRUNCATE of the v2i32-bitcast sign kept the low 16 bits of each lane but
dropped f32 sign bit at bit 31
Shift right by 16 first so the sign bit lands in the f16 sign position
---------
Co-authored-by: Jay Foad <jay.foad at gmail.com>
[LV] Add `-vplan-print-before=<pass-regex>` (#203933)
This can be helpful for debugging and for VPlan check tests (showing
before/after a specific transform).
This also adds `-vplan-print-before-all` for parity with
`-vplan-print-after-all`.
[clang][bytecode] Add more checks around _Complex values (#204076)
Check the actual source type when converting a pointer to an rvalue. We
otherwise allow converting form a two-element primitive array.
[VPlan] Recognize lshr in getSCEVExprForVPValue. (#203496)
When lshr v, const occurs and const is less than the type's bitwidth, it
can be treated as udiv v, (1 << const). This enables vectorizer to
convert more gathers into strided loads.
Pre-commit test #203488
[MLGO][EmitC] Scalarize single-element tensor returns (#199686)
Add an EmitC-owned preparation pass,
`mlgo-scalarize-single-element-tensor-return`, that rewrites private
functions returning a statically-shaped ranked tensor with exactly one
element into functions returning the element type directly.
Assisted-by: Codex (refine implementation + tests). I reviewed all code
and tests before submission.
## Example
Before:
```mlir
func.func private @rank1(%arg0: tensor<1xi64>) -> tensor<1xi64> {
return %arg0 : tensor<1xi64>
}
```
[44 lines not shown]