[ExpandIRInsts] Support llvm.fpto{u,s}i.sat (#199174)
Previously, running ExpandIRInsts on a program which needs to expand a
vector fptoui.sat would hit llvm_unreachable, because the `scalarize`
function didn't handle this intrinsic.
This bug was found by a large run of Opus 4.7 looking for bugs in LLVM.
[flang][OpenMP] Lower target in_reduction for host fallback
Teach Flang lowering and MLIR OpenMP translation to carry
in_reduction through omp.target for the host-fallback path.
The translation looks up task reduction-private storage with
__kmpc_task_reduction_get_th_data and binds the target region's
in_reduction block argument to that private pointer, so uses inside the
region do not keep referring to the original variable.
The patch also preserves in_reduction operands in the TargetOp builder
path and ensures target in_reduction list items are mapped into the
target region when needed.
The device/offload-entry path remains diagnosed as not yet implemented.
[InstCombine] Use sadd.sat for chained ldexp fold (#199274)
ldexp(ldexp(x, a), b) -> ldexp(x, a + b) didn't consider the fact that
`a + b` may overflow! Use a saturating add instead.
This bug was found by a large run of Opus 4.7 looking for bugs in LLVM.
[X86][AvoidStoreForwardingBlocks] Skip volatile/atomic accesses. (#199698)
The pass splits an XMM/YMM load+store pair into smaller copies when a
preceding narrower store would block store-to-load forwarding into the
load, but it didn't check the MachineMemOperand's isVolatile/isAtomic
bits.
This bug was found by a large run of Opus 4.7 looking for bugs in LLVM.
[win][x64] Updated `llvm-objdump` and `llvm-readobj` to be able to dump Windows x64 Unwind v3 information. (#199120)
Public docs:
<https://learn.microsoft.com/en-us/cpp/build/x64-unwind-information-v3?view=msvc-170>
The change adds Windows x64 unwind v3 info decoding and printing support
in LLVM, including new data structures, enums, and decoding functions to
handle the different WOD opcodes and epilog descriptors. It also updates
the dumping utilities (llvm-readobj and llvm-objdump) to correctly
interpret v3 unwind info.
[RISCV][P-ext] Make the direction argument for RVPPairShift* classes required. NFC (#199799)
It's part of the encoding. I don't think we should have a preference for
one of the bit values being the default.
[RISCV][P-ext] Add missing let Inst{31} = 0b0 to RVPPairShift_rr. (#199885)
This bit was accidentally left unset. I think this means we might have
treated this bit as a don't care for the disassembler could disassemble
some invalid encodings to these instructions. I didn't check the opcode
map closely enough to confirm this.
[AMDGPU] Remove explicit PartialThreshold setting in loop unrolling (#198901)
Remove UP.PartialThreshold = UP.Threshold / 4 from AMDGPU TTI, restoring
the default PartialThreshold of 150.
This was introduced in #194924 to limit code-size growth from runtime
unrolling, but PartialThreshold also gates compile-time partial
unrolling of constant-trip-count loops. This change will make the
PartialThreshold back to the default value for both compile-time partial
unrolling and runtime partial unrolling.
Benchmarked across CK, llama.cpp, and xpu-perf — no performance impact
from restoring the default.
Fixes #196372, replaces #196818.
Assisted-by: Claude Code
[flang][OpenMP] Lower target in_reduction for host fallback
Teach Flang lowering and MLIR OpenMP translation to carry
in_reduction through omp.target for the host-fallback path.
The translation looks up task reduction-private storage with
__kmpc_task_reduction_get_th_data and binds the target region's
in_reduction block argument to that private pointer, so uses inside the
region do not keep referring to the original variable.
The patch also preserves in_reduction operands in the TargetOp builder
path and ensures target in_reduction list items are mapped into the
target region when needed.
The device/offload-entry path remains diagnosed as not yet implemented.
[LLVM] Add per-target runtime directory to rpath (#199755)
Summary:
This simply adds the LLVM_DEFAULT_TARGET_TRIPLE to the LLVM build's
rpath if present. This keeps things hermetic for the library (offload)
that depends on it.
The reason this is required is because `llvm-gpu-loader` calls
`DynamicLibrary` on the Offload runtime. However, in a shared library
build the actual call is in libLLVMSupport.so, which does not have this
RPath, so `dlopen` delegates to that which does not know how to find it.
The only options to fix this are to use `dlopen` directly in the loader,
or add the rpath to the LLVM binaries.
I think this makes sense for LLVM, because the target-specific directory
can contain LLVM related libraries.
[libc][bazel] Add rules for __support/threads tests. (#199871)
* Add Bazel BUILD rules for three `__support/threads` unit tests.
* Fix/expand BUILD rules for the support libraries they depend on
(clock_gettime and vdso) that were previously incorrectly missing `.cpp`
files with implementations.
* Minor fix to use `internal::exit` in `raw_mutex_test` to avoid adding
a dependency on `exit` entrypoint, which doesn't yet exist in Bazel.
Assisted by: Gemini
[AMDGPU] Fix SuperReg to MCRegister conversion (#199993)
This is a fix for "[AMDGPU] Implement CFI for non-kernel functions
(#183153)" f78a233ac89dc0f9f0f26dfe051874013ae6e242 to use
"SuperReg.asMCReg()" instead of "MCRegister(SuperReg)", which leads to
"ambiguous call" when using the MSVC compiler.
[flang][OpenMP] Remove ompFlagsRequireMark from symbol resolution (#198591)
The `ompFlagsRequireMark` set was there to make sure that we put the
flags from it on symbols even when no new symbols needed to be created.
Instead of doing that, we can just put the flag on the symbol every
time. There is no harm in having these flags, it's just extra
information.
[flang][OpenMP] Support in_reduction on target
Teach Flang lowering and MLIR OpenMP translation to carry
in_reduction through omp.target.
The translation looks up the task reduction-private storage with
__kmpc_task_reduction_get_th_data and binds the target region's
in_reduction block argument to that private pointer, so uses inside the
region do not keep referring to the original variable.
The patch also preserves in_reduction operands in the TargetOp builder
path and makes sure target in_reduction list items are mapped into the
target region when needed.
[InstCombine] Narrow llvm.abs through trunc. (#199643)
Update EvaluateInDifferentType / canEvaluateTruncated to narrow abs
intrinsics when the operand has at least OrigBitWidth - BitWidth + 1
sign bits. The transform always emits the narrow abs with
IsIntMinPoison=false, as the narrowed value may be INT_MIN in the narrow
type, while not in the original width.
Alive2 Proof with weaker precondition (top and truncated sign bits must
match): https://alive2.llvm.org/ce/z/AMQRmi
End-to-end C pixel math example: https://clang.godbolt.org/z/Ma8bsTGTY
PR: https://github.com/llvm/llvm-project/pull/199643
[AMDGPU] Fix ShiftAmt32Imm to use unsigned comparison (#199052)
ShiftAmt32Imm used a signed 'Imm < 32' predicate, which incorrectly
matched negative immediates such as -1. The scalar fshr fast path:
def : GCNPat<(UniformTernaryFrag<fshr> i32:$src0, i32:$src1,
(i32 ShiftAmt32Imm:$src2)),
(i32 (EXTRACT_SUBREG (S_LSHR_B64 ..., $src2), sub0))>;
When fshl(scalar, X, Z) is lowered via expandFunnelShift for any
constant Z in [0, 31], the generic code converts it to fshr(..., ~Z) or
fshr(..., -Z), producing a negative shift amount. Because all such
values satisfy Imm < 32 in a signed comparison, ShiftAmt32Imm matched
and the pattern passed the negative immediate directly to S_LSHR_B64
without the S_AND_B32 masking. S_LSHR_B64 then shifted by the wrong
amount, producing an incorrect result.
Fix by changing the predicate to an unsigned comparison so that only
values in [0, 31] match, and negative values fall through to the general
[8 lines not shown]
[SystemZ] Don't fold memops after SSA if tied regs don't match. (#197475)
When foldMemoryOperandImpl() is called during register allocation,
folding into a reg/mem opcode mustn't be done if the tied def and use
operands do not end up referencing the same register.
Fixes #197414
[Hexagon] Fix up vector predicate before compressing it for bitcast (#199283)
In v64i1 vector Predicate, each i1 is represented by 2 bits of predicate
register. A predicate register needs to be fixed before we compress it.
Signed-off-by: Alexey Karyakin <akaryaki at qti.qualcomm.com>
Co-authored-by: Ikhlas Ajbar <iajbar at quicinc.com>
[AMDGPU] Refactor insertRelease into insertWriteback + insertWait (NFC) (#199486)
A release consists of two actions: write-back the current cache, and
wait for "relevant" outstanding operations to complete. With the new
memory model, it is possible to disable the cache write-back using
"non-av". This patch cleanly separates the existing implementation so
that the write-backs can be selectively applied after checking for
non-av semantics.
Part of a stack:
- #199486
- #199621
- #199489
- #199622
Assisted-By: Claude Opus 4.6
---------
Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve at amd.com>
[flang][OpenMP] Fix copyprivate crash with unlimited polymorphic pointer (#199768)
Lowering a copyprivate clause whose list item is an unlimited
polymorphic pointer (class(*), pointer) crashed in TypeInfo::typeScan.
The scan descends through the fir.class box and the fir.ptr, reaching a
`none` element type, which the terminal assertion did not allow.
Fixes #198770