[RISCV][P-ext] Add missing let Inst{31} = 0b0 to RVPPairShift_rr. (#199885)
This bit was accidentally left unset. I think this means we might have
treated this bit as a don't care for the disassembler could disassemble
some invalid encodings to these instructions. I didn't check the opcode
map closely enough to confirm this.
[AMDGPU] Remove explicit PartialThreshold setting in loop unrolling (#198901)
Remove UP.PartialThreshold = UP.Threshold / 4 from AMDGPU TTI, restoring
the default PartialThreshold of 150.
This was introduced in #194924 to limit code-size growth from runtime
unrolling, but PartialThreshold also gates compile-time partial
unrolling of constant-trip-count loops. This change will make the
PartialThreshold back to the default value for both compile-time partial
unrolling and runtime partial unrolling.
Benchmarked across CK, llama.cpp, and xpu-perf — no performance impact
from restoring the default.
Fixes #196372, replaces #196818.
Assisted-by: Claude Code
[flang][OpenMP] Lower target in_reduction for host fallback
Teach Flang lowering and MLIR OpenMP translation to carry
in_reduction through omp.target for the host-fallback path.
The translation looks up task reduction-private storage with
__kmpc_task_reduction_get_th_data and binds the target region's
in_reduction block argument to that private pointer, so uses inside the
region do not keep referring to the original variable.
The patch also preserves in_reduction operands in the TargetOp builder
path and ensures target in_reduction list items are mapped into the
target region when needed.
The device/offload-entry path remains diagnosed as not yet implemented.
[LLVM] Add per-target runtime directory to rpath (#199755)
Summary:
This simply adds the LLVM_DEFAULT_TARGET_TRIPLE to the LLVM build's
rpath if present. This keeps things hermetic for the library (offload)
that depends on it.
The reason this is required is because `llvm-gpu-loader` calls
`DynamicLibrary` on the Offload runtime. However, in a shared library
build the actual call is in libLLVMSupport.so, which does not have this
RPath, so `dlopen` delegates to that which does not know how to find it.
The only options to fix this are to use `dlopen` directly in the loader,
or add the rpath to the LLVM binaries.
I think this makes sense for LLVM, because the target-specific directory
can contain LLVM related libraries.
[libc][bazel] Add rules for __support/threads tests. (#199871)
* Add Bazel BUILD rules for three `__support/threads` unit tests.
* Fix/expand BUILD rules for the support libraries they depend on
(clock_gettime and vdso) that were previously incorrectly missing `.cpp`
files with implementations.
* Minor fix to use `internal::exit` in `raw_mutex_test` to avoid adding
a dependency on `exit` entrypoint, which doesn't yet exist in Bazel.
Assisted by: Gemini
[AMDGPU] Fix SuperReg to MCRegister conversion (#199993)
This is a fix for "[AMDGPU] Implement CFI for non-kernel functions
(#183153)" f78a233ac89dc0f9f0f26dfe051874013ae6e242 to use
"SuperReg.asMCReg()" instead of "MCRegister(SuperReg)", which leads to
"ambiguous call" when using the MSVC compiler.
[flang][OpenMP] Remove ompFlagsRequireMark from symbol resolution (#198591)
The `ompFlagsRequireMark` set was there to make sure that we put the
flags from it on symbols even when no new symbols needed to be created.
Instead of doing that, we can just put the flag on the symbol every
time. There is no harm in having these flags, it's just extra
information.
[flang][OpenMP] Support in_reduction on target
Teach Flang lowering and MLIR OpenMP translation to carry
in_reduction through omp.target.
The translation looks up the task reduction-private storage with
__kmpc_task_reduction_get_th_data and binds the target region's
in_reduction block argument to that private pointer, so uses inside the
region do not keep referring to the original variable.
The patch also preserves in_reduction operands in the TargetOp builder
path and makes sure target in_reduction list items are mapped into the
target region when needed.
[InstCombine] Narrow llvm.abs through trunc. (#199643)
Update EvaluateInDifferentType / canEvaluateTruncated to narrow abs
intrinsics when the operand has at least OrigBitWidth - BitWidth + 1
sign bits. The transform always emits the narrow abs with
IsIntMinPoison=false, as the narrowed value may be INT_MIN in the narrow
type, while not in the original width.
Alive2 Proof with weaker precondition (top and truncated sign bits must
match): https://alive2.llvm.org/ce/z/AMQRmi
End-to-end C pixel math example: https://clang.godbolt.org/z/Ma8bsTGTY
PR: https://github.com/llvm/llvm-project/pull/199643
[AMDGPU] Fix ShiftAmt32Imm to use unsigned comparison (#199052)
ShiftAmt32Imm used a signed 'Imm < 32' predicate, which incorrectly
matched negative immediates such as -1. The scalar fshr fast path:
def : GCNPat<(UniformTernaryFrag<fshr> i32:$src0, i32:$src1,
(i32 ShiftAmt32Imm:$src2)),
(i32 (EXTRACT_SUBREG (S_LSHR_B64 ..., $src2), sub0))>;
When fshl(scalar, X, Z) is lowered via expandFunnelShift for any
constant Z in [0, 31], the generic code converts it to fshr(..., ~Z) or
fshr(..., -Z), producing a negative shift amount. Because all such
values satisfy Imm < 32 in a signed comparison, ShiftAmt32Imm matched
and the pattern passed the negative immediate directly to S_LSHR_B64
without the S_AND_B32 masking. S_LSHR_B64 then shifted by the wrong
amount, producing an incorrect result.
Fix by changing the predicate to an unsigned comparison so that only
values in [0, 31] match, and negative values fall through to the general
[8 lines not shown]
[SystemZ] Don't fold memops after SSA if tied regs don't match. (#197475)
When foldMemoryOperandImpl() is called during register allocation,
folding into a reg/mem opcode mustn't be done if the tied def and use
operands do not end up referencing the same register.
Fixes #197414
[Hexagon] Fix up vector predicate before compressing it for bitcast (#199283)
In v64i1 vector Predicate, each i1 is represented by 2 bits of predicate
register. A predicate register needs to be fixed before we compress it.
Signed-off-by: Alexey Karyakin <akaryaki at qti.qualcomm.com>
Co-authored-by: Ikhlas Ajbar <iajbar at quicinc.com>
[AMDGPU] Refactor insertRelease into insertWriteback + insertWait (NFC) (#199486)
A release consists of two actions: write-back the current cache, and
wait for "relevant" outstanding operations to complete. With the new
memory model, it is possible to disable the cache write-back using
"non-av". This patch cleanly separates the existing implementation so
that the write-backs can be selectively applied after checking for
non-av semantics.
Part of a stack:
- #199486
- #199621
- #199489
- #199622
Assisted-By: Claude Opus 4.6
---------
Co-authored-by: Pierre van Houtryve <pierre.vanhoutryve at amd.com>
[flang][OpenMP] Fix copyprivate crash with unlimited polymorphic pointer (#199768)
Lowering a copyprivate clause whose list item is an unlimited
polymorphic pointer (class(*), pointer) crashed in TypeInfo::typeScan.
The scan descends through the fir.class box and the fir.ptr, reaching a
`none` element type, which the terminal assertion did not allow.
Fixes #198770
[clang] fix getTemplateInstantiationArgs
This implements a new strategy for collecting the template arguments, by
relying on the qualifiers and template parameter lists to navigate the template
context of out-of-line definitions.
This greatly simplifies the signature of that function, by removing a bunch
of workarounds, and simpliffying a couple that weren't removed yet.
Since this now relies on qualifiers and template parameter lists,
this patch expends most of its effort making sure these are placed,
transformed and propagated to template instantiations.
Also makes the explicit specialization AST nodes stop abusing the template
parameter lists by storing it's own template parameter list, creating a
dedicated field for them, similar to partial specializations.
[lld][ELF] Exclude SHT_NOBITS sections from LMA overlap checks (#196423)
In embedded applications it's sometimes useful to load a section at the
same virtual address as the .bss section. For example, one possible use
case is for temporary code/data that is only needed for a short time
when the program is starting up:
REGIONS {
RAM : ORIGIN = 0x100000, LENGTH = 1M
INIT : ORIGIN = 0x200000, LENGTH = 1M
}
.text { *(.text); } > RAM
.bss (NOLOAD) : { *(.bss); } > RAM
.init : AT(LOADADDR(.bss)) { *(.init); } > INIT
The .init section gets placed in the file immediately after the .text
section. At startup the .init section contents are copied to the INIT
region before zeroing .bss. Once the .init section is no longer needed
[14 lines not shown]
[clang] fix finding class template instantiation pattern for member specializations (#199979)
Stop treating the member which a member specialization specializes as
the pattern of the former.
Split off from https://github.com/llvm/llvm-project/pull/199528
[flang][OpenMP] Support in_reduction on target
Teach Flang lowering and MLIR OpenMP translation to carry
in_reduction through omp.target.
The translation looks up the task reduction-private storage with
__kmpc_task_reduction_get_th_data and binds the target region's
in_reduction block argument to that private pointer, so uses inside the
region do not keep referring to the original variable.
The patch also preserves in_reduction operands in the TargetOp builder
path and makes sure target in_reduction list items are mapped into the
target region when needed.
Reapply "[clang][ssaf][NFC] Rework how the Force linker anchors are defined and used" (#194693)
This reverts commit 582958c4337f539e650096c0257a322315298e1a.
Drop "const" from these anchor variables - like they are in clang-tidy
Turns out, MSVC likely doesn't conform with the C++ standard and makes
`const volatile` global variables have *internal* linkage - while they
should have *external* linkage.
https://eel.is/c++draft/basic.link#3.2
```
(3) The name of an entity that belongs to a namespace scope has internal linkage if it is the name of
(3.1) a variable, variable template, function, or function template that is explicitly declared static; or
(3.2) a non-template variable of non-volatile const-qualified type, unless
(3.2.1) it is declared in the purview of a module interface unit (outside the private-module-fragment, if any) or module partition, or
(3.2.2) it is explicitly declared extern, or
(3.2.3) it is inline, or
(3.2.4) it was previously declared and the prior declaration did not have internal linkage; or
[3 lines not shown]