[AMDGPU] Report only local per-function resource usage when object linking is enabled (#192594)
With object linking the linker aggregates resource usage across TUs, so
compile-time pessimism and call-graph propagation duplicate the linker's
work or pollute its inputs.
In this mode, skip the per-callsite conservative bumps in
`AMDGPUResourceUsageAnalysis` and assign each resource symbol in
`AMDGPUMCResourceInfo` a concrete local constant instead of building
call-graph max/or expressions.
[CodeGen] Fix non-determinism in MachineBlockHashInfo (#192826)
The previous implementation used `hash_value(MachineOperand)`, which
is not guaranteed to be stable across different executions because it
hashes pointers for certain operand types (like MBB, GlobalAddress,
etc).
Use existing stableHashValue which has no problem.
The rest of the file should the same, but it may break profile
compatibility.
Changing behavior for Operand is not an issue, as existing one is a low
quality RNG.
Code does not have test coverage, it will be fixed in #192911.
Fixes #173933.
AMDGPU/GlobalISel: RegbankLegalize rules for merge-like opcodes
Move RegbankLegalize handling for G_BUILD_VECTOR, G_MERGE_VALUES and
G_CONCAT_VECTORS from AMDGPURegBankLegalize to AMDGPURegBankLegalizeRules
by implementing rules for all supported types.
AMDGPU/GlobalISel: RegbankLegalize rules for G_BITCAST
Move RegbankLegalize handling for G_BITCAST from AMDGPURegBankLegalize to
AMDGPURegBankLegalizeRules by implementing rules for all supported types.
AMDGPU/GlobalISel: RegbankLegalize rules for undef and constants
Move RegbankLegalize handling for G_IMPLICIT_DEF, G_CONSTANT and G_FCONSTANT
from AMDGPURegBankLegalize to AMDGPURegBankLegalizeRules by implementing
rules for all supported types.
[Clang][AMDGPU] Deprecate `amdgpu-num-vgpr` and `amdgpu-num-sgpr`
We will just emit a warning at this moment. This will still take effect for
regular compilation, but in object linking, we will simply ignore them.
[CIR] Implement emitNewArrayInit for constant and strings (#192666)
This patch further fleshes out the emit New ArrayInit for constant and
string variables. Implementation wise, this is pretty much the same as
classic-codegen, however it required a few differences. First, our use
of cir.copy instead of a memcpy call means we had to 'lift' an
dyn_allocated pointer type to the array type. Second, we had to make
some changes to make sure that 'empty' extra init was skipped in a place
we didn't do before.
In order to test this, I found 2 tests from classic-codegen that I
pulled in nearly verbatum. 'Check' lines from paren-list-agg-init.cpp
are converted to LLVM lines with slight relaxation, mostly to make up
for cases where CIR lowering ntroduces extra branches or GEPS on
conversion changes.
new-array-init.cpp's 'Check' lines were particularly bad/not detailed,
so I wrote new ones.
ONE test was commented out, as it requires the rest of emitNewArrayInit
to be implemented.
[OpenACC] Make sure array-section diag normalizes width for diag (#193013)
The below issue exposed that the comparison of the values was not
properly adjusting the width of the constant values before comparison.
Thus, when we did an addition of the two, it caused an assert in APSInt.
This patch makes sure that the 'adjust width + sign' branch is taken if
the sign or width don't match. Previously we only did this if it was a
sign mismatch.
Fixes: #192783
[CIR] Fix dynamic cast of const types (#192751)
When a dynamic cast was performed using const-qualified values, we were
generating a reference to const-qualified typeinfo but never emitting
such const-qualified typeinfo, leading to an undefined reference at link
time.
This change fixes that by stripping the type qualifiers before
processing the cast. This matches the behavior of classic codegen in
ItaniumCXXABI::emitDynamicCastCall.
[HLSL][DirectX] Emit convergence control tokens when targeting DirectX (#188792)
This pr allows codegen to generate convergence control tokens. This
allows for a more accurate description of convergence behaviour to
prevent (or allow) invalid control flow graph transforms. As noted, the
use of convergence control tokens is the ideal norm and this follows
that by enabling it for `DirectX`.
This was done now under the precedent of preventing a convergent exit
condition of a loop from being illegally moved across control flow. Test
cases for this are explicitly added.
Please see the individual commits for logically similar chunks.
Unfortunately, it is tricky to stage this in smaller individual commits.
Resolves https://github.com/llvm/llvm-project/issues/180621.https://github.com/llvm/llvm-project/pull/188537 is a pre-requisite of
this passing HLSL offload suite tests.
Assisted by: Github Copilot
[AMDGPU] Report only local per-function resource usage when object linking is enabled
With object linking the linker aggregates resource usage across TUs, so
compile-time pessimism and call-graph propagation duplicate the linker's work or
pollute its inputs.
In this mode, skip the per-callsite conservative bumps in
`AMDGPUResourceUsageAnalysis` and assign each resource symbol in
`AMDGPUMCResourceInfo` a concrete local constant instead of building call-graph
`max`/`or` expressions.
[AArch64][clang][llvm] Add ACLE Armv9.7 MMLA intrinsics
Implement new ACLE matrix multiply-accumulate intrinsics for Armv9.7:
```c
// 16-bit floating-point matrix multiply-accumulate.
// Only if __ARM_FEATURE_SVE_B16MM
// Variant also available for _f16 if (__ARM_FEATURE_SVE2p2 && __ARM_FEATURE_F16MM).
svbfloat16_t svmmla[_bf16](svbfloat16_t zda, svbfloat16_t zn, svbfloat16_t zm);
// Half-precision matrix multiply accumulating to single-precision
// instruction from Armv9.7-A. Requires the +f16f32mm architecture extension.
float32x4_t vmmlaq_f32_f16(float32x4_t r, float16x8_t a, float16x8_t b)
// Non-widening half-precision matrix multiply instruction. Requires the
// +f16mm architecture extension.
float16x8_t vmmlaq_f16_f16(float16x8_t r, float16x8_t a, float16x8_t b)
```
[GlobalISel] Fix -Wunused-variable (#193009)
These variables are only used in assertions and set outside of the
variable definition, so mark them [[maybe_unused]].
[VectorCombine] Fix transitive Uses in foldShuffleToIdentity (#188989)
The Uses in foldShuffleToIdentity is intended to detect where an operand
is used to distinguish between splats, identities and concats of the
same value. When looking through multiple unsimplified shuffles the same
Use could be both a splat and a identity though. This patch changes the
Use to a Value and an original Use, so that even if we are looking
through multiple vectors we recognise the splat vs identity vs concat of
each use correctly.
Fixes #180338
(cherry picked from commit fd40c606652137706bc336ef80ed1814ab3d3680)
[NFC][test] Precommit test for pr188989 (#188667)
Precommit test for #188989.
This test case covers a scenario in the vector combine
foldShuffleToIdentity function where incorrect folding was caused when
different shuffle sequences shared the same initial Use *. This issue
may be due to cost model differences and currently reproduces only on
LoongArch for this test case.
(cherry picked from commit 3e015b89e8bd9c71f6bb1cf38747d2862f5d5a3d)
[X86] Fix missing ByValTemporaries update in CopyViaTemp path for musttail calls (#190540)
This fixes a miscompilation in musttail calls with byval arguments on
X86.
In the CopyViaTemp path, a temporary stack object is created and the
argument is copied into it.
However, the temporary is not recorded in ByValTemporaries,
so the final lowering phase does not emit the copy to the real outgoing
argument slot.
As a result, the callee may read incorrect values from the stack.
Fix this by recording the temporary in ByValTemporaries so that the
final lowering step correctly copies the argument to the expected stack
location.
Reproducer: https://github.com/llvm/llvm-project/issues/190429
(cherry picked from commit abd502a44e5ef19a302d943eeb017c29124b96e9)
[libsycl] Fix _LIBSYCL_EXPORT placement (#192243)
Current placement of _LIBSYCL_EXPORT in usm_functions.hpp causes
compilation errors on Windows and is not aligned with other header
files.
[PatternMatchHelpers] Factor deferred and bind matchers (NFC) (#191373)
Factor bind_ty and deferredval_ty as match_bind and match_deferred from
existing PatternMatch implementations into PatternMatchHelpers.
[lldb][docs] Add conference talks to the links page (#192724)
At EuroLLVM, I mentioned a previous LLDB talk and realized they would be
a lot more discoverable if we linked them from the website.
[LV][NFC] Rename PreferPredicateOverEpilogue to TailFoldingPolicy (#191803)
Rename the -prefer-predicate-over-epilogue flag and its associated
enum values to use 'TailFold' terminology instead of 'Predicate'. The
term 'Predicate' is overloaded in the vectorizer context and would
cause further confusion as more tail-folding styles are added.
[VPlan] CSE ScalarIVSteps recipes (#191307)
Extend getOpCodeOrIntrinsicID to return a pseudo opcode for
ScalarIVSteps, so it can be CSE'd, when extended to also check the
InductionOpcode.
[lldb] Fix crash when evaluating expressions in Wasm targets (#192893)
LLDB crashes with "LLVM ERROR: Incompatible object format!" when
evaluating expressions while debugging WebAssembly because ProcessWasm
never disables JIT. RuntimeDyld only supports ELF, MachO, and COFF
object formats, so attempting to JIT-compile an expression for a Wasm
target produces the aforementioned fatal error.
This PR avoids the crash by calling `SetCanJIT(false)` in the
`ProcessWasm` ctor. Simple expressions will still work via the IR
interpreter, while expression requiring the JIT now show a proper error
message instead of crashing.
Fixes #179915
[clang] implement CWG2064: ignore value dependence for decltype
The 'decltype' for a value-dependent (but non-type-dependent) should be known,
so this patch makes them non-opaque instead.
This patch also implements what's neceessary to allow overloading
on pure differences in instantiation dependence, making `std::void_t`
usable for SFINAE purposes.
This also readds a few test cases from da98651, which was a previous attempt
at resolving CWG2064.
Fixes #8740
Fixes #61818
Fixes #190388
[SPIR-V] Handle [N x i8] byte addressing in SPIRVEmitIntrinsics
LLVM started generating [N x i8] types on array indexing GEPs. Emit
intrinsiscs did not know what to do with it so it was generating a
cast to [N x i8] to perform the GEP. This does not work in logical
addressing.
The handle this, we expand the `i8` gep handling for logical addressing
mode to work for arbitrary size byte addressing.