[AMDGPU] Introduce ASYNC_CNT on GFX1250 (#185810)
Async operations transfer data between global memory and LDS. Their
progress is tracked by the ASYNC_CNT counter on GFX1250 and later
architectures. This change introduces the representation of that counter
in SIInsertWaitCnts. For now, the programmer must manually insert
s_wait_asyncnt instructions. Later changes will add compiler assistance
for generating the waits by including this counter in the asyncmark
instructions.
Assisted-by: Claude Sonnet 4.5
This is part of a stack:
- #185813
- #185810
[AArch64][GlobalISel] Remove fallback for scalar usqadd/suqadd intrinsics (#187513)
Previously, GlobalISel was failing to select these intrinsics when given
scalar operands, as RegBankSelect would place these on GPR banks. Fixing
this enables GlobalISel to lower correctly, as in Instruction Selection
the intrinsic matches the SIMD patterns in AArch64InstrInfo.td.
[clang-tidy] Fix "effective" -> "efficient". (#187536)
"Effective" is the wrong word: Both overloads are effective; they do
what they're supposed to do. But the character overload does less work.
[LV] Simplify `matchExtendedReductionOperand()` (NFCI) (#185821)
This updates `matchExtendedReductionOperand` so the simple case of
`UpdateR(PrevValue, ext(...))` is matched first as an early exit. The
binop matching is then flattened to remove the extra layer of the
`MatchExtends` lambda.
Reapply "[clang][bytecode] Allocate local variables in `InterpFrame` … (#187644)
…tail storage" (#187410)
This reverts commit bf1db77fc87ce9d2ca7744565321b09a5d23692f.
Avoid using an `InterpFrame` member after calling its destructor this
time. I hope that was the only problem.
[AMDGPU] Introduce ASYNC_CNT on GFX1250
Async operations transfer data between global memory and LDS. Their progress is
tracked by the ASYNC_CNT counter on GFX1250 and later architectures. This change
introduces the representation of that counter in SIInsertWaitCnts. For now, the
programmer must manually insert s_wait_asyncnt instructions. Later changes will
add compiler assistance for generating the waits by including this counter in
the asyncmark instructions.
Assisted-by: Claude Sonnet 4.5
[X86] Perform i128/i256/i512 BITREVERSE on the FPU (#187502)
Bitcast the large scalar integer to a vXi64 vector, reverse the elements
and then perform a per-element vXi64 bitreverse
If we have SSSE3 or later, BITREVERSE expansion using PSHUFB is always
more efficient than performing it as a scalar sequence (no need for
mayFoldIntoVector check).
Fixes #187353
Windows release build: Add checksum verification for downloaded source archives (#187113)
Add checksum verification for libxml2, zlib, and zstd source archives
via `cmake -E *sum` and `cmake -E compare_files` commands.
This also adds the following minor changes:
* Factor out libxml2 version into variable.
* Check `tar` return code.
[llc] Add -mtune option (#186998)
This patch adds a Clang-compatible -mtune option to llc, to enable
decoupled ISA and microarchitecture targeting, which is especially
important for backend development. For example, it can enable to easily
test a subtarget feature or scheduling model effects on codegen across a
variaty of workloads on the IR corpus benchmark:
https://github.com/dtcxzyw/llvm-codegen-benchmark.
The implementation adds an isolated generic codegen flag, to establish a
base for wider usage - the plan is to add it to `opt` as well in a
followup patch. Then `llc` consumes it, and sets `tune-cpu` attributes
for functions, which are further consumed by the backend.
[clang][cir] Adding myself in CODEOWNERS for CIRGenBuiltinAArch64.cpp (#187570)
This is to help with #185382 and to make sure that I don't miss any PRs.
libclc: Use log intrinsic for half and float cases for amdgpu (#187538)
This is pretty verbose and ugly. We're pulling the base implementation
in for the double cases, and scalarizing it. Also fully defining the
half and float cases to directly use the intrinsic, for all vector
types. It would be much more convenient if we had linker based overrides
for the generic implementations, rather than per source file.
libclc: Rewrite log implementation as gentype inc file (#187537)
Follow the ordinary gentype conventions for the log implementation,
instead of using a plain header. This doesn't quite yet enable
vectorization, due to how the table is currently indexed. This should
make it easier for targets to selectively overload the function for
a subset of types.
[AArch64] Use an unknown size for memcpy ops with non-constant sizes. (#187445)
The previous value of 0 was allowing loads to move past the mops
operations where it is not valid. Use a LocationSize::afterPointer()
size instead.
The GISel lowering currently loses the MMO, which is fine as it should
be conservatively treated as a load/store to any location.
libclc: Update trigpi functions (#187579)
These were originally ported from rocm device
libs in bc81ebefb7d9d9d71d20bfee2ce4cccb09701e9b.
Merge in more recent changes.