[AMDGPU] Pack overflow inreg args into VGPR lanes
When inreg function arguments overflow the available SGPRs, pack multiple values
into lanes of a single VGPR using writelane/readlane instead of consuming one
VGPR per overflow argument.
The feature is behind a flag (default off) and currently only supports the
SelectionDAG path.
Known issue: if the register allocator does not coalesce the COPY between the
writelane chain and the physical call argument register, the resulting v_mov_b32
is EXEC-dependent and will not transfer inactive lanes. This is correct when
EXEC is all-ones (the common case at call sites) but would be incorrect inside
divergent control flow.
[ValueTracking] Add CharWidth argument to getConstantStringInfo (NFC)
The method assumes that host chars and target chars have the same width.
Add a CharWidth argument so that it can bail out if the requested char
width differs from the host char width.
Alternatively, the check could be done at call sites, but this is more
error-prone.
In the future, this method will be replaced with a different one that
allows host/target chars to have different widths. The prototype will
be the same except that StringRef is replaced with something that is
byte width agnostic. Adding CharWidth argument now reduces the future
diff.
[SimplifyLibCalls] Add initial support for non-8-bit bytes
The patch makes CharWidth argument of `getStringLength` mandatory
and ensures the correct values are passed in most cases.
This is *not* a complete support for unusual byte widths in
SimplifyLibCalls since `getConstantStringInfo` returns false for those.
The code guarded by `getConstantStringInfo` returning true is unchanged
because the changes are currently not testable.
[IR] Account for byte width in m_PtrAdd
The method has few uses yet, so just pass DL argument to it. The change
follows m_PtrToIntSameSize, and I don't see a better way of delivering
the byte width to the method.
[SystemZ, LoopVectorizer] Enable vectorization of epilogue loops. (#172925)
This enables vectorization of epilogue loops produced by LoopVectorizer on
SystemZ.
LoopVectorizationCostModel::isEpilogueVectorizationProfitable() and
TTI.preferEpilogueVectorization() have been refactored slightly so that
targets can override preferEpilogueVectorization(ElementCount Iters) and
directly control this, whereas before this depended on
TTI.getMaxInterleaveFactor() as well.
The Iters passed to preferEpilogueVectorization() reflects the total number
of scalar iterations performed in the vectorized loop (including interleaving).
The default implementation of preferEpilogueVectorization() now subsumes
the old check against getMaxInterleaveFactor(). This patch should be NFC for
other targets.
[AMDGPU] Fix caller/callee mismatch in SGPR assignment for inreg args
On the callee side, `LowerFormalArguments` marks SGPR0-3 as allocated in
`CCState` before running the CC analysis. On the caller side, `LowerCall` (and
GlobalISel's `lowerCall`/`lowerTailCall`) added the scratch resource to
`RegsToPass` without marking it in `CCState`. This caused `CC_AMDGPU_Func` to
treat SGPR0-3 as available on the caller side, assigning user inreg args there,
while the callee skipped them without marking it in `CCState`. This caused
`CC_AMDGPU_Func` to treat SGPR0-3 as available on the caller side, assigning
user inreg args there, while the callee skipped them.
[DAGCombiner] Add legality check for CLMULR fold to prevent infinite loop (#182376)
The bitreverse(clmul(bitreverse, bitreverse)) -> clmulr fold was missing
a legality check, causing an infinite loop when CLMULR isn't supported
on the target. Added the check to match other folds in visitBITREVERSE.
Fixes #182270
[NFC][AMDGPU] Add test showing caller/callee SGPR mismatch for inreg args
Add a test demonstrating a bug where the caller and callee disagree on which
SGPRs hold user inreg arguments when there are enough to reach the SGPR0-3
range.
On the callee side, `LowerFormalArguments` marks SGPR0-3 as allocated in
`CCState` before the CC analysis runs. On the caller side, `LowerCall` adds the
scratch resource to `RegsToPass` without marking SGPR0-3 in `CCState`. This
causes `CC_AMDGPU_Func` to assign user inreg args to SGPR0-3 on the caller side
(they appear free) while the callee skips them.
In the test, the caller writes arg 0 (value 42) to s0, but the callee reads arg
0 from s16.
[CIR] Emit cir.zero directly in Vector logical ops (#182703)
Emit `cir.zero` directly instead of `vec.create<n, 0>` that will be
folded to `cir.const_vector<n, 0>` later
[IRBuilder] Add getByteTy and use it in CreatePtrAdd
The change requires DataLayout instance to be available, which, in turn,
requires insertion point to be set. In-tree tests detected only one case
when the function was called without setting an insertion point, it was
changed to create a constant expression directly.
[ValueTracking] Make isBytewiseValue byte width agnostic
This is a simple change to show how easy it can be to support unusual
byte widths in the middle end.
[IR] Make @llvm.memset prototype byte width dependent
This patch changes the type of the value argument of @llvm.memset and
similar intrinsics from i8 to iN, where N is the byte width specified
in data layout string.
Note that the argument still has fixed type (not overloaded), but type
checker will complain if the type does not match the byte width.
Ideally, the type of the argument would be dependent on the address
space of the pointer argument. It is easy to do this (and I did it
downstream as a PoC), but since data layout string doesn't currently
allow different byte widths for different address spaces, I refrained
from doing it now.
[DataLayout] Add byte specification
This patch adds byte specification to data layout string.
The specification is `b:<size>`, where `<size>` is the size of a byte
in bits (later referred to as "byte width").
Limitations:
* The only values allowed for byte width are 8, 16, and 32.
16-bit bytes are popular, and my downstream target has 32-bit bytes.
These are the widths I'm going to add tests for in follow-up patches,
so this restriction only exists because other widths are untested.
* It is assumed that bytes are the same in all address spaces.
Supporting different byte widths in different address spaces would
require adding an address space argument to all DataLayout methods
that query ABI / preferred alignments because they return *byte*
alignments, and those will be different for different address spaces.
This is too much effort, but it can be done in the future if the need
arises, the specification reserves address space number before ':'.
[llvm] Remove the docs for the (now removed) LLVM test-suite Makefiles (#179288)
The LLVM test suite used to provide a Makefile-based suite, which had
been deprecated and mostly unmaintained for many years. As explained in
https://discourse.llvm.org/t/llvm-test-suite-removing-the-deprecated-makefiles,
we recently got consensus to remove that test suite, which was done in
llvm/llvm-test-suite#320. This patch cleans up the related
documentation.
[libc++] Enable additional tests when Clang modules are enabled (#168967)
Disabling tests when Clang modules are enabled is not great because we
are moving more and more tests towards using Clang modules by default.
Instead, disable Clang modules on a per-test basis.
[HLSL][DXIL][SPRIV] Added WaveActiveProduct intrinsic #164385 (#165109)
From issue #99165, adds the implementation of WaveActiveProduct. Mainly
followed how WaveActiveSum was implemented
- [x] Implement WaveActiveProduct clang builtin,
- [x] Link WaveActiveProduct clang builtin with hlsl_intrinsics.h
- [x] Add sema checks for WaveActiveProduct to
CheckHLSLBuiltinFunctionCall in SemaChecking.cpp
- [x] Add codegen for WaveActiveProduct to EmitHLSLBuiltinExpr in
CGBuiltin.cpp
- [x] Add codegen tests to
clang/test/CodeGenHLSL/builtins/WaveActiveProduct.hlsl
- [x] Add sema tests to
clang/test/SemaHLSL/BuiltIns/WaveActiveProduct-errors.hlsl
- [x] Create the int_dx_WaveActiveProduct intrinsic in
IntrinsicsDirectX.td
- [x] Create the DXILOpMapping of int_dx_WaveActiveProduct to 119 in
DXIL.td
[11 lines not shown]
[SPARC] Set how many bytes load from or store to stack slot (#182674)
Refer from: https://reviews.llvm.org/D44782
The testcase is copied from
llvm/test/CodeGen/RISCV/stack-slot-coloring.mir.
[MLIR][Python] Allow passing dialect as a class keyword argument (#182465)
Previously, we constructed new ops using the pattern `class
MyOp(MyInt.Operation)`.
Now we’ve added a new pattern: `class MyOp(Operation, dialect=MyInt)`,
which allows more flexible composition. For example:
```python
class BinOpBase(Operation): # it can be used in any dialect!
res: Result[Any]
lhs: Operand[Any]
rhs: Operand[Any]
class MyInt(Dialect, name="myint"):
pass
class AddOp(BinOpBase, dialect=MyInt, name="add"):
...
```