[SLP]Convert compares from zexts, promoted to selects, to inversed op, if improves codegen
Some of the zext i1 (cmp) + select sequences can be transformed by
inverting compare predicates to remove extra shuffles, like
zext 1 (cmp ne) + select (cmp eq), 0, 2 can be modeled as select <2
x > (cmp ne), <1, 2>, zeroinitializer
Reviewers: RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/181580
[LLDB][ELF CORE] Only display a stop reason when there is a valid signo (#172781)
This patch fixes where ELF cores will report all threads as `STOP REASON
0`.
This was/is a large personal annoyance of mine; added a test to verify a
default elf core process/thread has no valid stop reason.
[AArch64][GlobalISel] Remove fallbacks for fpcvt intrinsics with 16-bit operands (#179693)
Previously, GlobalISel failed to lower neon fpcvt intrinsics, as
RegBankSelect was not keeping the result on a fpr.
An additional fix is needed for the fpcvtz intrinsics, as these are the
"default" floating point convert intrinsics. As a result, Instruction
Selection has patterns mapping the FPCVTZ intrinsic to the
architecture-agnostic G_FP_TO*I_SAT node.
This also provides the opportunity for more optimisations to be made to
the code before Selection.
[SLP]Fix an ArrayRef out-of-bounds access in slice
If the revec is enabled, may have the number of parts (registers) for
the combined node, not a single element node, so need to check for
potential out-of-bounds access
Fixes #181798
[CIR] Fix emission of functions referenced by member-pointer (#181452)
While working on attributes for these, I discovered that when a function
was referenced only via a member function pointer (see no-odr-use.cpp
test for the example that failed!), that we were incorrectly generating
the type of the function to not include the 'this' pointer. This
restores that behavior by making sure we generate the type for the
member-pointer type correctly.
[llvm-mca] Missing data dependencies due to constant registers not being cached (#177990)
Commit 385f59f modified MCA InstrBuilder methods `populateReads` and
`populateWrites` to discard information about constant registers and
avoid creating non-existent dependency chains.
However, information about reads and writes is cached based on
instruction descriptions. In this way, if the same instruction is
encountered multiple times with (before) and without (after) a constant
register, the cached entry will not contain information about that
specific register, resulting in missing data dependencies.
This patch moves the check of constant registers to `createInstruction`,
so that cached entries will also take into account constant registers
and, if necessary, they will be discarded later when creating the
instruction.
[Clang][HLSL] Fix struct semantic store (#181681)
The store to a nested semantic had an issue we the field index was not
increased when walking through it.
One of the check-in test was bad, causing this to slip by.
Fixes #181674
[AArch64][GlobalISel] Add other factors to comment
fp conversion result may also be stored on an fpr if the result is of equal size to its input size, or if PRCVT Is present.
[AMDGPU][SIInsertWaitcnts][NFC] Move soft xcnt deletion to separate function (#181760)
This patch simplifies the logic of `insertWaitcntInBlock()` by moving
the code that removes the redundant soft xcnt instructions to a new
function: `removeRedundantSoftXcnts()`.
While doing so, this patch also cleans up the logic a bit by dropping
the AtomiRMWState and the corresponding functions.
This helps in several ways:
- insertWaitcntInBlock() will now do what its name suggests, i.e., only
insert and not remove.
- it makes it clear that removal of softxcnts is orthogonal to insertion
of waitcnts.
- we won't have to worry about both erased and new instruction in
insertWaitcntInBlock()'s loop.
The change should be NFC.
[CIR] Fix handling of boolean builtin expressions (#181444)
Previously we were generating a signed 1-bit integer constant for
builtin expressions that returned a boolean value. This caused a
verification error of mismatched types when we tried to store this
constant result to a pointer-to-bool location. This change adds a check
for boolean types.
[AArch64][GlobalISel] Re-add necessary brackets
Some brackets needed to allow "Build and Test Linux" CI test to pass.
This is because some configurations of clang see the order of operations in A || B && C as ambigious. Add the brackets in to avoid this.
[NFC][Flang][OpenMP] Remove obsolete declare simd lowering TODO test (#181756)
The TODO test for Flang OpenMP `declare simd` lowering is no longer
needed, as the lowering was implemented in
https://github.com/llvm/llvm-project/pull/175604.
[ELF][SystemZ] Fix R_390_TLS_LDO32/64 in non-SHF_ALLOC sections
These can appear in .debug_info so, like other architectures (e.g.
X86_64), we still need to handle them in getRelExpr.
Fixes: aec1c984266c ("[ELF] Add target-specific relocation scanning for SystemZ (#181563)")
[mlir][AMDGPU] Allow packing of exactly 4 elements. (#181843)
`amdgpu.scaled_mfma` ops ingest byte sized scales stored in 4-byte
registers. To avoid unnecessary padding (where we only ever use the
first byte in this 4-byte register), this canonicalization finds
opportunities to enable packing multiple scales into 4-byte chunks
whenever possible. Note this is necessary but not sufficient to avoid
byte loads from LDS.
This canonicalization should try to pack scales that are extracted from
an alloc in shared mem of size 4 bytes or larger (meaning packing to 4
bytes is possible). Currently we bail out if it is exactly 4 bytes long
which is incorrect and fixed in this PR.
---------
Signed-off-by: Muzammiluddin Syed <muzasyed at amd.com>
[DAGCombiner] Combine (fshl A, X, Y) | (shl X, Y) --> fshl (A|X), X, Y (#180887)
Similar for (fshr X, B, Y) | (srl X, Y) --> fshr X, (X|B), Y
This is similar to the FSHL/FSHR handling in
hoistLogicOpWithSameOpcodeHands but here we treat a shl/shr like a
fshl/fshr with 0.
The pattern doesn't require X to be the same in both sides, but that's
what occurred in the case I was looking at so that's what is
implemented.
Alive2: https://alive2.llvm.org/ce/z/eUou-u
[NFC][analyzer] Remove StmtNodeBuilder (#181431)
The class `StmtNodeBuilder` was practically equivalent to its base class
`NodeBuilder` -- its data members and constructors were identical and
the only distinguishing feature was that it supported two additional
methods that were not present in `NodeBuilder`.
This commit moves those two methods to `NodeBuilder` (there is no reason
why they cannot be defined there) and replaces all references to
`StmtNodeBuilder` with plain `NodeBuilder`.
Note that previously `StmtNodeBuilder` had a distinguishing feature
where its destructor could pass nodes to an "enclosing node builder" but
this became dead code somewhen in the past, so my previous commit
320d0b5467b9586a188e06dd2620126f5cb99318 removed it.
[mlir][acc] Add pass to insert acc declare globals into GPU module (#181383)
Adds a new OpenACC pass that copies globals with the `acc.declare`
attribute into the GPU module so that device code (acc routine, compute
regions) can reference them.
---------
Co-authored-by: Susan Tan <zujunt at nvidia.com>
[ScalarizeMaskedMemIntr][ProfCheck] Correctly annotate branch weights (#181568)
There are two cases in ScalarizeMaskedMemIntr where conditional branches
are created using conditionals derived from the mask. Given these are
synthesized ad we do not have VP metadata for them, we need to mark them
as unknown.
[NFC][VPlan] Test showing that unit-stride-mv should be done later in pipeline (#180292)
Right now memory dependencies checks and speculation for unit-strideness
are performed somewhat simultaneously. This is wrong because:
* Ideally, if accesses aren't unit-strided in runtime we might want to
take a version with gather/strided load (longer term). Those two loops
should share legality checks and the dispatch based on stride should
only happen after the legality condition has been satisfied.
* Even if we don't generate multiple vector loops (current situation),
not vectorizing at all is worse than generating gather-only vector loop.
This PR adds a test for the latter as that could be a first step in
adding full support for the former.
This isn't target-specific, but gathers aren't supported in generic
target and result in very ugly scalarized code/CHECKs, hence put the
test under RISCV/.
Co-authored-by: Florian Hahn <flo at fhahn.com>