[ARM] Restore hasSideEffects flag on t2WhileLoopSetup (#168948)
ARM relies on deprecated TableGen behavior of guessing instruction
properties from patterns (`def ARM : Target` doesn't have
`guessInstructionProperties` set to false).
Before #168209, TableGen conservatively guessed that `t2WhileLoopSetup`
has side effects because the instruction wasn't matched by any pattern.
After the patch, TableGen guesses it has no side effects because the
added pattern uses only `arm_wlssetup` node, which has no side effects.
Add `SDNPSideEffect` to the node so that TableGen guesses the property
right, and also `hasSideEffects = 1` to the instruction in case ARM ever
sets `guessInstructionProperties` to false.
X86: Stop overriding getRegClass
This function should not be virtual; making this virtual was
an AMDGPU hack that should be removed not spread to other
backends.
This does not need to be overridden to reserve registers. The
register reservation mechanism is orthogonal to to the register
class constraints of the instruction, this should be reporting
the underlying instruction constraint. The registers are separately
reserved, so they will be removed from the allocation order anyway.
If the actual class needs to change based on the subtarget,
it should probably generalize the LookupPtrRegClass mechanism.
This was added by #70958. The new tests there for the class are
probably not useful anymore. These instead should compile to the
end and try to stress the allocation behavior.
[OpenMP][OMPIRBuilder] Use runtime CC for runtime calls (#168608)
Some targets have a specific calling convention that should be used for
generated calls to runtime functions.
Pass that down and use it.
Signed-off-by: Nick Sarnie <nick.sarnie at intel.com>
[acc][flang] Implement acc interface for tracking type descriptors (#168982)
FIR operations that use derived types need to have type descriptor
globals available on device when offloading. Examples of this can be
seen in `CUFDeviceGlobal` which ensures that such type descriptor uses
work on device for CUF.
Similarly, this is needed for OpenACC. This change introduces a new
interface to the OpenACC dialect named
`IndirectGlobalAccessOpInterface` which can be attached to operations
that may result in generation of accesses that use type descriptor
globals. This functionality is needed for the `ACCImplicitDeclare` pass
that is coming in a follow-up change which implicitly ensures that all
referenced globals are available in OpenACC compute contexts.
The interface provides a `getReferencedSymbols` method that collects all
global symbols referenced by an operation. When a symbol table is
provided, the implementation for FIR recursively walks type descriptor
globals to find all transitively referenced symbols.
[13 lines not shown]
[AMDGPU] Handle AV classes in SIFixSGPRCopies::processPHINode (#169038)
Fix a problem exposed by #166483 using AV classes in more places.
`isVectorRegister` only accepts registers of VGPR or AGPR classes.
`hasVectorRegisters` additionally accepts the combined AV classes.
Fixes: #168761
AMDGPU: Stop implementing shouldCoalesce (#168988)
Use the default, which freely coalesces anything it can.
This mostly shows improvements, with a handful of regressions.
The main concern would be if introducing wider registers is more
likely to push the register usage up to the next occupancy tier.
Revert "[MLIR][GPU] subgroup_mma fp64 extension" (#169049)
Reverts llvm/llvm-project#165873
The revert is triggered by a failing integration test on a couple of
buildbots.
[VPlan] Only apply forced cost to recipes with underlying values. (#168372)
Only apply forced instruction costs to recipes with underlying values to
match the legacy cost model. A VPlan may have a number of additional
VPInstructions without underlying values that are not considered for its
cost, and assigning forced costs to them would incorrectly inflate its
cost.
This fixes a cost divergence between legacy and VPlan-based cost models
with forced instruction costs.
PR: https://github.com/llvm/llvm-project/pull/168372
[Flang][OpenMP] Add semantic support for Loop Sequences and OpenMP loop fuse (#161213)
This patch adds semantics for the `omp fuse` directive in flang, as
specified in OpenMP 6.0. This patch also enables semantic support for
loop sequences which are needed for the fuse directive along with
semantics for the `looprange` clause. These changes are only semantic.
Relevant tests have been added , and previous behavior is retained with
no changes.
---------
Co-authored-by: Ferran Toda <ferran.todacasaban at bsc.es>
Co-authored-by: Krzysztof Parzyszek <Krzysztof.Parzyszek at amd.com>
[AMDGPU] Enable multi-group xnack replay in hardware (GFX1250) (#169016)
This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.
[MLIR][GPU] subgroup_mma fp64 extension (#165873)
This PR extends the `gpu.subgroup_mma_*` ops to support fp64 type.
The extension requires special handling during the lowering to `nvvm`
due to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
[LoopCacheAnalysis] Replace delinearization for fixed size array (#164798)
This patch replaces the delinearization function used in
LoopCacheAnalysis, switching from one that depends on type information
in GEPs to one that does not. Once this patch and
https://github.com/llvm/llvm-project/pull/161822 are landed, we can
delete `tryDelinearizeFixedSize` from Delienarization, which is an
optimization heuristic guided by GEP type information. After Polly
eliminates its use of `getIndexExpressionsFromGEP`, we will be able to
completely delete GEP-driven heuristics from Delinearization.
[ORC] Tailor ELF debugger support plugin to load-address patching only (#168518)
In 4 years the ELF debugger support plugin wasn't adapted to other
object formats or debugging approaches. After the renaming NFC in
https://github.com/llvm/llvm-project/pull/168343, this patch tailors the
plugin to ELF and section load-address patching. It allows removal of
abstractions and consolidate processing steps with the newly enabled
AllocActions from https://github.com/llvm/llvm-project/pull/168343.
The key change is to process debug sections in one place in a
post-allocation pass. Since we can handle the endianness of the ELF file
the single `visitSectionLoadAddresses()` visitor function now, we don't
need to track debug objects and sections in template classes anymore. We
keep using the `DebugObject` class and drop `DebugObjectSection`,
`ELFDebugObjectSection<ELFT>` and `ELFDebugObject`.
Furthermore, we now use the allocation's working memory for load-address
fixups directly. We can drop the `WritableMemoryBuffer` from the debug
object and most of the `finalizeWorkingMemory()` step, which saves one
[5 lines not shown]
[clang][OpenMP][CodeGen] Use an else if instead of checking twice (#168776)
These two classes are mutually exclusive so avoid doing the two checks
when the first succeeded.
[RISCV] Update SpacemiT-X60 vector mask instructions latencies (#150644)
This PR adds hardware-measured latencies for all instructions defined in
Section 15 of the RVV specification: "Vector Mask Instructions" to the
SpacemiT-X60 scheduling model.
[LowerMemIntrinsics] Optimize memset lowering
This patch changes the memset lowering to match the optimized memcpy lowering.
The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred
memory access type. If that type is larger than a byte, the memset is lowered
into two loops: a main loop that stores a sufficiently wide vector splat of the
SetValue with the preferred memory access type and a residual loop that covers
the remaining bytes individually. If the memset size is statically known, the
residual loop is replaced by a sequence of stores.
This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by
around 7-20x.
I'm planning similar treatment for memset.pattern as a follow-up PR.
For SWDEV-543208.
[OpenMP] Introduce "loop sequence" as directive association (#168934)
OpenMP 6.0 introduced a `fuse` directive, and with it a "loop sequence"
as the associated code. What used to be "loop association" has become
"loop-nest association".
Rename Association::Loop to LoopNest, add Association::LoopSeq to
represent the "loop sequence" association.
Change the association of fuse from "block" to "loop sequence".
[AArch64] Avoid introducing illegal types in LowerVECTOR_COMPRESS (NFC) (#168520)
This does not seem to be an issue currently, but when using
VECTOR_COMPRESS as part of another lowering, I found these BITCASTs
would result in "Unexpected illegal type!" errors.
For example, this would convert the legal nxv2f32 type into the illegal
nxv2i32 type. This patch avoids this by using no-op casts for unpacked
types.
[mlir][py][c] Enable setting block arg locations. (#169033)
This enables changing the location of a block argument. Follows the
approach for updating type of block arg.
[LowerMemIntrinsics] Optimize memset lowering
This patch changes the memset lowering to match the optimized memcpy lowering.
The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred
memory access type. If that type is larger than a byte, the memset is lowered
into two loops: a main loop that stores a sufficiently wide vector splat of the
SetValue with the preferred memory access type and a residual loop that covers
the remaining bytes individually. If the memset size is statically known, the
residual loop is replaced by a sequence of stores.
This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by
around 7-20x.
I'm planning similar treatment for memset.pattern as a follow-up PR.
For SWDEV-543208.