[VPlan] Only apply forced cost to recipes with underlying values. (#168372)
Only apply forced instruction costs to recipes with underlying values to
match the legacy cost model. A VPlan may have a number of additional
VPInstructions without underlying values that are not considered for its
cost, and assigning forced costs to them would incorrectly inflate its
cost.
This fixes a cost divergence between legacy and VPlan-based cost models
with forced instruction costs.
PR: https://github.com/llvm/llvm-project/pull/168372
[Flang][OpenMP] Add semantic support for Loop Sequences and OpenMP loop fuse (#161213)
This patch adds semantics for the `omp fuse` directive in flang, as
specified in OpenMP 6.0. This patch also enables semantic support for
loop sequences which are needed for the fuse directive along with
semantics for the `looprange` clause. These changes are only semantic.
Relevant tests have been added , and previous behavior is retained with
no changes.
---------
Co-authored-by: Ferran Toda <ferran.todacasaban at bsc.es>
Co-authored-by: Krzysztof Parzyszek <Krzysztof.Parzyszek at amd.com>
[AMDGPU] Enable multi-group xnack replay in hardware (GFX1250) (#169016)
This patch enables the multi-group xnack replay mode by
configuring the hardware MODE register at kernel entry.
This aligns the hardware behavior with the compiler's
existing multi-group s_wait_xcnt insertion logic.
[MLIR][GPU] subgroup_mma fp64 extension (#165873)
This PR extends the `gpu.subgroup_mma_*` ops to support fp64 type.
The extension requires special handling during the lowering to `nvvm`
due to the return type for load ops for fragment a and b (they return a
scalar instead of a struct).
[LoopCacheAnalysis] Replace delinearization for fixed size array (#164798)
This patch replaces the delinearization function used in
LoopCacheAnalysis, switching from one that depends on type information
in GEPs to one that does not. Once this patch and
https://github.com/llvm/llvm-project/pull/161822 are landed, we can
delete `tryDelinearizeFixedSize` from Delienarization, which is an
optimization heuristic guided by GEP type information. After Polly
eliminates its use of `getIndexExpressionsFromGEP`, we will be able to
completely delete GEP-driven heuristics from Delinearization.
[ORC] Tailor ELF debugger support plugin to load-address patching only (#168518)
In 4 years the ELF debugger support plugin wasn't adapted to other
object formats or debugging approaches. After the renaming NFC in
https://github.com/llvm/llvm-project/pull/168343, this patch tailors the
plugin to ELF and section load-address patching. It allows removal of
abstractions and consolidate processing steps with the newly enabled
AllocActions from https://github.com/llvm/llvm-project/pull/168343.
The key change is to process debug sections in one place in a
post-allocation pass. Since we can handle the endianness of the ELF file
the single `visitSectionLoadAddresses()` visitor function now, we don't
need to track debug objects and sections in template classes anymore. We
keep using the `DebugObject` class and drop `DebugObjectSection`,
`ELFDebugObjectSection<ELFT>` and `ELFDebugObject`.
Furthermore, we now use the allocation's working memory for load-address
fixups directly. We can drop the `WritableMemoryBuffer` from the debug
object and most of the `finalizeWorkingMemory()` step, which saves one
[5 lines not shown]
[clang][OpenMP][CodeGen] Use an else if instead of checking twice (#168776)
These two classes are mutually exclusive so avoid doing the two checks
when the first succeeded.
[RISCV] Update SpacemiT-X60 vector mask instructions latencies (#150644)
This PR adds hardware-measured latencies for all instructions defined in
Section 15 of the RVV specification: "Vector Mask Instructions" to the
SpacemiT-X60 scheduling model.
[LowerMemIntrinsics] Optimize memset lowering
This patch changes the memset lowering to match the optimized memcpy lowering.
The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred
memory access type. If that type is larger than a byte, the memset is lowered
into two loops: a main loop that stores a sufficiently wide vector splat of the
SetValue with the preferred memory access type and a residual loop that covers
the remaining bytes individually. If the memset size is statically known, the
residual loop is replaced by a sequence of stores.
This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by
around 7-20x.
I'm planning similar treatment for memset.pattern as a follow-up PR.
For SWDEV-543208.
[OpenMP] Introduce "loop sequence" as directive association (#168934)
OpenMP 6.0 introduced a `fuse` directive, and with it a "loop sequence"
as the associated code. What used to be "loop association" has become
"loop-nest association".
Rename Association::Loop to LoopNest, add Association::LoopSeq to
represent the "loop sequence" association.
Change the association of fuse from "block" to "loop sequence".
[AArch64] Avoid introducing illegal types in LowerVECTOR_COMPRESS (NFC) (#168520)
This does not seem to be an issue currently, but when using
VECTOR_COMPRESS as part of another lowering, I found these BITCASTs
would result in "Unexpected illegal type!" errors.
For example, this would convert the legal nxv2f32 type into the illegal
nxv2i32 type. This patch avoids this by using no-op casts for unpacked
types.
[mlir][py][c] Enable setting block arg locations. (#169033)
This enables changing the location of a block argument. Follows the
approach for updating type of block arg.
[LowerMemIntrinsics] Optimize memset lowering
This patch changes the memset lowering to match the optimized memcpy lowering.
The memset lowering now queries TTI.getMemcpyLoopLoweringType for a preferred
memory access type. If that type is larger than a byte, the memset is lowered
into two loops: a main loop that stores a sufficiently wide vector splat of the
SetValue with the preferred memory access type and a residual loop that covers
the remaining bytes individually. If the memset size is statically known, the
residual loop is replaced by a sequence of stores.
This improves memset performance on gfx1030 (AMDGPU) in microbenchmarks by
around 7-20x.
I'm planning similar treatment for memset.pattern as a follow-up PR.
For SWDEV-543208.
[LifetimeSafety] Detect expiry of loans to trivially destructed types (#168855)
Handling Trivially Destructed Types
This PR uses `AddLifetime` to handle expiry of loans to trivially
destructed types.
Example:
```cpp
int * trivial_uar(){
int *ptr;
int x = 1;
ptr = &x;
return ptr;
}
```
The CFG created now has an Expire Fact for trivially destructed types:
```
[19 lines not shown]
[OpenMP][libomp] Add transparent task flag bit to kmp_tasking_flags (#168873)
Clang is adding support for the new `OpenMP transparent` clause on
`task` and `taskloop` directives.
The parsing and semantic handling for this clause is introduced in
https://github.com/llvm/llvm-project/pull/166810 .
To allow the compiler to communicate this clause to the `OpenMP`
runtime, a dedicated bit in `kmp_tasking_flags` is required.
This patch adds a new compiler-reserved bit `transparent` to the`
kmp_tasking_flags` structure.
[libc++] Revert fstream::read optimizations (#168894)
This causes various runtime failures, as reported in #168628.
This reverts both #165223 and #167779
[AMDGPU][gfx1250] Add wait_xcnt before any access that cannot be repeated
All volatile accesses are concerned, and buffer operations are also concerned by this.
[libc++] Optimize std::has_single_bit (#133063)
Clang translates most implementations of has_single_bit to `(v ^ (v-1))
> v-1` - except the one definition libc++ actually uses.
Proof of correctness: https://godbolt.org/z/d61bxW4r1
(Could also be fixed by teaching Clang to optimize better, but making
source match output feels clearer to me. And it improves unoptimized
performance.)
[OpenMP] Fix firstprivate pointer handling in target regions (#167879)
Firstprivate pointers in OpenMP target regions were not being lowered
correctly, causing the runtime to perform unnecessary present table
lookups instead of passing pointer values directly.
This patch adds the OMP_MAP_LITERAL flag for firstprivate pointers,
enabling the runtime to pass pointer values directly without lookups.
The fix handles both explicit firstprivate clauses and implicit
firstprivate semantics from defaultmap clauses.
Key changes:
- Track defaultmap(firstprivate:...) clauses in MappableExprsHandler
- Add isEffectivelyFirstprivate() to check both explicit and implicit
firstprivate semantics
- Apply OMP_MAP_LITERAL flag to firstprivate pointers in
generateDefaultMapInfo()
Map type values:
[21 lines not shown]
[Clang] Fix handling of explicit parameters in `SemaLambda` (#168558)
Previously, the presence of an explicit parameter list was detected by
querying `getNumTypeObjects()` from the `Declarator` block of the lambda
definition. This breaks for lambdas which do not have a parameter list
but _do_ have a trailing return type; that is, both of
```
[]() -> int { return 0; };
[] -> int { return 0; };
```
would return `true` when inspecting
`LambdaExpr::hasExplicitParameters()`.
Fix this by instead querying the `LParenLoc()` from the `Declarator`'s
`FunctionTypeInfo`. If `isValid() == true`, then an explicit parameter
list must be present, and if it is `false`, then it is not.
[6 lines not shown]
[NVPTX] Support for dense and sparse MMA intrinsics with block scaling. (#163561)
This change adds dense and sparse MMA intrinsics with block scaling. The
implementation is based on [PTX ISA version
9.0](https://docs.nvidia.com/cuda/parallel-thread-execution/). Tests for
new intrinsics are added for PTX 8.7 and SM 120a and are generated by
`llvm/test/CodeGen/NVPTX/wmma-ptx87-sm120a.py`. The tests have been
verified with ptxas from CUDA-13.0 release.
Dense MMA intrinsics with block scaling were supported by
@schwarzschild-radius.
[Utils][update_mc_test_checks] Support generating asm tests from templates.
Reduces the pain of manual editing tests applying the same
changes over multiple instructions and keeping them consistent.