[AMDGPU][Scheduler] Revert all regions when remat fails to increase occ. (#177205)
When the rematerialization stage fails to increase occupancy in all
regions, the current implementation only reverts the effect of
re-scheduling in regions in which the increased occupancy target could
not be achieved. However, given that re-scheduling with a higher
occupancy target puts more pressure on the scheduler to achieve lower
maximum RP at the cost of potentially lower ILP as well, region
schedules made with higher occupancy targets are generally less
desirable if the whole function is not able to meet that target.
Therefore, if at least one region cannot reach its target, it makes
sense to revert re-scheduling in all affected regions to go back to a
schedule that was made with a lower occupancy target.
This implements such logic for the rematerialization stage, and adds a
test to showcase that re-scheduling is indeed interrupted/reverted as
soon as a re-scheduled region that does not meet the increased target
occupancy is encountered.
[4 lines not shown]
[clang-tidy] Speed up `modernize-use-nullptr` (#178829)
As noted in [this
comment](https://github.com/llvm/llvm-project/pull/178149#discussion_r2732896149),
it appears that registering one `anyOf(a, b, ...)` matcher is generally
slower than registering `a, b, ...` all individually. Applying that
knowledge to this check gives us an easy 3x speedup:
```txt
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name ---
Status quo: 0.3281 ( 6.1%) 0.0469 ( 5.2%) 0.3750 ( 6.0%) 0.3491 ( 5.5%) modernize-use-nullptr
With this change: 0.0938 ( 1.8%) 0.0156 ( 1.8%) 0.1094 ( 1.8%) 0.1260 ( 2.1%) modernize-use-nullptr
```
I'm not exactly sure *why* this works, but it seems pretty consistent.
I've seen a similar result trying this with `bugprone-infinite-loop`.
[ELF,test] Improve riscv and aarch64 relocation error tests
Adopt modern test patterns for relocation overflow and alignment error
tests:
* Use `rm -rf %t && mkdir %t && cd %t` pattern for isolation. Use simple
filenames (32.o, 64.o, out.32) instead of %t-prefixed names
* Use `--defsym` instead of external input files where possible
* Omit `-o /dev/null` for negative tests (implicit when errors occur)
* Add `--implicit-check-not=error:` to catch unexpected errors
[Matrix] Update test to make sure tiled loops can be used. (NFC)
The 6x6x6 and 7x7x7 matrix multiply used previously could not
use tiled loop codegen. Update to 8x8x8 and forced tile-size of 2.
[VPlan] Handle Mul/UDiv in getSCEVExprForVPValue (NFCI).
Support Mul/UDiv and AND-variant (https://alive2.llvm.org/ce/z/rBJVdg)
in getSCEVExprForVPValue.
This is used in code paths when computing SCEV expressions in the
VPlan-based cost model, which should produce costs matching the legacy
cost model.
[AMDGPU][SIInsertWaitcnt][NFC] Replace if/else with switch (#178956)
This is an NFC patch that replaces the consecutive ifs and else ifs in
generateWaitcntInstBefore() with a switch. This makes it a bit easier to
read.
Use reportFatalInternalError in DialectConversion (#178612)
Migrate from deprecated report_fatal_error to reportFatalInternalError
in DialectConversion.cpp. All 6 instances are internal consistency
checks in MLIR's dialect conversion system, so reportFatalInternalError
is the appropriate replacement.
Part of #138914
AMDGPU: Use SimplifyQuery in AMDGPUCodeGenPrepare
Enables assumes in more contexts. Of particular interest is the
nan check for the fract pattern.
The device libs f32 and s64 sin implementations have a range check,
and inside the large path this pattern appears. After a small patch
to invert this check to send nans down the small path, this will
enable the fold unconditionally on the large path.
[Hexagon] Track type locally in HexagonVectorCombine (#179066)
Replace getAllocatedType calls with tracked types from alloca creation.
The types are known at the CreateAlloca call sites, so we track them
locally instead of re-querying through getAllocatedType, to facilitate
someday possibly removing getAllocatedType from the API of AllocaInst.
Co-authored-by: Claude Sonnet 4.5 <noreply at anthropic.com>
[DirectX] remove getAllocatedType in DXILDataScalarization (#179067)
Update dynamicallyLoadArray to take the allocated type as a parameter
instead of querying getAllocatedType. This is to facilitate removing
other incorrect uses of getAllocatedType, and eventually possibly even
getAllocatedType itself.
Co-authored-by: Claude Sonnet 4.5 <noreply at anthropic.com>
[ELF,test] Improve error/warning message checks
Update tests to include proper `error:` or `warning:` prefixes and
file/section information in CHECK patterns. Add
--implicit-check-not=error: to ensure no unexpected errors are produced.
[libc] Address sincosf size bloat (#179004)
The recent refactoring in #177523 marked some functions as static which
increased the size of sinf/cosf functions. Removing the static storage
for these functions to remove the bloat which is especially problematic
in size constrained baremetal target builds.
[ELF,test] Improve error message checks with proper format
Update tests to use the canonical error message format with `error:`
prefix and file:section information. Add `--implicit-check-not=error:`
to ensure no unexpected errors are produced.
This commit focuses on "out of range" and "not aligned" errors.
[AMDGPU][Scheduler] Revert all regions when remat fails to increase occ.
When the rematerialization stage fails to increase occupancy in all
regions, the current implementation only reverts the effect of
re-scheduling in regions in which the increased occupancy target could
not be achieved. However, given that re-scheduling with a higher
occupancy target puts more pressure on the scheduler to achieve lower
maximum RP at the cost of potentially lower ILP as well, region
schedules made with higher occupancy targets are generally less
desirable if the whole function is not able to meet that target.
Therefore, if at least one region cannot reach its target, it makes
sense to revert re-scheduling in all affected regions to go back to
a schedule that was made with a lower occupancy target.
This implements such logic for the rematerialization stage, and adds a
test to showcase that re-scheduling is indeed interrupted/reverted as
soon as a re-scheduled region that does not meet the increased target
occupancy is encountered.
[5 lines not shown]
[AMDGPU][Scheduler] Simplify scheduling revert logic (#177203)
When scheduling must be reverted for a region, the current
implementation re-orders non-debug instructions and debug instructions
separately; the former in a first pass and the latter in a second pass
handled by a generic machine scheduler helper whose state is tied to the
current region being scheduled, in turns limiting the revert logic to
only work on the active scheduling region.
This makes the revert logic work in a single pass for all MIs, and
removes the restriction that it works exclusively on the active
scheduling region. The latter enables future use cases such as reverting
scheduling of multiple regions at once.
Reapply "[VPlan] Detect and create partial reductions in VPlan. (NFCI) (#167851)"
This reverts commit d1e477b00b49c63ff4dd513eeb14a5b18bc055d7.
Recommit with a extra checks making sure extends are VPWidenCastRecipes,
rejecting VPReplicateRecipes.
Original message:
As a first step, move the existing partial reduction detection logic to
VPlan, trying to preserve the existing code structure & behavior as
closely as possible.
With this, partial reductions are detected and created together in a
single step.
This allows forming partial reductions and bundling them up if
profitable together in a follow-up.
PR: https://github.com/llvm/llvm-project/pull/167851
[llvm-lipo] Fix handling of archives in universal binaries (#176448)
When extracting slices from a universal binary, llvm-lipo was not
handling the case where the slice is an archive.
Fixes #90156
[X86] getScalarMaskingNode - FIXUPIMM scalar ops take upper elements from second operand (#179101)
FIXUPIMMSS/SD instructions passthrough the SECOND operand upper elements, and not the first like most (2-op) instructions
Fixes #179057
[Analysis] Add Intrinsics::CLMUL case to cost calculations to getIntrinsicInstrCost / getTypeBasedIntrinsicInstrCost (#176552)
This patch adds a case in getIntrinsicInstrCost and
getTypeBasedIntrinsicInstrCost in
llvm/include/llvm/CodeGen/BasicTTIImpl.h for Intrinsic::clmul. This
patch uses TLI->isOperationLegalOrCustom to check if the instruction is
cheap. If not cheap, it sums up the cost of the arithmetic operations
(AND, SHIFT, XOR) multiplied by the bit width.
Fixes #176354