[SLP] Fix extract-cost scale using NCD of all external-user sites
ExtractCostCalculated deduplicates by scalar so only the first
ExternalUser determines the scale, making the cost depend on IR block
ordering via LLVM's reverse-insertion use-list order.
Add a pre-pass computing ScalarToExtractBlock - the nearest common
dominator of all effective extract sites per scalar. For PHI users inside
a loop the effective site is the incoming block; for PHI users outside
all loops it is the PHI's own block (scale = 1). The extract cost is
then scaled by getLoopNestScale of the NCD block, which is fully
order-independent.
Fixes #199548
Reviewers: bababuck, RKSimon, hiraditya
Pull Request: https://github.com/llvm/llvm-project/pull/199962
[MLIR][GPU][NFC] Reformat GPU target attachment tests (#199339)
Reformat attach-targets.mlir so each GPU module has a labeled check
block, split target-attachment RUN lines, and keep comments tied to the
expected target-specific matches.
[SLP] Fix extract-cost scale for LCSSA-phi external users in nested loops
getScaleToLoopIterations() used U->getParent() for all PHI-node external
users. For an LCSSA phi at an inner-loop exit still inside an outer loop,
this gave outer-loop scale instead of inner*outer scale. Because
ExtractCostCalculated deduplicates by scalar, only the first ExternalUser
determines the scale, making the cost order-dependent on use-list ordering
(and thus on .ll block ordering).
Reviewers: hiraditya, RKSimon, bababuck
Pull Request: https://github.com/llvm/llvm-project/pull/199954
[SLP] Recompute copyable operand deps for duplicate copyable nodes
A bundle may duplicate a previously built node that has copyable elements
(same schedulable instructions, different copyable lane) while the parent
node also has copyable elements. An operand modeled as a copyable element
in the previous node is then used directly by the new node, which is not
registered in the tree yet. Recomputing that operand's direct
dependencies at this point misses the direct use, so the scheduler
decrements the operand more times than its dependency count and trips the
unscheduled-deps assertion.
Defer recomputation of such operand dependencies via
RecalcCopyableOperandDeps and redo it at the next bundle scheduling, when
the duplicate node is part of the tree. Also clear and recompute the
direct dependencies of bundles whose user is a gather node referenced
through EdgeIdx == UINT_MAX in scheduleBlock, so combined gather
sub-entries get correct dependencies against the full tree.
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/200564