[X86] canonicalizeShuffleWithOp - add handling for SHUFFLE(PSADBW(X,Y),PSADBW(Z,W)) -> PSADBW(SHUFFLE(X,Z),SHUFFLE(Y,W)) (#188072)
PSADBW takes vXi8 inputs and gives a vXi64 result so we need to tweak
the bitcasts (shuffle types checks will already ensure that the result
type isn't affected).
Minor improvement to #187447
[AMDGPU] Add structural stall heuristic to scheduling strategies
Implements a structural stall heuristic that considers both resource
hazards and latency constraints when selecting instructions. In coexec,
this changes the pending queue from a binary “not ready to issue”
distinction into part of a unified candidate comparison. Pending
instructions still identify structural stalls in the current cycle, but
they are now evaluated directly against available instructions by stall
cost, making the heuristics both more intuitive and more expressive.
- Add getStructuralStallCycles() to GCNSchedStrategy that computes the
number of cycles an instruction must wait due to:
- Resource conflicts on unbuffered resources (from the SchedModel)
- Sequence-dependent hazards (from GCNHazardRecognizer)
- Add getHazardWaitStates() to GCNHazardRecognizer that returns the number
of wait states until all hazards for an instruction are resolved,
providing cycle-accurate hazard information for scheduling heuristics.
[LoopVectorize] Fix an integer narrowing conversion in `getPredBlockCostDivisor(...)` (#187605)
`LoopVectorizationCostModel::getPredBlockCostDivisor(...)` may return
large `uint64_t` values that get coerced to an `unsigned` by
`VPCostContext::getPredBlockCostDivisor(...)`, which can cause division
by zero.
Fixes #187584
[HLSL] handle hlslAttributedResourceType in init list code (#187813)
Handle HLSL Attributed Resource Type in the init list code. Treat it
like its a scalar value.
Closes #187568
[MLIR][Mem2Reg] Add support for region control flow and SCF (#185036)
This PR adds support for region control-flow. Region control-flow and
CFG can be mixed together in the same program. See the [accompanying
RFC](https://discourse.llvm.org/t/rfc-support-region-control-flow-in-mem2reg/90082)
for some design considerations.
Beyond the considerations in the RFC, a few minor changes were
introduced:
- Calling the visitor hook for defined values is now deferred to the end
of promotion.
- The lazy creation of default values has been moved to the places where
it happens to prepare for a future change where it is actually lazy.
Documentation about it not working as intended for now was also added.
All SCF operations are supported, including `forall` and `parallel`,
which is pretty cool I think.
[11 lines not shown]
[AMDGPU][GlobalISel] Add RegBankLegalize rules for permlane16/permlanex16 (#187906)
Add RegBankLegalize rules for the amdgcn_permlane16
and amdgcn_permlanex16 intrinsics. Both intrinsics
are sources of divergence, so only the divergent
case is needed: result, old, and src0 map to VGPR,
while src1 and src2 are SGPR with ReadFirstLane if
divergent.
Update the GISEL RUN lines in llvm.amdgcn.permlane.ll
and permlane16_opsel.ll to use -new-reg-bank-select,
and regenerate check lines. The v8i16 test cases now
produce identical SDAG/GISEL output so their checks
are unified.
pyzfs: update license tags/classifiers
The standard for package license metadata[1] is a SPDX identifier in the
the `license` and that's all. So, updating that, remove the deprecated
license classifier, and adding a tag at the top of the file for
spdxcheck to find.
1. https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#license
Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18356
[LifetimeSafety] Fix compiler crash with `static operator()` (#187853)
This PR removes the first argument from the `Args` list (which is `S()`)
before doing lifetime safety checks to ensure correct indexing.
It also adds a test to prevent regressions in the future
Fixes #187426
<details>
<summary>Bug details</summary>
When calling a `static operator()` directly (with `S()(...)`), we also
store `S()` in `Args` as the first argument, so all indexing is off by
one.
The most interesting part is that `S::operator()(...)` works correctly
and does not add `S()` at the beginning of the argument list, so it does
not crash during lifetime checks.
This solution is probably not the cleanest, but I would love to hear
feedback on where to put it!
</details>
Use the new M68K_EC_VAC and M68K_EC_PAC options, based on configured
model.
As a transitional step, ensure that the new options are consistent with
the legacy CACHE_HAVE_{PAC,VAC} defines.
AMDGPU: Codegen for v_dual_dot2acc_f32_f16/bf16 from VOP3
Codegen for v_dual_dot2acc_f32_f16/bf16 for targets that only have VOP3
version of the instruction.
Since there is no VOP2 version, instroduce temporary mir DOT2ACC pseudo
that is selected when there are no src_modifiers. This DOT2ACC pseudo
has src2 tied to dst (like the VOP2 version), PostRA pseudo expansion will
restore pseudo to VOP3 version of the instruction.
CreateVOPD will recoginize such VOP3 pseudo and generate v_dual_dot2acc.
[SLP]Use reduction root explicitly from reduction analysis to avoid non-determinism
Initially, the reduction root was detected using the last member of the UserIgnoreList set, which is unordered. Better to use the reduction root explicitly to avoid non-determinism in the reduction parent block, which may cause incorrect scale factor estimation for the reduction cost.
release: Remove not-NO_ROOT cases
We always use NO_ROOT for release artifact builds, so remove the
alternate code paths.
For the first step we set NO_ROOT unconditionally in cases that invoke
submakes, and turn NO_ROOT being unset into an error in lover-level
targets so that we can catch potential out-of-tree build scripts (or
missed in-tree cases) that expect to run not-NO_ROOT builds. The second
step will be to remove those entirely.
Reviewed by: cperciva
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D54179
(cherry picked from commit 54e006369c9aab4f3a22f026eb6924c0f9cafda8)
release: Use make's `:H` rather than `/..`
In general we want to strip subdir components, rather than appending
`..`s.
Reviewed by: lwhsu
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D54373
(cherry picked from commit 3949c2b8c4691a6dff8be7b38805d56faab91187)
[libc] Support Windows test executables in LibcTest lit format (#188057)
Updated LibcTest to handle Windows test executables:
* Added support for .exe extensions when identifying test executables.
* Skipped the executable bit check on Windows as it is not applicable.
* Updated .params file discovery to look for both <test>.exe.params and
<test>.params.
This allows running libc tests on Windows hosts.
[MLIR][TableGen] Make optional enum parser not consume the token when it is not matched (#188008)
Previously the optional parser would consume the token even when it
failed to match a value of the enum and prevented parsers later in the
op syntax from having an attempt. This PR changes that so that the token
is consumed only when the parsing succeeds. This change is made to the
emitted `FieldParser<std::optional<T>>` for enums.
This, for example, allows having a simple list of default valued props
in the assembly format without needing decorations around them. This
mimics the behaviour that is emitted for `DefaultValuedAttribute` when
it is used with `EnumAttr`.
This PR also adds `parseOptionalString` variant with an allow-list
argument as `parseOptionalKeyword` has and adds
`parseOptionalKeywordOrString` allow-list variant which combines these
two into a single utility wrapper. These methods do not consume the
token unless it is from the allow-list.
[AMDGPU] Add structural stall heuristic to scheduling strategies
Implements a structural stall heuristic that considers both resource
hazards and latency constraints when selecting instructions. In coexec,
this changes the pending queue from a binary “not ready to issue”
distinction into part of a unified candidate comparison. Pending
instructions still identify structural stalls in the current cycle, but
they are now evaluated directly against available instructions by stall
cost, making the heuristics both more intuitive and more expressive.
- Add getStructuralStallCycles() to GCNSchedStrategy that computes the
number of cycles an instruction must wait due to:
- Resource conflicts on unbuffered resources (from the SchedModel)
- Sequence-dependent hazards (from GCNHazardRecognizer)
- Add getHazardWaitStates() to GCNHazardRecognizer that returns the number
of wait states until all hazards for an instruction are resolved,
providing cycle-accurate hazard information for scheduling heuristics.
[AMDGPU] Add ML-oriented coexec scheduler selection and queue handling (#169616)
This patch adds the initial coexec scheduler scaffold for machine
learning workloads on gfx1250.
It introduces function and module-level controls for selecting the
AMDGPU preRA and postRA schedulers, including an `amdgpu-workload-type`
module flag that maps ML workloads to coexec preRA scheduling and a nop
postRA scheduler by default.
It also updates the coexec scheduler to use a simplified top-down
candidate selection path that considers both available and pending
queues through a single flow, setting up follow-on heuristic work.
[WebAssembly] Add initial shuffle cost capabilities (#187596)
Fixes #178940
Fixes the case of i16x8, i8x16 manual splat not recognized but the case of i32x4 still remains.