[mlir][Interfaces] Add generic pattern for region inlining (#176641)
Add a new canonicalization pattern that inlines the body of acyclic
`RegionBranchOpInterface` ops. This pattern is a generalization and
replacement for the following existing patterns:
* `SingleBlockExecuteInliner`: inlines `scf.execute_region` ops with a
single block.
* `SimplifyTrivialLoops`: inlines / folds away `scf.for` ops with 0 or 1
iterations.
* `RemoveStaticCondition`: inlines `scf.if` ops with a static condition.
* `FoldConstantCase`: inlines `scf.index_switch` ops with a constant
operand.
Additionally, this new pattern is also enabled for `scf.while` ops.
Loops with `scf.condition(%false)` are now also inlined. (New test case
added.)
The new pattern looks for region branch ops with a single acyclic path
[3 lines not shown]
[NVPTX] fix illegal name for .extern .shared global variables (#173018)
In ptx we can create a GV in AS(3) that will be compiled to a `.extern
.shared` in ptx. Since the `.extern .shared` is not an "extern" in the
traditional sense of the word it will not be linked based on name but
rather refer to the shared memory allocated at kernel launch.
Since we don't care about the name it's tempting to make the GV unnamed.
Then the problem that the `nameUnamedGlobals` will use a name for the
global that is invalid ptx occurs. For non-extern globals, this is later
fixed by running the `NVPTXAssignValidGlobalNames` pass. However, It
makes sure to not touch externs as changing the name of "traditional
externs" will cause linking issues down the road.
This MR treats `.extern .shared` in the same manner as non-extern
globals during `NVPTXAssignValidGlobalNames` to fix the invalid names
given by `nameUnamedGlobals`.
Co-authored-by: Kjetil Kjeka <kjetil at muybridge.com>
[AMDGPU] Allow amdgpu-waves-per-eu to lower target occupancy range (#168358)
`amdgpu-waves-per-eu` currently does not allow users to lower the target
occupancy range that the backend will try to achieve. For example, for a
kernel targeting gfx942 with default flat workgroup sizes and no LDS
usage, `AMDGPUSubtarget::getWavesPerEU` will by default produce the
range [4,10]. Setting `"amdgpu-waves-per-eu=M,N"` (N being optional)
will only have an effect if 4 <= M <= N <= 10, in which case the [M,N]
range will be produced instead. Advanced developers may in some cases
know that a specifc kernel would not benefit from higher occupancies and
wish to communicate that to the backend. It in turns could make better
codegen decisions if it knows that increasing occupancy is not a
priority.
This modifies the computation of the waves/EU range to enable this
behavior. User-provided minimum/maximum number of waves/EU are now able
to change the default waves/EU range almost arbitrarily, with only
subtarget's specifications and the maximum occupancy induced by
workgroup size and LDS usage limiting the target occupancy range. In the
[7 lines not shown]
IR: Add !nofpclass metadata (#177140)
This adds the analogous metadata to the nofpclass attribute
to assert values are not a certain set of floating-point classes.
This allows the same information to be expressed if a function
argument is passed indirectly. This matches the bitmask encoding
of nofpclass.
I also think this should be allowed for stores to symmetrically handle
sret, but leave that for later.
Alternatively we could add a more expressive !fprange metadata,
but that would be much more complex. It's useful to match the attribute,
and more annotations can always be added.
Fixes #133560
[VPlan] Replace ComputeFindIVRes with ComputeRdxRes + cmp + sel (NFC) (#176672)
Replace ComputeFindIVResult with ComputeReductionResult + explicit
compare + select, to more explicitly and simpler model computing finding
the first/last induction, which boils down to a min/max reduction +
compare and select of the sentinel value.
PR: https://github.com/llvm/llvm-project/pull/176672
[NFCI] replace getValueType with new getGlobalSize query (#177186)
Returns uint64_t to simplify callers. The goal is eventually replace
getValueType with this query, which should return the known minimum
reference-able size, as provided (instead of a Type) during create.
Additionally the common isSized query would be replaced with an
isExactKnownSize query to test if that size is an exact definition.
[ProfCheck] Add LoopInterchange test to xfail list
LoopInterchange is not currently on in the default pipeline and we have
not done any work around profile propagation within the pass yet, so
disable the test for now to get the profcheck bot back to green.
Move John McCall to the inactive maintainers list (#177406)
While reaching out to folks for a maintainers list refresh, John asked
to step down due to other commitments. Thank you for all your help!
AMDGPU: Change ABI of 16-bit scalar values for gfx6/gfx7 (#175795)
Keep bf16/f16 values encoded as the low half of a 32-bit register,
instead of promoting to float. This avoids unwanted FP effects
from the fpext/fptrunc which should not be implied by just
passing an argument. This also fixes ABI divergence between
SelectionDAG and GlobalISel.
I've wanted to make this change for ages, and failed the last
few times. The main complication was the hack to return
shader integer types in SGPRs, which now needs to inspect
the underlying IR type.