[CIR] Implement non-odr use of reference type lowering (#185720)
This is used somewhat rarely, but is a pretty simple emission of
pointers, and ends up using infrastructure we already have.
Additionally, this is the first use of `getNaturalTypeAlignment` that
uses the `pointee` argument, so this adds the implementation there,
which includes some alignment work for CXXRecordDecls, so this
implements that as well.
[CIR] Implement 'builtin-addressof' for 'getPointerWithAlignment' (#185684)
The 'getPointerWithAlignment' is really only called when evaluating
arguments for builtins, so the test is a touch weird as it test through
bcopy. However, this shows up in some headers, so it is important that
we support this.
This patch just adds the implementation, which mirrors classic-codegen,
except that we don't generate TBAA.
[CIR] Implement deferred V-Table emission (#185655)
We are currently only emitting Vtables that have an 'immediate' need to
emit. There rest, we are supposed to add to a list and emit at the end
of the translation unit if necessary. This patch implements that
infrastructure.
The test added is from classic-codegen and came in at the same time as
the deferred vtable emission over there, and only works with deferred
vtable emission, and while it does test the deferred emission, tests
quite a bit more than that. AND since it came in with the same
functionality in classic codegen, seemed to make sense to come in here
too.
[mlir][dialect-conversion] Fix OOB crash in convertFuncOpTypes for funcs with extra block args (#185060)
Some function ops (e.g., gpu.func with workgroup memory arguments) have
more entry block arguments than their FunctionType has inputs. The
workgroup memory arguments are not part of the public function signature
but are present as additional block arguments.
`convertFuncOpTypes` previously created a `SignatureConversion` sized
only for `type.getNumInputs()`, then called `applySignatureConversion`
on the entry block. When the block had more arguments (e.g., workgroup
args), the loop in `applySignatureConversion` would call
`getInputMapping(i)` with out-of-bounds indices, causing an assertion
failure in `SmallVector::operator[]`.
Fix this by:
1. Sizing the `SignatureConversion` for all entry block arguments.
2. Adding identity mappings for extra block args beyond the function
type inputs.
3. Using only the converted function-type-input types when updating the
[5 lines not shown]
[mlir][scf] Fix crash in extractFixedOuterLoops with iter_args loops (#184106)
The stripmineSink helper splices loop body operations into a new inner
scf.for that has no iter_args. When the target loop carries iter_args,
values yielded by the spliced body are moved inside the inner loop, but
the outer loop's yield terminator still references those values,
creating an SSA invariant violation. In debug builds this triggers the
assertion
use_empty() && "Cannot destroy a value that still has uses\!"
when the outer RewriterBase tries to erase the now-broken operations.
Fix: in extractFixedOuterLoops, skip the strip-mining transformation if
any of the collected perfectly-nested loops have iter_args.
Add a regression test to parametric-tiling.mlir.
Fixes #129044
Assisted-by: Claude Code
[MLIR] Fix crash in ValueBoundsConstraintSet for non-entry block args (#185048)
When two vector transfer ops share a non-entry block argument as an
index (e.g., in a loop with unstructured control flow), calling
`ValueBoundsConstraintSet::areEqual` on those values caused a crash.
The first `populateConstraints` call would insert the block argument
into the constraint set. The second call found it already mapped and
called `getPos`, which hit an assert requiring the value to be either an
OpResult or an entry-block argument.
Fix with two changes:
1. In `insert()`, suppress adding non-entry block arguments to the
worklist. `ValueBoundsOpInterface` cannot derive bounds for such values,
so the worklist push was a no-op and triggered the re-entrant `getPos`
call.
2. Remove the overly conservative assert in `getPos`. Looking up a
previously inserted non-entry block argument is valid; the assert was
preventing legitimate use after the value had already been inserted.
[3 lines not shown]
[X86] Add mayLoad/mayStore to legacy instructions CMPS/LODS/MOVS/SCAS/STOS (#185689)
When LLVM is used to disassemble instructions, legacy X86 strings
instructions doesn't report memory access with mayLoad and mayStore.
Note that INS and OUTS may also need sush flags, but I'm not totally
sure which one.
[VPlan] Handle FindLast in VPIRFlags::printFlags (#185857)
Noticed this when -vplan-print-after-all crashed on a find-last
reduction. We don't yet return an opcode for it because there's no
in-loop reduction.
[X86] Optimized ADD + ADC to ADC (#173543)
This patch folds an `adc` followed by an `add` into a single `adc` instruction when adding constants.
Fixes #173408
libclc: Add frexp_exp utility function
Many functions want to extract the exponent and
currently rely on bithacking to do it. These can be
better handled with frexp. AMDGPU has a dedicated
instruction for each of the frexp return values. Other
targets could override this to do the bithacking (though
they would be better off teaching codegen to optimize
frexp with a discarded output).
[StackSlotColoring] Check for zero stack slot size in RemoveDeadStores (#182673)
The default implementations of the methods isLoadFromStackSlot() and
isStoreToStackSlot() used in StackSlotColoring::RemoveDeadStores() set
the number of bytes loaded from the stack (MemBytes) to zero to indicate
that the value is unknown. This means that
StackSlotColoring::RemoveDeadStores() must abort if the size is zero
otherwise the stack slot size check doesn't mean anything.
As backends that use this are required to override the default
implementations this should not impose any degradation of the code.
As the registers also must match in
StackSlotColoring::RemoveDeadStores() for the store to be optimized away
there is small risk of this being a real bug.
---------
Co-authored-by: Karl-Johan Karlsson <karl-johan.karlsson at ericsson.com>
[MLIR][Arith] Add canonicalization rules for int-to-float of integer extension (#185386)
Three patterns are valid but were missing:
1. `sitofp(extsi(x)) → sitofp(x)`: extsi preserves the sign and value,
so it represents the same signed integer as x.
2. `uitofp(extui(x)) → uitofp(x)`: same reasoning as above, but for
unsigned extension.
3. `sitofp(extui(x)) → uitofp(x)` extui zero-extends, so the extended
value is always non-negative. For non-negative integers, sitofp and
uitofp produce the same result, meaning we could replace the left
expression by `uitofp(extui(x))`. At this point rule 2. above can be
used to simplify further to `uitofp(x)`.
All three rewrites have been verified with Alive2.
[flang][OpenMP][DoConcurrent] Emit declare mapper for records (#179936)
Extends `do concurrent` device support by emitting compiler-generated
declare mapper ops for live-ins whose types are record types and have
allocatable members.
[mlir][affine] Fix crash in affine-super-vectorize for index constants inside loops (#184614)
When an arith.constant of index type is defined inside the loop body
being vectorized, vectorizeConstant creates a vector<Nxindex> constant
and registers it as the vector replacement. However,
getScalarValueReplacementsFor (used by vectorizeAffineStore to compute
indices for vector.transfer_write) looks only in the scalar replacement
map. With no scalar replacement registered for the index constant, it
falls back to the original scalar value, which is erased when the scalar
loop is cleaned up. This results in a "operation destroyed but still has
uses" crash.
Fix: when vectorizeConstant processes an index-typed constant, also
create a new scalar constant in the vector loop body and register it as
the scalar replacement. This ensures that memory operation index
computation can find a live value in the vectorized IR.
Fixes #122213
Assisted-by: Claude Code
[IR] Split Br into UncondBr and CondBr (#184027)
BranchInst currently represents both unconditional and conditional
branches. However, these are quite different operations that are often
handled separately. Therefore, split them into separate opcodes and
classes to allow distinguishing these operations in the type system.
Additionally, this also slightly improves compile-time performance.
[AMDGPU] Set preferred function alignment based on icache geometry (#183064)
Non-entry functions were unconditionally aligned to 4 bytes with no
architecture-specific preferred alignment, and setAlignment() was used
instead of ensureAlignment(), overwriting any explicit IR attributes.
Add instruction cache line size and fetch alignment data to GCNSubtarget
for each generation (GFX9: 64B/32B, GFX10: 64B/4B, GFX11+: 128B/4B). Use
this to call setPrefFunctionAlignment() in SITargetLowering, aligning
non-entry functions to the cache line size by default. Change
setAlignment to ensureAlignment in AMDGPUAsmPrinter so explicit IR align
attributes are respected.
Empirical thread trace analysis on gfx942, gfx1030, gfx1100, and gfx1200
showed that only GFX9 exhibits measurable fetch stalls when functions
cross the 32-byte fetch window boundary. GFX10+ showed no alignment
sensitivity. A hidden option -amdgpu-align-functions-for-fetch-only is
provided to use the fetch granularity instead of cache line size.
Assisted-by: Claude Opus
[X86] LowerINTRINSIC_W_CHAIN - ensure the X86ISD::CMPCCXADD X86CondCode is a i8 target constant (#185856)
Fixes verification failure in X86SelectionDAGInfo::verifyTargetNode (#185649)
[Clang] Fix ICE in constraint normalization when substituting concept template parameters (#184406)
23341c3d139b889e8c46867f8d704ab3c22b51f8 introduced
`SubstituteConceptsInConstraintExpression` to substitute non-dependent
concept template arguments into a concept's constraint expression during
normalization, as part of the P2841R7 implementation
([temp.constr.normal]/1.4).
The `ConstraintExprTransformer` added in that commit overrides
`TransformTemplateArgument` to only transform concept-related arguments
and preserve all others. However, `TransformUnresolvedLookupExpr` called
`Sema::SubstExpr`, which creates a separate `TemplateInstantiator` that
performs full substitution bypassing the selective override entirely.
This caused all template parameters in the constraint expression to be
substituted using the concept's MLTAL. For example, given:
```cpp
template <class A, template <typename...> concept C>
[22 lines not shown]