[MLIR][XeGPU] Refactor XeGPU layout propagation: passing lane_layout/lane_data with inst_data (#203156)
**Motivation**
Enhance setup* rules in layout propagation to pass lane_layout, and
lane_data information during inst_data propagation, so that the
propagation can have lane level information when choosing an optimal
inst_data. This branch makes that relationship explicit and uniform
across all setup rules.
**Invariant**
All setup rules now produce layouts that satisfy:
Nd ops + dpas/dpas_mx: inst_data = k * (lane_layout * lane_data), k ≥ 1
Scatter/matrix ops + non-anchor ops: inst_data = lane_layout * lane_data
**Key changes in XeGPULayoutImpl**
- New per-op anchor setup rules: setupStoreNdAnchorLayout,
[38 lines not shown]
[LoopVectorize] Don't assert in getVectorCallCost for vector library variants (#202085)
During loop vectorization, `computePredInstDiscount` queries the cost of
instructions at vector VF using `getInstructionCost`. A `CallInst` with
a vector library variant delegates to `getVectorCallCost`, which
asserted that such variants should not reach it.
A predicated call can however reach `getVectorCallCost` via
`computePredInstDiscount` — before its widening decision is made — when
a predicated user (e.g. a scatter store) is being considered for
scalarization. Remove the assert and fall through to the existing
scalarization cost, which is the cost relevant to that analysis.
Adds a regression test exercising that path.
[AMDGPU] Change static NOP last terminator SI_DEMOTE_I1 to be replaced by S_BRANCH instead of assert (#204649)
This issue was first discovered in some testing downstream. A specific
chain of transformations on a ballot instruction with a constant
argument followed by an llvm.amgcn.wqm.demote call leads to an
instruction of `SI_DEMOTE_I1 -1, 0` being the last terminator of a block
with a single successor. This instruction is a NOP and can safely be
replaced with an S_BRANCH to the block's successor instead of asserting
failure.
The test added in this change is a very simplified recreation of the
pattern seen in the shader compilation in the downstream that lead to
assertion failure
sched_ule: sched_priority(): More accurate __unused annotation
Change a '__unused' to '__diagused', which is more precise for that use.
No functional change.
MFC after: 1 week
Event: Halifax Hackathon 202606
Sponsored by: The FreeBSD Foundation
[CIR] Lower byval/byref args in CallConvLowering (#201717)
[CIR] Lower byval/byref args in CallConvLowering
ArgKind::Indirect arguments were hitting an errorNYI in
CIRABIRewriteContext. Add the lowering: in the callee the block argument
type changes to !cir.ptr<T>, a load is inserted at entry so the body sees
the original value type, and llvm.byval or llvm.byref is attached based on
ownership. At call sites, both byval and byref are lowered by allocating a
stack slot, copying the value in, and passing the pointer.
For byval, llvm.noalias and llvm.noundef are also added -- llvm.noalias
because the call-site rewrite always produces a fresh alloca+store
(equivalent to -fpass-by-value-is-noalias), and llvm.noundef because the
copy is always fully defined. byref carries only llvm.byref and llvm.align
since it does not assert exclusive ownership.
Preserve descriptor strides for iterator map bounds
Iterator map lowering keeps a stable array base pointer and describes the
selected element or section with bounds inside omp.iterator. For boxed arrays,
those bounds must use the byte stride stored for each descriptor dimension.
Use fir.box_dims for boxed iterator map bounds and pass through each
dimension's descriptor byte stride. This preserves non-contiguous
assumed-shape actuals while keeping unit element strides for non-box arrays.
[X86] Hoist getMOVriOpcode to X86InstrInfo.h and share it, NFC (#205187)
The x86 backend often needs to materialize potentially 64-bit immediates
into registers, and the logic to pick between the available opcodes
exists in 3 places at least. Move this to X86InstrInfo.h so we can share
it over the x86 backend without copying it.
An LLM did the refactoring.
[DebugInfo][CodeView] Resolve forward references to types without unique name (#203781)
In the following code:
```cpp
// header.h
typedef struct lua_State lua_State;
lua_State *getState();
// source.c
#include "header.h"
struct lua_State { int field; };
lua_State *getState() {
static lua_State state = {.field=42}; // make sure the type is emitted
return &state;
}
// main.cpp
extern "C" {
[16 lines not shown]
[RISCV][XCV] Add missing IsRV32 predicate to the XCVmac block (#205095)
The XCVmac instruction block was missing the `IsRV32` predicate that
every other XCV block already carries. `HasVendorXCVmac` on its own does
not require RV32, so `-mtriple=riscv64 -mattr=+xcvmac` could select
these RV32-only vendor instructions on RV64. Add `IsRV32` to the XCVmac
block to match the other XCV extensions and prevent selecting invalid
instructions on RV64.
Split out of #204879 at review request (one fix per PR).
Part of a CORE-V (XCV) series; see RFC:
https://discourse.llvm.org/t/rfc-core-v-xcv-support-for-cv32e40p-clang-builtins-xcvsimd-intrinsics-and-generic-auto-selection/91111
[CIR] Implement support for emitting label address constants (#203644)
The evalloop.c test in the llvm-test-suite single source tests contains
a static array that is initialized with the address of labels within the
enclosing function. This wasn't implemented in CIR.
This change adds an implementation. The constant emitter change was
trivial. We just needed to create a #cir.block_addr_info attribute.
However, using that attribute as an initializer for a global requires
some additional handling and special lowering for the initializer.
The goto solver also needed to be updated to consider uses of labels in
global initializers.
The test case here was copied over directly from classic codegen. The
original test has an additional test case for the difference between two
label addresses. Support for that case will be added in a future change.
Assisted-by: Cursor / claude-opus-4.8
Remove global variable from multicast routing.
Global variable struct sockaddr_in sin is used to pre-initialize
length and family. Changing sin_addr dynamically does not work in
a multiprocessor environment. Allocate and initialize sin on the
stack.
OK claudio@