[MLIR][XeGPU] Add Layout Propagation support for multi-reduction/reduction op with scalar result (#189133)
This PR add Layout Propagation support for multi-reduction/reduction op
with scalar result:
1) Enhance setupMultiReductionResultLayout() and
LayoutInfoPropagation::visitVectorMultiReductionOp() to support scalar
result
2) Add propagation support for vector.reduction op at the lane level,
since the op is only introduced at the lane level.
[clang] Introduce `ModuleCache::write()` (#188877)
This introduces new `ModuleCache` interface for writing PCM files.
Together with #188876, this will enable adding a caching layer into the
`InProcessModuleCache` implementation, hopefully reducing IO cost.
Moreover, this makes it super explicit that the PCM is written before
its timestamp, which is an important invariant that we've broken before.
[OpenMP][MLIR] Fix GPU teams reduction buffer size for by-ref reductions (#185460)
The `ReductionDataSize` field in `KernelEnvironmentTy` and the
`MaxDataSize` used to compute the `reduce_data_size` argument to
`__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer
types for by-ref reductions instead of the actual element types. This
caused the global teams reduction buffer to be undersized relative to
the offsets used by the copy/reduce callbacks, resulting in
out-of-bounds accesses faults at runtime.
For example, a by-ref reduction over `[4 x i32]` (16 bytes) would
allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the
generated callbacks would access 16 bytes per slot.
Fix both computation sites:
1. In MLIR's `getReductionDataSize()`, use
`DeclareReductionOp::getByrefElementType()` instead of `getType()` when
the reduction is by-ref, so the reduction buffer struct layout (and more
[16 lines not shown]