[NewPM] Wire up gc-empty-basic-blocks into pipeline (#194179)
Same setup as the old pipeline and resolves a testing TODO now that we
have a full pipeline for x86.
[llvm][.gitignore] Ignore .agents directory (#194236)
The `.agents` directory is a common convention for storing coding agent
related stuff like rules and skills, recognized by major coding agents
like Claude Code, Codex, GitHub Copilot, OpenCode, etc. Ignore this
directory to avoid accidentally commiting these coding agent related
content to the repo.
[lldb][test] Modernize TestModulesAutoImport (#194357)
This replaces all the custom test setup logic with the newer test
utilities. Not technically NFC as the newer checks are more strict.
[AArch64][llvm][clang] Remove `int_aarch64_sve_bfmmla` and reuse existing def (NFC)
Remove the dedicated (superfluous) `int_aarch64_sve_bfmmla` def and
changed `svbfmmla` to use the existing shared fmmla intrinsic instead.
No functional change.
[CIR][OpenMP] Enable emission of target functions (#193204)
This PR allows generation of target device functions for OpenMP. It also
handles filtering out host functions that do not contain target regions.
Assisted-by: Cursor / claude-4.6-opus-high
[AArch64] Enable Spillage Copy Elimination by default (#186093)
In times of high register pressure, the greedy register allocator can
emit large eviction chains that consist of many `mov` instructions. The
Spillage Copy Elimination pass handles this, by finding these chains and
decreasing their impact. Take a mov chain such as the following where
`x8` is used for an 8-byte Folded Reload:
```
mov x7, x6
mov x6, x5
mov x5, x4
mov x4, x3
mov x3, x2
mov x2, x1
mov x1, x30
mov x30, x8
< use x8 >
mov x8, x30
mov x30, x1
[28 lines not shown]
[MLIR][OpenMP] Unify device shared memory logic, NFCI (#182856)
This patch creates a utils library for the OpenMP dialect with functions
used by MLIR to LLVM IR translation as well as the stack-to-shared pass
to determine which allocations must use local stack memory or device
shared memory.
[MLIR][OpenMP][OMPIRBuilder] Improve shared memory checks (#161864)
This patch refines checks to decide whether to use device shared memory
or regular stack allocations. In particular, it adds support for
parallel regions residing on standalone target device functions.
The changes are:
- Shared memory is introduced for `omp.target` implicit allocations,
such as those related to privatization and mapping, as long as they are
shared across threads in a nested parallel region.
- Standalone target device functions are interpreted as being part of a
Generic kernel, since the fact that they are present in the module after
filtering means they must be reachable from a target region.
- Prevent allocations whose only shared uses inside of an `omp.parallel`
region are as part of a `private` clause from being moved to device
shared memory.
[Flang][OpenMP] Add pass to replace allocas with device shared memory (#161863)
This patch introduces a new Flang OpenMP MLIR pass, only ran for target
device modules, that identifies `fir.alloca` operations that should use
device shared memory and replaces them with pairs of
`omp.alloc_shared_mem` and `omp.free_shared_mem` operations.
This works in conjunction to the MLIR to LLVM IR translation pass'
handling of privatization, mapping and reductions in the OpenMP dialect
to properly select the right memory space for allocations based on where
they are made and where they are used.
This pass, in particular, handles explicit stack allocations in MLIR,
whereas the aforementioned translation pass takes care of implicit ones
represented by entry block arguments.
[Flang][MLIR][OpenMP] Add explicit shared memory (de-)allocation ops (#161862)
This patch introduces the `omp.alloc_shared_mem` and
`omp.free_shared_mem` operations to represent explicit allocations and
deallocations of shared memory across threads in a team, mirroring the
existing `omp.target_allocmem` and `omp.target_freemem`.
The `omp.alloc_shared_mem` op goes through the same Flang-specific
transformations as `omp.target_allocmem`, so that the size of the buffer
can be properly calculated when translating to LLVM IR.
The corresponding runtime functions produced for these new operations
are `__kmpc_alloc_shared` and `__kmpc_free_shared`, which previously
could only be created for implicit allocations (e.g. privatized and
reduction variables).
[Flang][MLIR][OpenMP] Add explicit shared memory (de-)allocation ops
This patch introduces the `omp.alloc_shared_mem` and `omp.free_shared_mem`
operations to represent explicit allocations and deallocations of shared memory
across threads in a team, mirroring the existing `omp.target_allocmem` and
`omp.target_freemem`.
The `omp.alloc_shared_mem` op goes through the same Flang-specific
transformations as `omp.target_allocmem`, so that the size of the buffer can be
properly calculated when translating to LLVM IR.
The corresponding runtime functions produced for these new operations are
`__kmpc_alloc_shared` and `__kmpc_free_shared`, which previously could only be
created for implicit allocations (e.g. privatized and reduction variables).
[MLIR][OpenMP] Refactor omp.target_allocmem to allow reuse, NFC (#161861)
This patch moves tablegen definitions that could be used for all kinds
of heap allocations out of `omp.target_allocmem` and into a new
`OpenMP_HeapAllocClause` that can be reused.
Descriptions are updated to follow the format of most other operations
and the custom verifier for `omp.target_allocmem` is removed as it only
made a redundant check on its result type.
[OMPIRBuilder] Add support for explicit deallocation points (#154752)
In this patch, some OMPIRBuilder codegen functions and callbacks are
updated to work with arrays of deallocation insertion points. The
purpose of this is to enable the replacement of `alloca`s with other
types of allocations that require explicit deallocations in a way that
makes it possible for `CodeExtractor` instances created during
OMPIRBuilder finalization to also use them.
The OpenMP to LLVM IR MLIR translation pass is updated to properly store
and forward deallocation points together with their matching allocation
point to the OMPIRBuilder.
Currently, only the `DeviceSharedMemCodeExtractor` uses this feature to
get the `CodeExtractor` to use device shared memory for intermediate
allocations when outlining a parallel region inside of a Generic kernel
(code path that is only used by Flang via MLIR, currently). However,
long term this might also be useful to refactor finalization of
variables with destructors, potentially reducing the use of callbacks
[6 lines not shown]
[OpenMPOpt] Make parallel regions reachable from new DeviceRTL loop functions (#150927)
This patch updates the OpenMP optimization pass to know about the new
DeviceRTL functions for loop constructs.
This change marks these functions as potentially containing parallel
regions, which fixes a current bug with the state machine rewrite
optimization. It previously failed to identify parallel regions located
inside of the callbacks passed to these new DeviceRTL functions, causing
the resulting code to skip executing these parallel regions.
As a result, Generic kernels produced by Flang that contain parallel
regions now work properly.
One known related issue not fixed by this patch is that the presence of
calls to these functions will prevent the SPMD-ization of Generic
kernels by OpenMPOpt. Previously, this was due to assuming there was no
parallel region. This is changed by this patch, but instead we now mark
it temporarily as unsupported in an SPMD context. The reason is that,
[2 lines not shown]
[OpenMP][OMPIRBuilder] Support parallel in Generic kernels (#150926)
This patch introduces codegen logic to produce a wrapper function
argument for the `__kmpc_parallel_51` DeviceRTL function needed to
handle arguments passed using device shared memory in Generic mode.
[OpenMP][OMPIRBuilder] Use device shared memory for arg structures (#150925)
Argument structures are created when sections of the LLVM IR
corresponding to an OpenMP construct are outlined into their own
function. For this, stack allocations are used.
This patch modifies this behavior when compiling for a target device and
outlining `parallel`-related IR, so that it uses device shared memory
instead of private stack space. This is needed in order for threads to
have access to these arguments.
[MLIR][OpenMP] Support allocations of device shared memory (#150924)
This patch updates the allocation of some reduction and private
variables within target regions to use device shared memory rather than
private memory. This is a prerequisite to produce working Generic
kernels containing parallel regions.
In particular, the following situations result in the usage of device
shared memory (only when compiling for the target device if they are
placed inside of a target region representing a Generic kernel):
- Reduction variables on `teams` constructs.
- Private variables on `teams` and `distribute` constructs that are
reduced or used inside of a `parallel` region.
Currently, there is no support for delayed privatization on `teams`
constructs, so private variables on these constructs won't currently be
affected. When support is added, if it uses the existing
`allocatePrivateVars` and `cleanupPrivateVars` functions, usage of
device shared memory will be introduced automatically.
[OpenMP][OMPIRBuilder] Add device shared memory allocation support (#150923)
This patch adds the `__kmpc_alloc_shared` and `__kmpc_free_shared`
DeviceRTL functions to the list of those the OMPIRBuilder is able to
create.
[MLIR][OpenMP] Remove Generic-SPMD early detection (#150922)
This patch removes logic from MLIR to attempt identifying Generic
kernels that could be executed in SPMD mode.
This optimization is done by the OpenMPOpt pass for Clang and is only
required here to circumvent missing support for the new DeviceRTL APIs
used in MLIR to LLVM IR translation that Clang doesn't currently use
(e.g. `kmpc_distribute_static_loop` ). Removing checks in MLIR avoids
duplicating the logic that should be centralized in the OpenMPOpt pass.
Additionally, offloading kernels currently compiled through the OpenMP
dialect fail to run parallel regions properly when in Generic mode. By
disabling early detection, this issue becomes apparent for a range of
kernels where this was masked by having them run in SPMD mode.
[HLSL] Handle logical pointer for array assign (#193227)
This commits adds SPIR-V testing on an existing test (almost-NFC on DXIL
testing). It also copies it and invokes Clang using the experimental
logical pointer flag.
Adding this flag shows a missing case in the frontend, handled with this
commit.
Due to the difference in index handling between the structured.gep and
legacy one, the Cbuffer load codegen had to be rewritten. It's a bit
more naive, as we get one gep per level, but this will be handled by
optimizations later on.