[mlir][acc] Introduce privatization operations for codegen (#195273)
This change adds codegen-oriented operations for representing
private-variable storage and materializing the storage that a particular
parallel execution actually uses.
The two operations are meant to be used together:
- acc.privatize introduces an abstract handle for the privatized
storage,
including the parallel levels that determine the ultimate size of the
storage needed. Which parallel levels apply can be stated when that
structure is known, or omitted so the same representation can be refined
later as launch and loop parallelism are decided.
- acc.private_local takes that handle and yields the concrete storage
for the current execution context(for example the slice that corresponds
to this gang or worker).
[flang] Fix missed access group attribute when converting FIR to LLVM dialect. (#195376)
Apply group access attribute to memcpy when lowering fir.load/fir.store
of a box if an original FIR operation had it.
[asan] Change error to note when poison record is not found (#195669)
When `CheckPoisonRecords` fails to find a record, it's often due to the
history buffer being too small rather than a functional error in the
logic.
[GIsel] Add combine (sub a, (mul x, C)) -> (add a, (mul x, -C)) (#194282)
Copy this canonicalization from InstCombine so it can run on
post-legalized expansions. This is especially useful if the sub is a
neg.
[MLIR][AMDGPU] Add amdgpu.global_transpose_load op for gfx1200+ global memory transpose loads (#195287)
Adds a new `amdgpu.global_transpose_load` op to the AMDGPU dialect that
wraps the `global_load_tr` family of instructions introduced in RDNA4
(gfx1250+). Each thread reads a column of a matrix from global memory
and receives the corresponding transposed row in its result register.
The op is kept separate from the existing `amdgpu.transpose_load` (which
targets LDS via `ds_read_tr` on gfx950+) because the two variants target
different GPU architecture families, have different chipset
requirements, and differ in their valid (element size, num elements)
combinations — in particular the 16-bit case produces a 128-bit
(8-element) result via `global_load_tr.b128` rather than the 64-bit
(4-element) result from `ds_read_tr16.b64`.
Lowering to the existing ROCDL `global.load.tr{4,6,.}.b{64,96,128}`
intrinsics added for gfx1200+.
---------
[2 lines not shown]
[mlir][MathToLLVM] Fix vector type checks in math.absi lowering. (#195360)
For vector types, the lowered type is LLVMArrayType not VectorType. We
should use the original result type to guide if we can do the lowering
for vectors or not.
Signed-off-by: hanhanW <hanhan0912 at gmail.com>
[mlir][SPIRV] Add named-barrier type and OpNamedBarrierInitialize / OpMemoryNamedBarrier (#195664)
Adds the SPIR-V named-barrier object (TypeNamedBarrier) along with
NamedBarrierInitialize and MemoryNamedBarrier ops, gated on the
NamedBarrier capability and SPIR-V 1.1+.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply at anthropic.com>
[flang][semantics] Add a flag to relax some of the semantic constraints on C_LOC (#195112)
This PR adds a flag that downgrades some of the semantic constraints on
C_LOC so that it can be used more like LOC. Without the flag behavior is
unmodified, with the flag the constraint that the address be object
pointer or target is removed. There are other constraints we might
consider relaxing, but I think this is a start.
[clang][NFC] Mark CWG2785 as implemented and add a test (#195547)
[CWG2785](https://wg21.link/cwg2785) clarifies that a
*requires-expression* is never type-dependent, it always has type
`bool`. That means that in a snippet like this:
```cpp
void g(void *);
template <typename T>
void f() {
g(requires { T(); });
}
```
The call to `g` should be diagnosed as invalid (`bool` is not
convertible to `void *`) even if the template is never instantiated.
Clang does the right thing since version 10:
https://godbolt.org/z/s61rEbsfz
[flang][CUDA] Only apply implicit managed attribute when CUDA Fortran is enabled (#195353)
The implicit-managed tagging added in #175648 was intended for CUDA
Fortran allocatables. However, the gate was just
LanguageFeature::CudaManaged, so the tagging also fires on
non-CUDA-Fortran translation units when -gpu=mem:managed is in effect.
This patch adds a LanguageFeature::CUDA check so the implicit tagging
only fires for CUDA Fortran TUs (driver-set -fcuda or .cuf/.CUF source).
Adds a regression test that bbc -gpu=managed without -fcuda on a .f90
source must not produce any cuf.* ops or #cuf.cuda<managed> attributes.
[SandboxIR][SandboxVec] Remove score tracking from Region, add RegionWithScore (#190293)
Up until now the `Region` class contained a `ScoreBoard` and was
tracking instruction costs by default. However, design-wise the Region
is a generic IR-level structure and should be independent from score
tracking.
So this patch removes the score tracking capability from the base
`Region` class and creates a separate `RegionWithScore` derived class
for that. The new class is placed in the vectorizer directory because
the score tracking is meant for the vectorizer.
Should be NFC.
[clang][CodeView] Prevent the input name from appearing in LF_BUILDINFO (#194140)
The implicit contract of an `LF_BUILDINFO` record (represented in LLVM
by
[`BuildInfoRecord`](https://github.com/llvm/llvm-project/blob/6f0b55ec55f3e5e1ccc0d6b0d04a307479218768/llvm/include/llvm/DebugInfo/CodeView/TypeRecord.h#L667))
is that its `CommandLine` field should not contain the input source file
— a separate `SourceFile` field is reserved for that.
When the command-line flattening was moved from `llvm/` to `clang/` in
#106369, the comparison value used to identify and strip the source
positional was switched from `MainSourceFile->getFilename()` (the full
input path resolved by clang) to `CodeGenOpts.MainFileName` (just the
basename, set via `-main-file-name`). As a result, when the driver is
invoked with an absolute source path the cc1 positional is that absolute
path and no longer matches `MainFileName`, so the source filename leaks
into `CommandLine` as a trailing positional cc1 argument.
This is a regression in Clang 20. It breaks downstream tooling such as
Live++, whose unity-splitting feature relies on the embedded command
[12 lines not shown]
[BOLT] Gadget scanner: add less strict version of tail call checker
During tail call, it may be worth making sure the link register is as
trusted as during a regular call, though it may require inserting
expensive checking code by the compiler.
On the other hand, with pac-ret hardening enabled, there should be no
reason not to protect tail-calling functions at least as well as those
exited via regular return instruction.
This commit splits tail call checker into two versions: the basic one
which is suitable to make sure regular `PAC*` + `AUT*` are emitted as
needed, and the strict one, that additionally ensures the authentication
(if any) succeeded.