[AMDGPU][SIMemoryLegalizer] Consider scratch operations as NV=1 if GAS is disabled
- Clarify that `thread-private` MMO flag is still useful.
- If GAS is not enabled (which is the default as of last patch), consider an op as `NV=1` if it's a `scratch_` opcode, or if the MMO is in the private AS.
- Add tests for the new cases.
- Update AMDGPUUsage GFX12.5 memory model
[RFC][AMDGPU] Add BARRIER address space
Add a new BARRIER address space that is used for global variables that are used to represent the barrier IDs in GFX12.5.
These barrier addresses just have values corresponding 1-1 to barrier IDs. They are still implemented on top of LDS, but the offsetting happens during an addrspacecast to generic, not whenever the barrier GV is used.
The motivation for this is to make the relation between LDS and barrier GVs explicit in the compiler. It does add a bit more complexity, but that complexity was already there, just hidden by pretending barrier GVs were actual LDS.
[NFCI][clang] Allow overriding any global variable address space
Allow the target to change the AS of a global variable at will, not just whenever Clang cannot assign one.
This enables the next patch that will specialize LDS GVs for barriers as a separate address space.
[Sema] Call ActOnFields before late parsing in ParseStructUnionBody (#187166)
Implements for: #186914
Move the call to `ActOnFields()` before `ParseLexedCAttributeList()` in
ParseStructUnionBody for reordering so that the struct type is complete
when late-parsed attributes like counted_by get evaluated. This is a
prerequisite for supporting sizeof/offsetof expressions in counted_by
evaluation.
Update the heuristic for `GetEnclosingNamedOrTopAnonRecord`. Remove the
`isCompleteDefinition()` condition since it will always return true
under the new ordering. The `GetEnclosingNamedOrTopAnonRecord` intend to
treat the unnamed and anonymous struct permissively.
Add one test to verify the new ordering still make sure the function of
unnamed and anonymous struct works normally.
[NFC][AMDGPU] Generalize some LDS MemoryUtils
In preparation for upcoming work, I need some functions used by the LDS lowering
system to work on any GV. I removed the LDS specific queries inside these functions
and replaced them with functors passed by the caller, so these utility functions can be reused.
I also cleaned-up a few things that weren't up to code, such as lowercase variable names.
[AMDGPU] Make globally-addressable-scratch opt-in
This feature is meant to be opt-in for more advanced users, not default-enabled.
It may reduce performance otherwise as we can't assume private AS is thread-local
when it is enabled.
- Add `HasGloballyAddressableScratchSupport` feature to check if a target's scratch
addressing is changed due to support for globally addressable scratch.
- Use `EnableGloballyAddressableScratch` to check whether the user opted into
globally addressable scratch. This affects whether to lower scratch atomics as flat,
and in the future will affect whether NV=1 can be set on scratch accesses.
[GlobalISel] Recursively Optimise MatchTable Matchers (#197143)
The core of this change is the additional call to `Matcher::optimize()`
in the `optimizeRules` function,
which enables the match table optimization logic to recurse on the
children of every GroupMatcher, forming
additional groups (which hoist more common predicates into a shared
group).
To enable that, I had to update the `getFirstConditionAsRootType`
implementation to support `GroupMatcher`.
I also included a small refactoring of the match table optimization
pipeline that was identical between the
GlobalISel and GlobalISelCombiner emitters.
The results of this change are up to a 25% size reduction for GlobalISel
match tables.
There is a tiny increase (a few bytes) in a combiner table because we
now create new groups
[16 lines not shown]
[AMDGPU] Clamp load_monitor scope to minimum SCOPE_SE
The load_monitor instructions monitor L2 cache lines and therefore require at
least SCOPE_SE to ensure the L2 cache is hit. The current memory model requires
the user to ensure that the specified scope is such that it results in at least
SCOPE_SE, otherwise the behaviour is undefined. Instead, we now clamp the
emitted scope at a minimum of SCOPE_SE, so that the undefined behaviour is
converted into a performance loss instead.
Assisted-By: Claude Opus 4.6
[mlir][spirv] Set signed coop matrix operands (#197932)
Populate CooperativeMatrixOperandsKHR on KHR cooperative matrix
multiply-add based on the cooperative matrix element types. Signed
integer A, B, C and result matrices require their corresponding signed
component bits; otherwise SPIR-V treats those integer components as
unsigned.
Added lit test
Co-authored-by: Hsiangkai Wang <hsiangkai.wang at arm.com>
[libc] Introduce a typed syscall wrapper and use it in mmap (#197459)
Linux reserves a range of values (everything above -4096u, aka
MAX_ERRNO) as an error value, so the check can be performed without
knowing the details of the specific syscall. libc functions where these
values would be a valid result (e.g. PTRACE_PEEKDATA) are implemented
differently at the kernel level (e.g. returning the result through a
pointer argument). The only exception are a handful of syscalls (getpid,
getuid, ...) which can never fail, and where this could be an actual
user/group ID (particularly on 32-bit systems).
Specifically, for mmap, this lets us remove the is_valid_mmap helper and
SYS_mmap2 ifdefs in various places.
More generally, this can simplify many syscall wrappers as often the
only thing they are doing is converting the return value into an
ErrorOr.
[MLIR][NVVM] Add sqrt Ops (#197422)
Adds two NVVM dialect ops covering all 14 floating-point `sqrt` forms:
- `nvvm.sqrt` -- IEEE-compliant sqrt with explicit rounding mode
(`sqrt.<RM>[.ftz].{f32,f64}`), 12 forms.
- `nvvm.sqrt.approx` -- fast approximate sqrt (`sqrt.approx[.ftz].f32`),
2 forms; uses the `NVVM_F32UnaryApproxOp` base class.
The two ops are split because the rounded forms require an explicit rounding mode and support both f32 and f64, while the approx forms have no rounding mode and are f32-only.
[flang-rt][test] Fix write01.f90 missing LD_LIBRARY_PATH (introduced in #187662)
The test binary was run without setting LD_LIBRARY_PATH, causing
libflang_rt.runtime.so to not be found at runtime. Match the pattern
used by exec.f90 and ctofortran.f90.
Co-Authored-By: Claude Sonnet 4.6 <noreply at anthropic.com>