[CIR] Implement Namespace/global TLS CIR CodeGen (#196332)
Unlike local TLS, global TLS functions need to be initialized upon their
first use in a thread.
First, all attempts to 'get' said TLS global are replaced with calls to
a 'wrapper' function, which calls an 'init' alias function, then returns
the global. While classic codegen manages to omit this in simple cases
sometimes, this CIR implementation doesn't attempt to do such constant
folding/inlining. The call to the 'init' is omitted if there is no
ctor/dtor setup required, so sometimes the wrapper is just a 'no-op'
(intentionally!).
There are also two types of 'global' TLS functions: unordered, and
ordered. Unordered are typically variable templates, and their 'init'
function initializes JUST them. The rest are ordered, which requires all
ordered initializations to happen as soon as any happen.
The Wrapper:
[25 lines not shown]
AMDGPU/GlobalISel: Switch to extended LLTs
Switch is required to be able to translate bfloat.
After the switch most of the codegen patterns now require explicit
type on register to match instead of LLT::scalar.
So we can still use LLT::scalar for type checks but new instructions
created during lowerings/combines need to use propper extended LLT.
inst select test sources fully switched to i32/f32 so patterns can match
for legalizer and regbanklegalize left as is (should probably be switched
as well)
New functionality worth noting is f16 and bitcast lowering to i32
f16 = g_bitcast i16
->
i32 = g_anyext i16
f16 = g_trunc i32
f16 = trunc i32 is legal
[clang][bytecode] Fix a crash in Descriptor::getElemDataSize() (#196929)
`FIXED_SIZE_INT_TYPE_SWITCH` does not handle `PT_Bool`, handle it
explicitly before.
[clang][OpenMP 6.0][CodeGen] Codegen for declare_target 'local' clause (#196431)
Implement code generation for the OpenMP 6.0 declare_target 'local'
clause, which creates device-only variables with per-device static
storage.
A 'local' variable exists in the device image with its static
initializer and is always accessed directly by device code. This is the
same as 'to'/'enter' without unified shared memory, except that no
offload entry is registered.
Using 'device_type(nohost)' with 'local' is not yet supported. Sema
generates a warning and converts it to 'device_type(any)'.
Testing:
- Updated tests:
clang/test/OpenMP/declare_target_messages.cpp
clang/test/OpenMP/declare_target_ast_print.cpp
- New tests:
[2 lines not shown]
[gn] port 07b5dfe9473c6 + deps (LLVMABI dep in clang) (#196944)
Also adds build files for llvm/lib/ABI, which was dead code before
07b5dfe9473c6 (at least in the GN build).
[AMDGPU] Replace vdst_in opcode exclusion list with position check
Use getNamedOperandIdx to detect if vdst_in has already been added
by a prior converter, instead of maintaining a hardcoded opcode list.
[LV] Add test showing lack of gather/scatter can prevent if-convert
This introduces a new force-target-supports-gather-scatter-ops CLI
option for testing, as well a new isLegalMaskedLoadOrStore() helper.
[LV] Use isLegalMaskedLoadOrStore for interleaved accesses too (NFC) (#195243)
isLegalMaskedLoadOrStore is now the central place for querying target
capabilities for masked accesses. Access pattern legality checks are
hoisted outside of it.
[Flang][Semantics] Treat host/use-associated objects as externally visible. (#192892)
This patch fixes a false semantic error in Flang where function result
variables were incorrectly treated as externally visible in
pure-definability checks.
As a result, valid code assigning a pointer component of a function
result (as in flang/test/Semantics/pure-function-result-pointer.f90) was
rejected with “not definable in a pure subprogram.”
The fix updates _FindExternallyVisibleObject_ to treat function result
symbols as local, which matches Fortran semantics for function result
variables.
[Flang][OpenMP] Fix COPYIN of derived types with allocatable components at -O3 (#196063)
COPYIN of threadprivate derived types with allocatable components
segfaults at -O3 because the OpenMP runtime zero-fills per-thread
storage, leaving allocatable component descriptors with invalid
metadata. This patch skips the copy on the master thread (where source
and destination alias) and uses temporary_lhs assignment on worker
threads so the runtime initializes descriptors before the deep copy.
Assisted-by: Claude Opus 4.6
Fixes :
[https://github.com/llvm/llvm-project/issues/196134](https://github.com/llvm/llvm-project/issues/196134)
Minimal reprducing test-case :
```
program repro_o3_segv
use omp_lib
implicit none
[64 lines not shown]
Reapply [AA] No synchronization effects for never-escaping identified local (#196923)
Relative to the previous attempt, this makes sure that the location does
not alias with the pointer operand first. If it aliases, then we need to
consider the direct ModRef effects of the instruction, not just the
synchronization effects.
-----
Fences and other synchronizing operations (such as atomic accesses
stronger than monotonic) are modelled as reading and writing all memory,
in order to enforce their implied ordering constraints.
Currently, this happens even for identified function locals that do not
escape. This patch excludes those objects.
Notably, we can not reason based on captures-before here, because the
synchronizing operation still has an effect even if the object only
escapes later.
[2 lines not shown]
[Dexter] Add basic structured script parsing (#193710)
See PSA:
https://discourse.llvm.org/t/psa-planned-changes-to-dexter/90402
This patch begins adding support for "structured scripts" to Dexter,
starting with some of the core classes and the ability to parse script
files. This patch does not add the ability to actually run scripts, or
any of the underlying functionality required to do so.
NB: This patch adds a dependency on PyYAML, which is specified in a new
requirements.txt file.
[mlir][dataflow] IntRange: Replace yield-based widening with per-state lattice budget (#196616)
IntegerRangeAnalysis can hang on `scf.while` loops with dynamic bounds:
a
loop-carried range ratchets [0,0]->[0,1]->[0,2]->... by one per worklist
visit, requiring up to 2^31 iterations on i32. The new
`int-range-analysis-convergence.mlir` test reproduces this.
The ratchet lives at framework merge sites (region successors, callable
args) where the solver joins lattices via virtual
`Lattice::join(const AbstractSparseLattice &)`. The pre-existing
`isYieldedResult`/`isYieldedValue` heuristic in
`IntegerRangeAnalysis::visitOperation` doesn't help: it runs in the
transfer-function callback for inferrable-op results used by a
terminator,
not on the merge path. It is also harmful where it fires - slams to
maxRange on the *second* visit (after, say, [1,1]->[1,2]), so naturally
bounded accumulators (e.g. `arith.minsi`-clamped iter args) widen to
[INT_MIN, INT_MAX].
[8 lines not shown]
[MLIR][GPU] Add gpu-lower-to-rocdl-pipeline meta-pass (#196751)
Add `gpu-lower-to-rocdl-pipeline` meta-pass which lowers common MLIR
dialects (gpu/arith/scf/vector) to binary, similar to the existing
XeVM/NVVM pipelines.