[Offload][libsycl][clang-sycl-linker] Simplify SYCL Offload wrapping (#193876)
Replace the __sycl_tgt_bin_desc/__sycl_tgt_device_image-based fat binary
registration with a simpler OffloadBinary-native approach:
- __sycl_register_lib/__sycl_unregister_lib now take (BinaryStart, Size)
instead of a __sycl_tgt_bin_desc pointer; __sycl_unregister_lib only
needs BinaryStart since the runtime looks up the binary by its start
address.
- OffloadWrapper's SYCL wrapping is significantly simplified: the
__tgt_bin_desc/__tgt_device_image structs and the descriptor
construction code are replaced by a single embedded OffloadBinary blob
passed directly to the register/unregister entry points.
- clang-sycl-linker generates a single OffloadBinary, which contains
multiple images.
- ProgramAndKernelManager::registerFatBin parses the blob via
OffloadBinary::create, keying MDeviceImageManagers by BinaryStart to
eliminate the reparse on unregister.
- DeviceImageManager owns std::unique_ptr<OffloadBinary> instead of
[8 lines not shown]
[CIR] Eliminate SymbolTable::lookupSymbolIn hotspots (#193362)
mlir::SymbolTable::lookupSymbolIn is O(n) per lookup, so cumulative
symbol lookups during CIRGen are O(n^2) in the number of global symbols.
On template-heavy translation units this becomes a significant
compile-time hotspot.
Replace the SymbolTable lookup path with a per-CIRGenModule DenseMap
cache keyed by symbol name, giving O(1) lookups.
On a synthetic template-heavy stress test, end-to-end compile time on
`clang -fclangir -S -emit-llvm -O0` improves by ~11% on a 33K-LOC input
(5.86s -> 5.21s) and ~16% on a 67K-LOC input (16.09s -> 13.52s). The
super-linear growth of the win with input size confirms the O(n^2) ->
O(n) effect.
Similar to previous compile time fix, repro shape (scale records and
template instantiations into the hundreds/thousands to amplify):
[7 lines not shown]
[VPlan] Pass TTI + CostKind to spillCost instead of CostCtx (NFC) (#194417)
Instead of passing CostContext, directly pass the needed TTI and
CostKind. This makes the function easier to re-use in other places,
without the need of constructing VPCostContext.
Split off from https://github.com/llvm/llvm-project/pull/194267 as
suggested.
[NVPTX] Add reverseBranchCondition and CBranch inverted flag (#191889)
Add flag to `CBranch` instruction for inverted predicate branches (`@!p
bra`) and implement `reverseBranchCondition` to support branch condition
inversion.
This enables passes like branch folding to properly reverse branch
conditions, and is a prerequisite for SETP predicate inversion CSE.
Assisted-by: Cursor / Claude
[Sema] Enforce parameter match for ownership_returns attribute (#192339)
Previously parsing multiple ownership_returns attributes with different
arguments could lead to a crash. The documentation states that if
forward declarations have ownership_returns, they must have the same
arguments, and it may appear at most once per declaration.
This patch ensures that if multiple ownership_returns attributes are
present, their arguments (identifier and the optional index) must
exactly match. The diagonstic err_ownership_param_mismatch is introduced
for clarity.
Test cases for f15, C::f, and the newly added f22 were also updated to
match the requirement that all declarations of ownership_returns have
the same arguments, using the err_ownership_param_mismatch diagonstic.
Fixes #188733
[CIR] Set the static_local attribute if needed when initializing (#194094)
There was a case where we were creating a GetGlobalOp when initializing
a static local variable that required a guard variable but failing to
set the static_local attribute on the GetGlobalOp. This led to a CIR
verificaiton error. This change sets the attribute when it is needed.
Assisted-by: Cursor / claude-4.7-opus-high
[CodeGen] Change -O0 bool load codegen to have nonzero model (#193783)
The main follow-up item to
https://github.com/llvm/llvm-project/pull/160790 was changing -O0
codegen to convert in-memory i8 bool values to i1 with the `nonzero`
rule (`icmp ne i8 %val, 0`) rather than the `truncate` rule (`trunc i8
%val to i1`).
Bool values can only be `true` or `false`. While they are notionally a
single bit, the smallest addressable unit is CHAR_BIT bits large, and
CHAR_BIT is typically 8. Programming errors (such as memcpying a random
byte to a `bool`) can cause the 8-bit storage for a `bool` value to have
a bit pattern that is different from `true` or `false`, which then leads
to undefined behavior.
Clang has historically taken advantage of this in optimized builds
(everything other than -O0) by attaching range metadata to `bool` loads
to assume that the value loaded can only be 0 or 1. This leads to
exploitable security issues, and the correct behavior is not always easy
[31 lines not shown]
fix preserve_none in X86 backend (#192300)
This is a bug in the X86 backend affecting preserve_none with -O1. At
-O1, clang emits:
```
tail call preserve_nonecc void @_Z14func(ptr noundef nonnull byval(%struct.S) align 8 %1)
```
The `preserve_none` + `byval` never emitted the required caller-side
copy of the pointee. The callee received a pointer directly to the
original memory, violating the byval contract from LangRef: "a hidden
copy of the pointee is made between the caller and the callee, so the
callee is unable to modify the value in the caller".
The root cause is that CC_X86_64_Preserve_None in X86CallingConv.td had
no byval handling. As a result, byval arguments were falling through to
the register assignment rules and being assigned to registers instead of
stack slots.
The bug is fixed by one line in CC_X86_64_Preserve_None:
[7 lines not shown]
[DAG] Precommit tests for computeKnownFPClass - ISD::EXTRACT_SUBVECTOR and ISD::INSERT_SUBVECTOR. (#190694)
This patch adds baseline tests for `ISD::EXTRACT_SUBVECTOR` and `ISD::INSERT_SUBVECTOR` in `computeKnownFPClass` in #190378
[flang][OpenMP] Move some utility functions from Semantics to Parser,… (#194434)
… NFC
Functions that only operate on AST nodes and not require any semantic
information belong in the parser library.
[CIR] Implement static-local-tls lowering (#194059)
thread_local variables at function scope work very much like
static-locals, except with slightly different lowering from
cir-lowering-prepare. This patch implements that lowering. Global tls
variables are left to a later patch.
One decision I made here was that LocalInitOp lost its
'static-local'-ness, and assumes it is always static-local. Global TLS
is probably just going to use Global directly, so we don't need to to
permit it.
I DID leave it in the printing, as it makes it more clear what is
happening/for symmetry with get_global/global.
---------
Co-authored-by: Henrich Lauko <henrich.lau at gmail.com>
[CIR] Implement declref-lvalue lambda lowering (#194409)
This NYI showed up a few times and is pretty easy to get to,
implementation is equally as trivial.
---------
Co-authored-by: Andy Kaylor <akaylor at nvidia.com>
[DataLayout] Add null pointer value infrastructure
Add support for specifying the null pointer bit representation per address space
in DataLayout via new pointer spec flags:
- 'z': null pointer is all-zeros
- 'o': null pointer is all-ones
When neither flag is present, the address space inherits the default set by the
new 'N<null-value>' top-level specifier ('Nz' or 'No'). If that is also absent,
the null pointer value is zero.
No target DataLayout strings are updated in this change. This is pure
infrastructure for a future ConstantPointerNull semantic change to support
targets with non-zero null pointers (e.g. AMDGPU).
[SLP]Use VF for build-vector slice size when GatheredScalars were resized
After ExtractShuffles processing resizes GatheredScalars to VF and
recomputes NumParts, isGatherShuffledEntry populates Mask of size VF
and partitions it according to the post-resize NumParts. The previous
SliceSize formula based on E->Scalars.size() no longer matches that
layout and triggers an out-of-bounds ArrayRef::slice while iterating
gather-shuffle entries. Use getPartNumElems(VF, NumParts) by default;
fall back to the original-scalars formula only when
Mask.size() == E->Scalars.size() (no resize happened).
Fixes https://github.com/llvm/llvm-project/pull/194189#pullrequestreview-4182214738
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/194455
[Clang] Emit LLVM flatten attribute instead of per-callsite alwaysinline (#188615)
Follow-up to #174899 which added the flatten function attribute to LLVM
IR and implemented recursive inlining in the `AlwaysInliner` pass.
This patch updates Clang to emit the LLVM flatten attribute on functions
with `__attribute__((flatten))`, instead of the previous approach of
marking each call site with `alwaysinline`. This completes the
transition to matching GCC's flatten semantics.
Changes:
- Remove the callsite `alwaysinline` annotation logic from CGCall.cpp
- Emit the flatten function attribute in CodeGenModule.cpp
- Update clang/test/CodeGen/flatten.c to reflect the new IR output
- Update clang/test/CodeGen/AArch64/sme-inline-callees-streaming-attrs.c
to reflect the new behavior
- Add release notes documenting the behavior change
RFC:
https://discourse.llvm.org/t/rfc-function-level-flatten-depth-attribute-for-depth-limited-inlining
Allow DIBasicType (and others) to have a scope and location (#190217)
DIBasicType derives from DIType and so it already has slots to store the
scope and location of the type. However, the DIBasicType constructor
(and corresponding DIBuilder method) does not expose this, presumably
because it is not needed by C or C++.
In Ada it is more common to create one's own basic type. So, for Ada
exposing this functionality would be handy.
This patch adds a new overload of DIBuilder::createBasicType to allow
this. Because only Ada uses fixed point types, the patch simply updates
these DIBuilder methods unconditionally.
Note that DwarfUnit already handles this scenario and so no changes were
needed there.
[PowerPC] Further refactor atomic loads
Depending on the availability of the word-part feature, different code
is generated for 1 and 2 byte atomic loads. This change moves the decision
to use the word-part feature from C++ into TableGen patterns. This is done
via:
- move code from `EmitPartwordAtomicBinary()` into new function
'signExtendOperandIfUnknown()'
- decouple functions `EmitPartwordAtomicBinary()` and `EmitAtomicBinary()`
- remove the size from the name of the pseudo instructions; instead,
introduce a pseudo instruction which is used in case the word-part
feature is missing
- update the handling of the pseudo instruction insertion accordingly
A side effect of this change is the implementation requires 11 pseudo
instructions less.
[AMDGPU] Validate user SGPR count against HW range, not field width
The previous validation checked only the field width,
allowing values that exceeded the actual hardware limits
(e.g. 0–16 on gfx6-gfx120 and 0–32 on gfx125x) as long
as they fit in the bit width.
Tighten validation to reject out-of-range user SGPR counts.