[CIR][OpenCL] Add kernel argument metadata attribute (#199530)
Add a CIR attribute that carries OpenCL kernel argument metadata in
source argument order. Verify that each metadata field has the expected
element type and that all present arrays describe the same number of
arguments.
[AMDGPU] Add a few wmma co-execution hazard checks, NFC (#203658)
This is to reflect the gfx1251 update regarding wmma*8f6f4 with
matrix format as F4.
Also fix a comment in GCNHazardRecognizer.cpp
[Clang] Add AttrDocs entry for OverflowBehavior (#203392)
These docs were previously missing.
Fixes: #203322
Signed-off-by: Justin Stitt <justinstitt at google.com>
[libc++] Fix bug where `optional<T&>` couldn't be constructed from `transform()` (#203462)
- Add the proper from monadic base constructor
- Fix the constraint so it allows references.
- Add tests
[libc++] P3369R0: constexpr for `uninitialized_default_construct` (#200163)
Remarks:
- Tests also verify that `uninitialized_default_construct(_n)`
algorithms do not initialize trivially default-constructible elements
(`int` in these tests) to determined values during constant evaluation.
[GlobalISel] Fix sign-extended byte mask in lowerBswap (#199387)
The per-byte mask in `LegalizerHelper::lowerBswap` was constructed via
```
APInt APMask(SizeInBytes * 8, 0xFF << (i * 8));
```
where `0xFF << (i * 8)` is evaluated as a signed `int`. For `i*8 >= 24`
(byte-3 mask of an s64 G_BSWAP) the value `0xFF000000` does not fit in a
positive 32-bit `int`; the conversion to signed `int` is
implementation-defined under C++17 (UB under C++11, fully defined under
C++20) and on two's-complement targets produces `-16777216`. The modular
conversion to `uint64_t` in the `APInt` constructor then materializes
that negative `int` as `0xFFFFFFFFFF000000` — the intended mask was
`0x00000000FF000000`. The over-wide mask preserved bytes 4-7 of the
source where only byte 3 was intended, and the spurious bytes propagated
through the subsequent shift/OR chain.
[3 lines not shown]
[VectorCombine] Use TCK_CodeSize for size-optimized functions (#202207)
VectorCombine currently uses `TCK_RecipThroughput` for all functions,
including functions optimized for size.
Select `TCK_CodeSize` when `Function::hasOptSize()` is true, covering
both `-Os` (`optsize`) and `-Oz` (`minsize`), while retaining
`TCK_RecipThroughput` for the default optimization mode.
The X86 regression test demonstrates a sign-bit reduction where the
throughput cost model folds an `or` reduction into a `umax` reduction.
The code-size model preserves the smaller form for `optsize` and
`minsize` functions, while the default function retains the existing
throughput-oriented transformation.
Fixes #153375.
[HLSL][NFC] Move HLSLBufferCopyEmitter class (#203595)
Move `HLSLBufferCopyEmitter` class to the anonymous namespace at the top
of `CGHLSLRuntime.cpp` and use it directly from
`CGHLSLRuntime::createBufferMatrixTempAddress` instead going though the
`CGHLSLRuntime::emitBufferCopy` call. No changes were done to the
`HLSLBufferCopyEmitter` code.
This is preparation for work related to resources in cbuffer structs
which will be changing the signature of `CGHLSLRuntime::emitBufferCopy`
and modifying the `HLSLBufferCopyEmitter`.
[RISCV] Add PseudoClearGPR to the special cases in RISCVInstrInfo::getInstSizeInBytes. (#203637)
This instruction is expanded to an ADDI with immediate of 0 and should
then be compressed to c.li with Zca. The compression code doesn't know
this due to the Pseudo so manually give a size of 2 for Zca.
[RISCV] Mark HW shadow stack ops as frame setup/destroy (#203362)
This change follows up on PR #200182 and addresses the issue in the
[related
comment](https://github.com/llvm/llvm-project/pull/200182#discussion_r3329197379).
It sets `FrameSetup` on SSPUSH/C_SSPUSH and `FrameDestroy` on SSPOPCHK
instructions emitted by RISCVFrameLowering for the HW shadow stack path.
The test was written manually (instead of using
`utils/update_mir_test_checks.py`) to keep it simple and avoid
unnecessary fragility.
[NFC][LLVM] Refactor IIT_ANY payload for vector/element constraint (#203506)
Change `IIT_ANY` payload from a single packed OverloadIndex + AnyKind
byte to 2 bytes:
- An 8 bit OverloadIndex
- An 8 pit packed vector + element type constraint.
This will enable `IIT_ANY` to express constraints on the overload type
is a more general fashion compared to a flat `AnyKind` enum.
Also fixed a latent bug in fixed encodings generated by the intrinsic
emitter (exposed by this change). Existing `encodePacked` packs the
type-signature as 8 nibbles into a 32-bit word and then checks if the
MSB bit position (i.e., bit 15) is 0 (to allow it's use in fixed
encoding). This effectively drop any 0 valued bytes in the encoding in
the upper 4 nibbles. Fix this by changing `encodePacked` to use the
actual fixed encoding type and its size.
[flang][CUDA] Keep host literals from using unified-memory generic distance (#201257)
Fix CUDA generic resolution under `-gpu=mem:unified` so unattributed
literals and expression temporaries are not treated as unified-memory
actuals.
Previously, a host scalar literal such as `1.0` could score as
compatible with a `DEVICE` dummy and incorrectly select the
device-scalar overload. This could pass a host stack address to a device
helper and fail at runtime. The fix applies the unified/managed memory
distance columns only to symbol-backed actuals.
[flang][cuda] Fix host loads from CUDA constant globals (#203064)
This fixes CUDA Fortran lowering for scalar module variables with the
constant attribute that are read from host code, such as launch
configuration expressions or CUF kernel loop bounds.
Previously, host-side declarations for these globals could be rewritten
to device constant-memory addresses, causing host loads to dereference
the result of _FortranACUFGetDeviceAddress. The fix preserves host reads
from the host-visible global while still using the device address for
host-to-device assignment updates.
A FIR regression test covers host reads and assignment updates for
scalar CUDA constant globals.
[BOLT] Change DataAggregator error types (#203651)
1. In `filterBinaryMMapInfo`, replace `incovertibleErrorCode` with errc
code as `parseMainEvents` converts returned Error to std::error_code.
2. In `parsePerfData`, pass through Error returned by `prepareToParse`
for memory events.
Test Plan: updated perf_test.test
[libc] fix EAGAIN being treated as timeout in mutex and rwlock (#203574)
fix #203411.
This PR addresses the problem that `EAGAIN` may be treated as timeout in
mutex and rwlock. Two changes are applied:
1. timeout sites always explicitly check for timeout now to make the
logic more robust;
2. the futex wait now discards the error of `EAGAIN/EWOULDBLOCK` and
returns 0;
We don't distinguish waking up from signal and waking up from mismatch
for the following 3 reasons:
- We have userspace guard to avoid futex syscall if we already know
value would match, it seems awkward to make that check returns error, as
we may wake up and loop back to the check, where signal is consumed but
we still return error....;
- futex syscall can spuriously wake up anyway, there is no way to tell
[3 lines not shown]
QuantileType bytecode patch (#203495)
Since the merge of this
PR(https://github.com/llvm/llvm-project/pull/190321) there were some
issues identified, such as QuantileType not being added in the ByteCode
files. This PR focuses on fixing these missing pieces which should make
QuantileType a complete and functional type.