[clang-cl][test] Use /Zs to avoid writing unnecessary output files (#204501)
#194779 adds a test clang/test/Preprocessor/init-datetime-macros.c which
verifies some diagnostics. However, it does so with `/c`, which will
unnecessarily generate an output, and when run on a build system that
does not run tests in a writeable dir by default, will cause the test to
fail.
Since we don't care about the resulting object file, use `/Zs`
(equivalent of `-fsyntax-only`) to check the diagnostics but not produce
any output files.
[offload][OpenMP] Fix record replay when no memory is used
Progams that do not use any memory (e.g., no mappings) were failing because
we were trying to execute zero size transfers.
[RFC][BOLT] Add a new parallel DWARF processing(2/2) (#197859)
This PR implements a new parallel DWARF debug info processing pipeline
for BOLT that significantly speeds up `--update-debug-sections` for
large binaries. It is the second part of the split from the overall RFC
changes
RFC - [[RFC][BOLT] A New Parallel DWARF Processing Approach in
BOLT](https://discourse.llvm.org/t/rfc-bolt-a-new-parallel-dwarf-processing-approach-in-bolt/90736)
(The overall changes.)
This PR does the following:
1. **Equivalence-class CU partitioning:** Replaces batchsize grouping
with union-find over DW_FORM_ref_addr references. Connected CUs share a
bucket; isolated CUs become singletons.
> For the non-LTO case, CUs have no cross-CU dependencies, so each CU is
placed into its own singleton bucket and processed fully in parallel.
> For the LTO case, CUs with cross-CU dependencies are grouped into the
same bucket and processed sequentially within that bucket, while
[7 lines not shown]
[AMDGPU] Keep i64 carry chains on VCC when feeding VALU users
This PR fixes an issue where ISel could mix scalar and vector carry chains when
lowering widened integer add/sub operations. A scalar-looking i64 carry producer
may feed a divergent carry consumer, so ISel now keeps that carry chain on VCC
to avoid invalid MIR.
[LoongArch] Combine FP_TO_UINT/FP_TO_SINT with [X]VFTINTRZ instruction (#201569)
Combine double conversion to signed 32-bit integer with
`[X]VFTINTRZ_W_D` instructions.
There are three cases:
1. For VT smaller than i32, we promote it to i32 then truncate to the
final result.
2. For `fptoui double to i32`, we convert it to `fptosi double to i64`
then truncate, avoid doing so with LASX enabled because we already have
the corresponding pattern in TableGen.
3. Last, for `fptosi double to i32`, we'll split them into blocks
(128-bit or 256-bit depending on whether LASX is enabled or not) and
then feed them into `[X]VFINTRZ_W_D` instructions, we using the XV
version, a shuffle is need because of the data layout is per 128-bit
lane.
[LoopInterchange] Reject if inner loop header has duplicate successors (#204128)
Previously, loop interchange crashed in several cases where the inner
loop header had duplicate successors. In practice, the following was
happening:
- During the transformation phase, the inner loop header was not split
because its first non-PHI instruction was its terminator.
- `updateSuccessor` was called on the header with `MustUpdateOnce=true`,
which triggers an assertion failure.
This patch fixes the issue by rejecting such cases during the legality
check phase. I believe this situation is rare, so it should not
significantly affect real-world cases.
Fix #203887.
[Clang] Make the pointers to gpuintrin AS query const (#204492)
Summary:
Right now these force a const cast if the user is checking a read-only
pointer, not great.
[Instrumentor] Move NumericFlags into InstrumentorRuntimeHelper.h (#204068)
This patch makes the `NumericFlags` enum visible to the end user by
moving it into `InstrumentorRuntimeHelper.h`.
[CHERI] Fix incorrect MAX_E for RV64Y capabilities. (#204487)
Add tests for all capability formats at the upper end of their ranges, which would have caught this oversight.
[DirectX] Add DXILRemoveUnusedResources pass (#200965)
Adds `DXILRemoveUnusedResources` pass that scans the module and removes
any resource that is not used. It means that it removes calls to
`dx_resource_handlefrom{implicit}binding` whose return value is either
not used at all, or it is saved to a global variable that does not have
external linkage and is not used anywhere else in the module.
This pass needs to run before implicit resource binding assignment pass.
The test `unused-resources-impl-binding.ll` makes sure the implicit
binding assignments are not affected by the unused resources.
Since we have many tests that are initializing resources without
actually using them, an internal option
`-disable-dxil-remove-unused-resource` has been added to `llc` so we can
keep these tests simple without adding extra code to artificially use
each resource.
Depends on #200312
Fixes #192524
[libc][math][c23] Improve rsqrtf16 function for targets without fp32 FPUs. (#160639)
Closes #159378
#### Changes
- This PR adds math approximation for targets that don't have hardware
for floats - in other words, targets that don't have
`LIBC_TARGET_CPU_HAS_FPU_FLOAT`
- This PR also introduces Google Benchmark for rsqrtf16
- Fixed typo in `+inf` case. Should return +0 according to
[F.10.4.9](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf)
[lldb] Report a generic wasm32 architecture for Wasm object files (#204496)
ObjectFileWasm hardcoded the architecture of every Wasm module as
"wasm32-unknown-unknown-wasm". A Wasm binary does not actually encode a
vendor or OS, those are properties of the runtime executing it.
When debugging via a runtime whose gdb stub reports a more specific
triple (e.g. WAMR reports "wasm32-wamr-wasi-wasm"), lldb adopts that
triple and clears the module list. The dynamic loader then tries to
reload the main executable, but GetOrCreateModule rejects the on-disk
file because the triples are incompatible. This causes lldb to back to
reading from memory.
Fix all this by reporting a bare "wasm32"/"wasm64" architecture instead.
[AtomicExpand] Add bitcasts when expanding store atomic vector
AtomicExpand fails for aligned \`store atomic <n x T>\` because it
does not find a compatible library call. This change adds appropriate
ptrtoint + bitcast so that the call can be lowered, mirroring the
load-side handling from #148900.
[SelectionDAG] Keep split vector atomic store value in a vector register
When the value of an ATOMIC_STORE has a vector type whose legalization
action is split (e.g. <4 x half>/<4 x bfloat> on X86 without F16C),
SplitVecOp_ATOMIC_STORE bitcast the value straight to a scalar integer
spanning the memory width. For a split vector that bitcast is expanded
element by element, reassembling the value in GPRs (a long pextrw/shl/or
sequence) before the store.
Instead, keep the value in a vector register when a legal vector form
exists: reinterpret it as a same-shaped integer-element vector (an FP
element type may have no legal vector form, e.g. bfloat on SSE2, while
the integer-of-element-size form does), widen that to a legal vector,
and extract the low integer element of the memory width. This issues the
store directly from a vector register (a single MOVQ/MOVD on X86),
matching the widen-path codegen already produced on AVX targets. Falls
back to the scalar bitcast when no suitable legal vector type exists.
[MLIR][XeGPU] Treat lane_data repacks as compatible layouts (#204016)
A subgroup-level convert_layout that only repacks lane_data while keeping
lane_layout unchanged (e.g. [N, 1] to [1, 1] with order = [1, 0]) is a no-op
after lane distribution: each lane owns the same elements in the same order.
Previously isCompatibleWith compared per-distribution-unit block starts, which
encode the lane_data blocking, so such layouts looked incompatible.
Handle this at the Lane level in isCompatibleWith by expanding the block
starts into per-element coordinates before comparing. The expansion only runs
when lane_data differ; otherwise the cheaper block-start comparison is exact.
The shared logic lives in a compareDistributedCoords helper used by both
LayoutAttr and SliceAttr. The Subgroup level is left for a follow-up (TODO).
Add a lit test covering the fold in sg-to-lane-distribute-unit.mlir.
[lldb] Skip the prologue when a function's entry has no line row (#204480)
Function::GetPrologueByteSize computed the prologue only when a line
table row contained the function's entry address (low_pc). When no row
covers low_pc it returned 0, leaving a name breakpoint sitting on the
function's entry address. For WebAssembly the entry address is the
function's locals-declaration byte rather than an instruction, so the
line table has no row there and the breakpoint is never hit.
When low_pc has no covering row, fall back to the first line row that
begins within the function's range and run the existing prologue logic
on it. For functions whose entry is already covered (all normally
compiled native code) this branch is not taken, so behavior is remains
unchanged.
This PR adds a hand (Claude) crafted regression test with a function
whose entry address is not covered by a line row.
[AMDGPU] Introduce TransCoexecutionHazard target feature (#204412)
TransCoexecutionHazard implies there is data hazard between TRANS and
the following VALU instruction when they are co-executed. Currently
gfx1250 and gfx1251 have this target feature.
[AArch64][llvm] Some instructions should be `HINT` aliases (NFC)
Implement the following instructions as a `HINT` alias instead of a
dedicated instruction in separate classes:
* `stshh`
* `stcph`
* `shuh`
* `tsb`
Updated all their helper methods too, and updated the `stshh` pseudo
expansion for the intrinsic to emit `HINT #0x30 | policy`.
Code in AArch64AsmPrinter::emitInstruction identified an initial BTI using a
broad bitmask on the HINT immediate, which also matched shuh/stcph (50..52)
This could move the patchable entry label after a non-BTI instruction.
Replaced it with an exact BTI check using the BTI HINT range (32..63) and
AArch64BTIHint::lookupBTIByEncoding(Imm ^ 32).
A following change will remove duplicated code and simplify.
[2 lines not shown]
[clang][AST] Fix StmtProfile handling of GCCAsmStmt asm strings and clobbers (#201481)
`VisitGCCAsmStmt` did not profile asm strings and clobbers because they
are not child statements.
As a result, different inline asm statements could produce the same
profile.
This fixes a false positive in `bugprone-branch-clone` where branches
containing inline asm were incorrectly reported as identical.
I used AI assistance when writing the test code, but I personally
reviewed it. 🤖
Fixes https://github.com/llvm/llvm-project/issues/198616