[clang] fix getTemplateInstantiationArgs
This implements a new strategy for collecting the template arguments, by
relying on the qualifiers and template parameter lists to navigate the template
context of out-of-line definitions.
This greatly simplifies the signature of that function, by removing a bunch
of workarounds, and simpliffying a couple that weren't removed yet.
Since this now relies on qualifiers and template parameter lists,
this patch expends most of its effort making sure these are placed,
transformed and propagated to template instantiations.
Also makes the explicit specialization AST nodes stop abusing the template
parameter lists by storing it's own template parameter list, creating a
dedicated field for them, similar to partial specializations.
[coro] Use C calling convention for C++20 coroutines (#198943)
Change the calling convention for resume / destroy functions of C++
coroutines from `fastcc` to the C calling convention.
The resume / destroy functions are exposed as part of the coroutine ABI
and must be compatible with other compilers and other versions of LLVM.
fastcc is an LLVM-internal, unstable calling convention, though.
In practice, fastcc and the C calling convention are in sync for `void
func(void*)` function signatures on almost all platforms. Therefore, I
think we can still do this change without widespread ABI breakage.
`fastcc` and `ccc` do differ for i686 (x86-32), MIPS O32, PowerPC64
ELFv1 and Lanai. Afaik, those are all legacy ABIs and a recent feature
like C++20 coroutines is unlikely to be used by projects still targeting
legacy ABIs.
Historical context: I tried to figure out why `fastcc` was used. It is
[6 lines not shown]
[PGO][AMDGPU] Add basic HIP offload PGO support (#177665)
Provide the minimum HIP/offload path for device profile collection and
merge on HIP before layering profile-format and uniformity-specific
changes separately.
This adds the ROCm collection runtime, hooks device collection into the
host write-file path, lowers AMDGPU instrumentation to
__llvm_profile_instrument_gpu with regular counters, and disables GPU
indirect-call value profiling.
Clarify dynamic metadirective selection lowering
Explain that statically applicable variants are ranked before dynamic
user conditions. When a dynamic condition is selected, it is lowered to a
runtime branch whose else region continues selection among the remaining
candidates.
Add a begin/end variant test that includes clauses, and tighten checks
for the empty `nothing` fallback.
Place dynamic condition cleanups before branching
A dynamic user condition can create expression temporaries before the
selected variant is lowered. For example, a metadirective condition
such as:
when(user={condition(getbool("hello"))}: barrier)
passes a character literal through an associated temporary. That
temporary belongs to evaluating the condition, so it must be cleaned
up before lowering enters the generated fir.if that selects between
variants.
Finalize the statement context after evaluating the condition and
before creating the branch. Keep the condition expression and source
location together as DynamicUserCondition, use that source for
generated operations, and add a regression for the temporary-producing
condition case.
[AMDGPU][True16] Upstream some True16 test runlines
This testing batch preempts a change to G_MERGE_VALUES in True16 and will help demonstrate the changes. They all currently fail and so are commented out
[gsymutil] Disable readahead in `GsymReader::openFile()` (#199230)
`GsymReader::lookup()` has random access pattern (i.e. binary search an
address, then spot-load/parse info from rest of the GSYM data).
Readahead strategies in kernels (which was enabled by default) don't
necessarily improve (and may degrade) performance. This patch disables
readahead.
In a production system, similar change has seen 5% improvement on IOPS
and data reads. An offline performance test on a Linux machine shows
similar results - it reduces 14.3% total data read, 3.5% CPU%, and 2.9%
wall time (while adding 9.4% page faults). The reduction of total data
read and CPU % may help the performance of a heavily-loaded production
system.
```
┌────────────────┬─────────────┬─────────┬────────┐
│ Metric │ MADV_RANDOM │ Default │ Diff │
├────────────────┼─────────────┼─────────┼────────┤
│ Wall (s) │ 0.286 │ 0.294 │ -2.9% │
[18 lines not shown]
[DirectX] Add an "offset" operand to llvm.dbg.value (#197478)
Offset operand was removed in abe04759a6, so we need to bring it back
for DXIL. If offset is not specified, it should be zero.
---------
Co-authored-by: Andrew Savonichev <andrew.savonichev at gmail.com>
[lldb][Darwin] Read Mach-O binaries out of memory more efficiently (#200072)
When lldb needs to read a Mach-O binary out of memory, it first reads
512 bytes to get the mach header, which includes the size of the load
commands, and then does a second read to get the mach header and load
commands.
I am changing the initial read to get 3192 bytes, which will include the
full load commands for most binaries.
In April I changed debugserver to return the correct size of the mach
header and load commands in a `sizeof_mh_and_loadcmds` key. If this
number is provided, refine the amount we read to this size.
This reduces the number of memory read packets we issue from 2 to 1 for
a memory module, outside of packets that may be needed to get the symbol
table.
[LifetimeSafety] Propagate inner origins through std::move and related casts (#199600)
std::move and related casts (std::forward, std::forward_like,
std::move_if_noexcept, std::as_const) are reference casts: the result
refers to the same object as the argument. Flow all origin levels for
this family.
Fixes #191954
[clang] fix getTemplateInstantiationArgs
This implements a new strategy for collecting the template arguments, by
relying on the qualifiers and template parameter lists to navigate the template
context of out-of-line definitions.
This greatly simplifies the signature of that function, by removing a bunch
of workarounds, and simpliffying a couple that weren't removed yet.
Since this now relies on qualifiers and template parameter lists,
this patch expends most of its effort making sure these are placed,
transformed and propagated to template instantiations.
Also makes the explicit specialization AST nodes stop abusing the template
parameter lists by storing it's own template parameter list, creating a
dedicated field for them, similar to partial specializations.
[VectorCombine] Fold deinterleave2 with smaller effective element size (#192121)
Found in real-world code where this sequence:
```
%d = llvm.vector.deinterleave2 <vscale x 16 x i32> %v
%f0 = extractvalue { <vscale x 8 x i32>, <vscale x 8 x i32> } %d, 0
%f1 = extractvalue { <vscale x 8 x i32>, <vscale x 8 x i32> } %d, 1
%low0 = and <vscale x 8 x i32> %f0, splat (i32 65535)
%low1 = shl <vscale x 8 x i32> %f1, splat (i32 16)
%merge0 = or disjoint <vscale x 8 x i32> %low0, %low1
%high0 = and <vscale x 8 x i32> %f1, splat (i32 -65536)
%high1 = lshr <vscale x 8 x i32> %f0, splat (i32 16)
%merge1 = or disjoint <vscale x 8 x i32> %high0, %high1
```
is really just doing `deinterleave2` but on `<vscale x 32 x i16>`. That
is, the same total vector size but with half the element width. So we
can turn it into:
[11 lines not shown]
[CodeGenPrepare] Use recomputed split-branch weights. (#199822)
splitBranchCondition computes new branch weights after splitting an
and/or condition into two branches, but then passed the original weights
to createBranchWeights at each metadata update. The recomputed values
were discarded.
Pass the scaled NewTrueWeight/NewFalseWeight values when installing
metadata on both generated branches.
This bug was found by a large run of Opus 4.7 looking for bugs in LLVM.
[clang][deps] Disable app extensions during scanning (#200041)
Application extension contributes to the context hash, but only affects
the availability attribute on declarations. Since it cannot affect
dependencies, disable it for the scan to reduce the number of scanning
PCM variants.
[flang][FIRToMemRef] fix stride calculation for complex lowering (#200035)
**Summary**
When `fir.array_coor` targets a projected slice of a complex array (path
0 = real, 1 = imag), FIRToMemRef must not treat the result as a dense
memref.
**Bug:** The pass stopped after fir.convert to `memref<…×complex>` (or
static-shape fast path) and used default/dense strides. Loads/stores
then stepped by sizeof(complex) instead of sizeof(re)/sizeof(im).
**Fix:** For constant `%re/%im` on `complex<T>` storage:
`fir.convert` storage to `memref<…×2×T>` and index the component (0 or
1).
Read layout from `fir.box_dims` on the box (even if the memref shape is
static).
Set each memref stride to `box_dims_byte_stride / sizeof(T)`.
Advised by Cursor
[lldb] Edits and clarifications to DataFileCache comments, NFC (#199787)
I was reading through Greg Clayton's DataFileCache PR and fixed a few
small typeos as I went along.
I also had a little trouble understanding the two types of hashes that
are calculated for a file, at first, and I tried to write comments for
the relevant methods (in Module, ObjectFile, and DataFileCache) to be
more explicit about their role and the role of the other hashes that are
calculated. It may be more detail than necessary, but it would have been
helpful for me while reading this through.
[lldb] Keep addr for Memory Modules separate (#199810)
This change is to make DataFileCache symbol table caching work with
memory-read binary modules.
When we read a Module out of memory, we keep the address of the module
in Module's m_object_name field as a string. This is normally the name
of a file in a ranlib/static library/.a archive like the "main.o" in
"foo.a(main.o)". The address is most often seen in the "image list"
output, and is the only easy way to distinguish in that output which
binaries were read out of memory, versus found on local disk. The "name"
of the Module ends up being the combination of the FileSpec plus this
m_object_name.
Reading a binary out of memory is expensive, primarily because of
reading the symbol table. The DataFileCache feature that Greg introduced
five years ago can cache the Symbol Table for a binary locally, and when
we see the same binary loaded again in a future debug session/lldb
session, we can skip parsing the symbol table (or in the case of Memory
[26 lines not shown]
[libc] Add missing struct_mmsghdr dependency to sys_socket (#200051)
Updated libc/include/CMakeLists.txt to add
.llvm-libc-types.struct_mmsghdr to the sys_socket dependency list. This
ensures that the generated sys/socket.h correctly includes the
struct_mmsghdr.h type header.
Assisted-by: Automated tooling, human reviewed.
[AMDGPU] Implement -amdgpu-spill-cfi-saved-regs (#183149)
These spills need special CFI anyway, so implementing them directly
where CFI is emitted avoids the need to invent a mechanism to track them
from ISel.
Change-Id: If4f34abb3a8e0e46b859a7c74ade21eff58c4047
Co-authored-by: Scott Linder <scott.linder at amd.com>
Co-authored-by: Venkata Ramanaiah Nalamothu <VenkataRamanaiah.Nalamothu at amd.com>
[IR] Handle `expected` tag in switch branch weights. (#200025)
Switch branch weight metadata has an optional `expected` tag.
SwitchInstProfUpdateWrapper::getSuccessorWeight() did not handle this
tag; if it was present, it would return nullopt, effectively ignoring
the metadata.
This bug was found by a large run of Opus 4.7 looking for bugs in LLVM.
[SeparateConstOffsetFromGEP] Set `inbounds` correctly. (#199304)
swapGEPOperand reorders the GEPs (ptr+off)+const into (ptr+const)+off.
When it does so, it needs to determine if the inner GEP is inbounds.
Previously the way it did this was to call
stripAndAccumulateInBoundsConstantOffsets on (ptr+const), and then check
if this offset was indeed in-bounds.
However, this GEP was not necessarily marked as `inbounds` itself. If it
was not, stripAndAccumulateInBoundsConstantOffsets would return 0 for
the offset (instead of `const`), in which case we'd check if
`0 < [obj width]`, which is trivially true, and then incorrectly mark
the GEP as inbounds.
This bug was found by a large run of Opus 4.7 looking for bugs in LLVM.