[Instrumentor] Add Alloca and Function support; stack usage example
This adds support for alloca instrumentation and function pre/post
instrumentation. Alloca support follows load/store support directly.
Functions require special care to determine the insertion points.
Together, we can showcase how the stack high watermark can be profiled,
see InstrumentorStackUsage.cpp.
[Instrumentor] Use the pass builder's FileSystem for reading files
In the IO sandbox, the old read calls caused the CI to fail. This
changes uses the PassBuilder's FileSystem the same way other passes
read files from disk (during CI).
[clang][NFC] Mark CWG941 as implemented and add a test (#197202)
[CWG941](https://wg21.link/cwg941) allowed specializing deleted function
templates. Clang accepted this between 2.7 and 2.9, regressed and
started emitting redefinition errors between 3.0 and 3.8, then went back
to accepting in 3.9: https://godbolt.org/z/GKnf9je7j. I've marked it as
implemented since 3.9.
[RegisterScavenging] Respect early-clobber defs when scavenging registers (#197120)
When scavenging registers backwards for virtual registers introduced
during frame index elimination, the register scavenger was ignoring
early-clobber constraints on the instruction using the scavenged
register. This could lead to assigning a virtual register to a physical
register marked as early-clobber output, violating the constraint that
early-clobber outputs cannot overlap with inputs.
This change inspects `RestoreAfter` to determine if the scavenged
register will be used by the instruction pointed at by MBBI, and if so,
remove any such registers from the scavengeable set.
This also adds a test to check if such EC defs are indeed respected whne
they otherwise wouldn't be.
co-authored-by: @uweigand
---------
Co-authored-by: Matt Arsenault <arsenm2 at gmail.com>
[SLP] Do not account scalable vectorized users when estimating geps cost
We should not try to widen the scalable users of geps, they are not
vectorized and scalable vector type cannot be widened.
Fixes #197132
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/197301
[flang][cuda] Place box value kernel args in managed memory (#197116)
Example:
```fortran
type deviceArray
integer, allocatable, dimension(:,:), device :: Arr
end type deviceArray
type(deviceArray), allocatable, dimension(:) :: DA
allocate(DA(2))
allocate(DA(1)%Arr(32,32))
call mykernel<<<1,32>>>(DA(1)%Arr, 32) ! cudaErrorIllegalAddress
```
In this code, `DA(1)%Arr` is a device allocatable component inside a
managed derived type. The compiler loads the descriptor, reboxes it on
the host stack, and passes it to the kernel. Since `!fir.box` is lowered
to a pointer in LLVM IR, the kernel receives a host-stack pointer it
cannot dereference — causing `cudaErrorIllegalAddress`.
[11 lines not shown]
Revert "[CodeGen] Use byte offsets and ptradd in ShadowStackGCLowering" (#197297)
Reverts llvm/llvm-project#178436. I need to update the tests that I
added for that PR.
[RISCV][P-ext] Add initial 64-bit support for RV32. (#197093)
Most operations are set to expand. A few operations that were easy to
support using isel patterns have been added. concat_vectors and
extract_subvector are supported in order to allow type legalization to
split 64-bit vectors into 32-bit vectors around the supported
operations.
Loads and stores are custom split into two i32 scalars or two v4i8/v2i16
vectors.
I've added new opcodes to build and split vectors into 2 GPRs at
function arguments and returns. These are similar to BuildPairF64 and
SplitF64 nodes we use for RV32D soft float. Long term we might want to
use concat_vectors/build_vector and extract_subvector/extract_vectorelt.
[CodeGen] Use byte offsets and ptradd in ShadowStackGCLowering (#178436)
Replace typed struct GEPs with byte array allocation and ptradd
operations:
1. Track root offsets as byte offsets instead of building typed struct.
2. Use `ComputeFrameLayout` to compute byte offsets based on DataLayout,
properly accounting for each root's size and alignment.
3. Allocate frame as `[FrameSize x i8]` byte array instead of typed
struct.
4. Replace all CreateGEP operations with CreatePtrAdd using computed
offsets.
5. Frame layout unchanged: `[Next ptr | Map ptr | Root 0 | Root 1 | ...
| Root N]` where each root is placed at its computed aligned offset.
6. Zero out padding between roots with memset for deterministic frame
contents for GC.
Benefits:
- Removes dependency on `getAllocatedType` for building frame struct
[7 lines not shown]
[mlir][AMDGPU] Canonicalize masks on global_load_async_to_lds (#197280)
If the mask is always true, remove the mask operand (there are patterns
that key off the presence of the lack of a mask operand to know when
they can be more aggressive). If the mask is always false, just go ahead
and delete the op as it won't write anythig.
AI: I described the patterns, Codex 5.5 wrote them
[clang][NFC] Mark CWG730 as implemented and add a test (#197186)
[CWG730](https://wg21.link/cwg730) clarifies that it's allowed to
specialize templates that are members of a non-template class. Clang
implements this since 2.7: https://godbolt.org/z/bWzb766rz
[SLP]Disable reused reductions in revec mode for vector scalars
Reused reductions may require some special processing, but courrently it
crashes the compiler. Disable reused reductions for vector scalars in
revec mode to fix a crash.
Fixes #196914
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/197291
[clang] use QualType addrspace when making an alloca (#181390)
Instead of assuming that QualType is in default addrspace (or
compatible with it), actually use the addrspace declared by the
frontend. That removes needless dueling addrspacecast calls and
associated IR noise. Any callers that intend to discard the attributes
of the type (e.g. because they are casting an rvalue through memory)
need to now be explicit about that (e.g. by calling getUnqualifiedType).
This is part of a commit sequence trying to help the WASM be able to
have distinguished pointer types between stack memory and local memory
(attempting to emit an addrspacecast between the two is invalid).
Assisted-By: Claude Sonnet 4.5 <noreply at anthropic.com>
[flang][cuda] Fix CUDA generic matching with omitted optional args (#197275)
Skip omitted optional arguments when computing CUDA address-space
matching distances, so -gpu=unified overload resolution does not compare
expanded dummy-argument lists of different sizes. Adds a regression
covering a unified-memory overload with optional extras.
[SimplifyCFG] correct and move debug info for mergeConditionalStoreToAddress (#180789)
Previously, a combination of TryToSimplifyUncondBranchFromEmptyBlock
and SpeculatedStoreValue was changing the separate conditional stores
into a store of one value, which was then being hoisted to a
non-conditional store of that one value (and a DCE of the other). This
makes all linked stores use the new value, which is still
unconditionally correct. It isn't easy for
TryToSimplifyUncondBranchFromEmptyBlock to otherwise guess why the
value is different and try to recover which one is correct when doing
the conditional update. The end result being that the debug info might
have the wrong value. Now instead this updates the debug info at the
same time to reflect that the merged store will be equivalent, hoping
to turn these into the same info. This ensures that later passes don't
need to reverse how the different stores connected back to the new IR,
since either debug info now contains correct information for either
branch taken.
And additionally, without `combineMetadataForCSE`, it was dropping the
[8 lines not shown]
[scudo] Move MemMap tests from to map_test.cpp
The tests VerifyGetResidentPages, VerifyReleasePagesToOS, and Zeros test
MemMapT functionality and fit better in map_test.cpp where other MemMapT
tests reside.
[Support] Always scale InstructionCost::Value (#178962)
Allows for fractional InstructionCost's up to a granularity with little overhead.
Will allow for more accurate division results and will support finer granularity
of TTI costing.
Before:
InstructionCost(2) / 4 = 0
After (with ScalingFactor 4):
InstructionCost(2) / 4 = 1 / 2
Also, there is a decrease in the maximum value of InstructionCost, as
the largest value is now `std::numeric_limits<CostType>::max() /
ScalingFactor`.
Addresses #174429