[clang][CUDA] Add new new CUDA and PTX versions (#197992)
PTX 9.1 and 9.2 already exists in LLVM, so this change just plumbs these
versions into clang to allow using newer instructions when we're
compiling with cuda-13.x
[flang-rt][NFC] Split up expensive .cpp files into multiple files (#198111)
Summary:
This PR simply takes the existing `.cpp` files for the heaviest
implementations and separates them logically, typically between real,
integer, and complex types. The existing `.cpp` file is turned into a
`.h` file and we create new `.cpp` files that *only* contain the old
portion that used `RTDEF`. This allows for far more build system
parallelism, and it also means that static library linking semantics
mean that if the user only uses integer routines the linker will not
include the unused complex / real routines in the final executable.
All around this is a good practice for runtime libraries. Verified that
all `_Fortran` entrypoint routines are still present, port was strictly
mechanical.
The result of all of this is that I can now build `flang-rt` in ~10s
with all threads instead of ~50s due to the most expensive files being
split into parallelizable chunks.
[llvm-dwarfdump] Decode the virtual register names from the dwarf register numbers (#192353)
Backends like `NVPTX` encode virtual register names as the DWARF
register number- the ASCII bytes of the name are concatenated into a
uint64_t.
This change adds fallback logic to decode these dwarf register numbers
into strings.
This improves the readability of Dwarfdump output.
e.g.
Before the change-
` DW_AT_location (DW_OP_regx 0x25726431)`
After the change-
` DW_AT_location (DW_OP_regx %rd1)`
[SLP] Generate StoreChainContext for all chains for a given base pointer first (#193616)
Rather than generating the chains for a `RelatedStoreInsts` worth of
stores at a time and then vectorizing that group, create the
StoreChainContext for all chains in all `RelatedStoreInsts`, and then
vectorize at the end.
Will allow easier integration with runtime strided stores since those
will exist across `RelatedStoreInsts`.
Bigger VF chains are now attempted before smaller VF chains across all
`RelatedStoreInsts` groups for a base value type, so some behavioral
changes in the vectorization of overlapping chains as the relative order
in which we attempt to vectorize them may have changed (longer before
shorter).
[clang][AArch64][NFC] Remove redundant bitcasts in builtin codegen (#196988)
Update CodeGen for the ACLE AdvSIMD “extract one element from vector”
builtins to avoid emitting unnecessary bitcasts:
* https://arm-software.github.io/acle/neon_intrinsics/advsimd.html#extract-one-element-from-vector
The existing tests continue to cover the generated IR and require no
updates, confirming that this is an NFC cleanup.
This is similar to #195825.
[MLIR][test] Add lit coverage for cf.br/cond_br/switch under narrow-type emulation (#198053)
Wires `cf::populateCFStructuralTypeConversionsAndLegality` into the
in-tree `TestEmulateNarrowType` pass and adds lit coverage that
exercises `cf.br` / `cf.cond_br` / `cf.switch` operand and successor
block-argument rewriting when emulating sub-byte element types:
* `memref<NxiW>` carried across `cf.br` / `cf.cond_br` / `cf.switch`.
* Sub-byte integer scalars across `cf.br`.
* Sub-byte integer vectors across `cf.br`.
This PR initially added thin wrapper functions
(`memref::populateMemRefNarrowTypeEmulationCFPatterns`,
`vector::populateVectorNarrowTypeEmulationCFPatterns`) over
`cf::populateCFStructuralTypeConversionsAndLegality`. Per review
feedback those wrappers were redundant, so callers (including the
in-tree test pass) now call
`cf::populateCFStructuralTypeConversionsAndLegality`
directly. Net contribution is the test-pass plumbing and the new lit
tests demonstrating that the existing cf structural type conversion
correctly handles narrow-type-emulated values.
[Instrumentor] Provide source location to runtime calls
To allow runtime calls to inspect the source location of the
instrumentation opportunity, we encode it in the module. This allows the
use in all environments, e.g., on GPUs, which might lack runtime dwarf
reading or libunwind. The stub printer is extended to make handling with
the encoded location information easy.
[AArch64][GlobalISel] Add pre-legalizer combines for AVGFLOOR and AVGCEIL (#192866)
This patch adds GlobalISel pre-legalizer combines to pattern-match and
optimize average operations, bringing GlobalISel on par with
SelectionDAG.
Specifically, it matches:
- `(a + b) >> 1` into `G_UAVGFLOOR` / `G_SAVGFLOOR`
- `(a + b + 1) >> 1` into `G_UAVGCEIL` / `G_SAVGCEIL`
Support is included for both scalar and vector types, correctly handling
constants and splat vectors via `isOneOrOneSplat()`. This builds upon
the generic opcodes introduced for AArch64 intrinsics lowering and
enables optimal emission of Neon instructions (e.g., `urhadd`, `shadd`)
directly from generic IR.
Fixes #118083
[Instrumentor] Improve stub printer (for C/C++ and value packs)
The stub printer now emits a helper header to deal with value packs (in
C and C++). We also make the files C/C++ compatible and use the proper
format strings for int32_t and int64_t.
Make sure optional components are excluded (#187824)
Extends a fix from
https://github.com/llvm/llvm-project/commit/b1e92f8def98c5e34fdb3b4c18ac16d65fb613a2
to examples and docs, both of which may be missing but are
unconditionally included if missing
This fixes an issue where the Chapel team vendors LLVM (and subsequently
deletes directories like docs and examples for smaller file sizes), but
if those directories are missing the build will fail
Signed-off-by: Jade Abraham <jademabraham17 at gmail.com>
[Clang] Improve __block attribute coverage for ivars and static variables (#198167)
As discussed in #194856, we need to improve the diagnostic coverage for
the `__block` attribute.
The modifications I made are as follows:
1. added diagnostic definitions
2. modified diagnostic logic
3. added test cases
4. modified the affected test cases
close #197213
[Github][CI] Don't build analysis targets when no relevant projects present (#196882)
Fixes error described in
[link](https://github.com/llvm/llvm-project/pull/194442#issuecomment-4330108752),
When `clang-tools-extra` project was not computed to build but
`genconfusable` (part of `clang-tools-extra`) was build anyway.
Revert "Add clang warning if fp exception functions are called without appropriate flags/pragmas" (#198341)
Reverts llvm/llvm-project#187860
Reason: this breaks compiling several different versions of libc, and is
also issuing diagnostics for platforms that are incompatible (see
https://github.com/llvm/llvm-project/pull/187860 for details).
Revert for now until we resolve how to move forward and reland.
[mlir][AMDGPU] Move memory access op folding to memref interfaces (#197310)
This PR implements IndexedAccessOpInterface and
IndexedMemCopyOpInterface for relevant ops in the AMDGPU dialect,
removing the custom folding pass we used to have now that there's
interfaces for this sort of thing.
As a result:
- The in-bonuds semantics of various AMDGPU ops have been clarified
- Interface methods to enable oob checks on DMA operations have been
added (to prevent accidental `disjoint`ing and the like)
- Said memref rewrite patterns have been hardened to allow for mixed
tensor/memref semantics.
- Helpers for detecting memory spaces were factored out of
`AMDGPUOps.cpp` so that they could be re-used in the interface
implementations.
# Breaking changes / migration
[4 lines not shown]
[mlir][GPU] Extend gpu.barrier with scope and named-barrier support (#195692)
This commit adds two features to gpu.barrier that are supported on
targets like recent AMDGPU chips, Nvidia's hardware, and SPIR-V.
The first of these is named barriers, which allow creating a barrier
object that is initialized with the number of subgroups that must arrive
at it before those subgroups are released. These are represented in MLIR
with a new `!gpu.named_barrier` type and created by
`gpu.initialized_named_barrier` operation. These named barriers then
become arguments to `gpu.barrier`.
The other change is adding a "scope" enum and using it to specify the
execution scope of barriers. This allows for rerpresenting cluster- and
subgroup-wide barriers (the latter exists on AMDGPU and Nvidia, and
while I suspect Nvidia has cluster-scope barriers, I didn't go looking)
and allows us to fully lower to SPIR-V's OpControlBarrier.
While these are two different features, I figured I'd land them in one
[4 lines not shown]
[FIRToMemRef] Fix fir.convert insertion inside omp.wsloop (#197653)
When replaceFIRMemrefs inserted a fir.convert before an op inside a
LoopWrapperInterface region (e.g. omp.simd inside omp.wsloop), it
violated the single-nested-op invariant, producing a verifier error. Fix
by walking up the LoopWrapperInterface parent chain and inserting before
the outermost wrapper instead.
Co-authored-by: Claude Sonnet 4.6 <noreply at anthropic.com>
Co-authored-by: Claude Sonnet 4.6 <noreply at anthropic.com>
[Driver] Uniform handling of invalid rtlib across drivers (#198219)
This is mostly an NFC except for a different diagnostic being emitted.
The goal is to unify validation and handling of invalid rtlib value
across different drivers to simplify supporting more -rtlib= values in
the future.
Add noreturn call count to FunctionPropertiesAnalysis pass (#198322)
Adding this metric to visualize how many noreturn functions there are
with the idea of analyzing their relationship with unreachable
instructions
[clang][deps] Move `ModuleDepCollectorPP` to .cpp file (#197964)
This PR moves the `ModuleDepCollectorPP` type into the .cpp file. It's
an implementation detail that the header doesn't need to expose.
[AMDGPU][GlobalISel] Remove dependency on legal ruleset (#197371)
This fills in always legal rules, to remove the dependency on the legacy
ruleset. This is not guaranteed to be all the rules, just the ones that
appear in tests.