[CIR] Upstream support for coroutine co_yield expression (#173162)
This PR upstreams support for the co_yield expression by emitting a
cir.await op with the yield kind.
[DebugInfo] Only generate template parameters in the skeleton CU for a template function/type with simplified name (3/3) (#175879)
Currently, when generating debug info for skeleton units, all template
parameters are emitted unconditionally. To optimize debug info size, the
emission should be conditional — providing parameters only for template
types/functions whose names have actually been simplified. As described
in [this
RFC](https://discourse.llvm.org/t/rfc-debuginfo-selectively-generate-template-parameters-in-the-skeleton-cu/89395).
Previous patches: #175130, #175708
[MachinePipeliner] Remove cheap check in dependence analysis (#174390)
In loop-carried dependence analysis of MachinePipeliner, there is
special handling for a specific case, referred to as a "cheap check".
This check is not sound and sometimes misses dependencies. If there is
no significant performance regression, this special logic should be
deleted.
Split off from https://github.com/llvm/llvm-project/pull/135148
[X86][NewPM] Add rest of non-ported passes to X86PassRegistry (#176068)
I noticed these when writing up the pass builder. Put them in the pass
registry to make it easier to see what is not done yet for when people
start working on more porting.
[NewPM][CodeGen] Add missing non-ported pass to registry
Not sure why this did not make it in the list originally. But adding it
so that someone looking for passes to port in the registry will see it.
[AMDGPU] Limit allocation of lo128 registers for occupancy
Parent change allows allocation of lo128 VGPRs from all 4 banks.
That may result in the undesired allocation leaving a hole of
maximum 128 registers in case if for example v0-v127 are allocated,
and v128-v255 are free.
Limit the available allocation order to the occupancy. Both hard
occupancy limits and occupancy achieved during scheduling are
considered. That is better to spill a register than to drop occupancy
in this case.
[AMDGPU] Limit allocation of lo128 registers for occupancy
Parent change allows allocation of lo128 VGPRs from all 4 banks.
That may result in the undesired allocation leaving a hole of
maximum 128 registers in case if for example v0-v127 are allocated,
and v128-v255 are free.
Limit the available allocation order to the occupancy. Both hard
occupancy limits and occupancy achieved during scheduling are
considered. That is better to spill a register than to drop occupancy
in this case.
[DTLTO] Fix handling of multi-module bitcode inputs (#174624)
This change fixes two issues when processing multi-module bitcode files
in DTLTO:
1. The DTLTO archive handling code incorrectly uses
getSingleBitcodeModule(), which asserts when the bitcode file contains
more than one module.
2. The temporary file containing the contents of an input archive member
was not emitted for multi-module bitcode files. This was due to
incorrect logic for recording whether a bitcode input contains any
ThinLTO modules. In a typical multi-module bitcode file, the first
module is a ThinLTO module while a subsequent auxiliary module is
non-ThinLTO. When modules are processed in order, the auxiliary module
causes the entire bitcode file to be classified as non-ThinLTO, and the
archive-member emission logic then incorrectly skips it.
In addition, this patch adds a test that verifies that multi-module
bitcode files can be successfully linked with DTLTO. The test reproduces
[2 lines not shown]
[CIR] Make cir.alloca alignment mandatory (#172663)
Fixed a crash in `CIRToLLVMAllocaOpLowering` where `cir.alloca`
operations without an explicit alignment attribute caused failures.
Modified the ODS definition of `cir.alloca` to use
`ConfinedAttr<I64Attr, [IntMinValue<0>]>`. This ensures the attribute is
always present.
Added a regression test in `clang/test/CIR/Lowering/alloca.cir`.
---------
Co-authored-by: Sirui Mu <msrlancern at gmail.com>
[MC/DC] Create dedicated MCDCCondBitmapAddr for each Decision (#125411)
MCDCCondBitmapAddr is moved from `CodeGenFunction` into `MCDCState` and
created for each Decision.
In `maybeCreateMCDCCondBitmap`, Allocate bitmaps for all valid Decisions
and emit them order by ID, to prevent nondeterminism.
[X86] Lower scalar llvm.clmul intrinsics to PCLMULQDQ (#175189) (#175216)
Add support for lowering scalar llvm.clmul intrinsics (i8/i16/i32/i64)
to the PCLMULQDQ hardware instruction on X86 targets with the PCLMUL
feature, instead of using the default software expansion.
The lowering:
- Extends smaller types to the target's native width (i64 on x86-64, i32
on i686)
- Uses SCALAR_TO_VECTOR to create vectors (v2i64 on x86-64, v4i32 with
bitcast to v2i64 on i686)
- Performs X86ISD::PCLMULQDQ with immediate 0x00
- Extracts the result and truncates back to the original type
i8/i16/i32 CLMUL is enabled on both 32-bit and 64-bit targets. i64
CLMUL/CLMULH is only enabled on 64-bit targets.
Also adds ISD::CLMULH i64 support by extracting the upper element from
[2 lines not shown]
[mlir][Tensor] Add rank-reducing slice in generatedSlices (#174248)
When `replaceExtractSliceWithTiledProducer `creates a rank-reducing
slice to handle type mismatches, it should be tracked in
`generatedSlices `so downstream cleanup patterns (like IREE's
FoldExtractSliceOfBroadcast) can process it.
This PR also fixes an infinite loop in getUntiledProducerFromSliceSource
where adding the slice to generatedSlices caused the fusion worklist to
repeatedly try to re-fuse producers already inside the innermost loop;
the fix skips producers that are already inside the innermost loop via
an isProperAncestor check.
Added a lit test (@fuse_through_rank_reducing_slice) demonstrating
correct fusion through rank-reducing slices. Note that demonstrating the
generatedSlices tracking benefit requires a cleanup pattern
(SwapExtractSliceWithFillPatterns) to consume the slice; IREE's full CI
suite (iree-org/iree#23012) validates this works correctly in practice
with patterns like FoldExtractSliceOfBroadcast.
[3 lines not shown]
[LLDB][NativePDB] Introduce PdbAstBuilderClang (#175840)
This changes `PdbAstBuilder` to a language-neutral abstract interface
and moves all of its functionality to the `PdbAstBuilderClang` derived
class.
All Clang-specific methods with external callers are now public methods
on `PdbAstBuilderClang`. `TypeSystemClang` and `UdtRecordCompleter` use
`PdbAstBuilderClang` directly.
Did my best to clean up includes and unused methods.
RFC for context:
https://discourse.llvm.org/t/rfc-lldb-make-pdbastbuilder-language-agnostic/89117
[AMDGPU] Allow allocation of lo128 registers from all banks
We can encode 16-bit operands in a short form for VGPRs [0..127].
When we have 1K registers available we can in fact allocate 4
times more from all 4 banks. That, however, requires an allocatable
class for these operands. When for most of the instructions it will
result in the VOP3 longer form, for V_FMAAMK/FMADAK_F16 it will
simply prohibit the encoding because these do not have VOP3 forms.
A straight forward solution would be to create a register class
with all registers having bit 8 of the encoding zero, i.e. to
create a register class with holes punched in it: [0-127, 256-383,
512-639, 768-895]. LLVM, however, does not like register classes
with punched holes when they also have subregisters. The cross-
product of all classes explodes and some combinations of a 'class
having a common subreg with another' becomeing impossible. Just
doing so explodes our register info to 4+Gb, uncompilable too.
The solution proposed is to define _lo128 RC with contigous 896
[17 lines not shown]