[AMDGPU] SIFoldOperands: constant-fold S_ADD/S_SUB with immediate operands (#198410)
Extend SIFoldOperands::tryConstantFoldOp to recognise three patterns
* ADD/SUB(imm, imm) -> S_MOV_B32 (LHS +/- RHS)
* ADD x, 0 -> COPY x (Also `0 + x`)
* SUB x, 0 -> COPY x (SUB is not commutable)
Assisted-by: Claude Opus 4.7
[X86] Fold splat AND on VGF2P8AFFINEQB source (#193364)
Given that each row within `vgf2p8affineqb`'s matrix controls which
source bits are selected, zeroing the same bit within all rows treats
that corresponding source bit like it is zero. This means a AND of the
input by any splatted 8-bit values can be folded with the matrix. This
is patch:
- Can eliminate a constant and/or reduces the instruction count from 2
to 1.
- Only occurs when the matrix is constant, ensuring that it can't
increase the dependency chain.
- Don't occur if the AND is multi use well the splat isn't constant,
preventing additional operations.
- Work with both constant 8-bit splats and scalars value that where
splatted to a vector.
- Includes test coverage for positive cases (by constants, variable
scalars, non zero immediates) and negative (multi use, larger splats,
variable matrices).
Fixes #191325
[libc] prefer *at syscalls in sys/stat wrappers (#197940)
- These changes flips the #ifdef order to prefer the *at syscalls over
normal ones.
- In modern architectures, *at system calls are preferred over normal
system calls cuz of safety issues.
- So by checking for ""*at"" system calls first, we ensure better
compatibility with modern systems.
- After then normal syscalls moved else or elif for support to older
ones.
- From merged pr(#195792) and issue(#195620)
---------
Signed-off-by: udaykiriti <udaykiriti624 at gmail.com>
Co-authored-by: Jeff Bailey <jbailey at raspberryginger.com>
[X86] Lower vector i8 ashr-by-1 using pavgb (#198487)
For vector i8 arithmetic shift right by 1, the current lowering produces
a 5-instruction sequence (psrlw + pand + xor + psubb plus a constant
load) with a 4-deep dependency chain.
This patch uses the identity
ashr(x, 1) == avgceilu(x, -1) ^ (~x & 0x80)
to lower to ISD::AVGCEILU + a short fixup, producing 4 instructions on
SSE/AVX/AVX2 and 3 on AVX-512BW (after vpternlogd fusion of the AND/XOR
pair), with two parallel dependency chains instead of one long one.
The freeze on R is required because the target reads it twice, matching
the pattern of the existing `shl R, 1 -> add R, R` case in
LowerShiftByScalarImmediate.
Alive2 proof: https://alive2.llvm.org/ce/z/LbXPhE
Fixes #198061
[flang][OpenMP] Skip declare simd lowering for interface bodies (#197010)
When DECLARE SIMD appears in the specification part of an interface
body, the PFT records the directive as an evaluation of the enclosing
program unit rather than of the interface body's subprogram. Its clause
operands (linear/aligned/uniform) reference dummy arguments local to the
interface body, which have no address in the enclosing scope, causing a
crash.
Detect the mismatch by comparing the program unit containing the
directive with the procedure currently being lowered, and skip op
emission when they differ.
This handles both explicit declare simd(proc-name) and implicit forms in
any enclosing context.
Fixes #192581
[Flang] [OpenMP] atomic compare (#184761)
Support for `omp atomic compare` in flang.
Multiple clauses like capture with compare are not supported
An issue for this was raised earlier at
[181116](https://github.com/llvm/llvm-project/issues/181116)
---------
Co-authored-by: Sunil Kuravinakop <kuravina at pe31.hpc.amslabs.hpecorp.net>
[RegisterCoalescer] Don't remat trivial defs without a size benefit
isAsCheapAsAMove doesn't imply "one machine instruction". AArch64 marks
multi-instruction pseudos cheap when their fused latency matches a real
move (MOVaddr = adrp+add, MOVi64imm = MOVZ+MOVK). The trivial remat
duplicates such defs at every COPY use.
[LLVM] Add a function to reset the opt bisector (#197723)
For daemonized testing, we need to be able to reset the global opt
bisector between test runs. This PR just adds a small function to the
OptPassGate class to reset its state.
[lldb] Fix possible invalidated iterator. (#198482)
The begin or end interator may be invalidated when a idx_pos in erased
from the vector.
Unblocks sanitised CI.
Revert "[AArch64] Copy x4/x5 vararg payload into the x64 stack in Arm64EC exit thunks" (#198540)
Reverts llvm/llvm-project#190933
Reported issues with an EXPENSIVE_CHECKS_BUILD. Reverting so this can be
fixed without undue time pressure.
[Flang] Adding -ffree-line-length-<value> flag (#192941)
Added support for the `-ffree-line-length-<value>` flag in Flang, which
is equivalent to `-ffixed-line-length-<value>` but in free form.
This flag is supported by gfortran and can be used in some applications.
---------
Co-authored-by: Tarun Prabhu <tarunprabhu at gmail.com>
Co-authored-by: Andre Kuhlenschmidt <andre.kuhlenschmidt at gmail.com>
[Clang] [NFC] Use `unique_ptr<Lexer>` everywhere (#198393)
Replace every instance of `new Lexer` with `make_unique<Lexer>` and
adjust `Lexer::Create_PragmaLexer()` to return a `std::unique_ptr<Lexer>`
instead.
The Preprocessor was already storing a `unique_ptr<Lexer>`, so there’s
no need to change how that works.
[DAG] scalarizeExtractedBinOp - extract from non-constant one use buildvectors (#198013)
When attempting to scalarize a vector binop that has a single extract,
we currently only fold if either of the binop's operands is a constant
buildvector - but we can extract from non-constant buildvectors without
increasing instruction count as long as the vector binop was the only
use of the buildvector.
More yak shaving for #196493
[flang][acc] Handle Fortran do loops as acc loops in acc routine (#198420)
As was previously done for do loops in acc compute constructs in
https://github.com/llvm/llvm-project/issues/149614 , this PR does the
same for do loops in `acc routine`. The rules are follows:
- Do loops not marked with `acc loop` are considered `auto`
- Do concurrent loops are considered `independent`
- Any loops in an `acc routine seq` are considered `seq`
This ensures that the IV is correctly privatized and attached to acc
loop.
Reland "[CodeGen] Use byte offsets and ptradd in ShadowStackGCLowering" (#197436)
Replace typed struct GEPs with byte array allocation and ptradd
operations:
1. Track root offsets as byte offsets instead of building typed struct.
2. Use `ComputeFrameLayout` to compute byte offsets based on DataLayout,
properly accounting for each root's size and alignment.
3. Allocate frame as `[FrameSize x i8]` byte array instead of typed
struct.
4. Replace all CreateGEP operations with CreatePtrAdd using computed
offsets.
5. Frame layout unchanged: `[Next ptr | Map ptr | Root 0 | Root 1 | ...
| Root N]` where each root is placed at its computed aligned offset.
6. Zero out padding between roots with memset for deterministic frame
contents for GC.
Benefits:
- Removes dependency on `getAllocatedType` for building frame struct
[7 lines not shown]
[AMDGPU][NFCI] Change MCSubtargetInfo references in AMDGPUBaseInfo.h/.cpp to be const ref instead of pointers (#197038)
Change all `AMDGPU::IsaInfo` functions and `initDefaultAMDKernelCodeT`
to take `const MCSubtargetInfo &` instead of `const MCSubtargetInfo *`.
These functions never accept null, so a reference better expresses the
contract.
Also change `AMDGPUMCKernelCodeT::initDefault` to take a const reference
for consistency, and convert local `MCSubtargetInfo` pointer variables
to references in `AMDGPUMCExpr.cpp` where the pointer is always
dereferenced.
Requested by @arsenm in
https://github.com/llvm/llvm-project/pull/192306#discussion_r2076113671.
Co-authored-by: Claude Opus 4 (1M context) <noreply at anthropic.com>
[Utils] Examine debug info type instead of alloca type to guess the debug behavior of the alloca uses (#177480)
Replace `isArray` and `isStructure` helpers that queried alloca IR type
with a `isCompositeType` helper that checks the debug variable's
source-level type from debug info metadata to decide if this seems
perhaps profitable to convert to this debug info from #debug_declare to
a #debug_value.
This changes behavior: the lowering decision is now based on the
source-level type from debug info rather than the IR alloca type, which
is more semantically correct for debug info processing. This should
have minimal effect on clang, but may change behavior more
significantly on front-ends like rust that have not used semantically
meaningful alloca element types.
Removes all uses of getAllocatedType() from Utils/Local.cpp.
This seemed slightly more semantically correct to me, though it is
slightly challenging to enumerate all of the possible scalar debug
[7 lines not shown]
[VPlan] Simplify select x, (i1 y | z), y -> y | (x && z) (#190196)
Fixes https://github.com/llvm/llvm-project/issues/189553
This adds a canonicalization `select x, (i1 y | z), y -> y | (x && z)`,
[Alive2]( https://alive2.llvm.org/ce/z/qcQRn6). InstCombine already
performs this.
This adds a canonicalization which causes the `lhs | (headermask && rhs)
-> vp.merge rhs, true, lhs, evl` pattern in optimizeMasksToEVL to match,
improving the RISC-V codegen for an anyof select reduction.