[SandboxVectorizer] Implement topdown vectorizer
This patch introduces the `top-down-vec` pass to the Sandbox Vectorizer,
adding the ability to traverse use-def chains top-down to discover and
collect vectorization opportunities.
Key changes include:
* TopDownVec Pass: Implemented `TopDownVec` which recursively processes
value bundles top-down, creates vectorization actions (widening, packing,
shuffles), and emits the final vector IR.
* Shared Infrastructure (VecPassBase): Extracted common IR emission logic
out of `BottomUpVec` and into a new shared base class, `VecPassBase`.
Functions for generating vector instructions, handling diamond reuse,
creating shuffles/packs, and collecting dead instructions are now shared
between the bottom-up and top-down vectorizers to prevent code
duplication.
* Pass Registration: Exposed `top-down-vec` in `PassRegistry.def` and
`SandboxVectorizerPassBuilder`, allowing it to be invoked within pass
pipelines via `opt`.
[3 lines not shown]
Reland [Allocator] Keep bump pointer at a minimum alignment (#205240)
Reland #203718 (reverted in #205091) by making computation in integer
domain to avoid UB (nullptr + non-zero offset).
Add a `MinAlign` template parameter (default 8, sizeof(size_t) on 64-bit
platforms) so that the common case `Alignment <= MinAlign` can skip
realigning `CurPtr`.
This is achieved by rounding each allocation's size up to MinAlign, so
the bump pointer stays MinAlign-aligned between allocations.
SpecificBumpPtrAllocator::DestroyAll() walks objects at a fixed
sizeof(T) stride and needs tight packing, so it uses MinAlign=1.
(alignof(T) would
pack just as tightly and reuse the default instantiation, but T may be
incomplete here, e.g. `SpecificBumpPtrAllocator<MCSectionELF>`.)
Its `Allocate` still skips the realign: the slab is max_align_t-aligned
[9 lines not shown]
[OpenMPOpt][Attributor] Selectively seed deglobalization AAs (#198710)
This addresses a compile-time issue observed on a large generated C++
translation unit compiled with `-fopenmp`.
The source code is not OpenMP-heavy. It mainly consists of generated
function-registration wrappers, template instantiations, lambdas, and
small helper functions. However, because the TU is compiled with OpenMP
enabled, `OpenMPOptCGSCCPass` runs and drives Attributor on a module
with many functions.
`OpenMPOpt::registerAAsForFunction` currently eagerly creates the
deglobalization AAs for every function in OpenMP device modules:
* `AAHeapToShared`
* `AAHeapToStack`
Most generated wrapper/helper functions in the motivating workload do
not contain `__kmpc_alloc_shared`, removable allocations, or free-like
[25 lines not shown]
[AMDGPU] Fold constant offsets into named barrier addresses
Allow isOffsetFoldingLegal to fold a constant offset into an LDS
named-barrier global, and include the node offset when materializing the
LDS address in LowerGlobalAddress. s_barrier_signal_var on a GEP'd named
barrier now selects the immediate form, matching a bare global and GlobalISel.
With object linking the offset folds into the relocation addend.
Change-Id: I639bc723eb001573585cc05d0ad19f2773054f21
Assisted-by: Cursor
[AMDGPU] Pre-commit test for constant-offset named barrier signal_var
A GEP into a named-barrier array (&bars[1]) lowers s_barrier_signal_var to
the dynamic m0 form on SelectionDAG, unlike the bare global and GlobalISel.
With object linking it emits a runtime add of the offset instead of folding
it into the relocation addend.
Change-Id: I7cea0dd64d050eb3e2143841e7136355cbb3bc50
Assisted-by: Cursor
[AMDGPU] Fold constant offsets into named barrier addresses
Allow isOffsetFoldingLegal to fold a constant offset into an LDS
named-barrier global, and include the node offset when materializing the
LDS address in LowerGlobalAddress. s_barrier_signal_var on a GEP'd named
barrier now selects the immediate form, matching a bare global and GlobalISel.
With object linking the offset folds into the relocation addend.
Change-Id: Ie05b8c8cd127604ff174c423a74340fd2de4e405
Assisted-by: Cursor
[AMDGPU] Pre-commit test for constant-offset named barrier signal_var
A GEP into a named-barrier array (&bars[1]) lowers s_barrier_signal_var to
the dynamic m0 form on SelectionDAG, unlike the bare global and GlobalISel.
With object linking it emits a runtime add of the offset instead of folding
it into the relocation addend.
Change-Id: I59f0e6fe6a72b4c96c8efb926610f7f2d3833e38
Assisted-by: Cursor
Merge tag 'erofs-for-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"The most notable change is the removal of the fscache backend: it has
been deprecated for almost two years, mainly because EROFS file-backed
mounts and fanotify pre-content hooks (together with erofs-utils) now
provide better functionality and simpler codebase. In addition,
fscache has depended on netfslib for years, which is undesirable for
EROFS since it is a local filesystem. More details in [1].
In addition, sparse support has been added to the pcluster layout,
which is helpful for large sparse AI datasets, and map requests for
chunk-based inodes have been optimized to be more efficient as well.
There are also the usual fixes and cleanups.
Summary:
- Report more consecutive chunks of the same type for
each iomap request
[21 lines not shown]
don't increment scatterlist length twice
this occurs as sg_dma_len() returns the length member of struct scatterlist
where as on x86 linux it returns a dma_length member of the struct
Problem reported by Ryan Fahy in FreeBSD drm-kmod PR 468.
Avoids a 'Data modified on freelist' panic on boot when using discrete
Intel cards (DG2). DG2 has other issues, so remains disabled for now.
[RISCV][P-ext] packed exchanged add/sub codegen (#203473)
Wire up the already-defined exchanged add/sub instructions
pas/psa/psas/pssa/paas/pasa with llvm.riscv.* intrinsics and isel
patterns.
[Instrumentor] Add runtime examples: [2/N] A FP precision analysis
Second example:
Check all floating point operations and track if they could be done at
lower precision.
Partially developped by Claude (AI), tested and verified by me.
nfscl: Add support for flexible file layout striping
Commit 72e57bc26417 added support for striping to the pNFS
server configuration. This patch adds support for striping
to the NFS client.
For striped flexible file layouts, an extra structure
must be malloc()d for each stripe, since the number
of stripe servers can vary from one mirror to another.
This new structure is called nfsffs and a single one
of these structures is in the nfsffm structure so that
the non-striped layouts can avoid the additional malloc()'s.
This patch only affects NFSv4.1/4.2 mounts that use the
"pnfs" mount option against servers that support the
flexible file layout.
[BPF] Increase BPFMaxStoresPerMemFunc from 128 to 192 (#205222)
With commits [1] and [2], memory operations like memcpy/memmove lower to
a sequence of loads/stores whose width is the minimum of the source and
destination alignment, and the store count is bounded by
BPFMaxStoresPerMemFunc. For 1-byte alignment, the maximum copy length
that can be inlined is therefore 128 bytes.
This may regress cases that previously inlined. Consider a memcpy with
src alignment 8, dst alignment 1 and size 136. After [1]/[2], the store
width is the minimum alignment (1 byte), so the store count is 136,
which exceeds the 128 limit and the copy falls back. Before [1]/[2], the
store count was computed with a fixed 8-byte unit regardless of the
actual alignment (each unit expands to 8 one-byte stores when the
minimum alignment is 1), so the total count was only 17 (136/8 < 128)
and the copy was inlined.
Raise the limit from 128 to 192 to mitigate. Alternatively, users can
increase alignment to avoid the regression.
[2 lines not shown]
[Instrumentor] Add runtime examples: [2/N] A FP precision analysis
Second example:
Check all floating point operations and track if they could be done at
lower precision.
Partially developped by Claude (AI), tested and verified by me.
[Instrumentor] Add runtime examples: [1/N] A flop counter
This adds a instrumentor-examples folder into compiler RT to showcase
use cases of the instrumentor. The initial example is a program that,
via instrumentation, counts the number of flops performed.
Partially developped by Claude (AI), tested and verified by me.