[libc] Increase the maximum RPC port size for future hardware (#188756)
Summary:
We store the locks in local device memory for performance and
simplicity. The number here needs to correspond to the maximum occupancy
so that we never have a situation where a GPU thread is blocking another
GPU thread.
The number now is sufficient for most hardware, but modern compute chips
like the MI300x are already pushing ~12000 resident waves. This has ABI
impliciations so I'd like to bump it up sooner rather than later. The
ABI change is within what OpenMP expects, LLVM major versions, and it
will be caught statically so there's no risk of silent corruption (size
doesn't match).
[compiler-rt] Rework profile data handling for GPU targets (#187136)
Summary:
Currently, the GPU iterates through all of the present symbols and
copies them by prefix. This is inefficient as it requires a lot of small
high-latency data transfers rather than a few large ones. Additionally,
we force every single profiling symbol to have protected visibility.
This means potentially hundreds of unnecessary symbols in the symbol
table.
This PR changes the interface to move towards the start / stop section
handling. AMDGPU supports this natively as an ELF target, so we need
little changes. Instead of overriding visibility, we use a single table
to define the bounds that we can obtain with one contiguous load.
Using a table interface should also work for the in-progress HIP
implementation for this, as it wraps the start / stop sections into
standard void pointers which will be inside of an already mapped region
of memory, so they should be accessible from the HIP API.
[13 lines not shown]
NAS-140407 / 25.10.2.2 / Fix FC/iSCSI path availability during ALUA failover (#18568)
Fixes FC/iSCSI path availability during HA failover when ALUA is
enabled.
Four independent problems caused paths to drop or I/O to fail during the
`dev_disk` -> `dev_vdisk` LUN swap window:
- **FC path death**: HA iSCSI session logout cascaded through SCST and
removed LUN mappings before the LUN swap, destroying the ALUA tgt_dev
filter and causing LUN NOT SUPPORTED on FC. Fixed by deferring
`reset_active` to after `become_active` has replaced all LUN mappings.
- **90-second global drain**: `activate_extents` wrote `active=1` via
sysfs, triggering `scst_suspend_activity(90s)`. Fixed by removing the
job entirely - `bind_alua_state=1` already handles dev_vdisk file-open
drain-free via `blockio_on_alua_state_change_finish`.
- **LUN replace blocks on in-flight commands**: `scst_acg_repl_lun`
[10 lines not shown]
[AMDGPU] Remove AMDGPUISD::FFBH_I32 and add ISD::CTLS lowering (#187694)
It's the a continuation of previously reverted
https://github.com/llvm/llvm-project/pull/178420
The patch removes custom AMDGPUISD::FFBH_I32 SelectionDAG node. Call
sites that need raw hardware semantics (LowerINT_TO_FP32, legalizeITOFP)
now use amdgcn_sffbh intrinsic directly. ISD::CTLS is added as a Custom
operation for i32.
Previous attempt had an issue:
The hardware v_ffbh_i32 instruction (v_cls_i32 on newer targets) has
different semantics than ISD::CTLS:
-sffbh returns [1, BitWidth-1] for normal values, -1 for
all-same-bits
-CTLS returns [0, BitWidth-2] for normal values, BitWidth-1 for
all-same-bits
Now LowerCTLS handles this by: sffbh -> umin(sffbh, BitWidth) -> sub 1.
[6 lines not shown]
[TargetLowering] In prepareUREMEqFold/prepareSREMEqFold, fix K=-1 for i64 elements. (#188600)
K is an unsigned, it will be zero extended to uint64_t for
the APInt constructor. If the ShSVT has more than 32 bits, we won't
create an all ones ConstantSDNode.
To fix this, explicitly push an all ones constant to KAmts. This
also fixes an APInt ImplicitTrunc.
This allows turnVectorIntoSplatVector to work for this case.
fusefs: redo vnode attribute locking
Previously most fields in fuse_vnode_data were protected by the vnode
lock. But because DEBUG_VFS_LOCKS was never enabled by default until
stable/15 the assertions were never checked, and many were wrong.
Others were missing. This led to panics in stable/15 and 16.0-CURRENT,
when a vnode was expected to be exclusively locked but wasn't, for fuse
file systems that mount with "-o async".
In some places it isn't possible to exclusively lock the vnode when
accessing these fields. So protect them with a new mutex instead. This
fixes panics and unprotected field accesses in VOP_READ,
VOP_COPY_FILE_RANGE, VOP_GETATTR, VOP_BMAP, and FUSE_NOTIFY_INVAL_ENTRY.
Add assertions everywhere the protected fields are accessed.
Lock the vnode exclusively when handling FUSE_NOTIFY_INVAL_INODE.
During fuse_vnode_setsize, if the vnode isn't already exclusively
locked, use the vn_delayed_setsize mechanism. This fixes panics during
[14 lines not shown]
fusefs: add a regression test for a cluster_read bug
VOP_BMAP is purely advisory. If VOP_BMAP returns an error during
readahead, cluster_read should still succeed, because the actual data
was still read just fine.
Add a regression test for PR 264196, wherein cluster_read would fail if
VOP_BMAP did.
PR: 264196
Reported by: danfe
Reviewed by: arrowd
Differential Revision: https://reviews.freebsd.org/D51316
(cherry picked from commit 6d408ac490730614b3ed0ebd3caffcd23f303fb4)
vfs_cluster.c: Do not propagate VOP_BMAP errors to the caller
The code that makes this VOP_BMAP call tries to perform a read-ahead I/O
operation. Failing to do that for any reason isn't fatal for `cluster_read()`,
because we still can return some data to the caller. This change is consistent
with other places within `cluster_read()`, where error returned by VOP_BMAP is
not returned to the caller - see the `if (nblks > 1)` block above the changed
lines and `if (reqbp)` at the end of the function.
PR: 264196
Approved by: markj, kib
Differential Revision: https://reviews.freebsd.org/D51254
(cherry picked from commit 62aef3f73f38db9fb68bffc12cc8900fecd58f0e)
fusefs: remove the obsolete rename_lock
This lock was included in the original GSoC submission. Its purpose
seems to have been to prevent concurrent FUSE_RENAME operations for the
current mountpoint, as well as to synchronize FUSE_RENAME with
fuse_vnode_setparent. But it's obsolete, now that ef6ea91593e added
mnt_renamelock .
Sponsored by: ConnectWise
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D55231
(cherry picked from commit 7755a406a6ae3801e885a79f714155f97c4d2bc6)
[OpenMP][flang] Fix crash in host offload (#187847)
Guard `getGridValue` in `OMPIRBuilder` to avoid reaching the
`unreachable` in `getGridValue` when offloading to host device without
an explicit num_threads clause.
[CIR] Add support for __atomic_fetch_uinc and __atomic_fetch_udec (#188050)
This patch adds CIRGen and LLVM lowering support for the
`__atomic_fetch_uinc` and the `__atomic_fetch_udec` built-in functions.
Assisted-by: Claude Opus 4.6
[Driver][HIP] Bundle AMDGPU -S output under the new offload driver (#188262)
[Driver][HIP] Bundle AMDGPU -S output under the new offload driver
The old offload driver emits bundled assembly code for -S in textual
clang-offload-bundler format. This allows a single .s file to contain
assembly
code for both host and devices, which can be consumed by clang. This
eases
manual optimization of assembly code for host and device. There are
existing
HIP tests and examples depending on this feature. The new offload driver
does
not support it, causing regressions. This patch adds support for this
feature
with minor changes to the job action creations.
Fixes: LCOMPILER-553
[OpenMP] Fix non-contiguous array omp target update (#156889)
The existing implementation has three issues which this patch addresses.
1. The last dimension which represents the bytes in the type, has the
wrong stride and count. For example, for a 4 byte int, count=1 and
stride=4. The correct representation here is count=4 and stride=1
because there are 4 bytes (count=4) that we need to copy and we do not
skip any bytes (stride=1).
2. The size of the data copy was computed using the last dimension.
However, this is incorrect in cases where some of the final dimensions
get merged into one. In this case we need to take the combined size of
the merged dimensions, which is (Count * Stride) of the first merged
dimension.
3. The Offset into a dimension was computed as a multiple of its Stride.
However, this Stride which is in bytes, already includes the stride
multiplier given by the user. This means that when the user specified
[3 lines not shown]