Fix snapshot automount deadlock during concurrent zfs recv
zfsctl_snapshot_mount() holds z_teardown_lock(R) across
call_usermodehelper(), which spawns a mount process that needs
namespace_sem(W) via move_mount. Reading /proc/self/mountinfo holds
namespace_sem(R) and needs z_teardown_lock(R) via zpl_show_devname.
When zfs_suspend_fs (from zfs recv or zfs rollback) queues
z_teardown_lock(W), the rrwlock blocks new readers, completing the
deadlock cycle.
Fix by releasing z_teardown_lock(R) after gathering the dataset name
and mount path, before any blocking operation. Everything after the
release operates on local string copies or uses its own
synchronization. The parent zfsvfs pointer remains valid because the
caller holds a path reference to the automount trigger dentry.
Releasing the lock allows zfs_suspend_fs to proceed concurrently
with the mount helper, so dmu_objset_hold in zpl_get_tree can
transiently fail with ENOENT during the clone swap. The mount
[8 lines not shown]
Fix options memory leak in zfsctl_snapshot_mount
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #18415
zvol: Fix uses of uninitialized variables in zvol_rename_minors_impl()
Reported-by: GitHub Copilot
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Mark Johnston <markj at FreeBSD.org>
Closes #18191
zvol: Hold the zvol state writer lock when renaming
Otherwise nothing serializes updates to the global zvol hash table.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Mark Johnston <markj at FreeBSD.org>
Closes #18191
Make zvol_set_common() block until the operation has completed
This is motivated by a FreeBSD AIO test case which create a zvol with -o
volmode=dev, then immediately tries to open the zvol device file. The
open occasionally fails with ENOENT.
When a zvol is created without the volmode setting, zvol_create_minors()
blocks until the task is finished, at which point OS-dependent code will
have created a device file. However, zvol_set_common() may cause the
device file to be destroyed and re-created, at least on FreeBSD, if the
voltype switches from GEOM to DEV. In this case, we do not block
waiting for the operation to finish, causing the test failure.
Fix the problem by making zvol_set_common() block until the operation
has finished. In FreeBSD zvol code, use g_waitidle() to block until
asynchronous GEOM operations are done. This fixes a secondary race
where zvol_os_remove_minor() does not block until the zvol device file
is removed, and the subsequent zvol_os_create_minor() fails because the
(to-be-destroyed) device file already exists.
[5 lines not shown]
FreeBSD: Fix zvol teardown races
zvol_geom_open() may be called to taste an orphaned provider. The test
for pp->private == NULL there is racy as no locks are synchronizing the
test.
Use the GEOM topology lock to interlock the pp->private == NULL test
with the zvol state checks. This establishes a new lock order but I
believe this is necessary. Set pp->private = NULL under the GEOM
topology lock instead of the per-zvol state lock. Modify
zvol_os_rename_minor() to drop the zvol state lock to avoid a lock order
reversal with the topology lock.
Also reverse the order of tests in zvol_geom_open() and zvol_cdev_open()
as at least zvol_geom_open() may race with zvol_os_remove_minor(), which
sets zv->zv_zso = NULL. Testing for ZVOL_REMOVING first avoids a race
which can lead to a NULL pointer dereference.
Add a new OS-specific flag to handle the case where zvol_geom_open()
[8 lines not shown]
draid: add failure domains support
Currently, the only way to tolerate the failure of the whole
enclosure is to configure several draid vdevs in the pool, each
vdev having disks from different enclosures. But this essentially
degrades draid to raidz and defeats the purpose of having fast
sequential resilvering on wide pools with draid.
This patch allows to configure several children groups in the
same row in one draid vdev. In each such group, let's call it
failure group, the user can configure disks belonging to different
enclosures - failure domains. For example, in case of 10 such
enclosures with 10 disks each, the user can put 1st disk from each
enclosure into 1st group, 2nd disk from each enclosure into 2nd
group, and so on. If one enclosure fails, only one disk from each
group would fail, which won't affect draid operation, and each
group would have enough redundancy to recover the stored data. Of
course, in case of draid2 - two enclosures can fail at a time, in
case of draid3 - three enclosures (provided there are no other
[52 lines not shown]
CI: set /etc/hostid in zloop runner
ztest can enable and disable the multihost property when testing.
This can result in a failure when attempting to import an existing
pool when multihost=on but no /etc/hostid file exists. Update the
workflow to use zgenhostid to create /etc/hostid when not present.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #18413
Add ability to set user properties while changing encryption key
`zfs change-key` changes the key used to encrypt a ZFS dataset. When
used programmatically, it may be useful to track some external state
related to the key in a user property. E.g. a generation number,
expiration date, or application-specific source of the key.
This can be done today by running `zfs set user:prop=value` before or
after running `zfs change-key`. However, this introduces a race
condition where the property may not be set even though the key has
changed, or vice versa (depending on the order the commands are
executed).
This can be addressed by using a channel program (`zfs program`) which
calls both `zfs.sync.change_key()` and `zfs.sync.set_prop()`, changing
the property and key atomically. However, it is nontrivial to write such
a channel program to handle error cases, and provide the new key
securely (e.g. without logging it).
[14 lines not shown]
Remove forced zfs_umount() from zfs_resume_fs() bail path
When zfsvfs_init() fails during zfs_resume_fs(), the bail
path called zfs_umount() directly. All three callers
(zfs_ioc_rollback, zfs_ioc_recv_impl, and
zfs_ioc_userspace_upgrade) hold an s_active reference
via getzfsvfs() at entry.
This creates two bugs:
1. Deadlock: zfs_umount() -> zfsvfs_teardown() ->
txg_wait_synced() blocks in uninterruptible D state.
The superblock cannot tear down because s_active is
pinned by the calling thread itself. Survives SIGKILL.
Blocks clean reboot. Requires hard power cycle.
2. Use-after-free: if txg_wait_synced() returns,
zfs_umount() calls zfsvfs_free(). The caller then
dereferences the freed zfsvfs via zfs_vfs_rele().
[12 lines not shown]
Fix s_active leak in zfsvfs_hold() when z_unmounted is true
When getzfsvfs() succeeds (incrementing s_active via
zfs_vfs_ref()), but z_unmounted is subsequently found to
be B_TRUE, zfsvfs_hold() returns EBUSY without calling
zfs_vfs_rele(). This permanently leaks the VFS superblock
s_active reference, preventing generic_shutdown_super()
from ever firing, which blocks dmu_objset_disown() and
makes the pool permanently unexportable (EBUSY).
Add the missing zfs_vfs_rele() call, guarded by
zfs_vfs_held() to handle the zfsvfs_create() fallback
path where no VFS reference exists. This matches the
existing cleanup pattern in zfsvfs_rele().
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: mischivus <1205832+mischivus at users.noreply.github.com>
Closes #18309
Closes #18310
draid: allow seq resilver reads from degraded vdevs
When sequentially resilvering allow a dRAID child to be read
as long as the DTLs indicate it should have a good copy of the
data and the leaf isn't being rebuilt. The previous check was
slightly too broad and would skip dRAID spare and replacing
vdevs if one of their children was being replaced. As long
as there exists enough additional redundancy this is fine, but
when there isn't this vdev must be read in order to correctly
reconstruct the missing data.
A new test case has been added which exhausts the available
redundancy, faults another device causing it to be degraded,
and then performs a sequential resilver for the degraded device.
In such a situation enough redundancy exists to perform the
replacement and a scrub should detect no checksum errors.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Andriy Tkachuk <andriy.tkachuk at seagate.com>
[2 lines not shown]
Add support for POSIX_FADV_DONTNEED
For now make it only evict the specified data from the dbuf cache.
Even though dbuf cache is small, this may still reduce eviction of
more useful data from there, and slightly accelerate ARC evictions
by making the blocks there evictable a bit sooner.
On FreeBSD this also adds support for POSIX_FADV_NOREUSE, since the
kernel translates it into POSIX_FADV_DONTNEED after every read/write.
This is not as efficient as it could be for ZFS, but that is the only
way FreeBSD kernel allows to handle POSIX_FADV_NOREUSE now.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18399
fix memleak in spa_errlog.c
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Alan Somers <asomers at freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk at axcient.com>
Closes #18403
Linux: Refactor zpl_fadvise()
Similar to FreeBSD stop issuing prefetches on POSIX_FADV_SEQUENTIAL.
It should not have this semantics, only hint speculative prefetcher,
if access ever happen later. Instead after POSIX_FADV_WILLNEED
handling call generic_fadvise(), if available, to do all the generic
stuff, including setting f_mode in struct file, that we could later
use to control prefetcher as part of read/write operations.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18395
vdevprops: remove unused slow_io defaults, fix documentation
Remove the unused DEFAULT_SLOW_IO_N and DEFAULT_SLOW_IO_T defines
from zfs_diagnosis.c. Unlike the checksum and I/O thresholds, the
slow_io_n and slow_io_t properties must be manually opted in and
have no built-in defaults. The defines were misleading.
Update the vdevprops man page to clarify that slow_io_n and
slow_io_t must be manually set, and that the documented defaults
(10 errors in 600 seconds) apply only to checksum and I/O events.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes #18359
CI: Free 35GB of unused files on the runner
Free 35GB of unused files, mostly from unused development environments.
This helps with the out of disk space problems we were seeing on
FreeBSD runners.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #18400
zpool create: report which device caused failure
When zpool create fails because a vdev is already in use, the
error message now identifies the problematic device and the pool
it belongs to, e.g.:
cannot create 'tank': device '/dev/sdb1' is part of
active pool 'rpool'
Implementation follows the ZPOOL_CONFIG_LOAD_INFO pattern used
by zpool import:
- Add spa_create_info to spa_t to capture error info during
vdev_label_init(), before vdev_close() resets vdev state
- When vdev_inuse() detects a conflict, read the on-disk
label to extract the pool name and store it with the
device path
- Return the info wrapped under ZPOOL_CONFIG_CREATE_INFO
through the ioctl zc_nvlist_dst to userspace
[10 lines not shown]
linux/vfsops: remove zfs_mnt_t, pass directly
A cleanup of opportunity. Since we already are modifying the contents of
zfs_mnt_t, we've broken any API guarantee, so we might as well go the
rest of the way and get rid of it, and just pass the osname and/or the
vfs_t directly.
It seems like zfs_mnt_t was never really needed anyway; it was added in
1c2555ef92 (March 2017) to minimise the difference to illumos, but
zfs_vfsops was made platform-specific anyway in 7b4e27232d.
We also remove setting SB_RDONLY on the caller's flags when failing a
read-write remount on a read-only snapshot or pool. Since 0f608aa6ca
the caller's flags have been a pointer back to fc->sb_flags, which are
discarded without further ceremony when the operation fails, so the
change is unnecessary and we can simplify the call further.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
[2 lines not shown]
linux/super: work around kernels that enforce "forbidden" mount options
Before Linux 5.8 (include RHEL8), a fixed set of "forbidden" options
would be rejected outright. For those, we work around it by providing
our own option parser to avoid the codepath in the kernel that would
trigger it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18377
linux/super: implement new mount params parser
Adds zpl_parse_param and wires it up to the fs_context. This uses the
kernel's standard mount option parsing infrastructure to keep the work
we need to do to a minimum. We simply fill in the vfs_t we attached to
the fs_context in the previous commit, ready to go for the mount/remount
call.
Here we also document all the options we need to support, and why. It's
a lot of history but in the end the implementation is straightforward.
Finally, if we get SB_RDONLY on the proposed superblock flags, we record
that as the readonly mount option, because we haven't necessarily seen a
"ro" param and we still need to know for remount, the `readonly` dataset
property, etc.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18377
linux/super: match vfs_t lifetime to fs_context
vfs_t is initially just parameters for the mount or remount operation,
so match them to the lifetime of the fs_context that represents that
operation.
When we actually execute the operation (calling .get_tree or .reconfigure),
transfer ownership of those options to the associated zfsvfs_t.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18377
linux/super: remove zpl_parse_monolithic
Final bit of cleanup of the old method.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18377
linux/vfsops: remove old options parser
We're working to replace this, and its easier to drop it outright while
we get set up.
To keep things compiling, the calls to zfsvfs_parse_options() are
replaced with zfsvfs_vfs_alloc(), though without any option parsing at
all nothing will work. That's ok, next commits are working towards it.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18377
linux/vfsops: add vfs_t allocator, make public
In a few commits, we're going to need to allocate and free vfs_t from
zpl_super.c as well, so lets keep them uniform.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18377
Add zoned_uid property with additive least privilege authorization
This implements zoned_uid - a ZFS property that delegates dataset
visibility and administration to user namespaces owned by a specific
UID, enabling rootless Podman/Docker with native ZFS storage.
Usage: zfs set zoned_uid=1000 pool/dataset
Problem solved:
- zfs zone requires an existing namespace PID
- Podman creates a new namespace on each container start
- Solution: delegate to UID, any namespace owned by that UID is
authorized
Authorization model — three-layer additive (all must pass):
L0 (auth): Namespace owner UID matches zoned_uid property
L1 (dsl_deleg): Per-operation grants via `zfs allow` (when pool
delegation is ON — the default)
[126 lines not shown]
zinject: add numeric suffix support for -r range
Parse range values with zfs_nicestrtonum() instead of strtoull()
so that -r accepts human-readable suffixes (K, M, G, T, P, E).
For example: zinject -r 1G,2G /pool/file
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes #18374
FreeBSD: Implement relatime property
While FreeBSD does not support relatime natively, it seems trivial
to implement it just as dataset property for consistency. To not
change the status quo, change its default to off on FreeBSD. Now,
if explicitly enabled, it should actually work.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18385
ZTS: re-enable send_raw_ashift on FreeBSD
The test was skipped on FreeBSD since 2023 (#14961) due to exceeding
the 10-minute CI timeout on FreeBSD 14. CI runs on the fork now show
the test completes well within limits:
FreeBSD 14.3-RELEASE: 10 seconds
FreeBSD 15.0-STABLE: 11 seconds
FreeBSD 16.0-CURRENT: 14 seconds
Remove the FreeBSD skip and the corresponding known skip entry.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes #18389