Linux: annotate nested xattr setattr znode locks
zfs_setattr() updates both the target znode and its hidden xattr
directory when ownership, mode, or project ID changes. The xattr
directory uses the same z_acl_lock and z_lock classes as the
parent znode, so lockdep reports recursive locking when the
second znode's mutexes are acquired.
This is a lockdep false positive rather than a real deadlock.
attrzp is the target file's hidden xattr directory, and the code
does not acquire these znode mutexes in the reverse order.
Acquire the attrzp mutexes with mutex_enter_nested() so lockdep
treats them as nested.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: ZhengYuan Huang <gality369 at gmail.com>
Co-authored-by: gality369 <gality369 at example.com>
Closes #18506
zarcstat: detect attached L2ARC device with no data
zarcstat and zarcsummary detected L2ARC presence using the l2_size
kstat, which is data held in L2ARC, not whether a cache device is
attached. When a cache device was attached but empty (freshly added,
or fully evicted):
- zarcstat rejected "-f l2*" with "Incompatible field specified!"
- zarcsummary printed "L2ARC not detected, skipping section",
hiding cumulative I/O history and health counters
Expose the existing l2arc_ndev counter as a new kstat l2_dev_count.
It is maintained by l2arc_add_vdev() and l2arc_remove_vdev(), so it
tracks attachment in real time. Use it in both tools, falling back to
l2_size for compatibility with older kernel modules.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #18499
zdb: detect BRT and DDT leaks during block traversal
During -b traversal, track BRT and DDT reference counts and report
blocks claimed more times than their reference tables account for
if it causes claim errors, instead of just asserting it. Also
report entries with references not fully consumed by the traversal.
Add zdb leaks checks to cloning and dedup tests. This should make
sure the pools are in a sane state after completing the functional
tests.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18494
ZTS: redundancy_draid_spare1
Preserve the 'zpool status' output used to calculate the number of
checksum errors so it can be logged on failure. Several instances have
been observed in the CI where cksum was set to a non-zero value, yet a
subsequent run of 'zpool status' on failure showed no checksum errors.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #18500
Add some more file layout output, triggered by -v
With one -v, the block type (parity or data) is printed (matching
the ASCII-art version); with two -v, the offset into the file is
also printed.
This also updates the man page, and adds some simple
test scripts.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Sean Fagan <sean.fagan at klarasystems.com>
Signed-off-by: Sean Fagan <sean.fagan at klarasystems.com>
Closes #18470
sa: fix sa_add_projid lock ordering
sa_add_projid() currently acquires hdl->sa_lock before zp->z_lock.
Several same-znode update paths take zp->z_lock and then call
sa_update() or sa_bulk_update() on the same SA handle.
On Linux, FS_IOC_FSSETXATTR reaches zfs_setattr() through
zpl_ioctl_setxattr() without outer inode serialization. This makes
the reversed lock order a real ABBA deadlock rather than a lockdep
false positive when projid is added to an old-format inode while
another thread updates the same znode.
Acquire zp->z_lock before hdl->sa_lock in sa_add_projid() to match
the existing znode update ordering.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: ZhengYuan Huang <gality369 at gmail.com>
Co-authored-by: gality369 <gality369 at example.com>
Closes #18503
ZTS: zpool_iostat_002_pos remove sleep
In the CI environment commands may occasionally take longer than
expected. For zpool_iostat_002_pos this can cause a failure if fewer
than the expected numbers of lines are logged in time. To prevent
this issue relax the time constraint and simply verify the command
ran to completion and generate the correct number of lines.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #18501
Vdev allocation bias/class change
Normal, special and dedup vdevs differ only by space allocation
bias. Normal and special vdevs might even legally store blocks
targeted to other classes. Dedup vdevs don't normally do it, but
there is no real reason why they can't. Considering this, it is
not impossible to change the allocation bias for those vdevs.
This change introduces a new top-level vdev property -- alloc_bias,
reporting current bias for the vdev, and allowing to change it.
This allows to easily change vdev role in a pool, especially if
vdev removal is impossible. To not complicate the code, changes
take effect only on next pool import.
Changes to/from log vdev could also be theoretically possible, but
they are artificially blocked for now, partially due to additional
complications, and partially due to potential danger of placing
other blocks on log vdevs, that would otherwise be non-fatal.
[3 lines not shown]
ZTS: removal_with_export.ksh busy export
If the pool is active 'zpool export' will fail resulting in
a test failure. Swap log_must with log_must_busy so the export
is retried when reported as busy before failing the test.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #18498
zfs: annotate nested dd_lock in reservation sync accounting
When reservation sync updates a child's reserved space, it rolls the
delta into ancestor space accounting while still holding the child's
dd_lock. That locking order is intentional, but Linux lockdep sees
the ancestor acquisition as recursive because it lacks a nested lock
subclass annotation.
Teach the reservation-sync space-accounting path to acquire ancestor
dd_lock instances with a nested subclass. Keep the existing public
interfaces and accounting behavior unchanged by routing only the
ancestor rollup through local helpers.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: ZhengYuan Huang <gality369 at gmail.com>
Signed-off-by: gality369 <gality369 at example.com>
Closes #18497
ZTS: use 'zpool trim -w' in zpool_trim_partial.ksh
Don't use trim_progress() which is racy to wait for the pool trim
to complete. Instead use the wait (-w) option which is intended
for this.
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #18496
ZTS: Remove threadsappend_001_pos exception
Commit f828a80c may have resolved the underlying cause for
the occasional CI failures observed for this test. Remove
the exception to ensure any new occurrences are noticed.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #6136
Closes #18495
Zstd: rework ZSTD_isError symbol renaming
The import of Zstd v1.5.7 in a2ac9cd606ce2428c23cc89cec6f0392424e82c9
added an unconditional renaming of ZSTD_isError to zfs_ZSTD_isError
with an asm directive. Instead, do it with a define that is conditioned
on whether zstd_compat_wrapper.h is actually in use. Also add a define
to that header so that it can be detected. This allows the build to
work without using the compat wrapper.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Ryan Libby <rlibby at FreeBSD.org>
Closes #18483
linux: verify stale znodes in legacy fallocate
The mode=0 and FALLOC_FL_KEEP_SIZE preallocation path can reach
zfs_freesp() directly and call zfs_statvfs() before going through the
normal zpl_enter_verify_zp() boundary.
When zfs_rezget() tears down a failed SA reload, a stale inode may
remain alive in the VFS with z_sa_hdl cleared. The unchecked
fallocate path can then reach sa_lookup(zp->z_sa_hdl, ...) through
zfs_statvfs() or zfs_freesp() and crash on a NULL SA handle.
Use zfs_enter_verify_zp() in zfs_statvfs() so stale znodes are
rejected under the teardown lock for both fallocate and statfs.
Also wrap the direct zfs_freesp() call in
zpl_enter_verify_zp()/zfs_exit() so this path follows the same
validation rules as the other Linux ZPL file operations.
Fixes: f734301d2267
("linux: add basic fallocate(mode=0/2) compatibility")
[4 lines not shown]
Update description of spl_schedule_hrtimeout_slack_us
Clarify the effect of the non-zero value on wakeup coalescing.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes #18467
man: document three missing properties and tunables
Add manpage entries for parameters and properties that exist in
source but were not previously described:
- spl.4: spl_schedule_hrtimeout_slack_us
- zfsprops.7: longname
- vdevprops.7: raidz_expanding
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes #18467
CI: FreeBSD 15.1 PRERELEASE (#18490)
Update freebsd15-0s builder to freebsd15-1s and point it at the
15.1-PRERELEASE tag. The previous freebsd-15.0-STABLE images are
no longer available.
Additionally, add a freebsd15-0r stanza for the RELEASE.
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Fix long POSIX_FADV_DONTNEED for single block files
dbuf_whichblock() is not made to handle offsets beyond the block
end for single-block objects. Handle it in dmu_evict_range(),
similar to dmu_prefetch_by_dnode().
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18399
Closes #18489
CI/GCC: Add Fedora 44, fix build errors and threadsappend
- Add Fedora 44 to CI tests
- Fix build issues from the newer compiler. These are mostly 'char *'
to 'const char *' conversions.
- Fix threadsappend.c test waiting for the same thread TID twice.
This caused the test to hang on F44 (but strangely not other OSs?)
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #18478
Initialize vr_last_txg for rebuild
Only call txg_wait_synced() when rebuild IOs were issued for this
metaslab. This is a small optimization since in practice the first
metaslab is very likely to have allocations and cause vr_last_txg
to be initialized. After this point when processing empty metaslabs
txg_wait_synced() is called but with an already committed txg so it
will not wait. Still it's better not to call txg_wait_synced() at
all when it's not needed.
Reviewed-by: Andriy Tkachuk <atkachuk at wasabi.com>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #18482
Avoid flushing unrelated NFS exports on snapshot unmount
zfsctl_snapshot_unmount() called exportfs_flush() before every umount
attempt to drop NFS export cache references that pin the snapshot
mountpoint. The flush has global effect on the host's NFS exports and
clients, so paying it on every snapshot unmount (including auto-expire
rounds for snapshots that were never NFS-accessed) impacts unrelated
snapshots and clients.
ZFS cannot invalidate individual export cache entries because the
relevant sunrpc cache APIs are exported GPL-only. Defer the global
flush so it runs only when the umount has actually failed, then retry
once. Snapshots that are not NFS-pinned succeed on the first attempt
and never trigger the flush.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Youzhong Yang <yyang at mathworks.com>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #18476
Fix rare cksum errors after rebuild
Currently, after rebuild (aka sequential resilver), checksum
errors can be seen sometimes on the spare vdev or draid spare.
On my laptop, it happens from 2 to 4 times of running
redundancy_draid_spare1 test in a loop for 100 times.
It looks like there's a race in vdev_rebuild_thread() when the
rebuild of space map ranges is finished and we re-enable
allocations from the metaslab too soon: a new allocations may
happen from that metaslab before txg with the rebuilt ranges is
sync-ed, causing undesirable interference.
Solution: wait for the txg to be sync-ed before enabling metaslab.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Akash B <akash-b at hpe.com>
Signed-off-by: Andriy Tkachuk <atkachuk at wasabi.com>
Closes #18307
Closes #18319
Closes #18473
Fix off-by-one in PREVIOUSLY_REDACTED handler that drops last block
In send_reader_thread(), the PREVIOUSLY_REDACTED handler computed
file_max as MIN(dn->dn_maxblkid, range->end_blkid). dn_maxblkid is
an inclusive maximum block ID while range->end_blkid is exclusive (one
past the last block). The resulting file_max was then used as an
exclusive loop bound, causing the last block of any file (at index
dn_maxblkid) to be silently skipped when a PREVIOUSLY_REDACTED range
covered the end of the file.
The block was never written to the send stream so the receiver kept
zeros there. ZFS reported no error because the stream itself was
valid; the data was simply absent.
Fix: use dn_maxblkid + 1 so file_max is consistently exclusive.
Add a regression test (redacted_max_blkid.ksh) that modifies only the
last block of a file in one clone, creates a redaction bookmark from
it, then sends an unmodified clone incrementally from that bookmark.
[7 lines not shown]
Linux 7.1: access dentry d_alias directly
The d_u union introduced in 3.18 is now anonymous, so we need to detect
it and decide the right way to name d_alias.
Note that we used to have support for both names to support kernels
before 3.18, so this commit is effectively reverting the commit that
removed that support, efc293e371.
Sponsored-by: TrueNAS
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18471
ZTS: add libzfs_mnttab_cache test
This is the repro test from #18464, and confirms that when disabled, the
libzfs_mnttab_cache is discarded and reloaded on every lookup.
Sponsored-by: TrueNAS
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Co-authored-by: Prakash Surya <prakash.surya at perforce.com>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18466
Closes #18464
libzfs/mnttab: restore ability to enable/disable cache
In #18296 we made the cache "always on", with the justification that our
internal tools always enable the cache anyway. This allowed removing the
entire alternate implementation of libzfs_mnttab_find().
Unfortunately, it appears that there are still libzfs consumers out
there that were expecting to be able to disable the cache entirely, and
this broke some behaviour for them.
This commit restores the ability to enable or disable the cache (and
returns to "disabled" as the default, to preserve existing behaviour).
Fortunately there is no need for a whole second codepath; just a small
reorganisation to drop all cached entries each time.
Sponsored-by: TrueNAS
Reviewed-by: Prakash Surya <prakash.surya at perforce.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
[4 lines not shown]
AUTHORS: add names of recent new contributors
"Speak, friend, and enter."
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #18475
libspl/mnttab: follow symlinks when resolving path via statx (#18469)
When the path argument to "zfs list -Ho name <path>" (or any caller of
zfs_path_to_zhandle()) is a symlink that crosses a mount boundary, the
wrong dataset is returned. Instead of returning the dataset that owns
the symlink's target, getextmntent() matches the dataset containing the
symlink itself.
For example, given two ZFS datasets "tank/ds1" and "tank/ds2", and a
symlink "/tank/ds1/link" pointing into "/tank/ds2":
$ sudo zfs list -Ho name /tank/ds1/link
tank/ds1
The expected (and previous) behavior is to return "tank/ds2", since the
symlink's target resides in that dataset.
The problem is in getextmntent(), in lib/libspl/os/linux/mnttab.c. That
function calls statx() on the caller-supplied path to obtain its mnt_id
[41 lines not shown]
build: use pax tar format for make dist
Automake's default tar formats (v7 pre-1.18, ustar since) impose path
length limits that drop several long test filenames from the release
tarball when `make dist` runs. Pax format has no such limit and is
read by GNU tar 1.14+ and libarchive/bsdtar.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes: #17276
Closes: #18465
include: Remove duplicate lzc_send_space prototype
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Ryan Moeller <ryan.moeller at klarasystems.com>
Closes #18463