Implement allocation size ranges and use for gang leaves (#17111)
When forced to resort to ganging, ZFS currently allocates three child
blocks, each one third of the size of the original. This is true
regardless of whether larger allocations could be made, which would
allow us to have fewer gang leaves. This improves performance when
fragmentation is high enough to require ganging, but not so high that
all the free ranges are only just big enough to hold a third of the
recordsize. This is also useful for improving the behavior of a future
change to allow larger gang headers.
We add the ability for the allocation codepath to allocate a range of
sizes instead of a single fixed size. We then use this to pre-allocate
the DVAs for the gang children. If those allocations fail, we fall back
to the normal write path, which will likely re-gang.
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
txg: generalise txg_wait_synced_sig() to txg_wait_synced_flags() (#17284)
txg_wait_synced_sig() is "wait for txg, unless a signal arrives". We
expect that future development will require similar "wait unless X"
behaviour.
This generalises the API as txg_wait_synced_flags(), where the provided
flags describe the events that should cause the call to return.
Instead of a boolean, the return is now an error code, which the caller
can use to know which event caused the call to return.
The existing call to txg_wait_synced_sig() is now
txg_wait_synced_flags(TXG_WAIT_SIGNAL).
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Allan Jude <allan at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
ZTS: Restore some delays in online_offline tests
After more CI runs and code reading after #17259 I've found that
online starts resilver via async mechanism, which does not provide
wait primitives at this time. Restore some delays to restore CI
until this is properly fixed.
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Fix race between resilver wait and offline/detach
We should not clear scn_state and notify waiters until we call
vdev_dtl_reassess(), otherwise following offline/detach request
may fail with "no valid replicas".
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
tests: fix `S_IFMT` undeclared at `statx.c`
`S_IFMT` is declared in `sys/stat.h`, but we cannot include this header
because it redeclares the `statx` function with different argument
types. Therefore, we define `S_IFMT` ourselves, in the same way as the
other definitions.
Reviewed-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: José Luis Salvador Rufo <salvador.joseluis at gmail.com>
Closes #17293
Closes #17294
ZTS: Stop zpool_status tests from spamming stdout (#17292)
zpool_status_003 and zpool_status_004_pos use 'dd' to trigger a read of
a file without specifying 'of=/dev/null'. This spams the ZTS logs
with ~20MB of garbage data. This commit adds 'of=/dev/null'.
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Fix double spares for failed vdev
It's possible for two spares to get attached to a single failed vdev.
This happens when you have a failed disk that is spared, and then you
replace the failed disk with a new disk, but during the resilver
the new disk fails, and ZED kicks in a spare for the failed new
disk. This commit checks for that condition and disallows it.
Reviewed-by: Akash B <akash-b at hpe.com>
Reviewed-by: Ameer Hamza <ahamza at ixsystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes: #16547
Closes: #17231
ZTS: Fix replacement/resilver_restart_001 on FreeBSD
Decrease the RESILVER_MIN_TIME_MS variable from 50 to 20.
So the test, which expects two 2 resilver starts will see them.
Logfile of the seen failures before this fix:
log: NOTE: expected 2 resilver start(s) after offline/online, found 1
log: expected 2 resilver start(s) after offline/online, found 1
The test time decreases also from around 00:42 to 00:24 seconds.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs at mcmilk.de>
Closes #16822
Closes #17279
Sort the blocking snapshots list #12751 (#17264)
When multiple snapshots prevent the destruction/rollback of the
respective dataset/snapshot/volume via zfs destroy or zfs rollback,
the error message does not list the blocking snapshots sorted
according to their order of creation. This causes inconvenience and can
lead to confusion, and also creates a contrast with a returned message
from zfs list -t snap function.
Closes: #12751
Signed-off-by: Artem-OSSRevival <artem.vlasenko at ossrevival.org>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Double quote variables to prevent globbing and word splitting
This change goes through and quotes variables where appropriate to
avoid issues with incorrect splitting. The performance tests ran into
an issue with $SUDO_COMMAND splitting incorrectly because it was not
quoted. This change fixes that issue and hopefully gets ahead of any
other similar problems.
Reviewed by: John Wren Kennedy <jwk404 at gmail.com>
Reviewed-by: Tony Nguyen <tony.nguyen at delphix.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Aleksandr Liber <aleksandr.liber at perforce.com>
Closes #17235
RPM: Hold back incompatible kernel packages on Fedora
A user reported that when your upgrade your kernel packages on Fedora
with ZFS installed, only the kernel-devel package gets held back to the
ZFS-supported version, but not the other kernel packages. So if ZFS only
supports the 6.13 kernel, Fedora will still happily upgrade the kernel
RPM to 6.14, but hold back kernel-devel at 6.13, for example.
This commit includes version checks for the 'kernel-uname-r' dependency,
typically provided by the 'kernel-core' package.
Original-patch-by: @jkool702
Reviewed-by: @jkool702
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #17265
Closes #17271
RPM: Hold back incompatible kernel packages on Fedora
A user reported that when your upgrade your kernel packages on Fedora
with ZFS installed, only the kernel-devel package gets held back to the
ZFS-supported version, but not the other kernel packages. So if ZFS only
supports the 6.13 kernel, Fedora will still happily upgrade the kernel
RPM to 6.14, but hold back kernel-devel at 6.13, for example.
This commit includes version checks for the 'kernel-uname-r' dependency,
typically provided by the 'kernel-core' package.
Original-patch-by: @jkool702
Reviewed-by: @jkool702
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #17265
Closes #17271
cred: properly pass and test creds on other threads (#17273)
### Background
Various admin operations will be invoked by some userspace task, but the
work will be done on a separate kernel thread at a later time. Snapshots
are an example, which are triggered through zfs_ioc_snapshot() ->
dsl_dataset_snapshot(), but the actual work is from a task dispatched to
dp_sync_taskq.
Many such tasks end up in dsl_enforce_ds_ss_limits(), where various
limits and permissions are enforced. Among other things, it is necessary
to ensure that the invoking task (that is, the user) has permission to
do things. We can't simply check if the running task has permission; it
is a privileged kernel thread, which can do anything.
However, in the general case it's not safe to simply query the task for
its permissions at the check time, as the task may not exist any more,
or its permissions may have changed since it was first invoked. So
[38 lines not shown]
ZTS: Optimize KSM on Linux and remove it for FreeBSD
Don't use KSM on the FreeBSD VMs and optimize KSM settings for
Linux to have faster run times.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs at mcmilk.de>
Closes #17247
zfs-rollback.8: fix typo in example number
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Alexander Ziaee <ziaee at FreeBSD.org>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Quentin Thébault <quentin.thebault at defenso.fr>
Closes #17282
ZTS: Use Ubuntu default url for cloud-image
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs at mcmilk.de>
Closes #17278
ZTS: Make zvol_stress write some more
Sometimes it fails unable to see any injected write errors.
I guess writing 25KB of zeroes might be not enough to trigger
errors with probability set to 10%. Lets try to write more.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #17270
ZTS: Reduce extra caching in pool_checkpoint (#17268)
Those tests are write-mostly at the nested pool. Considering we have
3 more layers of caching underneath, we can hint ZFS how to use the
memory better by setting primarycache=metadata.
While there, add missing zpool sync after rm in checkpoint_capacity
before we could potentially see the freed space, would not there be
a pool checkpoint.
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Support using llvm-libunwind
This commit adds support for using llvm-libunwind for kernels built
using llvm and clang. The two differences are that the largest register
index is given by _LIBUNWIND_HIGHEST_DWARF_REGISTER, we need to check
whether the register is a floating point register and the prototype
for unw_regname takes the unwind cursor as the first argument.
Reviewed-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Sebastian Pauka <me at spauka.se>
Closes #17230
Export correct symbols for Lustre Direct I/O
Originally the Lustre ZFS OSD code was going to use zfs_uio_t structs
for supporting Direct I/O with ZFS. However, this has changed to using
abd_t structs instead. This exports the proper symbols that will be used
by the Lustre ZFS OSD code.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Brian Atkinson <batkinson at lanl.gov>
Closes #17256
Add more descriptive destroy error message
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Reviewed by: Attila Fülöp <attila at fueloep.org>
Signed-off-by: Artem-OSSRevival <artem.vlasenko at ossrevival.org>
Fixes: #14538
Closes: #17234
ZTS: Fix 256MB file leak in zed_cksum_reported
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes: #17267
ZTS: Remove fixed sleeps from slog_006_pos
Replace `sleep 15` with `zpool wait`, which should take much less
than the 15 seconds. And considering it is called 16 times, this
should save us up to 4 minutes total.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes: #17257
ZTS: Polish online_offline tests
- Kill workload first for faster cleanup.
- Use `zpool wait` for resilver instead of `sleep`.
- Remove irrelevant workload from `online_offline_003_neg`.
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes: #17259
CI: Add Fedora 42 runner (#17249)
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Handle interaction between gang blocks, copies, and FDT.
With the advent of fast dedup, there are no longer separate dedup tables
for different copies values. There is now logic that will add DVAs to
the dedup table entry if more copies are needed for new writes. However,
this interacts poorly with ganging. There are two different cases that
can result in mixed gang/non-gang BPs, which are illegal in ZFS.
This change modifies updates of existing FDT; if there are already gang
DVAs in the FDT, we prevent the new write from extending the DDT
entry. We cannot safely mix different gang trees in one block
pointer. if there are non-gang DVAs in the FDT, then this allocation may
not be gangs. If it would gang, we have to redo the whole write as a
non-dedup write.
This change also fixes a refcount leak that could occur if the lead DDT
write failed.
Sponsored by: iXsystems, Inc.
[4 lines not shown]
ZTS: Remove ashift setting from dedup_quota test (#17250)
The test writes 1M of 1KB blocks, which may produce up to 1GB of
dirty data. On top of that ashift=12 likely produces additional
4GB of ZIO buffers during sync process. On top of that we likely
need some page cache since the pool reside on files. And finally
we need to cache the DDT. Not surprising that the test regularly
ends up in OOMs, possibly depending on TXG size variations.
Also replace fio with pretty strange parameter set with a set of
dd writes and TXG commits, just as we neeed here.
While here, remove compression. It has nothing to do here, but
waste CI CPU time.
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
nvlist: Add nvlist_snprintf() and zfs_dbgmsg_nvlist()
Add nvlist_snprintf() to print a nvlist to a buffer. This is basically
the snprintf() version of dump_nvlist(). Along with that, add a
zfs_dbgmsg_nvlist() to print out an nvlist to dbgmsg. This will aid in
debugging.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #17215