Fix nonrot property being incorrectly unset (#17206)
When opening a vdev and setting the nonrot property, we used to wait for
each child to be opened before examining its nonrot property. When the
change was made to open vdevs asynchronously, we didn't move the nonrot
check out of the main loop. As a result, the nonrot property is almost
always set to false, regardless of the actual type of the underlying
disks. The fix is simply to move the nonrot check to a separate loop
after the taskq has been waited for.
Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, Inc.
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Fix dspace underflow bug
Since spa_dspace accounts only normal allocation class space,
spa_nonallocating_dspace should do the same. Otherwise we may get
negative overflow or respective assertion spa_update_dspace() if
removed special/dedup vdev is bigger than all normal class space.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Allan Jude <allan at klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17183
OpenZFS/src 9250403 — module/zfs zio.c dmu.c, tests/zfs-tests/tests/functional/gang_blocks gang_blocks.kshlib gang_blocks_redundant.ksh
Make ganging redundancy respect redundant_metadata property (#17073)
The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.
We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.
[6 lines not shown]
FDT dedup log sync -- remove incremental
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Authored-by: Don Brady <don.brady at klarasystems.com>
Authored-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17038
Add more DDT tests
The new Fast Dedup feature has a lot of moving parts, and only some of
them have tests. We have some tests for prefetch and quota, and a
generic ZAP shrinking test, but we don't have anything for the pruning
command or specific to DDT zap shrinking. Here we add a couple small new
tests for zpool ddtprune and DDT-specific ZAP shrinking.
Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Allan Jude <allan at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17049
Don't try to get mg of hole vdev in removal
Don't try to get mg of hole vdev in removal
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17080
Add kstats tracking gang allocations
Gang blocks have a significant impact on the long and short term
performance of a zpool, but there is not a lot of observability into
whether they're being used. This change adds gang-specific kstats to
ZFS, to better allow users to see whether ganging is happening.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17003
Expand fragmentation table to reflect larger possibile allocation sizes
When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.
We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan at klarasystems.com>
[2 lines not shown]
Don't try to get mg of hole vdev in removal
Don't try to get mg of hole vdev in removal
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17080
Add kstats tracking gang allocations
Gang blocks have a significant impact on the long and short term
performance of a zpool, but there is not a lot of observability into
whether they're being used. This change adds gang-specific kstats to
ZFS, to better allow users to see whether ganging is happening.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17003
Expand fragmentation table to reflect larger possibile allocation sizes
When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.
We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan at klarasystems.com>
[2 lines not shown]
Add missing kstats to dataset kstats
Reviewed-by: Tony Nguyen <tony.nguyen at delphix.com>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #16431
Give a better message from 'zpool get' with invalid pool name
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Don Brady <don.brady at klarasystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen at delphix.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15942
Give a better message from 'zpool get' with invalid pool name
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Don Brady <don.brady at klarasystems.com>
Reviewed-by: Tony Nguyen <tony.nguyen at delphix.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15942
Don't assert mg_initialized due to device addition race
During device removal stress tests, we noticed that we were tripping
the assertion that mg_initialized was true. After investigation, it was
determined that the mg in question was the embedded log metaslab
group for a newly added vdev; the normal mg had been initialized (by
metaslab_sync_reassess, via vdev_sync_done). However, because the spa
config alloc lock is not held as writer across both calls to
metaslab_sync_reassess, it is possible for an allocation to happen
between the two metaslab_groups being initialized. Because the metaslab
code doesn't check the group in question, just the vdev's main mg, it
is possible to get past the initial check in vdev_allocatable and
later fail due to the assertion.
We simply remove the assertions. We could also consider locking the
ALLOC lock around the reassess calls in vdev_sync_done, but that risks
deadlocks. We could check the actual target mg in vdev_allocatable,
but that risks racing with a passivation that comes in after that
check but before the assertion. We still won't be able to actually
[6 lines not shown]
Don't assert mg_initialized due to device addition race
During device removal stress tests, we noticed that we were tripping
the assertion that mg_initialized was true. After investigation, it was
determined that the mg in question was the embedded log metaslab
group for a newly added vdev; the normal mg had been initialized (by
metaslab_sync_reassess, via vdev_sync_done). However, because the spa
config alloc lock is not held as writer across both calls to
metaslab_sync_reassess, it is possible for an allocation to happen
between the two metaslab_groups being initialized. Because the metaslab
code doesn't check the group in question, just the vdev's main mg, it
is possible to get past the initial check in vdev_allocatable and
later fail due to the assertion.
We simply remove the assertions. We could also consider locking the
ALLOC lock around the reassess calls in vdev_sync_done, but that risks
deadlocks. We could check the actual target mg in vdev_allocatable,
but that risks racing with a passivation that comes in after that
check but before the assertion. We still won't be able to actually
[6 lines not shown]
Fix memory leak in zfs_setprocinit code
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain at gmail.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15508
OpenZFS/src d38f466 — tests/zfs-tests/tests/functional/cli_root/zpool_trim zpool_trim_partial.ksh zpool_trim_verify_trimmed.ksh, tests/zfs-tests/tests/functional/trim autotrim_config.ksh autotrim_integrity.ksh
Reduce trim min size even lower for tests to reduce flakiness
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15315
OpenZFS/src 99dc1fc — tests/zfs-tests/tests/functional/block_cloning block_cloning_copyfilerange_fallback_same_txg.ksh
ZTS: Fix introduced test bug in block_cloning_copyfilerange
Reviewed-by: John Wren Kennedy <john.kennedy at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15316
OpenZFS/src 8526b12 — tests/zfs-tests/tests/functional/block_cloning block_cloning_copyfilerange_fallback_same_txg.ksh
Set timeout before creating pool in test
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <rob.norris at klarasystems.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15309
Don't allocate from new metaslabs
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15307
Closes #15308
Invoke zdb by guid to avoid import errors
The problem that was occurring is basically that a device was removed
by ztest and replaced with another device. It was then reguided. The
import then failed because there were two possible imports with the
same name; one with the new guid, and one with the old. This can
happen because the label writes from the device removal/replacement
can be subject to ztest's error injection.
The other ways to fix this would be to change the error injection to
not trigger on removals (which may not be technically feasible), or
to change the import code to not report configurations that are so
short on devices (which would potentially have unpleasant end-user
effects when trying to recover from data losses/device configuration
issues).
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens at delphix.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15298
Don't allocate from new metaslabs
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15307
Closes #15308
OpenZFS/src ec99448 — tests/zfs-tests/tests/functional/cli_root/zpool_trim zpool_trim_partial.ksh zpool_trim_verify_trimmed.ksh, tests/zfs-tests/tests/functional/trim autotrim_config.ksh autotrim_integrity.ksh
Reduce trim min size even lower for tests to reduce flakiness
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15315
OpenZFS/src a59e294 — tests/zfs-tests/tests/functional/block_cloning block_cloning_copyfilerange_fallback_same_txg.ksh
ZTS: Fix introduced test bug in block_cloning_copyfilerange
Reviewed-by: John Wren Kennedy <john.kennedy at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15316
OpenZFS/src 8a9ba76 — tests/zfs-tests/tests/functional/block_cloning block_cloning_copyfilerange_fallback_same_txg.ksh
Set timeout before creating pool in test
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <rob.norris at klarasystems.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15309
Fix occasional rsend test crashes
We have occasional crashes in the rsend tests. Debugging revealed
that this is because the send_worker thread is getting EINTR from
splice(). This happens when a non-fatal signal is received during
the syscall. We should retry the syscall, rather than exiting failure.
Tweak the loop to only break if the splice is finished or we receive
a non-EINTR error.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli at nabijaczleweli.xyz>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15273
Fix incorrect expected error in ztest
There is an occasional ztest failure that looks like ztest: attach
(/var/tmp/zloop-run/ztest.13a 570425344, draid1-1-0 532152320, 1)
returned 22, expected 95. This is because the value that we return
is EINVAL, but expected_error is set incorrectly.
Change the expected_error value to match both the comment and the
actual error value.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15295
Fix l2arc_apply_transforms ztest crash
In #13375 we modified the allocation size of the buffer that we use
to apply l2arc transforms to be the size of the arc hdr we're using,
rather than the allocation size that will be in place on the disk,
because sometimes the hdr size is larger. Unfortunately, sometimes
the allocation size is larger, which means that we overflow the buffer
in that case. This change modifies the allocation to be the max of
the two values
Reviewed-by: Mark Maybee <mark.maybee at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15177
Closes #15248
Invoke zdb by guid to avoid import errors
The problem that was occurring is basically that a device was removed
by ztest and replaced with another device. It was then reguided. The
import then failed because there were two possible imports with the
same name; one with the new guid, and one with the old. This can
happen because the label writes from the device removal/replacement
can be subject to ztest's error injection.
The other ways to fix this would be to change the error injection to
not trigger on removals (which may not be technically feasible), or
to change the import code to not report configurations that are so
short on devices (which would potentially have unpleasant end-user
effects when trying to recover from data losses/device configuration
issues).
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens at delphix.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #15298