vdev_file: unify FreeBSD and Linux implementations (#17046)
Kernel & userspace specifics are in zfs_file_os.c, so there's no
particular reason these have to be separate.
The one platform-specific part is in the Linux kernel part, to offload
flushes to a taskq if we're already inside a filesystem transaction.
This would be normally be an unsatisfying wart, but I'm intending to
remove this shortly, so I'm content to leave it gated for the moment.
Reviewed-by: Allan Jude <allan at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Fix metaslab group fragmentation math (#17037)
Since we are calculating a free space fragmentation, we should
weight metaslabs by the amount of their free space, not a full
size. Fragmentation of full metaslabs may not matter in presence
empty ones. The old algorithm did not differentiate metaslabs
having only one free 4KB block from metaslabs having 50% of space
free in 4KB blocks, reporting higher fragmentation.
While there, move metaslab_group_alloc_update() call after setting
mg_fragmentation, otherwise the effect may be delayed by one TXG.
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Reviewed-by: Tony Nguyen <tony.nguyen at delphix.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
range_tree: convert remaining range_* defs to zfs_range_*
Signed-off-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Linux 6.12 compat: Rename range_tree_* to zfs_range_tree_*
Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols.
Signed-off-by: Ivan Volosyuk <Ivan.Volosyuk at gmail.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Free memory in an error path in spl-kmem-cache.c
skc->skc_name also needs to be freed in an error path.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Signed-off-by: Vandana Rungta <vrungta at amazon.com>
Closes #17041
Update the dataset name in handle after zfs_rename (#17040)
For zfs_rename, after the dataset name is successfully updated,
the dataset handle that was passed to zfs_rename, still contains
the old name, due to which, the dataset handle becomes invalid.
The following operations performed using this handle result in
error since the dataset with old name cannot be found anymore.
changelist_rename does update the names in dataset handles,
but those are temporary handles that were created during
changelist_gather. The original handle that was used to call
zfs_rename is not updated.
We should update the name in original ZFS handle after the IOCTL
for rename returns success for the operation.
Signed-off-by: Umer Saleem <usaleem at ixsystems.com>
Reviewed-by: Ameer Hamza <ahamza at ixsystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
zio: do no-op injections just before handing off to vdevs
The purpose of no-op is to simulate a failure between a device cache and
its permanent store. We still want it to go through the queue and
respond in the same way to everything else.
So, inject "success" as the very last thing, and then move on to
VDEV_IO_DONE to be dequeued and so any followup work can occur.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Closes #17029
Fix "make install" with DESTDIR set (#16995)
"DESTDIR=/path/to/target/root/ make install" may fail when installing to
a root that contains an existing lib/modules structure. When run as root
we may even affect the wrong kernel (the build system's one, or, if
running a different version, some other directory in /lib/modules, but
not the desired one installed in DESTDIR).
Add a missing reference to the INSTALL_MOD_PATH root when calling
"depmod" during "make install"
Also add a switch "DONT_DELETE_MODULES_FILES=1" that skips the removal
of files named "modules.*" prior to running depmod.
Signed-off-by: Christian Kohlschütter <christian at kohlschutter.com>
Closes #16994
Reviewed-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
optimize recv_fix_encryption_hierarchy()
recv_fix_encryption_hierarchy() in its present state goes through all
stream filesystems, and for each one traverses the snapshots in order to
find one that exists locally. This happens by calling guid_to_name() for
each snapshot, which iterates through all children of the filesystem.
This results in CPU utilization of 100% for several minutes (for ~1000
filesystems on a Ryzen 4350G) for 1 thread at the end of a raw receive
(-w, regardless whether encrypted or not, dryrun or not).
Fix this by following a different logic: using the top_fs name, call
gather_nvlist() to gather the nvlists for all local filesystems. For
each one filesystem, go through the snapshots to find the corresponding
stream's filesystem (since we know the snapshots guid and can search
with it in stream_avl for the stream's fs). Then go on to fix the
encryption roots and locations as in its present state.
Avoiding guid_to_name() iteratively makes
recv_fix_encryption_hierarchy() significantly faster (from several
[14 lines not shown]
Add kstats tracking gang allocations
Gang blocks have a significant impact on the long and short term
performance of a zpool, but there is not a lot of observability into
whether they're being used. This change adds gang-specific kstats to
ZFS, to better allow users to see whether ganging is happening.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17003
Expand fragmentation table to reflect larger possibile allocation sizes
When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.
We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan at klarasystems.com>
[2 lines not shown]
Fix typos in zpool_do_scrub() error messages (#17028)
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Mateusz Piotrowski <0mp at FreeBSD.org>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: George Amanakis <gamanakis at gmail.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Clarify documentation of `zfs destroy` on snapshots (#17021)
The current documentation of `zfs destroy` in application to snapshots
is particularly difficult to understand. The following changes are made:
- Remove circular reference to `zfs destroy` in the documentation of
that command.
- Remove use of "for example", which implies there are more,
undocumented reasons that ZFS may fail to destroy a snapshot
immediately.
- Mention properties `defer_destroy` and `userrefs`.
- Add `zfsprops(8)` to "SEE ALSO" list.
- Clarify meaning of `-d` option.
Requires-builders: none
Signed-off-by: mnrx <83848843+mnrx at users.noreply.github.com>
[3 lines not shown]
Linux 6.14: BLK_MQ_F_SHOULD_MERGE was removed
According to the upstream change, all callers set it, and all block
devices either honoured it or ignored it, so removing it entirely allows
a bunch of handling for the "unset" case to be removed, and it becomes
effectively implied.
We follow suit, and keep setting it for older kernels.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Linux 6.14: dops->d_revalidate now takes four args
This is a convenience for filesystems that need the inode of their
parent or their own name, as its often complicated to get that
information. We don't need those things, so this is just detecting which
prototype is expected and adjusting our callback to match.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
zio: lock parent zios when updating wait counts on reexecute
As zios are reexecuted after resume from suspension, their ready and
wait states need to be propagated to wait counts on all their parents.
It's possible for those parents to have active children passing through
READY or DONE, which then end up in zio_notify_parent(), take their
parent's lock, and decrement the wait count. Without also taking a lock
here, it's possible for an increment race to occur, which leads to
either there being no references left (tripping the assert in
zio_notify_parent()), or a parent waiting forever for a nonexistent
child to complete.
To protect against this, we simply take the appropriate zio locks in
zio_reexecute() before updating the wait counts.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
[3 lines not shown]
Avoid ARC buffer transfrom operations in prefetch
This change will prevent prefetch to perform unnecessary ARC buffer
fill when reading from disk.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Jaydeep Kshirsagar <jkshirsagar at maxlinear.com>
Co-authored-by: Alexander Motin <mav at FreeBSD.org>
Closes #17013
Add recursive dataset mounting and unmounting support to pam_zfs_key (#16857)
Introduced functionality to recursively mount datasets with a new
config option `mount_recursively`. Adjusted existing functions to
handle the recursive behavior and added tests to validate the feature.
This enhances support for managing hierarchical ZFS datasets within
a PAM context.
Signed-off-by: Jerzy Kołosowski <jerzy at kolosowscy.pl>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Update pin_user_pages() calls for Direct I/O
Originally #16856 updated Linux Direct I/O requests to use the new
pin_user_pages API. However, it was an oversight that this PR only
handled iov_iter's of type ITER_IOVEC and ITER_UBUF. Other iov_iter
types may try and use the pin_user_pages API if it is available. This
can lead to panics as the iov_iter is not being iterated over correctly
in zfs_uio_pin_user_pages().
Unfortunately, generic iov_iter API's that call pin_user_page_fast() are
protected as GPL only. Rather than update zfs_uio_pin_user_pages() to
account for all iov_iter types, we can simply just call
zfs_uio_get_dio_page_iov_iter() if the iov_iter type is not ITER_IOVEC
or ITER_UBUF. zfs_uio_get_dio_page_iov_iter() calls the
iov_iter_get_pages() calls that can handle any iov_iter type.
In the future it might be worth using the exposed iov_iter iterator
functions that are included in the header iov_iter.h since v6.7. These
functions allow for any iov_iter type to be iterated over and advanced
[12 lines not shown]
Make the vfs.zfs.vdev.raidz_impl sysctl cross-platform
Reviewed-by: Allan Jude <allan at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Alan Somers <asomers at gmail.com>
Sponsored by: ConnectWise
Closes #16980
FreeBSD: Add setting of the VFCF_FILEREV flag
The flag VFCF_FILEREV was recently defined in FreeBSD
so that a file system could indicate that it increments
va_filerev by one for each change.
Since ZFS does do this, set the flag if defined for the
kernel being built. This allows the NFSv4.2 server to
reply with the correct change_attr_type attribute value.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rick Macklem <rmacklem at uoguelph.ca>
Closed #16976
zinject: add "probe" device injection type
Injecting a device probe failure is not possible by matching IO types,
because probe IO goes to the label regions, which is explicitly excluded
from injection. Even if it were possible, it would be awkward to do,
because a probe is sequence of reads and writes.
This commit adds a new IO "type" to match for injection, which looks for
the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the
injection record and recieve the wanted error.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Closes #16947
zinject: make iotype extendable
I'm about to add a new "type", and I need somewhere to put it!
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Closes #16947
ZTS: remove get_arcstat
It's now a simple wrapper, so lets just call kstat direct.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
ZTS: update existing kstat users to new helper
Removes other custom helpers and direct accesses to /proc.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
ZTS: reimplement kstat helper function
The old kstat helper function was barely used, I suspect in part because
it was very limited in the kinds of kstats it could gather.
This adds new functions to replace it, for each kind of thing that can
have stats: global, pool and dataset. There's options in there to get a
single stat value, or all values within a group.
Most importantly, the interface is the same for both platforms.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Fix several typos in the man pages
Reviewed-by: George Amanakis <gamanakis at gmail.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Tim Smith <tsmith84 at gmail.com>
Closes #16965
zfs-destroy.8: Fix minor formatting typo
The warning at the end of the second example in the description section
was actually inside the options table. Move the El macro to match what
is done in the first section for improved readability.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Alexander Ziaee <ziaee at FreeBSD.org>
Closes #16962
Update RELEASES.md LTS release to 2.2
2.3.0 is out now, so make 2.2.x the LTS release.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
Closes #16945
Closes #16948
style: remove unnecessary spaces in sa.h
Removed three unnecessary spaces in the definition of the
sa_attr_reg_t structure to improve code style consistency
and adhere to OpenZFS coding standards.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Peng Liu <littlenewton6 at gmail.com>
Closes #16955