Add Quality Assurance to pull request template
PRs like #17352 have no applicable checkbox, so let us add one.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Richard Yao <richard at ryao.dev>
Closes #17354
dmu_objset_hold_flags() should call dsl_dataset_rele_flags() on error
This was caught when doing a manual check to see if #17352 needed to be
improved to catch mismatches across stack frames of the kind that were
first found in #17340.
Reviewed-by: George Amanakis <gamanakis at gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Richard Yao <richard at ryao.dev>
Closes #17353
arcstat: prevent ZeroDivisionError when L2ARC becomes empty
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Richard Yao <richard at ryao.dev>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #17348
Linux 6.2/6.15: del_timer_sync() renamed to timer_delete_sync()
Renamed in 6.2, and the compat wrapper removed in 6.15. No signature or
functional change apart from that, so a very minimal update for us.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa at snajpa.net>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #17229
Linux 6.15: mkdir now returns struct dentry *
The intent is that the filesystem may have a reference to an "old"
version of the new directory, eg if it was keeping it alive because a
remote NFS client still had it open.
We don't need anything like that, so this really just changes things so
we return error codes encoded in pointers.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa at snajpa.net>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #17229
icp: Use explicit_memset() exclusively in gcm_clear_ctx()
d634d20d1be31dfa8cf06ef2dc96285baf81a2fb had been intended to fix a
potential information leak issue where the compiler's optimization
passes appeared to remove `memset()` operations that sanitize sensitive
data before memory is freed for use by the rest of the kernel.
When I wrote it, I had assumed that the compiler would not remove the
other `memset()` operations, but upon reflection, I have realized that
this was a bad assumption to make. I would rather have a very slight
amount of additional overhead when calling `gcm_clear_ctx()` than risk a
future compiler remove `memset()` calls. This is likely to happen if
someone decides to try doing link time optimization and the person will
not think to audit the assembly output for issues like this, so it is
best to preempt the possibility before it happens.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
[2 lines not shown]
Fix 2 bugs in non-raw send with encryption
Bisecting identified the redacted send/receive as the source of the bug
for issue #12014. Specifically the call to
dsl_dataset_hold_obj(&fromds) has been replaced by
dsl_dataset_hold_obj_flags() which passes a DECRYPT flag and creates
a key mapping. A subsequent dsl_dataset_rele_flag(&fromds) is missing
and the key mapping is not cleared. This may be inadvertedly used, which
results in arc_untransform failing with ECKSUM in:
arc_untransform+0x96/0xb0 [zfs]
dbuf_read_verify_dnode_crypt+0x196/0x350 [zfs]
dbuf_read+0x56/0x770 [zfs]
dmu_buf_hold_by_dnode+0x4a/0x80 [zfs]
zap_lockdir+0x87/0xf0 [zfs]
zap_lookup_norm+0x5c/0xd0 [zfs]
zap_lookup+0x16/0x20 [zfs]
zfs_get_zplprop+0x8d/0x1d0 [zfs]
setup_featureflags+0x267/0x2e0 [zfs]
dmu_send_impl+0xe7/0xcb0 [zfs]
[20 lines not shown]
Add explicit DMU_DIRECTIO checks
UIO_DIRECT means we can do Direct I/O, while DMU_DIRECTIO we want
to do it. First does not automatically means second. Add few
checks to not use Direct I/O in few cases we don't want it.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #17342
Increase meta-dnode redundancy in "some" mode
Loss of one indirect block of the meta dnode likely means loss of
the whole dataset. It is worse than one file that the man page
promises, and in my opinion is not much better than "none" mode.
This change restores redundancy of the meta-dnode indirect blocks,
while same time still corrects expectations in the man page.
Reviewed-by: Akash B <akash-b at hpe.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #17339
Cause zpool scan resume commands to get logged in history
Currently, commands that resume a scrub/errorscrub from a paused state
don't get logged in the pool history. This is because resumes actually
return ECANCELED, instead of 0. This causes the tsd code in the common
ioctl logic to not think the ioctl succeeded, which causes the
log_history ioctl to fail with EPERM. However, for resuming a scrub from
a paused state, ECANCELED is success.
There are two options for how to deal with this. The first is the one
that I implemented here; I can't find a good reason for dmu_scan to
return ECANCELED on resume instead of 0, so let's just not. The only
place we check for the ECANCELED value is in zpool_scan, where we just
convert it back to zero. However, I am aware that this is changing an
ioctl interface, which I believe is a breaking change. I don't think
it's an important change, but maybe there is someone who relies on it.
The other option that could be implemented is to either allow ECANCELED
specifically from dsl_scan in the common ioctl code, or add a generic
[10 lines not shown]
ARC: parallel eviction
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.
This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.
A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.
Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Youzhong Yang <youzhong at gmail.com>
Signed-off-by: Allan Jude <allan at klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski at klarasystems.com>
[5 lines not shown]
ARC: Notify dbuf cache about target size reduction
ARC target size might drop significantly under memory pressure,
especially if current ARC size was much smaller than the target.
Since dbuf cache size is a fraction of the target ARC size, it
might need eviction too. Aside of memory from the dbuf eviction
itself, it might help ARC by making more buffers evictable.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Ameer Hamza <ahamza at ixsystems.com>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #17314
Linux: Stop using NR_FILE_PAGES for ARC scaling
I've found that QEMU/KVM guest memory accounted as shared also
included into NR_FILE_PAGES. But it is actually a non-evictable
anonymous memory. Using it as a base for zfs_arc_pc_percent
parameter makes ARC to ignore shrinker requests while page cache
does not really have anything to evict, ending up in OOM killer
killing the QEMU process.
Instead use of NR_ACTIVE_FILE + NR_INACTIVE_FILE should represent
the part of a page cache that is actually evictable, which should
be safer to use as a reference for ARC scaling.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Ameer Hamza <ahamza at ixsystems.com>
Reviewed-by: Pavel Snajdr <snajpa at snajpa.net>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #17334
runners: Add option to install custom kernel on Fedora
Allow installing a custom kernel version from the Fedora experimental
kernel repos onto the github runners. This is useful for testing if
ZFS works against a newer kernel.
Fedora has a number of repos with experimental kernel packages. This
PR allows installs from kernels in these repos:
@kernel-vanilla/stable
@kernel-vanilla/mainline
(https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories)
You will need to manually kick of a github runner to test with a custom
kernel version. To do that, go to the github actions tab under
'zfs-qemu' and click the drop-down for 'run workflow'. In there you
will see a text box to specify the version (like '6.14'). The scripts
will do their best to match the version to the newest matching version
that the repos support (since they're may be multiple nightly versions
[6 lines not shown]
Wire O_DIRECT also to Uncached I/O (#17218)
Before Direct I/O was implemented, I've implemented lighter version
I called Uncached I/O. It uses normal DMU/ARC data path with some
optimizations, but evicts data from caches as soon as possible and
reasonable. Originally I wired it only to a primarycache property,
but now completing the integration all the way up to the VFS.
While Direct I/O has the lowest possible memory bandwidth usage,
it also has a significant number of limitations. It require I/Os
to be page aligned, does not allow speculative prefetch, etc. The
Uncached I/O does not have those limitations, but instead require
additional memory copy, though still one less than regular cached
I/O. As such it should fill the gap in between. Considering this
I've disabled annoying EINVAL errors on misaligned requests, adding
a tunable for those who wants to test their applications.
To pass the information between the layers I had to change a number
of APIs. But as side effect upper layers can now control not only
[9 lines not shown]
vdev_id: symlinks creation for multipath disk partitions (#17331)
It has been observed that the symlinks are not being created
for the disk partitions on multipath enabled systems.
This fix addresses the issue.
Signed-off-by: Diwakar Kristappagari <diwakar-k at hpe.com>
Reviewed-by: Akash B <akash-b at hpe.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Ameer Hamza <ahamza at ixsystems.com>
update_authors: output possible mailmap additions
Once we've selected a best ident for the AUTHORS file, it makes sense to
set up a corresponding mailmap entry for any other ident for that
committer, to ensure the git history also reflects this into the future.
So, here we output potential mailmap updates for a human to consider.
For the moment, this needs to be done by a human, because update_authors
uses git to get the author names, and thus is reliant on the mailmap
contents to generate its output, so having it update mailmap directly
would introduce a circular dependency that I'm not totally sure about.
It's definitely better than having to go back through the history and
check each commit by hand though.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn at despairlabs.com>
update_authors: consider Signed-off-by trailers for committer idents
I've increasingly found that commits from new contributors have the
author set in the "Github noreply" obfuscated style. If they do have a
better canonical choice, it's usually in the Signed-off-by: trailer in
the commit message.
I had avoided using these in the first version of this program because
they aren't always present, aren't always correct, and some commits have
multiple signoffs. It seems however that requiring either the name or
the email address to match the commit author sufficiently narrows the
scope to be useful for the "Github noreply" situation, which is really
the main sticking point. And of course, if it gets it wrong, overriding
in .mailmap or AUTHORS is always an option.
Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn at despairlabs.com>
metaslab_alloc: make hint BP and DVA const (#17324)
Nothing modifies them, and nothing should, so lets try to enforce that.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Introduce zfs rewrite subcommand (#17246)
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values. It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing. Also since it is
protected by normal range range locks, it can be done under any
other load. Also it does not affect file's modification time or
other properties.
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
test-runner: rework output dir construction
The old code would compare all the test group names to work out some
sort of common path, but it didn't appear to work consistently,
sometimes placing output in a top-level dir, other times in one or more
subdirs. (I confess, I do not quite understand what it's supposed to
do).
This is a very simple rework that simply looks at all the test group
paths, removes common leading components, and uses the remainder as the
output directory. This should work because groups paths are unique, and
means we get a output dir tree of roughly the same shape as the test
groups in the runfiles and the test source dirs themselves.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: @ImAwsumm
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Closes #17167
spa: clear checkpoint information during retry
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Mariusz Zaborski <oshogbo at FreeBSD.org>
Closes #17319
linux/uio: remove "skip" offset for UIO_ITER
For UIO_ITER, we are just wrapping a kernel iterator. It will take care
of its own offsets if necessary. We don't need to do anything, and if we
do try to do anything with it (like advancing the iterator by the skip
in zfs_uio_advance) we're just confusing the kernel iterator, ending up
at the wrong position or worse, off the end of the memory region.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson at lanl.gov>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #17298
More aggressively assert that db_mtx protects db.db_data
db.db_mtx must be held any time that db.db_data is accessed. All of
these functions do have the lock held by a parent; add assertions to
ensure that it stays that way.
See https://github.com/openzfs/zfs/discussions/17118
* Refactor dbuf_read_bonus to make it obvious why db_rwlock isn't
required.
* Refactor dbuf_hold_copy to eliminate the db_rwlock
Copy data into the newly allocated buffer before assigning it to the db.
That way, there will be no need to take db->db_rwlock.
* Refactor dbuf_read_hole
In the case of an indirect hole, initialize the newly allocated buffer
before assigning it to the dmu_buf_impl_t.
[2 lines not shown]
zvol: Enable zvol threading functionality on FreeBSD
Make zvol I/O requests processing asynchronous on FreeBSD side in some
cases. Clone zvol threading logic and required module parameters from
Linux side. Make zvol threadpool creation/destruction logic shared for
both Linux and FreeBSD.
The IO requests are processed asynchronously in next cases:
- volmode=geom: if IO request thread is geom thread or cannot sleep.
- volmode=cdev: if IO request passed thru struct cdevsw .d_strategy
routine, mean is AIO request.
In all other cases the IO requests are processed synchronously. The
volthreading zvol property is ignored on FreeBSD side.
Sponsored-by: vStack, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: @ImAwsumm
Signed-off-by: Fedor Uporov <fuporov.vstack at gmail.com>
Closes #17169
Delete dead code: dbuf_loan_arcbuf
It's been dead ever since 5fa356ea44240c188ad0e4f7b6af20c3f1b99257
Sponsored by: ConnectWise
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Rob Norris <robn at despairlabs.com>
Signed-off-by: Alan Somers <asomers at gmail.com>
Closes #17119
ioctl: remove FICLONE/FICLONERANGE/FIDEDUPERANGE compat
These are only required to support these ioctls on Linux <4.5. Since
4.18 is our cutoff, we don't need this code anymore.
Also removing related test things that will never match again.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #17308
FreeBSD: Use new SYSCTL_SIZEOF()
SYSCTL_SIZEOF() has been introduced in FreeBSD by commit "sysctl(9):
Ease exporting struct sizes; Discourage doing that" (713abc9880aa) in
branch 'main'. It will soon be backported to 'stable/14'. We will thus
be able to remove the old, alternate version left in the '#else' branch
as soon as 'stable/13' goes out of support (April 30, 2026).
Sponsored-by: The FreeBSD Foundation
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Signed-off-by: Olivier Certner <olce at FreeBSD.org>
Closes #17309
ARC: Avoid overflows in arc_evict_adj() (#17255)
With certain combinations of target ARC states balance and ghost
hit rates it was possible to get the fractions outside of allowed
range. This patch limits maximum balance adjustment speed, which
should make it impossible, and also asserts it.
Fixes #17210
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Reviewed-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>