Add support for FreeBSD's Solaris style extended attribute interface
FreeBSD commit 2ec2ba7e232d added the Solaris style syscall interface
for extended attributes. This patch wires this interface into the
FreeBSD ZFS port, since this style of extended attributes is supported
by OpenZFS internally when the "xattr" property is set to "dir".
Some specific changes:
LOOKUP_NAMED_ATTR is defined to indicate the need to set V_NAMEDATTR
for calls to zfs_zaccess().
V_NAMEDATTR indicates that the access checking does need to be done
for FreeBSD.
The access checking code for extended attributes was copy/pasted from
the Linux port into zfs_zaccess() in the FreeBSD port.
Most of the changes are in zfs_freebsd_lookup() and
zfs_freebsd_create().
The semantics of these functions should remain unchanged unless named
[8 lines not shown]
ZVOL: Return early, if volmode is ZFS_VOLMODE_NONE on FreeBSD side
Return from zvol_os_create_minor() function immediately after
dsl_prop_get_integer() call if volmode property value is set to
'none', like it is doing on Linux side.
Sponsored-by: vStack, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Fedor Uporov <fuporov.vstack at gmail.com>
Closes #17405
Add CodeQL mismatched dsl_dataset_hold/_rele pairs check
This check is currently limited to checking mismatches that occur in the
same stack frame. It does not detect across stack frames.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Richard Yao <richard at ryao.dev>
Closes #17352
BRT: Fix ZAP entry endianness
During original block cloning implementation a mistake was made,
making BRT ZAP entries an array of 8 1-byte entries instead of 1
entry of 8 bytes. This makes the pools non-endian-safe.
This commit introduces a new read-compatible pool feature
"com.truenas:block_cloning_endian", fixing the endianness issue
for new pools while maintaining compatibility with existing ones.
The feature is automatically activated when creating the first BRT
ZAP (ensuring we don't activate it on pools that already have BRT
entries in the old format). When active, BRT entries are stored
as single 8-byte values.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #17572
Faster checksum benchmark on system boot
While booting, only the needed 256KiB benchmarks are done now.
The delay for checking all checksums occurs when requested via:
- Linux: cat /proc/spl/kstat/zfs/chksum_bench
- FreeBSD: sysctl kstat.zfs.misc.chksum_bench
Reported by: Lahiru Gunathilake <gunathilakebllg at gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs at mcmilk.de>
Co-authored-by: Colin Percival <cperciva at tarsnap.com>
Closes #17563
Closes #17560
zpool/zfs: Add '-a|--all' option to scrub, trim, initialize
Add support for the '-a | --all' option to perform trim,
scrub, and initialize operations on all pools.
Previously, specifying a pool name was mandatory for
these operations. With this enhancement, users can now
execute these operations across all pools at once,
without needing to manually iterate over each pool
from the command line.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs at mcmilk.de>
Signed-off-by: Akash B <akash-b at hpe.com>
Closes #17524
Don't use wrong weight when passivating group
When we're passivating a metaslab group we start by passivating the
metaslabs that have been activated for each of the allocators. To do
that, we need to provide a weight. However, currently this erroneously
always uses a segment-based weight, even if segment-based weighting is
disabled.
Use the normal weight function, which will decide which type of weight
to use.
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17566
CI: Remove Debian backports
The latest Debian 11 image includes bullseye-backports as a default
repository in the /etc/apt/sources.list. However, this repository
has gone end of life which effectively breaks the default install.
We shouldn't need anything in backports so lets unconditionally
remove backports on all Debian builders to resolve the issue.
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #17569
Default to zfs_bclone_wait_dirty=1
Update the default FICLONE and FICLONERANGE ioctl behavior to wait
on dirty blocks. While this does remove some control from the
application, in practice ZFS is better positioned to the optimial
thing and immediately force a TXG sync.
Reviewed-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #17455
zdb: fix checksum calculation for decompressed blocks
Currently, when reading compressed blocks with -R and decompressing
them with :d option and specifying lsize, which is normally bigger
than psize for compressed blocks, the checksum is calculated on
decompressed data. But it makes no sense since zfs always calculates
checksum on physical, i.e. compressed data. So reading the same block
produces different checksum results depending on how we read it,
whether we decompress it or not, which, again, makes no sense.
Fix: use psize instead of lsize when calculating the checksum so that
it is always calculated on the physical block size, no matter was it
compressed or not.
Signed-off-by: Andriy Tkachuk <andriy.tkachuk at seagate.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #17547
ZED: Fix device type detection and pool iteration logic
During hotplug REMOVED events, devid matching fails for partition-based
spares because devid information is not stored in pool config for
partitioned devices. However, when devid is populated by the hotplug
event, the original code skipped the search logic entirely, skipping
vdev_guid matching and resulting in wrong device type detection that
caused spares to be incorrectly identified as l2arc devices.
Additionally, fix zfs_agent_iter_pool() to use the return value from
zfs_agent_iter_vdev() instead of relying on search parameters, which
was previously ignored. Also add pool_guid optimization to enable
targeted pool searching when pool_guid is available.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #17545
linux: Fix out-of-src builds
The linux kernel modules haven't been building successfully when the
build occurs in a separate directory than the source code, which is a
common build pattern in Linux. Was not able to determine the root cause,
but the %.o targets in subdirectories are no longer being matched by the
pattern targets in the Linux Kbuild system. This change fixes the issue
by dynamically creating the missing ones inside our Kbuild.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Coleman Kane <ckane at colemankane.org>
Closes #17517
spa: update blkptr diagram to include vdev padding on encrypted blocks
Probably just an oversight in 4d044c4c1d. SPA_VDEVBITS is always 24,
regardless of whether or not the bp is for an encrypted block, and it
wouldn't make sense for it to be different anyway.
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #17564
everywhere: misc unnecessary var init/update
These are all cases where we initialise or update a variable, and then
never use it. None of them particularly matter, as the compiler should
optimise them all away during dead store elimination, but some static
analysers complain about them and they are extra work for casual readers
to follow, so worth removing.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
vdev_raidz: asize/psize: remove unnecessary var initialisation
It would have been optimised away anyway so it doesn't matter, but it
does make things a little tougher to read.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
test/draid: fix error return
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
spa_activity_check: narrow scope of MMP vars
They aren't used outside these very small blocks, and their initial
values are never used at all.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
linux/kmem: remove HAVE_ATOMIC64_T and kmem_alloc_used wrappers
Seems like we haven't set it since the SPL was pulled into the main ZFS
tree. In removing the define, I've taken the 64-bit version (ie the one
that _hasn't_ been running since back then) because it looks like its
closer to the intended width by the way its used.
Since the macros ar eno longer needed as a selector, pull those too.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
linux/kmem: remove long-obsolete __GFP compat flags
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
linux/kmem: remove PF_FSTRANS and PF_MEMALLOC_NOIO compat
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Sponsored-by: https://despairlabs.com/sponsor/
Closes #17551
add uncompressed_size to arc_summary
Add uncompressed ARC size to statistics reported by arc_summary.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #17556
enforce arc_dnode_limit
Linux kernel shrinker in the context of null/root memcg does not scan
dentry and inode caches added by a task running in non-root memcg. For
ZFS this means that dnode cache routinely overflows, evicting valuable
meta/data and putting additional memory pressure on the system.
This patch restores zfs_prune_aliases as fallback when the kernel
shrinker does nothing, enabling zfs to actually free dnodes. Moreover,
it (indirectly) calls arc_evict when dnode_size > dnode_limit.
Reviewed-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #17487
Closes #17542
Allow and prefer special vdevs as ZIL
Before this change ZIL blocks were allocated only from normal or
SLOG vdevs. In typical situation when special vdevs are SSDs and
normal are HDDs it could cause weird inversions when data blocks
are written to SSDs, but ZIL referencing them to HDDs.
This change assumes that special vdevs typically have much better
(or at least not worse) latency than normal, and so in absence of
SLOGs should store ZIL blocks. It means similar to normal vdevs
introduction of special embedded log allocation class and updating
the allocation fallback order to: SLOG -> special embedded log ->
special -> normal embedded log -> normal.
The code tries to guess whether data block is going to be written
to normal or special vdev (it can not be done precisely before
compression) and prefer indirect writes for blocks written to a
special vdev to avoid double-write. For blocks that are going to
be written to normal vdev, special vdev by default plays as SLOG,
[12 lines not shown]
Define sops->free_inode() to prevent use-after-free during lookup
On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can
be dereferenced without refcounts and locks. For this reason, dentry and
inode must only be freed after RCU grace period.
However, zfs currently frees inode in zfs_inode_destroy synchronously
and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on
Linux 5.2 and after, if we define sops->free_inode(), the kernel will do
call_rcu() for us.
This issue may be triggered more easily with init_on_free=1 boot
parameter:
BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:selinux_inode_permission+0x10e/0x1c0
Call Trace:
? show_trace_log_lvl+0x1be/0x2d9
? show_trace_log_lvl+0x1be/0x2d9
[27 lines not shown]
ZIL: Force writing of open LWB on suspend
Under parallel workloads ZIL may delay writes of open LWBs that
are not full enough. On suspend we do not expect anything new to
appear since zil_get_commit_list() will not let it pass, only
returning TXG number to wait for. But I suspect that waiting for
the TXG commit without having the last LWB issued may not wait for
its completion, resulting in panic described in #17509.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris at klarasystems.com>
Signed-off-by: Alexander Motin <mav at FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #17521
Correct weight recalculation of space-based metaslabs
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes #17531
Fix zdb pool/ with -k
When examining the root dataset with zdb -k, we get into a mismatched
state. main() knows we are not examining the whole pool, but it strips
off the trailing slash. import_checkpointed_state() then thinks we are
examining the whole pool, and does not update the target path
appropriately. The fix is to directly inform import_checkpointed_state
that we are examining a filesystem, and not the whole pool.
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Rob Norris <rob.norris at klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Co-authored-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17536
FreeBSD: zfs_putpages: don't undirty pages until after write completes
In syncing mode, zfs_putpages() would put the entire range of pages onto
the ZIL, then return VM_PAGER_OK for each page to the kernel. However,
an associated zil_commit() or txg sync had not happened at this point,
so the write may not actually be on disk.
So, we rework that case to use a ZIL commit callback, and do the
post-write work of undirtying the page and signaling completion there.
We return VM_PAGER_PEND to the kernel instead so it knows that we will
take care of it.
The original version of this (238eab7dc1) copied the Linux model and did
the cleanup in a ZIL callback for both async and sync. This was a
mistake, as FreeBSD does not have a separate "busy for writeback" flag
like Linux which keeps the page usable. The full sbusy flag locks the
entire page out until the itx callback fires, which for async is after
txg sync, which could be literal seconds in the future.
[12 lines not shown]
Revert "FreeBSD: zfs_putpages: don't undirty pages until after write completes"
This causes async putpages to leave the pages sbusied for a long time,
which hurts concurrency. Revert for now until we have a better
approach.
This reverts commit 238eab7dc16932edbe9bcc990e8e5376bfe5b2ba.
Reported by: Ihor Antonov <ngor at hugpoint.tech>
Discussed with: Rob Norris <rob.norris at klarasystems.com>
References: freebsd/freebsd-src at 738a9a7
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Mark Johnston <markj at FreeBSD.org>
Ported-by: Rob Norris <rob.norris at klarasystems.com>
Signed-off-by: Rob Norris <rob.norris at klarasystems.com>
Closes #17533
ZTS: test that zdb can work with libzpool tunables
Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #17537