zio_ddt_write: compute have_dvas after taking dde_io_lock
In zio_ddt_write(), have_dvas and is_ganged were computed before
dde_io_lock was taken. A concurrent zio_ddt_child_write_done() error
path calls ddt_phys_unextend() under dde_io_lock, which can zero
DVA[0] while another thread is between computing have_dvas and taking
dde_io_lock. That thread then uses the stale have_dvas=1 to call
ddt_bp_fill(), copying the zeroed DVA into the BP. A zero DVA resolves
as a hole, producing blocks that read back as zeros with no checksum
error (silent data corruption).
Fix by moving have_dvas and is_ganged computation to after dde_io_lock
is taken, so they always reflect the current state of dde->dde_phys.
Regression introduced by a41ef36858 ("DDT: Reduce global DDT lock
scope during writes").
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
[6 lines not shown]
Fix race between device removal completion and pool export
vdev_remove_complete() finalizes a device removal in two phases under
the spa lock framework. Between the two phases it called
spa_vdev_exit(), which drops both the config locks (SCL_ALL) and
spa_namespace_lock and blocks on a txg sync. By that point
vdev_remove_replace_with_indirect() has already set svr->svr_thread =
NULL, and that is the only thing the export path (spa_export_common()
-> spa_async_suspend() -> spa_vdev_remove_suspend()) waits on. Once the
namespace lock is dropped, a concurrent export or destroy can acquire
it and set spa->spa_export_thread. When the removal thread re-enters
for its second phase via spa_vdev_enter(), it trips the
ASSERT0P(spa->spa_export_thread) assertion.
Hold spa_namespace_lock across both phases instead of dropping and
re-taking it: the intermediate spa_vdev_exit() becomes
spa_vdev_config_exit(), which drops only SCL_ALL and syncs the txg
while keeping the namespace lock held, and the second spa_vdev_enter()
becomes spa_vdev_config_enter(). Because the namespace lock is never
[7 lines not shown]
CI: Increase default watchdog NMI timeout on Linux
When the watchdog driver is configured and enabled an NMI will be
generated when the watchdogd process fails to regularly reset the
watchdog timer. Given the heavily virtualized and potentially
over-subscribed nature of the CI environment increase the default
timeout to 120 seconds (normally defaults to 30 seconds).
Reviewed-by: Christos Longros <chris.longros at gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #18704
ddt_log: Fix refcount tagging for begin/commit
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Rob Norris <rob.norris at truenas.com>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Igor Ostapenko <igor.ostapenko at klarasystems.com>
Closes #18706
Assert SA bulk array bounds before sa_bulk_update
The SA bulk arrays are filled by series of conditional SA_ADD_BULK_ATTR
calls whose worst-case count is easy to miscount, so a future attribute
add could silently overrun the fixed-size array.
Add ASSERT3S(count, <=, ARRAY_SIZE(bulk)) before each sa_bulk_update so
an overrun trips in debug builds, and use ARRAY_SIZE so the bound stays
tied to the declaration. Linux zfs_setattr allocates its arrays on the
heap sized by 'bulks', so it asserts against that instead.
Also tighten FreeBSD zfs_setattr's xattr_bulk from [7] to [6], its
actual worst case.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #18573
Update mtime/ctime when fallocate grows a file
Growing a file with fallocate updated its size but left mtime/ctime
unchanged and didn't log the change. A fallocate that changes the file
size should update mtime/ctime, and the change should be logged so it
survives a crash.
Pass log=TRUE to zfs_freesp() on the extend path so it updates the
timestamps and logs the size change, matching zfs_space(). Punch-hole
and zero-range already use this path and are unaffected.
Reviewed-by: Rob Norris <rob.norris at truenas.com>
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Ameer Hamza <ahamza at ixsystems.com>
Closes #18573
Persist z_seq across znode eviction
Commit 312bdab0f5 advertises STATX_ATTR_CHANGE_MONOTONIC and builds
the NFSv4 change_cookie from (ctime.tv_sec << 32) | zp->z_seq.
zp->z_seq is reset to a magic constant in zfs_znode_alloc(), so any
event that drops the znode from cache (memory pressure, remount,
reboot) regresses the lower bits of the cookie, a backward step
within the same second.
NFSv4 clients that trust this contract treat a regressed cookie as
evidence that the file's metadata cannot be relied on. VMware ESXi
over NFSv4.1 surfaces this as "The file specified is not a virtual
disk", and a VM stored on the affected NFS-exported ZFS dataset
fails to power on.
Widen z_seq to 64 bit and present it directly as the change_cookie,
dropping the ctime packing, so the cookie is a single monotonic
counter that no longer depends on the clock. FreeBSD's va_filerev
consumer also takes the wider value.
[29 lines not shown]
zbookmark_compare: handle "marker" bookmarks with negative levels
"Marker" bookmarks (those with zb_level == ZB_ROOT_LEVEL, ZB_ZIL_LEVEL
or ZB_DNODE_LEVEL) represent valid blocks, but are associated with a
dataset directly rather than with a specific object within it. They end
up on bookmark lists during scan prefetch, and so need to be sorted
ahead of any "true" object blocks.
The problem is that for negative levels, BP_SPANB produces a negative
shift, which is not legal C. Fortunately the results are used only for
comparison, so the worst possible behaviour in a forgiving compilation
environment is a mis-sort, which for the scan/traverse cases, means that
we haven't prefetched certain metadata before we actually need it. But
there _is_ UB in there, and UBSAN does rightly complain.
Here we fix all this by handling these bookmarks directly - sorting them
ahead of "true" object blocks, which is usually what scan/traverse will
prefer. And we don't do any interesting math on these bookmarks, so we
sidestep the whole UB thing.
[6 lines not shown]
Constify some rrd_*() functions
These don't modify the db, so just constify them while we're in the
area.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Chris Longros <chris.longros at gmail.com>
Signed-off-by: Kyle Evans <kevans at FreeBSD.org>
Add dbrrd_latest_time() to grab the latest timestamp in the db
Returns 0 if the database is empty, otherwise it returns the highest
value of the minutely db. dbrrd_add() will already enforce the property
that these are monotonically increasing, so we won't try to second-guess
it.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Chris Longros <chris.longros at gmail.com>
Signed-off-by: Kyle Evans <kevans at FreeBSD.org>
freebsd: set mnt_time on the rootfs at mountroot time
FreeBSD's vfs_mountroot() will collect `mnt_time` from every filesystem
that we mounted and use the highest timestamp as a source for the system
time if we didn't get anything from an attached RTC.
Use the rrd mechanism added to gather up a notion of the latest time
and set it on mnt_time. If the timestamp db is empty, we just fallback
to the uberblock timestamp and hope that that is in the right ballpark.
Relevant: FreeBSD PR254058[0] reporting the problem downstream
[0] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254058
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Chris Longros <chris.longros at gmail.com>
Signed-off-by: Kyle Evans <kevans at FreeBSD.org>
Fix aarch64 build failure by removing earlyclobber (#18532)
The UVR macros used "+&w" (read-write + earlyclobber) as the
constraint for NEON register operands that are declared as explicit
hard-register variables via:
register unsigned char wN asm("vN") __attribute__((vector_size(16)));
The + modifier implicitly makes the operand also an input (reading the
register before the asm runs). The & (earlyclobber) modifier says "this
output may be written before all inputs are consumed." Having an
earlyclobber output on the same hard-register that is simultaneously
an input is a contradiction — GCC 16 now strictly diagnoses this.
The fix removes the & from "+&w", yielding "+w". The earlyclobber
was both incorrect (contradicts the implicit input) and unnecessary
(the physical registers are already hard-bound, so the compiler has no
freedom to assign conflicting registers anyway).
[7 lines not shown]
unit/zap: uint64 keys
Add coverage for binary uint64-array keys (ZAP_FLAG_UINT64_KEY), the key
type used by the dedup table and block reference table. Covers the
by-dnode operations on such a ZAP: add, lookup, length, lookup_length,
update and remove.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes #18639
Avoid more abd_t allocations in RAIDZ/dRAID
RAIDZ/dRAID allocate rc_abdstruct for every column, but used it only
for data columns, while parity columns allocated additional abd_t's
on top of that. This change introduces abd_alloc_linear_struct(),
abd_alloc_gang_struct() and abd_get_zeros_struct() to the ABD API,
mirroring the existing abd_get_offset_struct() pattern, and uses
them to use rc_abdstruct in remaining column cases.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <rob.norris at truenas.com>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18668
unit/zap: add MT_MATCH_CASE mixed-case test
New case-normalization test to cover the exact-case path. On a TOUPPER
ZAP, MT_NORMALIZE | MT_MATCH_CASE matches only the stored casing, while
an MT_NORMALIZE lookup matches any case.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <rob.norris at truenas.com>
Signed-off-by: Christos Longros <chris.longros at gmail.com>
Closes #18638
zstream: refactor common functions
### Motivation
In the current version of `zstream`, each subcommand is independent and
is responsible for implementing its own stream-processing pipeline. It
started as a stream dumper, but as additional subcommands were added,
contributors typically copied an existing subcommand's pipeline and
adapted it for different purposes.
This pattern has led to quite a bit of duplicated code and has also led
to some functional nonuniformities. For example, some subcommands
support opposite-endian streams and others don't.
### Overview
This PR segregates functions that most subcommands need into
free-standing modules and reimplements the existing subcommands in
terms of those modules. The current modules are:
[100 lines not shown]
freebsd: set mnt_time on the rootfs at mountroot time
FreeBSD's vfs_mountroot() will collect `mnt_time` from every filesystem
that we mounted and use the highest timestamp as a source for the system
time if we didn't get anything from an attached RTC.
Use the rrd mechanism added to gather up a notion of the latest time
and set it on mnt_time. If the timestamp db is empty, we just fallback
to the uberblock timestamp and hope that that is in the right ballpark.
Relevant: FreeBSD PR254058[0] reporting the problem downstream
[0] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=254058
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Chris Longros <chris.longros at gmail.com>
Signed-off-by: Kyle Evans <kevans at FreeBSD.org>
Closes #18645
Add dbrrd_latest_time() to grab the latest timestamp in the db
Returns 0 if the database is empty, otherwise it returns the highest
value of the minutely db. dbrrd_add() will already enforce the property
that these are monotonically increasing, so we won't try to second-guess
it.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Chris Longros <chris.longros at gmail.com>
Signed-off-by: Kyle Evans <kevans at FreeBSD.org>
Closes #18645
Constify some rrd_*() functions
These don't modify the db, so just constify them while we're in the
area.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Chris Longros <chris.longros at gmail.com>
Signed-off-by: Kyle Evans <kevans at FreeBSD.org>
Closes #18645
RAIDZ: Optimize single data column writes
When a row contains only a single data column (one ashift-sized
block or 2-wide RAIDZ), P = Q = R = data mathematically. In this
case point all parity column ABDs at the data column ABD, skipping
both buffer allocation and parity generation.
It might be not very efficient to write so small blocks on RAIDZ,
but it is allowed and does happen. Skipping this allocation and
memory copy saves several percents of CPU time.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <rob.norris at truenas.com>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18695
Optimize metaslab_set_selected_txg()
I don't think it makes much sense to choose for eviction between
metaslabs selected in the same TXG. Considering that we also don't
evict them for at least 32 TXG, the difference should be in a noise.
Just skip the metaslab bumping if we already done it in this TXG.
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Rob Norris <rob.norris at truenas.com>
Signed-off-by: Alexander Motin <alexander.motin at TrueNAS.com>
Closes #18669
FreeBSD: avoid lookup overhead for nonexistent xattr directories
Port the z_xattr_dir_absent cache to FreeBSD. As on Linux, a
getextattr that misses in the SA otherwise falls through to
zfs_get_xattrdir(), which takes the "" ZXATTR dirlock and reads
SA_ZPL_XATTR only to find the file has no xattr directory. The
in-core znode now caches that: after an SA miss zfs_getextattr_impl()
skips the directory lookup when the flag is set, zfs_get_xattrdir()
sets it when no directory is found and clears it when one is found,
and zfs_make_xattrdir() clears it on creation, which also covers the
TX_MKXATTR ZIL replay path.
The flag is serialized by the base file's vnode lock.
zfs_make_xattrdir(), the only path that creates the directory and
clears the flag, runs with the vnode held exclusive, while every
reader that sets the flag holds it shared, so a set can never race
the clear. ASSERT_VOP_ELOCKED() in zfs_make_xattrdir() and
ASSERT_VOP_LOCKED() in zfs_get_xattrdir() enforce this, both skipped
during ZIL replay since it is single threaded with no locked vnode.
[8 lines not shown]
Avoid lookup overhead for nonexistent xattr directories
A getxattr that misses in the file's SA falls through to
zfs_get_xattrdir(), which takes the "" ZXATTR dirlock and issues an
sa_lookup(SA_ZPL_XATTR), only to find the file has no xattr directory at
all. security.capability is the common trigger: the kernel probes it on
file access (get_vfs_caps_from_disk()), so for the many files that carry
no extended attributes the same fruitless lookup repeats constantly.
Profiling an SMB metadata workload showed roughly 6% of CPU spent in
zfs_get_xattrdir(), every call missing and returning ENOENT.
Cache the result in the in-core znode: a new boolean marks a file as
having no xattr directory. When it is set, a getxattr that misses in the
SA returns ENODATA from __zpl_xattr_get() without the zfs_lookup into
zfs_get_xattrdir, so neither the "" ZXATTR dirlock nor the SA_ZPL_XATTR
lookup runs. The flag is set when the directory lookup finds nothing and
cleared in zfs_make_xattrdir() whenever a directory is created, so the
setxattr and TX_MKXATTR ZIL replay paths are both covered. It is updated
under the existing z_xattr_lock and defaults to the real lookup, so
[9 lines not shown]
Improve performance of "zpool offline" for log devices
When offlining a log device, if it's part of a mirror that would still
be available after the offline operation, skip replaying the ZIL for
every dataset. This drastically improves the performance of "zpool
offline" for one log device of a mirrored pair.
Sponsored by: ConnectWise
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alek Pinchuk <alek.pinchuk at connectwise.com>
Signed-off-by: Alan Somers <asomers at gmail.com>
Closes #18664
honor file argument in file_wait_event
grep the log path passed by the caller instead of always using
ZED_DEBUG_LOG.
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Alek Pinchuk <Alek.Pinchuk at connectwise.com>
Closes #18700
delegate: add 'send:encrypted' permission
send:encrypted is like send:raw, but only permits encrypted datasets to
be sent - raw send is not permitted for unencrypted datasets.
This commit creates the permission, wires it up, and adds the check for
it in zfs_secpolicy_send_impl(), if it is the last send permission
standing, the dataset is checked for its encryption state.
Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18673
zfs_secpolicy_send: lift checks to common function for both
The permissions checks for send are a little involved because different
permissions grant different abilities, and there's two ways to initiate
a send.
This lifts the common permissions checks into a single function, and
ensures that we maintain a single dataset hold across all checks. This
will become important in the next commit when we need to check a
specific dataset property as part of the permission check.
Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18673
ZTS: delegate: test send:encrypted
Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18673
ZTS: delegate: check send permissions on encrypted datasets
Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18673
ZTS: delegate: add encryption option for test fixture datasets
The delegate test framework doesn't care about the encryption status of
the dataset under test, so by adding an option to create with encryption
the framework can be used to check encryption-related permissions
without any further fanfare.
Sponsored-by: TrueNAS
Reviewed-by: Alexander Motin <alexander.motin at TrueNAS.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Rob Norris <rob.norris at truenas.com>
Closes #18673