OpenZFS/src ef81812man/man4 zfs.4, man/man7 vdevprops.7

Fix spelling errors

Unlike some of my other fixes which are more subtle, these are
unambigously spelling errors.

Signed-off-by: Simon Howard <fraggle at gmail.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
DeltaFile
+4-4man/man4/zfs.4
+2-2man/man7/vdevprops.7
+1-1man/man8/zpool-events.8
+7-73 files

OpenZFS/src 1d4505dman/man4 spl.4 zfs.4, man/man7 zfsprops.7

Capitalize in various places where appropriate

These are mostly acronyms (CPUs; ZILs) but also proper nouns such as
"Unix" and "Unicode" which should also be capitalized.

Signed-off-by: Simon Howard <fraggle at gmail.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
DeltaFile
+3-3man/man4/spl.4
+3-3man/man7/zfsprops.7
+2-2man/man4/zfs.4
+1-1man/man8/zpool.8
+1-1man/man8/zfs-hold.8
+10-105 files

OpenZFS/src 73494f3man/man4 spl.4 zfs.4, man/man7 zpool-features.7

Make use of "i.e." (id est) consistent

This is the most common way it is written throughout the manpages, but
there are a few cases where it is written slightly differently.

Signed-off-by: Simon Howard <fraggle at gmail.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
DeltaFile
+1-1man/man4/spl.4
+1-1man/man4/zfs.4
+1-1man/man7/zpool-features.7
+3-33 files

OpenZFS/src 530ddcdman/man1 ztest.1, man/man4 zfs.4

Harmonize on American spelling in several places

Most of the documentation is written in American English, so it makes
sense to be consistent.

Signed-off-by: Simon Howard <fraggle at gmail.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
DeltaFile
+3-3man/man4/zfs.4
+1-1man/man7/zpoolconcepts.7
+1-1man/man1/ztest.1
+1-1man/man8/zed.8.in
+1-1man/man8/zpool-attach.8
+1-1man/man8/zpool-remove.8
+8-81 files not shown
+9-97 files

OpenZFS/src 94a3fabinclude/sys metaslab_impl.h, man/man4 zfs.4

Unified allocation throttling (#17020)

Existing allocation throttling had a goal to improve write speed
by allocating more data to vdevs that are able to write it faster.
But in the process it completely broken the original mechanism,
designed to balance vdev space usage.  With severe vdev space use
imbalance it is possible that some with higher use start growing
fragmentation sooner than others and after getting full will stop
any writes at all.  Also after vdev addition it might take a very
long time for pool to restore the balance, since the new vdev does
not have any real preference, unless the old one is already much
slower due to fragmentation.  Also the old throttling was request-
based, which was unpredictable with block sizes varying from 512B
to 16MB, neither it made much sense in case of I/O aggregation,
when its 32-100 requests could be aggregated into few, leaving
device underutilized, submitting fewer and/or shorter requests,
or in opposite try to queue up to 1.6GB of writes per device.

This change presents a completely new throttling algorithm. Unlike

    [28 lines not shown]
DeltaFile
+363-404module/zfs/metaslab.c
+110-155module/zfs/zio.c
+8-71module/zfs/spa.c
+16-45include/sys/metaslab_impl.h
+17-27man/man4/zfs.4
+1-32module/zfs/vdev_queue.c
+515-7346 files not shown
+536-78612 files

OpenZFS/src eb9098etests/zfs-tests/tests/functional/inheritance inherit_001_pos.ksh state001.cfg

SPDX: license tags: CDDL-1.0

Sponsored-by: https://despairlabs.com/sponsor/
Signed-off-by: Rob Norris <robn at despairlabs.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
DeltaFile
+1-0tests/zfs-tests/tests/functional/inheritance/inherit_001_pos.ksh
+1-0tests/zfs-tests/tests/functional/inheritance/state001.cfg
+1-0tests/zfs-tests/tests/functional/inheritance/state002.cfg
+1-0tests/zfs-tests/tests/functional/inheritance/state003.cfg
+1-0tests/zfs-tests/tests/functional/inheritance/state004.cfg
+1-0tests/zfs-tests/tests/functional/inheritance/state005.cfg
+6-02,910 files not shown
+2,916-02,916 files

OpenZFS/src 1b495eeinclude/sys ddt.h, man/man4 zfs.4

FDT dedup log sync  -- remove incremental

This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Authored-by: Don Brady <don.brady at klarasystems.com>
Authored-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Signed-off-by: Paul Dagnelie <paul.dagnelie at klarasystems.com>
Closes #17038
DeltaFile
+179-163module/zfs/ddt.c
+109-0tests/zfs-tests/tests/functional/dedup/dedup_fdt_pacing.ksh
+50-31man/man4/zfs.4
+10-0module/zfs/vdev_queue.c
+2-5include/sys/ddt.h
+2-2tests/runfiles/common.run
+352-2017 files not shown
+366-20213 files

OpenZFS/src 2adca17man/man4 zfs.4, module/zfs metaslab.c

Expand fragmentation table to reflect larger possibile allocation sizes

When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.

We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan at klarasystems.com>

    [2 lines not shown]
DeltaFile
+31-26module/zfs/metaslab.c
+1-1man/man4/zfs.4
+32-272 files

OpenZFS/src e6c98d1man/man4 zfs.4, man/man7 zpool-features.7 vdevprops.7

Fix several typos in the man pages

Reviewed-by: George Amanakis <gamanakis at gmail.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Tim Smith <tsmith84 at gmail.com>
Closes #16965
DeltaFile
+4-4man/man4/zfs.4
+2-2man/man7/zpool-features.7
+1-1man/man7/vdevprops.7
+1-1man/man8/zfs.8
+1-1man/man8/zpool-initialize.8
+1-1man/man8/zpool-status.8
+10-106 files

OpenZFS/src 4049651man/man4 zfs.4, module/zfs metaslab.c

Expand fragmentation table to reflect larger possibile allocation sizes

When you are using large recordsizes in conjunction with raidz, with
incompressible data, you can pretty reliably be making 21 MB
allocations. Unfortunately, the fragmentation metric in ZFS considers
any metaslabs with 16 MB free chunks completely unfragmented, so you can
have a metaslab report 0% fragmented and be unable to satisfy an
allocation. When using the segment-based metaslab weight, this is
inconvenient; when using the space-based one, it can seriously degrade
performance.

We expand the fragmentation table to extend up to 512MB, and redefine
the table size based on the actual table, rather than having a static
define. We also tweak the one variable that depends on fragmentation
directly.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Allan Jude <allan at klarasystems.com>

    [2 lines not shown]
DeltaFile
+31-26module/zfs/metaslab.c
+1-1man/man4/zfs.4
+32-272 files

OpenZFS/src b8c0c15man/man4 zfs.4, man/man7 zpool-features.7 vdevprops.7

Fix several typos in the man pages

Reviewed-by: George Amanakis <gamanakis at gmail.com>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Tim Smith <tsmith84 at gmail.com>
Closes #16965
DeltaFile
+4-4man/man4/zfs.4
+2-2man/man7/zpool-features.7
+1-1man/man8/zfs.8
+1-1man/man7/vdevprops.7
+1-1man/man8/zpool-initialize.8
+1-1man/man8/zpool-status.8
+10-106 files

OpenZFS/src c2d9494man/man4 zfs.4, module/os/linux/zfs arc_os.c

set zfs_arc_shrinker_limit to 0 by default

zfs_arc_shrinker_limit was introduced to avoid ARC collapse due to
aggressive kernel reclaim. While useful, the current default (10000) is
too prone to OOM especially when MGLRU-enabled kernels with default
min_ttl_ms are used. Even when no OOM happens, it often causes too much
swap usage.

This patch sets zfs_arc_shrinker_limit=0 to not ignore kernel reclaim
requests. ARC now plays better with both kernel shrinker and pagecache
but, should ARC collapse happen again, MGLRU behavior can be tuned or
even disabled.

Anyway, zfs should not cause OOM when ARC can be released.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #16909
DeltaFile
+2-2man/man4/zfs.4
+2-2module/os/linux/zfs/arc_os.c
+4-42 files

OpenZFS/src 54126fdman/man4 zfs.4, module/os/linux/zfs arc_os.c

set zfs_arc_shrinker_limit to 0 by default

zfs_arc_shrinker_limit was introduced to avoid ARC collapse due to
aggressive kernel reclaim. While useful, the current default (10000) is
too prone to OOM especially when MGLRU-enabled kernels with default
min_ttl_ms are used. Even when no OOM happens, it often causes too much
swap usage.

This patch sets zfs_arc_shrinker_limit=0 to not ignore kernel reclaim
requests. ARC now plays better with both kernel shrinker and pagecache
but, should ARC collapse happen again, MGLRU behavior can be tuned or
even disabled.

Anyway, zfs should not cause OOM when ARC can be released.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #16909
DeltaFile
+2-2man/man4/zfs.4
+2-2module/os/linux/zfs/arc_os.c
+4-42 files

OpenZFS/src 022bf86man/man4 zfs.4, module/zfs arc.c

Increase L2ARC write rate and headroom

Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).

These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Today we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.

For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #15457 
DeltaFile
+3-3man/man4/zfs.4
+2-2module/zfs/arc.c
+5-52 files

OpenZFS/src 2bd540dman/man4 zfs.4, man/man7 zfsprops.7

man: update recordsize max size info

Reflect https://github.com/openzfs/zfs/commit/f2330bd1568489ae1fb16d975a5a9bcfe12ed219
change in our man pages and add some context.

Wording is primarily copy-pasted from code comments.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by:  Tino Reichardt <milky-zfs at mcmilk.de>
Signed-off-by: George Melikov <mail at gmelikov.ru>
Closes #16581 
DeltaFile
+12-1man/man7/zfsprops.7
+5-0man/man4/zfs.4
+17-12 files

OpenZFS/src 880b739man/man4 zfs.4, module/zfs zfs_vnops.c

zfs(4): remove "experimental" from zfs_bclone_enabled

I think we've done enough experiments.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #16189 
Closes #16712 
DeltaFile
+4-3man/man4/zfs.4
+3-3module/zfs/zfs_vnops.c
+7-62 files

OpenZFS/src b0cfb48cmd/zed/zed.d deadman-slot_off.sh zed.rc, man/man4 zfs.4

zed: Add deadman-slot_off.sh zedlet

Optionally turn off disk's enclosure slot if an I/O is hung
triggering the deadman.

It's possible for outstanding I/O to a misbehaving SCSI disk to
neither promptly complete or return an error.  This can occur due
to retry and recovery actions taken by the SCSI layer, driver, or
disk.  When it occurs the pool will be unresponsive even though
there may be sufficient redundancy configured to proceeded without
this single disk.

When a hung I/O is detected by the kmods it will be posted as a
deadman event.  By default an I/O is considered to be hung after
5 minutes.  This value can be changed with the zfs_deadman_ziotime_ms
module parameter.  If ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is set
the disk's enclosure slot will be powered off causing the outstanding
I/O to fail.  The ZED will then handle this like a normal disk failure.
By default ZED_POWER_OFF_ENCLOSURE_SLOT_ON_DEADMAN is not set.

    [10 lines not shown]
DeltaFile
+71-0cmd/zed/zed.d/deadman-slot_off.sh
+12-9man/man4/zfs.4
+9-1module/zfs/vdev.c
+4-4tests/zfs-tests/tests/functional/deadman/deadman_ratelimit.ksh
+7-0cmd/zed/zed.d/zed.rc
+2-0cmd/zed/zed.d/Makefile.am
+105-141 files not shown
+106-147 files

OpenZFS/src 91bd12dman/man4 zfs.4, module/zfs zfs_vnops.c

zfs(4): remove "experimental" from zfs_bclone_enabled

I think we've done enough experiments.

Sponsored-by: https://despairlabs.com/sponsor/
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Rob Norris <robn at despairlabs.com>
Closes #16189 
Closes #16712 
DeltaFile
+4-3man/man4/zfs.4
+3-3module/zfs/zfs_vnops.c
+7-62 files

OpenZFS/src 26ecd8binclude/sys zio.h, module/os/freebsd/zfs abd_os.c

Always validate checksums for Direct I/O reads

This fixes an oversight in the Direct I/O PR. There is nothing that
stops a process from manipulating the contents of a buffer for a
Direct I/O read while the I/O is in flight. This can lead checksum
verify failures. However, the disk contents are still correct, and this
would lead to false reporting of checksum validation failures.

To remedy this, all Direct I/O reads that have a checksum verification
failure are treated as suspicious. In the event a checksum validation
failure occurs for a Direct I/O read, then the I/O request will be
reissued though the ARC. This allows for actual validation to happen and
removes any possibility of the buffer being manipulated after the I/O
has been issued.

Just as with Direct I/O write checksum validation failures, Direct I/O
read checksum validation failures are reported though zpool status -d in
the DIO column. Also the zevent has been updated to have both:
1. dio_verify_wr -> Checksum verification failure for writes

    [35 lines not shown]
DeltaFile
+121-59tests/zfs-tests/cmd/manipulate_user_buffer.c
+85-35module/zfs/zio.c
+107-0tests/zfs-tests/tests/functional/direct/dio_read_verify.ksh
+39-5module/zfs/vdev_raidz.c
+37-4module/os/freebsd/zfs/abd_os.c
+15-14include/sys/zio.h
+404-11718 files not shown
+510-14624 files

OpenZFS/src b4e4cbeinclude/sys zio.h, module/os/freebsd/zfs abd_os.c

Always validate checksums for Direct I/O reads

This fixes an oversight in the Direct I/O PR. There is nothing that
stops a process from manipulating the contents of a buffer for a
Direct I/O read while the I/O is in flight. This can lead checksum
verify failures. However, the disk contents are still correct, and this
would lead to false reporting of checksum validation failures.

To remedy this, all Direct I/O reads that have a checksum verification
failure are treated as suspicious. In the event a checksum validation
failure occurs for a Direct I/O read, then the I/O request will be
reissued though the ARC. This allows for actual validation to happen and
removes any possibility of the buffer being manipulated after the I/O
has been issued.

Just as with Direct I/O write checksum validation failures, Direct I/O
read checksum validation failures are reported though zpool status -d in
the DIO column. Also the zevent has been updated to have both:
1. dio_verify_wr -> Checksum verification failure for writes

    [35 lines not shown]
DeltaFile
+121-59tests/zfs-tests/cmd/manipulate_user_buffer.c
+85-35module/zfs/zio.c
+107-0tests/zfs-tests/tests/functional/direct/dio_read_verify.ksh
+39-5module/zfs/vdev_raidz.c
+37-4module/os/freebsd/zfs/abd_os.c
+15-14include/sys/zio.h
+404-11718 files not shown
+510-14624 files

OpenZFS/src 0d77e73man/man4 zfs.4, module/zfs dsl_scan.c

Defer resilver only when progress is above a threshold

Restart a resilver from scratch, if the current one in progress is
below a new tunable, zfs_resilver_defer_percent (defaulting to 10%).

The original rationale for deferring additional resilvers, when there is
already one in progress, was to help achieving data redundancy sooner
for the data that gets scanned at the end of the resilver.

But in case the admin wants to attach multiple disks to a single vdev,
it wasn't immediately obvious the admin is supposed to run
`zpool resilver` afterwards to reset the deferred resilvers and start
a new one from scratch.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa at snajpa.net>
Closes #15810 
DeltaFile
+44-19module/zfs/dsl_scan.c
+10-1tests/zfs-tests/tests/functional/replacement/resilver_restart_001.ksh
+7-0man/man4/zfs.4
+1-0tests/zfs-tests/include/tunables.cfg
+62-204 files

OpenZFS/src 224393alib/libzfs libzfs.abi, man/man7 zpool-features.7

feature: large_microzap

In a4b21eadec we added the zap_micro_max_size tuneable to raise the size
at which "micro" (single-block) ZAPs are upgraded to "fat" (multi-block)
ZAPs. Before this, a microZAP was limited to 128KiB, which was the old
largest block size. The side effect of raising the max size past 128KiB
is that it be stored in a large block, requiring the large_blocks
feature.

Unfortunately, this means that a backup stream created without the
--large-block (-L) flag to zfs send would split the microZAP block into
smaller blocks and send those, as is normal behaviour for large blocks.
This would be received correctly, but since microZAPs are limited to the
first block in the object by definition, the entries in the later blocks
would be inaccessible. For directory ZAPs, this gives the appearance of
files being lost.

This commit adds a feature flag, large_microzap, that must be enabled
for microZAPs to grow beyond 128KiB, and which will be activated the

    [38 lines not shown]
DeltaFile
+53-2module/zfs/zap_micro.c
+20-3man/man7/zpool-features.7
+22-1module/zfs/dmu_recv.c
+14-1module/zcommon/zfeature_common.c
+12-1module/zfs/dmu_send.c
+6-5lib/libzfs/libzfs.abi
+127-1310 files not shown
+162-2216 files

OpenZFS/src d34d4f9man/man4 zfs.4, man/man7 zfsprops.7

snapdir: add 'disabled' value to make .zfs inaccessible

In some environments, just making the .zfs control dir hidden from sight
might not be enough. In particular, the following scenarios might
warrant not allowing access at all:
- old snapshots with wrong permissions/ownership
- old snapshots with exploitable setuid/setgid binaries
- old snapshots with sensitive contents

Introducing a new 'disabled' value that not only hides the control dir,
but prevents access to its contents by returning ENOENT solves all of
the above.

The new property value takes advantage of 'iuv' semantics ("ignore
unknown value") to automatically fall back to the old default value when
a pool is accessed by an older version of ZFS that doesn't yet know
about 'disabled' semantics.

I think that technically the zfs_dirlook change is enough to prevent

    [16 lines not shown]
DeltaFile
+17-5module/os/linux/zfs/zfs_ctldir.c
+9-0man/man4/zfs.4
+3-3man/man7/zfsprops.7
+5-0module/os/linux/zfs/zfs_vfsops.c
+4-0module/os/linux/zfs/zpl_ctldir.c
+4-0module/zfs/dsl_prop.c
+42-810 files not shown
+56-1516 files

OpenZFS/src 5591505man/man4 zfs.4, man/man7 zfsprops.7

man: update recordsize max size info

Reflect https://github.com/openzfs/zfs/commit/f2330bd1568489ae1fb16d975a5a9bcfe12ed219
change in our man pages and add some context.

Wording is primarily copy-pasted from code comments.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by:  Tino Reichardt <milky-zfs at mcmilk.de>
Signed-off-by: George Melikov <mail at gmelikov.ru>
Closes #16581 
DeltaFile
+12-1man/man7/zfsprops.7
+5-0man/man4/zfs.4
+17-12 files

OpenZFS/src a10e552lib/libzfs libzfs.abi, module/os/linux/zfs zfs_uio.c

Adding Direct IO Support

Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,

    [66 lines not shown]
DeltaFile
+351-289lib/libzfs/libzfs.abi
+395-0module/zfs/dmu_direct.c
+331-0tests/zfs-tests/tests/functional/direct/dio.kshlib
+231-88module/zfs/dbuf.c
+293-2module/os/linux/zfs/zfs_uio.c
+274-20module/zfs/zfs_vnops.c
+1,875-399105 files not shown
+5,989-726111 files

OpenZFS/src cd42e99man/man4 zfs.4, module/zfs arc.c

Enable L2 cache of all (MRU+MFU) metadata but MFU data only

`l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU
data and metadata. However it can be useful to cache as much
metadata as possible while, at the same time, restricting data
cache to MFU buffers only.

This patch allow for such behavior by setting `l2arc_mfuonly` to 2
(or higher). The list of possible values is the following:
0: cache both MRU and MFU for both data and metadata;
1: cache only MFU for both data and metadata;
2: cache both MRU and MFU for metadata, but only MFU for data.

Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #16343 
Closes #16402 
DeltaFile
+10-4man/man4/zfs.4
+8-3module/zfs/arc.c
+18-72 files

OpenZFS/src bbe8512man/man4 zfs.4, module/os/linux/zfs arc_os.c

Ignore zfs_arc_shrinker_limit in direct reclaim mode

zfs_arc_shrinker_limit (default: 10000) avoids ARC collapse
due to excessive memory reclaim. However, when the kernel is
in direct reclaim mode (ie: low on memory), limiting ARC reclaim
increases OOM risk. This is especially true on system without
(or with inadequate) swap.

This patch ignores zfs_arc_shrinker_limit when the kernel is in
direct reclaim mode, avoiding most OOM. It also restores
"echo 3 > /proc/sys/vm/drop_caches" ability to correctly drop
(almost) all ARC.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Adam Moss <c at yotes.com>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #16313 
DeltaFile
+3-3module/os/linux/zfs/arc_os.c
+1-0man/man4/zfs.4
+4-32 files

OpenZFS/src 77a797aman/man4 zfs.4, module/zfs arc.c

Enable L2 cache of all (MRU+MFU) metadata but MFU data only

`l2arc_mfuonly` was added to avoid wasting L2 ARC on read-once MRU
data and metadata. However it can be useful to cache as much
metadata as possible while, at the same time, restricting data
cache to MFU buffers only.

This patch allow for such behavior by setting `l2arc_mfuonly` to 2
(or higher). The list of possible values is the following:
0: cache both MRU and MFU for both data and metadata;
1: cache only MFU for both data and metadata;
2: cache both MRU and MFU for metadata, but only MFU for data.

Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Gionatan Danti <g.danti at assyoma.it>
Closes #16343 
Closes #16402 
DeltaFile
+10-4man/man4/zfs.4
+8-3module/zfs/arc.c
+18-72 files

OpenZFS/src a60e15dman/man4 zfs.4

Man page updates for dmu_ddt_copies

Reviewed-by: Alexander Motin <mav at FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Allan Jude <allan at klarasystems.com>
Closes #15895
DeltaFile
+11-0man/man4/zfs.4
+11-01 files

OpenZFS/src cd69ba3cmd/zdb zdb.c, include/sys ddt_impl.h ddt.h

ddt: dedup log

Adds a log/journal to dedup. At the end of txg, instead of writing the
entry directly to the ZAP, instead its adding to an in-memory tree and
appended to an on-disk object. The on-disk object is only read at
import, to reload the in-memory tree.

Lookups first go the the log tree before going to the ZAP, so
recently-used entries will remain close by in memory. This vastly
reduces overhead from dedup IO, as it will not have to do so many
read/update/write cycles on ZAP leaf nodes.

A flushing facility is added at end of txg, to push logged entries out
to the ZAP. There's actually two separate "logs" (in-memory tree and
on-disk object), one active (recieving updated entries) and one flushing
(writing out to disk). These are swapped (ie flushing begins) based on
memory used by the in-memory log trees and time since we last flushed
something.


    [16 lines not shown]
DeltaFile
+760-0module/zfs/ddt_log.c
+524-122module/zfs/ddt.c
+130-1include/sys/ddt_impl.h
+82-0man/man4/zfs.4
+36-3include/sys/ddt.h
+32-1cmd/zdb/zdb.c
+1,564-12711 files not shown
+1,621-13117 files