ZFS on Linux/src 679b0f2module/zfs metaslab.c

Concurrent small allocation defeats large allocation

With the new parallel allocators scheme, there is a possibility for 
a problem where two threads, allocating from the same allocator at 
the same time, conflict with each other. There are two primary cases 
to worry about. First, another thread working on another allocator
activates the same metaslab that the first thread was trying to
activate. This results in the first thread needing to go back and
reselect a new metaslab, even though it may have waited a long time
for this metaslab to load. Second, another thread working on the same
allocator may have activated a different metaslab while the first
thread was waiting for its metaslab to load. Both of these cases
can cause the first thread to be significantly delayed in issuing 
its IOs. The second case can also cause metaslab load/unload churn; 
because the metaslab is loaded but not fully activated, we never set 
the selected_txg, which results in the metaslab being immediately 
unloaded again. This process can repeat many times, wasting disk and 
cpu resources. This is more likely to happen when the IO of the first 
thread is a larger one (like a ZIL write) and the other thread is 
doing a smaller write, because it is more likely to find an 
acceptable metaslab quickly.

There are two primary changes. The first is to always proceed with 
the allocation when returning from metaslab_activate if we were 
preempted in either of the ways described in the previous section. 

    [11 lines not shown]
DeltaFile
+233-51module/zfs/metaslab.c
+233-511 files

ZFS on Linux/src 3fab4d9cmd/zdb zdb.c

zdb -vvvvv on ztest pool dies with "out of memory"

ztest creates some extremely large files as part of its 
operation. When zdb tries to dump a large enough file, it 
can run out of memory or spend an extremely long time 
attempting to print millions or billions of uint64_ts.

We cap the amount of data from a uint64 object that we 
are willing to read and print.

Reviewed-by: Don Brady <don.brady at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
External-issue: DLPX-53814
Closes #8947 
DeltaFile
+20-6cmd/zdb/zdb.c
+20-61 files

ZFS on Linux/src fc75467include/sys multilist.h, module/zfs multilist.c dmu_objset.c

Avoid extra taskq_dispatch() calls by DMU

DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

Reviewed by: Sean Eric Fagan <sef at ixsystems.com>
Reviewed by: Matt Ahrens <matt at delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by:  Alexander Motin <mav at FreeBSD.org>
Closes #8909 

ZFS on Linux/src 5279ae9cmd/zdb zdb.c

Redacted Send/Receive causes zdb to dump core

When used with verbosity >= 4 zdb fails an assertion in dump_bookmarks()
because it expects snprintf() to retun 0 on success.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Signed-off-by: loli10K <ezomori.nozomu at gmail.com>
Closes #8948 
DeltaFile
+1-1cmd/zdb/zdb.c
+1-11 files

ZFS on Linux/src 746d4a4include/sys spa.h, module/zfs zio.c

Fix bp_embedded_type enum definition

With the addition of BP_EMBEDDED_TYPE_REDACTED in 30af21b0 a couple of
codepaths make wrong assumptions and could potentially result in errors.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Chris Dunlop <chris at onthe.net.au>
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Signed-off-by: loli10K <ezomori.nozomu at gmail.com>
Closes #8951 

ZFS on Linux/src 13d454ctests/zfs-tests/tests/functional/cli_root/zdb zdb_001_neg.ksh

-Y option for zdb is valid

The -Y option was added for ztest to test split block reconstruction.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling at RichardElling.com>
Signed-off-by: Igor Kozhukhov <igor at dilos.org>
Closes #8926 

ZFS on Linux/src 59ec30acmd/zfs zfs_main.c, module/zfs dmu.c dmu_objset.c

Remove code for zfs remap

The "zfs remap" command was disabled by
6e91a72fe3ff8bb282490773bd687632f3e8c79d, because it has little utility
and introduced some tricky bugs.  This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.

Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.

Reviewed-by: Tom Caputi <tcaputi at datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
Closes #8944 

ZFS on Linux/src 5386480lib/libzfs libzfs_dataset.c, module/zfs dsl_crypt.c

Fix error message on promoting encrypted dataset

This patch corrects the error message reported when attempting
to promote a dataset outside of its encryption root.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tom Caputi <tcaputi at datto.com>
Closes #8905 
Closes #8935 

ZFS on Linux/src 8f12a4fcmd/zed Makefile.am, cmd/zed/zed.d Makefile.am

Fix out-of-tree build failures

Resolve the incorrect use of srcdir and builddir references for
various files in the build system.  These have crept in over time
and went unnoticed because when building in the top level directory
srcdir and builddir are identical.

With this change it's again possible to build in a subdirectory.

    $ mkdir obj
    $ cd obj
    $ ../configure
    $ make

Reviewed-by: loli10K <ezomori.nozomu at gmail.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Don Brady <don.brady at delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #8921 
Closes #8943 

ZFS on Linux/src cc9625ctests/zfs-tests/cmd/get_diff .gitignore

Add missing .gitignore from "Implement Redacted Send/Receive"

30af21b025 needed to ignore get_diff executable.

Reviewed-by: loli10K <ezomori.nozomu at gmail.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #8950 

ZFS on Linux/src 186898binclude/sys zcp.h, module/spl spl-condvar.c

OpenZFS 9425 - channel programs can be interrupted

Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.

Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete.  Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.

Porting notes: Added missing return value from cv_wait_sig()

Authored by: Don Brady <don.brady at delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy at delphix.com>

    [11 lines not shown]

ZFS on Linux/src cb9e5b7module/zfs dmu_tx.c

dn_struct_rwlock can not be held in dmu_tx_try_assign()

The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock
while assigning the tx, because this can lead to deadlock. Specifically,
if this dnode is already assigned to an earlier txg, this thread may
need to wait for that txg to sync (the ERESTART case below).  The other
thread that has assigned this dnode to an earlier txg prevents this txg
from syncing until its tx can complete (calling dmu_tx_commit()), but it
may need to acquire the dn_struct_rwlock to do so (e.g. via
dmu_buf_hold*()).

This commit adds an assertion to dmu_tx_try_assign() to ensure that this
deadlock is not inadvertently introduced.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
Closes #8929 
DeltaFile
+19-0module/zfs/dmu_tx.c
+19-01 files

ZFS on Linux/src ca4e5a7rpm/generic zfs.spec.in

Remove arch and relax version dependency

Remove arch and relax version dependency for zfs-dracut
package.

Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Gordan Bobic <gordan at redsleeve.org>
Issue #8913 
Closes #8914 

ZFS on Linux/src 8b14cb4lib/libzfs libzfs.pc.in

Add libnvpair to libzfs pkg-config

Functions such as `fnvlist_lookup_nvlist` need libnvpair to be linked.
Default pkg-config file did not contain it.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Harry Mallon <hjmallon at gmail.com>
Closes #8919 

ZFS on Linux/src a9cd8bfcmd/zfs zfs_main.c

Let zfs mount all tolerate in-progress mounts

The zfs-mount service can unexpectedly fail to start when zfs 
encounters a mount that is in progress. This service uses 
zfs mount -a, which has a window between the time it checks if 
the dataset was mounted and when the actual mount (via mount.zfs 
binary) occurs.

The reason for the racing mounts is that both zfs-mount.target 
and zfs-share.target are allowed to execute concurrently after 
the import.  This is more of an issue with the relatively recent 
addition of parallel mounting, and we should consider serializing
the mount and share targets.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed by: John Kennedy <john.kennedy at delphix.com>
Reviewed-by: Allan Jude <allanjude at freebsd.org>
Signed-off-by: Don Brady <don.brady at delphix.com>
Closes #8881
DeltaFile
+18-1cmd/zfs/zfs_main.c
+18-11 files

ZFS on Linux/src fb6e6f1cmd/zstreamdump zstreamdump.c, tests/zfs-tests/tests/functional/rsend rsend.kshlib

zstreamdump: add per-record-type counters and an overhead counter

Count the bytes of payload for each replication record type

Count the bytes of overhead (replication records themselves)

Include these counters in the output summary at the end of the run.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Signed-off-by: Allan Jude <allanjude at freebsd.org>
Sponsored-By: Klara Systems and Catalogic
Closes #8432 

ZFS on Linux/src 2b09628include/sys dsl_bookmark.h

Fix comments on zfs_bookmark_phys

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #8945 

ZFS on Linux/src d5bf1cftests/zfs-tests/cmd/libzfs_input_check libzfs_input_check.c

Fix build break by "Implement Redacted Send/Receive"

30af21b025 broke build on Fedora. gcc can detect potential overflow
on compile-time. Consider strlen of already copied string.

Also change strn to strl variants per suggestion from @behlendorf
and @ofaaland.

--
libzfs_input_check.c: In function 'test_redact':
libzfs_input_check.c:711:2: error: 'strncat' specified bound 288 equals
 destination size [-Werror=stringop-overflow=]
  strncat(bookmark, "#testbookmark", sizeof (bookmark));
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Olaf Faaland <faaland1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #8939 

ZFS on Linux/src a370182module/zfs zvol.c

Add SCSI_PASSTHROUGH to zvols to enable UNMAP support

When exporting ZVOLs as SCSI LUNs, by default Windows will not
issue them UNMAP commands. This reduces storage efficiency in
many cases.

We add the SCSI_PASSTHROUGH flag to the zvol's device queue,
which lets the SCSI target logic know that it can handle SCSI
commands.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: John Gallagher <john.gallagher at delphix.com>
Signed-off-by: Paul Dagnelie <pcd at delphix.com>
Closes #8933 
DeltaFile
+4-0module/zfs/zvol.c
+4-01 files

ZFS on Linux/src 3976fd6cmd/zfs zfs_main.c

Redacted Send/Receive broke zfs(8) help message

Since 30af21b0 was merged 'zfs send' help message format is broken
and lists "-r" as a valid option: this commit corrects these
small issues.

Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Signed-off-by: loli10K <ezomori.nozomu at gmail.com>
Closes #8942 
DeltaFile
+1-2cmd/zfs/zfs_main.c
+1-21 files

ZFS on Linux/src 9585497module/zfs zfs_sysfs.c

Prevent pointer to an out-of-scope local variable

`show_str` could be a pointer to a local variable in stack
which is out-of-scope by the time
`return (snprintf(buf, buflen, "%s\n", show_str));`
is called.

Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #8924 
Closes #8940 

ZFS on Linux/src accd6d9module/zfs zio.c

dedup=verify doesn't clear the blkptr's dedup flag

The logic to handle strong checksum collisions where the data doesn't
match is incorrect. It is not clearing the dedup bit of the blkptr,
which can cause a panic later in zio_ddt_free() due to the dedup table
not matching what is in the blkptr.

Reviewed-by: Tom Caputi <tcaputi at datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
External-issue: DLPX-48097
Closes #8936 
DeltaFile
+2-0module/zfs/zio.c
+2-01 files

ZFS on Linux/src a64f827module/zfs vdev_mirror.c vdev_file.c

Update vdev_ops_t from illumos

Align vdev_ops_t from illumos for better compatibility.

Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Igor Kozhukhov <igor at dilos.org>
Closes #8925 

ZFS on Linux/src da68988lib/libzfs libzfs_sendrecv.c libzfs_crypto.c, module/zfs dsl_crypt.c dmu_recv.c

Allow unencrypted children of encrypted datasets

When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.

Reviewed-by: Jason King <jason.king at joyent.com>
Reviewed-by: Sean Eric Fagan <sef at ixsystems.com>
Reviewed-by: Richard Laager <rlaager at wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Tom Caputi <tcaputi at datto.com>
Closes #8737 
Closes #8870 

ZFS on Linux/src 84b4201contrib/dracut/90zfs zfs-lib.sh.in

Replace whereis with type in zfs-lib.sh

The whereis command should not be used since it may not exist 
in the initramfs.  The dracut plymouth module also uses the type
command instead of whereis.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Garrett Fields <ghfields at gmail.com>
Signed-off-by: Dacian Reece-Stremtan <dacianstremtan at gmail.com>
Closes #8920 
Closes #8938 

ZFS on Linux/src 050d720cmd/ztest ztest.c, lib/libzfs libzfs_pool.c

Remove dedupditto functionality

If dedup is in use, the `dedupditto` property can be set, causing ZFS to
keep an extra copy of data that is referenced many times (>100x).  The
idea was that this data is more important than other data and thus we
want to be really sure that it is not lost if the disk experiences a
small amount of random corruption.

ZFS (and system administrators) rely on the pool-level redundancy to
protect their data (e.g. mirroring or RAIDZ).  Since the user/sysadmin
doesn't have control over what data will be offered extra redundancy by
dedupditto, this extra redundancy is not very useful.  The bulk of the
data is still vulnerable to loss based on the pool-level redundancy.
For example, if particle strikes corrupt 0.1% of blocks, you will either
be saved by mirror/raidz, or you will be sad.  This is true even if
dedupditto saved another 0.01% of blocks from being corrupted.

Therefore, the dedupditto functionality is rarely enabled (i.e. the
property is rarely set), and it fulfills its promise of increased
redundancy even more rarely.

Additionally, this feature does not work as advertised (on existing
releases), because scrub/resilver did not repair the extra (dedupditto)
copy (see https://github.com/zfsonlinux/zfs/pull/8270).


    [17 lines not shown]

ZFS on Linux/src fb0be12lib/libzfs_core libzfs_core.c, lib/libzpool util.c

Use ZFS_DEV macro instead of literals

The rest of the code/comments use ZFS_DEV, so sync with that.

Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling at RichardElling.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #8912 

ZFS on Linux/src 0b755eccmd/zpool zpool_vdev.c

Fix memory leak in check_disk()

Reviewed-by: Allan Jude <allanjude at freebsd.org>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling at RichardElling.com>
Signed-off-by: Michael Niewöhner <foss at mniewoehner.de>
Closes #8897  
Closes #8911 

ZFS on Linux/src c308b1drpm/redhat zfs-kmod.spec.in

kmod-zfs-devel rpm should provide kmod-spl-devel

When configure is run with --with-spec=redhat, and rpms are built, the
kmod-zfs-devel package is missing

Provides: kmod-spl-devel = %{version}

which is required by software such as Lustre which builds against zfs
kmods.  Adding it makes it easier for such software to build against
both zfs-0.7 (where SPL is separate and may be missing) and zfs-0.8.

Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Olaf Faaland <faaland1 at llnl.gov>
Closes #8930 

ZFS on Linux/src 4ca457btests/zfs-tests/include libtest.shlib

ZTS: Fix mmp_interval failure

The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80 at gmail.com>
Reviewed-by: Tony Hutter <hutter2 at llnl.gov>
Reviewed-by: Olaf Faaland <faaland1 at llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling at RichardElling.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Closes #8906 

ZFS on Linux/src 30af21blib/libzfs libzfs_sendrecv.c, module/zfs dmu_send.c dmu_recv.c

Implement Redacted Send/Receive

Redacted send/receive allows users to send subsets of their data to 
a target system. One possible use case for this feature is to not 
transmit sensitive information to a data warehousing, test/dev, or 
analytics environment. Another is to save space by not replicating 
unimportant data within a given dataset, for example in backup tools 
like zrepl.

Redacted send/receive is a three-stage process. First, a clone (or 
clones) is made of the snapshot to be sent to the target. In this 
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction 
snapshot" (or snapshots). Second, the new zfs redact command is used 
to create a redaction bookmark. The redaction bookmark stores the 
list of blocks in a snapshot that were modified by the redaction 
snapshot(s). Finally, the redaction bookmark is passed as a parameter 
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive 
or unwanted information, and those blocks are not included in the send 
stream.  When sending from the redaction bookmark, the blocks it 
contains are considered as candidate blocks in addition to those 
blocks in the destination snapshot that were modified since the 
creation_txg of the redaction bookmark.  This step is necessary to 
allow the target to rehydrate data in the case where some blocks are 

    [23 lines not shown]

ZFS on Linux/src c1b5801module/zfs arc.c

Minimize aggsum_compare(&arc_size, arc_c) calls.

For busy ARC situation when arc_size close to arc_c is desired.  But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result.  Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.

Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.

I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).

Reviewed-by: Chris Dunlop <chris at onthe.net.au>
Reviewed-by: Richard Elling <Richard.Elling at RichardElling.com>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Reviewed-by: Allan Jude <allanjude at freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by:  Alexander Motin <mav at FreeBSD.org>
Closes #8901 
DeltaFile
+1-1module/zfs/arc.c
+1-11 files

ZFS on Linux/src 63b88f7. META

Tag zfs-0.8.1

META file and changelog updated.

Signed-off-by: Tony Hutter <hutter2 at llnl.gov>
DeltaFile
+1-1META
+1-11 files

ZFS on Linux/src b1b4ac2config always-python.m4 always-pyzfs.m4

Python config cleanup

Don't require Python at configure/build unless building pyzfs.
Move ZFS_AC_PYTHON_MODULE to always-pyzfs.m4 where it is used.
Make test syntax more consistent.

Sponsored by: iXsystems, Inc.
Reviewed-by: Neal Gompa <ngompa at datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Ryan Moeller <ryan at ixsystems.com>
Closes #8895 

ZFS on Linux/src 7218b29include/sys zio_compress.h

lz4_decompress_abd declared but not defined

`lz4_decompress_abd` is declared in zio_compress.h but it is not defined
anywhere. The declaration should be removed.

Reviewed by: Dan Kimmel <dan.kimmel at delphix.com>
Reviewed-by: Allan Jude <allanjude at freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
External-issue: DLPX-47477
Closes #8894 

ZFS on Linux/src 53dce5ainclude/sys vdev_removal.h, man/man5 zfs-module-parameters.5

panic in removal_remap test on 4K devices

If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.

On debug bits this assertion will fail in spa_vdev_copy_segment():
    ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);

On nondebug, the system panics with a stack like:
    metaslab_free_concrete()
    metaslab_free_impl()
    metaslab_free_impl_cb()
    vdev_indirect_remap()
    free_from_removing_vdev()
    metaslab_free_impl()
    metaslab_free_dva()
    metaslab_free()

Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default.  We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.

This change makes the zfs_remove_max_segment tunable more robust,

    [13 lines not shown]

ZFS on Linux/src be89734man/man5 zfs-module-parameters.5, module/zfs zio.c

compress metadata in later sync passes

Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable
compression (including of metadata).  Ostensibly this helps the sync
passes to converge (i.e. for a sync pass to not need to allocate
anything because it is 100% overwrites).

However, in practice it increases the average number of sync passes,
because when we turn compression off, a lot of block's size will change
and thus we have to re-allocate (not overwrite) them.  It also increases
the number of 128KB allocations (e.g. for indirect blocks and spacemaps)
because these will not be compressed.  The 128K allocations are
especially detrimental to performance on highly fragmented systems,
which may have very few free segments of this size, and may need to load
new metaslabs to satisfy 128K allocations.

We should increase zfs_sync_pass_dont_compress.  In practice on a highly
fragmented system we see a few 5-pass txg's, a tiny number of 6-pass
txg's, and no txg's with more than 6 passes.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling at RichardElling.com>
Reviewed by: Pavel Zakharov <pavel.zakharov at delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim at delphix.com>
Reviewed-by: George Wilson <george.wilson at delphix.com>

    [3 lines not shown]

ZFS on Linux/src ae5c78emodule/zfs vdev_queue.c

Move write aggregation memory copy out of vq_lock

Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.

Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.

Reviewed-by: Richard Yao <ryao at gentoo.org>
Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by:  Alexander Motin <mav at FreeBSD.org>
Closes #8890 
DeltaFile
+12-10module/zfs/vdev_queue.c
+12-101 files

ZFS on Linux/src d3230d7man/man5 zfs-module-parameters.5, module/zfs metaslab.c

looping in metaslab_block_picker impacts performance on fragmented pools

On fragmented pools with high-performance storage, the looping in
metaslab_block_picker() can become the performance-limiting bottleneck.
When looking for a larger block (e.g. a 128K block for the ZIL), we may
search through many free segments (up to hundreds of thousands) to find
one that is large enough to satisfy the allocation. This can take a long
time (up to dozens of ms), and is done while holding the ms_lock, which
other threads may spin waiting for.

When this performance problem is encountered, profiling will show
high CPU time in metaslab_block_picker, as well as in mutex_enter from
various callers.

The problem is very evident on a test system with a sync write workload
with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage,
84% full and 88% fragmented. It has also been observed on production
systems with 90TB of storage, 76% full and 87% fragmented.

The fix is to change metaslab_df_alloc() to search only up to 16MB from
the previous allocation (of this alignment). After that, we will pick a
segment that is of the exact size requested (or larger). This reduces
the number of iterations to a few hundred on fragmented pools (a ~100x
improvement).


    [8 lines not shown]

ZFS on Linux/src 9c7da9ainclude zfs_namecheck.h, lib/libzfs libzfs_dataset.c

Restrict filesystem creation if name referred either '.' or '..'

This change restricts filesystem creation if the given name
contains either '.' or '..'

Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling at RichardElling.com>
Signed-off-by: TulsiJain <tulsi.jain at delphix.com>
Closes #8842 
Closes #8564 

ZFS on Linux/src 3475724module/zfs vdev_removal.c

ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread()

When running zloop, we occasionally see the following crash:

    dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0)
    ASSERT at 
../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3]


The error value 0x1c is ENOSPC.

The transaction used by spa_vdev_remove_thread() should not be able to
fail due to being out of space. i.e. we should not call
dmu_tx_hold_space().  This will allow the removal thread to schedule its
work even when the pool is low on space.  The "slop space" will provide
enough free space to sync out the txg.

Reviewed-by: Igor Kozhukhov <igor at dilos.org>
Reviewed-by: Paul Dagnelie <pcd at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
External-issue: DLPX-37853
Closes #8889 

ZFS on Linux/src daddbdcmodule/zfs zfs_sysfs.c

Fix lockdep warning on insmod

sysfs_attr_init() is required to make lockdep happy for dynamically
allocated sysfs attributes. This fixed #8868 on Fedora 29 running
kernel-debug.

This requirement was introduced in 2.6.34.
See include/linux/sysfs.h for what it actually does.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Olaf Faaland <faaland1 at llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro at gmail.com>
Closes #8868 
Closes #8884 

ZFS on Linux/src d9b4bf0include/sys zap.h, man/man5 zfs-module-parameters.5

fat zap should prefetch when iterating

When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.

For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs at 1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.

Reviewed-by: Andreas Dilger <andreas.dilger at whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
External-issue: DLPX-58347
Closes #8862 

ZFS on Linux/src d9cd66emodule/zfs arc.c

Target ARC size can get reduced to arc_c_min

Sometimes the target ARC size is reduced to arc_c_min, which impacts
performance.  We've seen this happen as part of the random_reads
performance regression test, where the ARC size is reduced before the
reads test starts which impacts how long it takes for system to reach
good IOPS performance.

We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE,
and arc_available_memory() is less than arc_c>>arc_shrink_shift.

However, arc_available_memory() could easily be low, even when arc_c is
low, because we can have tons of unused bufs in the abd kmem cache. This
would be especially true just after the DMU requests a bunch of stuff be
evicted from the ARC (e.g. due to "zpool export").

To fix this, the ARC should reduce arc_c by the requested amount, not
all the way down to arc_size (or arc_c_min), which can be very small.

Reviewed-by: Tim Chase <tim at chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens at delphix.com>
External-issue: DLPX-59431
Closes #8864 
DeltaFile
+0-2module/zfs/arc.c
+0-21 files

ZFS on Linux/src 10269e0module/zfs vdev_raidz_math.c

Fix typo in vdev_raidz_math.c

Fix typo in vdev_raidz_math.c

Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by: Brad Forschinger <github at bnjf.id.au>
Closes #8875 
Closes #8880 

ZFS on Linux/src 7288881module/zfs arc.c

Fix comparison signedness in arc_is_overflowing()

When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
of signed comparison there fixes the problem.

Reviewed-by: Matt Ahrens <mahrens at delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: George Melikov <mail at gmelikov.ru>
Signed-off-by:  Alexander Motin <mav at FreeBSD.org>
Closes #8873
DeltaFile
+2-2module/zfs/arc.c
+2-21 files

ZFS on Linux/src 581c77emodule/zfs dmu_recv.c

Fix incorrect error message for raw receive

This patch fixes an incorrect error message that comes up when
doing a non-forcing, raw, incremental receive into a dataset
that has a newer snapshot than the "from" snapshot. In this
case, the current code prints a confusing message about an IVset
guid mismatch.

This functionality is supported by non-raw receives as an
undocumented feature, but was never supported by the raw receive
code. If this is desired in the future, we can probably figure
out a way to make it work.

Reviewed by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed by: Matthew Ahrens <mahrens at delphix.com>
Signed-off-by: Tom Caputi <tcaputi at datto.com>
Issue #8758
Closes #8863

ZFS on Linux/src ba505f9cmd/arc_summary Makefile.am

arc_summary: prefer python3 version and install when there is no python

This matches the behavior of other python scripts, such as arcstat and
dbufstat, which are always installed but whose install-exec-hook actions
will simply touch up the shebang if a python interpreter was configured
*and* that interpreter is a python2 interpreter.

Fixes installation in a minimal build chroot without python available.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Ryan Moeller <ryan at freqlabs.com>
Signed-off-by: Eli Schwartz <eschwartz at archlinux.org>
Closes #8851

ZFS on Linux/src eaa21b2scripts kmodtool

Fix %post and %postun generation in kmodtool

During zfs-kmod RPM build, $(uname -r) gets unintentionally evaluated on
the build host, once and for all. It should be evaluated during the
execution of the scriptlets on the installation host. Escaping the $
character avoids evaluating it during build.

Reviewed-by: Brian Behlendorf <behlendorf1 at llnl.gov>
Reviewed-by: Olaf Faaland <faaland1 at llnl.gov>
Reviewed-by: Neal Gompa <ngompa at datto.com>
Signed-off-by: Samuel Verschelde <stormi-xcp at ylix.fr>
Closes #8866
DeltaFile
+2-2scripts/kmodtool
+2-21 files

ZFS on Linux/src 5662fd5include/sys abd.h, module/zfs abd.c arc.c

single-chunk scatter ABDs can be treated as linear

Scatter ABD's are allocated from a number of pages.  In contrast to
linear ABD's, these pages are disjoint in the kernel's virtual address
space, so they can't be accessed as a contiguous buffer.  Therefore
routines that need a linear buffer (e.g. abd_borrow_buf() and friends)
must allocate a separate linear buffer (with zio_buf_alloc()), and copy
the contents of the pages to/from the linear buffer.  This can have a
measurable performance overhead on some workloads.

https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e
("abd_alloc should use scatter for >1K allocations") increased the use
of scatter ABD's, specifically switching 1.5K through 4K (inclusive)
buffers from linear to scatter.  For workloads that access blocks whose
compressed sizes are in this range, that commit introduced an additional
copy into the read code path.  For example, the
sequential_reads_arc_cached tests in the test suite were reduced by
around 5% (this is doing reads of 8K-logical blocks, compressed to 3K,
which are cached in the ARC).

This commit treats single-chunk scattered buffers as linear buffers,
because they are contiguous in the kernel's virtual address space.

All single-page (4K) ABD's can be represented this way.  Some multi-page
ABD's can also be represented this way, if we were able to allocate a

    [20 lines not shown]