kernel - VM rework part 20 - Fix vmmeter_neg_slop_cnt
* Fix some serious issues with the vmmeter_neg_slop_cnt calculation.
The main problem is that this calculation was then causing
vmstats.v_free_min to be recalculated to a much higher value
than it should beeen calculated to, resulting in systems starting
to page far earlier than they should.
For example, the 128G TR started paging tmpfs data with 25GB of
free memory, which was not intended. The correct target for that
amount of memory is more around 3GB.
* Remove vmmeter_neg_slop_cnt entirely and refactor the synchronization
code to be smarter. It will now synchronize vmstats fields whos
adjustments exceed -1024, but only if paging would actually be
needed in the worst-case scenario.
* This algorithm needs low-memory testing and might require more
kernel - Reduce/refactor nbuf and maxvnodes calculations.
* The prime motivation for this commit is to target about 1/20
(5%) of physical memory for use by the kernel. These changes
significantly reduce kernel memory usage on systems with less
than 4GB of ram (and more specific for systems with less
than 1TB of ram), and also emplace more reasonable caps on
systems with 128GB+ of ram.
These changes return 100-200MB of ram to userland on systems
with 1GB of ram, and return around 6.5GB of ram on systems
with 128G of ram.
* The nbuf calculation and related code documentation was a bit
crufty, still somewhat designed for an earlier era and was
calculating about twice the stated 5% target. For systems with
128GB of ram or less the calculation was simply creating too many
filesystem buffers, allowing as much as 10% of physical memory to
be locked up by the buffer cache.
Particularly on small systems, this 10% plus other kernel overheads
left a lot less memory available for user programs than we would
have liked. This work gets us closer to the 5% target.
* Change the base calculation from 1/10 of physical memory to 1/20
[49 lines not shown]
kernel - VM rework part 19 - Cleanup
* vmpageinfo breaks down the kernel load size, vm_page_array
size, and buffer headers for the buffer cache, all of which
are major boot-time wired kernel memory.
Note that the vm_page_array uses 3.1% of physical memory.
Its a lot, but there is no convenient way to make it less.
kernel - VM rework part 18 - Cleanup
* Significantly reduce the zone limit for pvzone (for pmap
pv_entry structures). pv_entry's are no longer allocated
on a per-page basis so the limit can be made much smaller.
This also has the effect of reducing the per-cpu cache limit
which ultimately stabilizes wired memory use for the zone.
* Also reduce the generic pre-cpu cache limit for zones.
This only really effects the pvzone.
* Make pvzone, mapentzone, and swap_zone __read_mostly.
* Enhance vmstat -z, report current structural use and actual
total memory use.
* Also cleanup the copyright statement for vm/vm_zone.c. John Dyson's
original copyright was slightly different than the BSD copyright and
stipulated no changes, so separate out the DragonFly addendum.
kernel - VM rework part 17 - Cleanup
* Adjust kmapinfo and vmpageinfo in /usr/src/test/debug.
Enhance the code to display more useful information.
* Get pmap_page_stats_*() working again.
* Change systat -vm's 'VM' reporting. Replace VM-rss with PMAP and
VMRSS. Relabel VM-swp to SWAP and SWTOT.
PMAP - Amount of real memory faulted into user pmaps.
VMRSS - Sum of all process RSS's in thet system. This is
the 'virtual' memory faulted into user pmaps and
includes shared pages.
SWAP - Amount of swap space currently in use.
SWTOT - Total amount of swap installed.
* Redocument vm_page.h.
* Remove dead code from pmap.c (some left over cruft from the
days when pv_entry's were used for PTEs).
kernel - VM rework part 15 - Core pmap work, refactor PG_*
* Augment PG_FICTITIOUS. This takes over some of PG_UNMANAGED's previous
capabilities. In addition, the pmap_*() API will work with fictitious
pages, making mmap() operation (aka of the GPU) more consistent.
* Add PG_UNQUEUED. This prevents a vm_page from being manipulated in
the vm_page_queues in any way. This takes over another feature
of the old PG_UNMANAGED flag.
* Remove PG_UNMANAGED
* Remove PG_DEVICE_IDX. This is no longer relevant. We use PG_FICTITIOUS
for all device pages.
* Refactor vm_contig_pg_alloc(), vm_contig_pg_free(),
vm_page_alloc_contig(), and vm_page_free_contig().
These functions now set PG_FICTITIOUS | PG_UNQUEUED on the returned
pages, and properly clear the bits upon free or if/when a regular
(but special contig-managed) page is handed over to the normal paging
This is combined with making the pmap*() functions work better with
PG_FICTITIOUS is the primary 'fix' for some of DRMs hacks.
kernel - VM rework part 16 - Optimization & cleanup pass
* Adjust __exclusive_cache_line to use 128-byte alignment as
per suggestion by mjg. Use this for the global vmstats.
* Add the vmmeter_neg_slop_cnt global, which is a more generous
dynamic calculation verses -VMMETER_SLOP_COUNT. The idea is to
return how often vm_page_alloc() synchronizes its per-cpu statistics
with the global vmstats.
kernel - VM rework part 10 - Precursor work for terminal pv_entry removal
* Effectively remove pmap_track_modified(). Turn it into an assertion.
The normal pmap code should NEVER EVER be called with any range inside
the clean map.
This assertion, and the routine in its entirety, will be removed in a
* The purpose of the original code was to prevent buffer cache kvm mappings
from being misinterpreted as contributing to the underlying vm_page's
modified state. Normal paging operation synchronizes the modified bit and
then transfers responsibility to the buffer cache. We didn't want
manipulation of the buffer cache to further affect the modified bit for
In modern times, the buffer cache does NOT use a kernel_object based
mapping for anything and there should be no chance of any kernel related
pmap_enter() (entering a managed page into the kernel_pmap) from messing
with the space.
kernel - VM rework part 9 - Precursor work for terminal pv_entry removal
* Cleanup the API a bit
* Get rid of pmap_enter_quick()
* Remove unused procedures.
* Document that vm_page_protect() (and thus the related
pmap_page_protect()) must be called with a hard-busied page. This
ensures that the operation does not race a new pmap_enter() of the page.
kernel - VM rework part 14 - Core pmap work, stabilize for X/drm
* Don't gratuitously change the vm_page flags in the drm code.
The vm_phys_fictitious_reg_range() code in drm_vm.c was clearing
PG_UNMANAGED. It was only luck that this worked before, but
because these are faked pages, PG_UNMANAGED must be set or the
system will implode trying to convert the physical address back
to a vm_page in certain routines.
The ttm code was setting PG_FICTITIOUS in order to prevent the
page from getting into the active or inactive queues (they had
a conditional test for PG_FICTITIOUS). But ttm never cleared
the bit before freeing the page. Remove the hack and instead
fix it in vm_page.c
* in vm_object_terminate(), allow the case where there are still
wired pages in a OBJT_MGTDEVICE object that has wound up on a
queue (don't complain about it). This situation arises because the
ttm code uses the contig malloc API which returns wired pages.
NOTE: vm_page_activate()/vm_page_deactivate() are allowed to mess
with wired pages. Wired pages are not anything 'special' to
the queues, which allows us to avoid messing with the queues
when pages are assigned to the buffer cache.
kernel - VM rework part 11 - Core pmap work to remove terminal PVs
* Remove pv_entry_t belonging to terminal PTEs. The pv_entry's for
PT, PD, PDP, and PML4 remain. This reduces kernel memory use for
pv_entry's by 99%.
The pmap code now iterates vm_object->backing_list (of vm_map_backing
structures) to run-down pages for various operations.
* Remove vm_page->pv_list. This was one of the biggest sources of
contention for shared faults. However, in this first attempt I
am leaving all sorts of ref-counting intact so the contention has
not been entirely removed yet.
* Current hacks:
- Dynamic page table page removal currently disabled because the
vm_map_backing scan needs to be able to deterministically
run-down PTE pointers. Removal only occurs at program exit.
- PG_DEVICE_IDX probably isn't being handled properly yet.
- Shared page faults not yet optimized.
* So far minor improvements in performance across the board.
[4 lines not shown]
kernel - VM rework part 13 - Core pmap work, stabilize & optimize
* Refactor the vm_page_hash hash again to get a better distribution.
* I tried to only hash shared objects but this resulted in a number of
edge cases where program re-use could miss the optimization.
* Add a sysctl vm.page_hash_vnode_only (default off). If turned on,
only vm_page's associated with vnodes will be hashed. This should
generally not be necessary.
* Refactor vm_page_list_find2() again to avoid all duplicate queue
checks. This time I mocked the algorithm up in userland and twisted
it until it did what I wanted.
* VM_FAULT_QUICK_DEBUG was accidently left on, turn it off.
* Do not remove the original page from the pmap when vm_fault_object()
must do a COW. And just in case this is ever added back in later,
don't do it using pmap_remove_specific() !!! Use pmap_remove_pages()
to avoid the backing scan lock.
vm_fault_page() will now do this removal (for procfs rwmem), the normal
vm_fault will of course replace the page anyway, and the umtx code
uses different recovery mechanisms now and should be ok.
[4 lines not shown]
kernel - VM rework part 12 - Core pmap work, stabilize & optimize
* Add tracking for the number of PTEs mapped writeable in md_page.
Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page
to avoid clear/set races. This problem occurs because we would
have otherwise tried to clear the bits without hard-busying the
page. This allows the bits to be set with only an atomic op.
Procedures which test these bits universally do so while holding
the page hard-busied, and now call pmap_mapped_sfync() prior to
properly synchronize the bits.
* Fix bugs related to various counterse. pm_stats.resident_count,
wiring counts, vm_page->md.writeable_count, and
* Fix bugs related to synchronizing removed pte's with the vm_page.
Fix one case where we were improperly updating (m)'s state based
on a lost race against a pte swap-to-0 (pulling the pte).
* Fix a bug related to the page soft-busying code when the
m->object/m->pindex race is lost.
* Implement a heuristical version of vm_page_active() which just
updates act_count unlocked if the page is already in the
[92 lines not shown]
<assert.h>: add missing __dead2 to __assert().
__assert() is called when an assertion fails. After printing an error
message, it will call abort(). abort() never returns, hence it has the
__dead2 attribute. Also add this attribute to __assert().
Taken-from: FreeBSD (r217207)
Submitted-by: Jan Beich
kernel - Remove improper direct user-space access
* chroot_kernel() (a privileged system call) was improperly
callin kprintf() with a direct user address. Just remove
kernel: Don't include <sys/user.h> in kernel code.
There is really no point in doing that because its main purpose is to
expose kernel structures to userland. The majority of cases wasn't
needed at all and the rest required only a couple of other includes.
Fix building release on master.
* <histedit.h> was moved to /usr/include/priv on master, so add that
to the include search path when building sh(1) as a bootstrap tool.
* Fix the apropos(1) database generation (used for 'make distribution').
If the system doesn't have the makewhatis(8) for a compatible
database, just build no database.
makedb: Fix apropos database generation better across release/master.
The apropos database format used by our new man(1) is different and
incompatible to that used by our old man(1). The files are also named
differently, mandoc.db (new) and whatis (old).
So it makes no sense to use the old makewhatis on new systems or the
new makewhatis on old systems. If the desired makewhatis does not
exist, then we just don't generate the db, because the building system
doesn't have the makewhatis needed to generate it.
Once installed, the database will be updated regularly as per weekly
Revert "kernel - Clean up direction flag on syscall entry"
Actually not needed, the D flag is cleared via the mask
set in MSR_SF_MASK. Revert.
This reverts commit cea0e49dc0b2e5aea1b929d02f12d00df66528e2.
kernel - Implement support for SMAP and SMEP security
* Implement support for SMAP security. This prevents accidental
accesses to user address space from the kernel. When available,
we wrap intentional user-space accesses from the kernel with
the 'stac' and 'clac' instructions.
We use a NOP replacement policy to implement the feature. The wrapper
is initially a 'nop %eax' (3-byte NOP), and is replaced by 'stac' and
'clac' via a .section iteration when the feature is supported.
* Implement support for SMEP security. This prevents accidental
execution of user code from the kernel and simply requires
turning the bit on in CR4.
* Reports support in dmesg via the 'CPU Special Features Installed:'
kernel - Implement retpoline for kernel
* Now that we have gcc-8 operational, we can turn on retpoline (software
spectre protection against the return stack buffer). Turn it on via
* No discernable performance loss with a generic buildkernel test:
Xeon e5-2620v4 x 2
time make -j 32 nativekernel (all tmpfs)
BEFORE 1717.427u 323.662s 2:28.49 1374.5% 9582+721k 200842+0io 4870pf+0w
BEFORE 1720.130u 338.635s 2:30.21 1370.5% 9555+720k 199720+0io 4804pf+0w
BEFORE 1722.395u 341.508s 2:30.71 1369.4% 9559+720k 199720+0io 4804pf+0w
AFTER 1720.271u 329.492s 2:28.27 1382.4% 9578+721k 200842+0io 4870pf+0w
AFTER 1736.268u 344.874s 2:30.90 1379.1% 9555+720k 199720+0io 4804pf+0w
AFTER 1726.056u 348.324s 2:31.14 1372.4% 9543+719k 199720+0io 4804pf+0w
Don't include "internal" headers outside of regular headers.
Include files like <sys/_timespec.h> and so on contain small parts
such as struct timespec that are supposed to be provided by multiple
regular headers. They should only be included by other headers, not
by *.c files.
None of these was actually needed except for the libtelnet one
(replaced with <stddef.h>).
kernel - VM rework part 8 - Precursor work for terminal pv_entry removal
* Adjust structures so the pmap code can iterate backing_ba's with
just the vm_object spinlock.
Add a ba.pmap back-pointer.
Move entry->start and entry->end into the ba (ba.start, ba.end).
This is replicative of the base entry->ba.start and entry->ba.end,
but local modifications are locked by individual objects to allow
pmap ops to just look at backing ba's iterated via the object.
Remove the entry->map back-pointer.
Remove the ba.entry_base back-pointer.
* ba.offset is now an absolute offset and not additive. Adjust all code
that calculates and uses ba.offset (fortunately it is all concentrated
in vm_map.c and vm_fault.c).
* Refactor ba.start/offset/end modificatons to be atomic with
the necessary spin-locks to allow the pmap code to safely iterate
the vm_map_backing list for a vm_object.
* Test VM system with full synth run.
kernel - Add MDS mitigation support for Intel side-channel attack
* Add MDS (Microarchitectural Data Sampling) attack mitigation to
the kernel. This is an attack against Intel CPUs made from 2011
to date. The attack is not currently known to work against AMD CPUs.
With an intel microcode update the mitigation can be enabled with
* Without the intel microcode update, only disabling hyper-threading
gives you any protection. Older architectures might not get
support. If sysctl machdep.mds_support does not show support,
then the currently loaded microcode does not have support for the
* DragonFlyBSD only supports the MD_CLEAR mode, and it will only
be available with a microcode update from Intel.
Updating the microcode alone does not protect against the attack.
The microcode must be updated AND the mode must be turned on in
DragonFlyBSD to protect against the attack.
This mitigation burns around 250nS of additional latency on kernel->user
transitions (system calls and interrupts primarily). The additional
[10 lines not shown]
kernel - VM rework part 7 - Initial vm_map_backing index
* Implement a TAILQ and hang vm_map_backing structures off
of the related object. This feature is still in progress
and will eventually be used to allow pmaps to manipulate
vm_page's without pv_entry's.
At the same time, remove all sharing of vm_map_backing.
For example, clips no longer share the vm_map_backing. We
can't share the structures if they are being used to
itemize areas for pmap management.
TODO - reoptimize this at some point.
TODO - not yet quite deterministic enough for pmap
searches (due to clips).
* Refactor vm_object_reference_quick() to again allow
operation on any vm_object whos ref_count is already
at least 1, or which belongs to a vnode. The ref_count
is no longer being used for complex vm_object collapse,
shadowing, or migration code.
This allows us to avoid a number of unnecessary token
grabs on objects during clips, shadowing, and forks.
[7 lines not shown]
rtld-elf - Notify thread state to optimize relocations
* Add shims to allow libthread_xu to notify rtld when threading
is being used.
* Requires weak symbols in libc which are overriden by rtld-elf.
* Implement the feature in rtld-elf and use it to avoid making calls
to lwp_gettid(). When threaded, use tls_get_tcb() (which does not
require a system call) instead of lwp_gettid(). When not threaded,
just use a constant.
NOTE: We cannot use tls_get_tcb() unconditionally because the tcb
is not setup during early relocations. So do this whack-a-mole
to make it work.
* This leaves just the sigprocmask wrappers around rtld-elf (which
are needed to prevent stacked relocations from signal handlers).
kernel - Restore kern.cam.da.X.trim_enabled sysctl
* This sysctl was not always being properly installed due to an
ordering and timing issue.
* The code was not setting the trim flag in the correct structure.