DragonFlyBSD/src e3c330fsys/kern kern_synch.c vfs_bio.c, sys/platform/pc64/include pmap.h

kernel - VM rework part 12 - Core pmap work, stabilize & optimize

* Add tracking for the number of PTEs mapped writeable in md_page.
  Change how PG_WRITEABLE and PG_MAPPED is cleared in the vm_page
  to avoid clear/set races.  This problem occurs because we would
  have otherwise tried to clear the bits without hard-busying the
  page. This allows the bits to be set with only an atomic op.

  Procedures which test these bits universally do so while holding
  the page hard-busied, and now call pmap_mapped_sfync() prior to
  properly synchronize the bits.

* Fix bugs related to various counterse.  pm_stats.resident_count,
  wiring counts, vm_page->md.writeable_count, and
  vm_page->md.pmap_count.

* Fix bugs related to synchronizing removed pte's with the vm_page.
  Fix one case where we were improperly updating (m)'s state based
  on a lost race against a pte swap-to-0 (pulling the pte).

* Fix a bug related to the page soft-busying code when the
  m->object/m->pindex race is lost.

* Implement a heuristical version of vm_page_active() which just
  updates act_count unlocked if the page is already in the
  PQ_ACTIVE queue, or if it is fictitious.

* Allow races against the backing scan for pmap_remove_all() and
  pmap_page_protect(VM_PROT_READ).  Callers of these routines for
  these cases expect full synchronization of the page dirty state.
  We can identify when a page has not been fully cleaned out by
  checking vm_page->md.pmap_count and vm_page->md.writeable_count.
  In the rare situation where this happens, simply retry.

* Assert that the PTE pindex is properly interlocked in pmap_enter().
  We still allows PTEs to be pulled by other routines without the
  interlock, but multiple pmap_enter()s of the same page will be
  interlocked.

* Assert additional wiring count failure cases.

* (UNTESTED) Flag DEVICE pages (dev_pager_getfake()) as being
  PG_UNMANAGED.  This essentially prevents all the various
  reference counters (e.g. vm_page->md.pmap_count and
  vm_page->md.writeable_count), PG_M, PG_A, etc from being
  updated.

  The vm_page's aren't tracked in the pmap at all because there
  is no way to find them.. they are 'fake', so without a pv_entry,
  we can't track them.  Instead we simply rely on the vm_map_backing
  scan to manipulate the PTEs.

* Optimize the new vm_map_entry_shadow() to use a shared object
  token instead of an exclusive one.  OBJ_ONEMAPPING will be cleared
  with the shared token.

* Optimize single-threaded access to pmaps to avoid pmap_inval_*()
  complexities.

* Optimize __read_mostly for more globals.

* Optimize pmap_testbit(), pmap_clearbit(), pmap_page_protect().
  Pre-check vm_page->md.writeable_count and vm_page->md.pmap_count
  for an easy degenerate return; before real work.

* Optimize pmap_inval_smp() and pmap_inval_smp_cmpset() for the
  single-threaded pmap case, when called on the same CPU the pmap
  is associated with.  This allows us to use simple atomics and
  cpu_*() instructions and avoid the complexities of the
  pmap_inval_*() infrastructure.

* Randomize the page queue used in bio_page_alloc().  This does not
  appear to hurt performance (e.g. heavy tmpfs use) on large many-core
  NUMA machines and it makes the vm_page_alloc()'s job easier.

  This change might have a downside for temporary files, but for more
  long-lasting files there's no point allocating pages localized to a
  particular cpu.

* Optimize vm_page_alloc().

  (1) Refactor the _vm_page_list_find*() routines to avoid re-scanning
      the same array indices over and over again when trying to find
      a page.

  (2) Add a heuristic, vpq.lastq, for each queue, which we set if a
      _vm_page_list_find*() operation had to go far-afield to find its
      page.  Subsequent finds will skip to the far-afield position until
      the current CPUs queues have pages again.

  (3) Reduce PQ_L2_SIZE From an extravagant 2048 entries per queue down
      to 1024.  The original 2048 was meant to provide 8-way
      set-associativity for 256 cores but wound up reducing performance
      due to longer index iterations.

* Refactor the vm_page_hash[] array.  This array is used to shortcut
  vm_object locks and locate VM pages more quickly, without locks.
  The new code limits the size of the array to something more reasonable,
  implements a 4-way set-associative replacement policy using 'ticks',
  and rewrites the hashing math.

* Effectively remove pmap_object_init_pt() for now.  In current tests
  it does not actually improve performance, probably because it may
  map pages that are not actually used by the program.

* Remove vm_map_backing->refs.  This field is no longer used.

* Remove more of the old now-stale code related to use of pv_entry's
  for terminal PTEs.

* Remove more of the old shared page-table-page code.  This worked but
  could never be fully validated and was prone to bugs.  So remove it.
  In the future we will likely use larger 2MB and 1GB pages anyway.

* Remove pmap_softwait()/pmap_softhold()/pmap_softdone().

* Remove more #if 0'd code.
DeltaFile
+258-102sys/platform/pc64/x86_64/pmap.c
+140-69sys/vm/vm_page.c
+18-5sys/vm/vm_page.h
+1-21sys/vm/vm_map.c
+19-0sys/platform/vkernel64/platform/pmap.c
+12-5sys/platform/pc64/x86_64/pmap_inval.c
+9-1sys/vm/vm_object.c
+4-4sys/kern/kern_synch.c
+4-2sys/vm/vm_page2.h
+5-0sys/kern/vfs_bio.c
+4-1sys/vm/vm_fault.c
+3-1sys/vm/vm_pageout.c
+3-1sys/vm/swap_pager.c
+2-1sys/kern/kern_fork.c
+2-0sys/vm/pmap.h
+1-1sys/vm/device_pager.c
+2-0sys/platform/pc64/include/pmap.h
+0-1sys/vm/vm_map.h
+487-21518 files

UnifiedSplitRaw