FreeBSD/src 013c58csys/kern sched_ule.c

sched_ule: 32-bit platforms: Fix runq_print() after runq changes

The compiler would report a mismatch between the format and the actual
type of the runqueue status word because the latter is now
unconditionally defined as an 'unsigned long' (which has the "natural"
platform size) and the format expects a 'size_t', which expands to an
'unsigned int' on 32-bit platforms (although they are both of the same
actual size).

This worked before as the C type used depended on the architecture and
was set to 'uint32_t' aka 'unsigned int' on these 32-bit platforms.

Just fix the format (use 'l').  While here, remove outputting '0x' by
hand, instead relying on '#' (only difference is for 0, and is fine).

runq_print() should be moved out of 'sched_ule.c' in a subsequent
commit.

Reported by:    Jenkins

    [4 lines not shown]
DeltaFile
+1-1sys/kern/sched_ule.c
+1-11 files

FreeBSD/src 63c9b01include/arm Makefile

arm64: lib32: Don't try to install removed <machine/runq.h>

Reported by:    Herbert J. Skuhra (herbert gojira.at)
Fixes:          79d8a99ee583 ("runq: Deduce most parameters, remove machine headers")
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
DeltaFile
+0-1include/arm/Makefile
+0-11 files

FreeBSD/src abdbd85stand/lua drawer.lua

lualoader: adapt builtin brand/logo definitions as well

While these should be moved to the new format, it wasn't my intention
to force them over immediately.  Downstreams may embed their own brands
in drawer.lua, and we shouldn't break them for something like this.

Move adapt_fb_shim() up and use it for preloaded definitions to avoid
forcing the matter for now.  Perhaps in the future we'll start writing
out warnings for those that do need adapted.

Reported by:    0x1eef on IRC
DeltaFile
+37-31stand/lua/drawer.lua
+37-311 files

FreeBSD/src 9d0d55esys/dev/ufshci ufshci_private.h

ufshci: Remove an unneeded variable definition

Reported by:    gcc
Fixes:          1349a733cf28 ("ufshci: Introduce the ufshci(4) driver")
DeltaFile
+0-2sys/dev/ufshci/ufshci_private.h
+0-21 files

FreeBSD/src 359f590sys/netinet/tcp_stacks rack.c

Fix a warning in the rack stack.

There is an initialization warning where error may not be set when logging
extended BBlogs. Lets fix this so error is init'd to zero so we won't have
a warning.
DeltaFile
+17-11sys/netinet/tcp_stacks/rack.c
+17-111 files

FreeBSD/src 690f642sbin/growfs/tests legacy_test.pl

growfs(8): use gpart(8) instead of bsdlabel(8) in test

bsdlabel(8) is deprecated

Reviewed by:    emaste
Differential Revision:  https://reviews.freebsd.org/D50865
DeltaFile
+3-3sbin/growfs/tests/legacy_test.pl
+3-31 files

FreeBSD/src 46023d5sys/netinet tcp_var.h

tcp: fixup wording in comment

Submitted by:   Steffen Nurpmeso <steffen sdaoden.eu>
Fixes:          b59753f1d55da6c6d4b73252444212e6895ce913
DeltaFile
+3-3sys/netinet/tcp_var.h
+3-31 files

FreeBSD/src 1d8f8f3bin/ps print.c, usr.bin/top machine.c

ps(1), top(1): Priority: Let 0 be the first timesharing level

Change the origin from PZERO to PUSER.

Doing so allows users to immediately detect if some thread is running
under a high priority (kernel or realtime) or under a low one
(timesharing or idle).

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
DeltaFile
+1-1usr.bin/top/machine.c
+1-1bin/ps/print.c
+2-22 files

FreeBSD/src eebc148sys/kern sched_4bsd.c

sched_4bsd: ESTCPULIM(): Allow any value in the timeshare range

The current formula wastes queues and degrades usage estimation
precision, since any increase of ticks that goes over 40 priorities (so,
8 * 40) is clamped to the last of these 40 levels (the nice value is
subsequently added to that number to get the final priority level).

Allow 'ts_estcpu' to grow up to a value corresponding to the greatest
(i.e., lowest) priority of the timeshare range.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45392
DeltaFile
+3-1sys/kern/sched_4bsd.c
+3-11 files

FreeBSD/src 51a4ae0sys/kern sched_4bsd.c

sched_4bsd: Remove RQ_PPQ from ESTCPULIM()'s formula

Substracting RQ_PPQ to the maximum number of allowed priority values
(the factor to INVERSE_ESTCPU_WEIGHT) has the effect of pessimizing the
number of processes assigned to the last priority bucket.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45392
DeltaFile
+3-3sys/kern/sched_4bsd.c
+3-31 files

FreeBSD/src a454ff6sys/kern sched_4bsd.c

sched_4bsd: Move ESTCPULIM() after its macro dependencies

No functional change (intended).

Also makes the comment about INVERSE_ESTCPU_WEIGHT() adjacent to its
definition.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45392
DeltaFile
+3-3sys/kern/sched_4bsd.c
+3-31 files

FreeBSD/src a33225esys/kern sched_ule.c

sched_ule: Sanitize CPU's use and priority computations, and ticks storage

Computation of %CPU in sched_pctcpu() was overly complicated, wrong in
the case of a non-maximal window (10 seconds span; this is always the
case in practice as the window would oscillate between 10 and 11 seconds
for continuously running processes) and performed unshifted for the
first part, essentially losing precision (up to 9% for SCHED_TICK_SECS
being 10), and with some uneffective shift for the second part.
Conserve maximum precision by only shifting by the require amount to
attain FSHIFT before dividing.  Apply classical rounding to nearest
instead of rounding down.

To generally avoid wraparound problems with tick fields in 'struct
td_sched' (as already happened once in sched_pctcpu_update()), make then
all unsigned, and ensure 'ticks' is always converted to some 'u_int'.
While here, fix SCHED_AFFINITY().

Rewrite sched_pctcpu_update() while keeping the existing formulas:
- Fix the hole in the cliff case that in theory 'ts_ticks' can become

    [34 lines not shown]
DeltaFile
+115-75sys/kern/sched_ule.c
+115-751 files

FreeBSD/src 6792f34sys/kern sched_ule.c

sched_ule: Recover previous nice and anti-starvation behaviors

Justification for this change is to avoid disturbing ULE's behavior too
much at this time.  We however acknowledge that the effect of "nice"
values is extremely weak and will most probably change it going forward.

Tuning allows to mostly recover ULE's behavior prior to the switch to
a single 256-queue runqueue and the increase of the timesharing priority
levels' range.

After this change, in a series of test involving two long-running
processes with varying nice values competing for the same CPU, we
observe that used CPU time ratios of the highest priority process to
change by at most 1.15% and on average by 0.46% (absolute differences).
In relative differences, they change by at most 2% and on average by
0.78%.

In order to preserve these ratios, as the number of priority levels
alloted to timesharing have been raised from 136 to 168 (and the subsets

    [24 lines not shown]
DeltaFile
+58-18sys/kern/sched_ule.c
+58-181 files

FreeBSD/src dee257csys/sys priority.h

sched: Internal priority ranges: Reduce kernel, increase timeshare

Now that a difference of 1 in priority level is significant, we can
shrink the priority range reserved for kernel threads.

Only four distinct levels are necessary for the bottom half (3 base
levels and arguably an additional one for demoted interrupt threads that
run for full time slices so that they finally don't compete with other
ones).  To leave room for other possible uses, we settle on 8 levels.

Given the symbolic constants for the top half, 10 levels are currently
necessary.  We settle on 16 levels.

This allows to enlarge the timesharing range, which covers ULE's both
interactive and batch range, to 168 distinct levels from less than 64
ones for ULE (as of before the changes to make it use a single runqueue
and have 256 distinct levels per runqueue) and 34 ones for 4BSD.

While here, note that the realtime range is required to have at least 32

    [17 lines not shown]
DeltaFile
+28-22sys/sys/priority.h
+28-221 files

FreeBSD/src d710acesys/kern kern_switch.c, sys/sys runq.h

runq: Add copyright

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
DeltaFile
+5-0sys/sys/runq.h
+5-0sys/kern/kern_switch.c
+10-02 files

FreeBSD/src 055b5b5sys/sys runq.h

runq: Restrict <sys/runq.h> to kernel only

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
DeltaFile
+9-13sys/sys/runq.h
+9-131 files

FreeBSD/src a2d1c3bsys/tests/epoch epoch_test.c

epoch_test: Assign different priorities using offset 1

Replace the hardcoded 4 (old RQ_PPQ) by 1 (new RQ_PPQ), as all priority
levels are now treated differently.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
DeltaFile
+1-1sys/tests/epoch/epoch_test.c
+1-11 files

FreeBSD/src b2a9ee2tests/sys/kern ptrace_test.c

runq: Remove userland references to RQ_PPQ in rtprio contexts

Concerns only a single test (ptrace_test.c).

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
DeltaFile
+2-3tests/sys/kern/ptrace_test.c
+2-31 files

FreeBSD/src e3a4b98sys/sys param.h

runq: Bump __FreeBSD_version after switching to 256 levels

Corresponding to changing RQ_PPQ to 1.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
DeltaFile
+1-1sys/sys/param.h
+1-11 files

FreeBSD/src af8de65sys/sys runq.h

runq: Switch to 256 levels

This increases the number of levels from 64 to 256, which coincides with
the distinct internal priority values (priority is currently encoded in
a 'u_char', whose range is entirely used).

With this change, we become POSIX-compliant for SCHED_FIFO/SCHED_RR in
that we really provide 32 distinct priority levels for these policies.
Previously, threads in the same "priority group", with priority groups
defined as the threads in consecutive spans of 4 priority levels
starting with level 0 up to 31 (so there are 8 groups), could not
preempt or be preempted by each other even if they were assigned
different priority levels.

See also commit "sched_ule: Use a single runqueue per CPU" for all the
drawbacks that this change also removes.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506

    [2 lines not shown]
DeltaFile
+1-1sys/sys/runq.h
+1-11 files

FreeBSD/src fd14158sys/contrib/openzfs/include/os/freebsd/spl/sys proc.h, sys/contrib/openzfs/include/os/linux/spl/sys sysmacros.h

zfs: spa: ZIO_TASKQ_ISSUE: Use symbolic priority

This allows to change the meaning of priority differences in FreeBSD
without requiring code changes in ZFS.

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45390
DeltaFile
+3-18sys/contrib/openzfs/module/zfs/spa.c
+3-1sys/contrib/openzfs/include/os/linux/spl/sys/sysmacros.h
+3-1sys/contrib/openzfs/include/os/freebsd/spl/sys/proc.h
+3-1sys/contrib/openzfs/include/sys/zfs_context.h
+12-214 files

FreeBSD/src 8ecc419sys/dev/vkbd vkbd.c, sys/kern vfs_vnops.c kern_rmlock.c

Internal scheduling priorities: Always use symbolic ones

Replace priorities specified by a base priority and some hardcoded
offset value by symbolic constants.  Hardcoded offsets prevent changing
the difference between priorities without changing their relative
ordering, and is generally a dangerous practice since the resulting
priority may inadvertently belong to a different selection policy's
range.

Since RQ_PPQ is 4, differences of less than 4 are insignificant, so just
remove them.  These small differences have not been changed for years,
so it is likely they have no real meaning (besides having no practical
effect).  One can still consult the changes history to recover them if
ever needed.

No functional change (intended).

MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506

    [2 lines not shown]
DeltaFile
+11-11sys/ufs/ufs/ufs_quota.c
+7-7sys/dev/vkbd/vkbd.c
+6-6sys/kern/vfs_vnops.c
+4-4sys/net/if_tuntap.c
+4-3sys/kern/kern_rmlock.c
+3-4sys/kern/vfs_bio.c
+35-3521 files not shown
+65-6627 files

FreeBSD/src baecdeasys/kern sched_ule.c kern_switch.c, sys/sys runq.h

sched_ule: Use a single runqueue per CPU

Previously, ULE would use 3 separate runqueues per CPU to store threads,
one for each of its selection policies, which are realtime, timesharing
and idle.  They would be examined in this order, and the first thread
found would be the one selected.

This choice indeed appears as the easiest evolution from the single
runqueue used by sched_4bsd (4BSD): It allows sharing most of the same
runqueue code, which currently defines 64 levels per runqueue, while
multiplying the number of levels (by 3).  However, it has several
important drawbacks:

1. The number of levels is the same for each selection policy.  64 is
unnecessarily large for the idle policy (only 32 distinct levels would
be necessary, given the 32 levels of our RTP_PRIO_IDLE and their future
aliases in the to-be-introduced SCHED_IDLE POSIX scheduling policy) and
unnecessary restrictive both for the realtime policy (which should
include 32 distinct levels for PRI_REALTIME, given our implementation of

    [34 lines not shown]
DeltaFile
+190-121sys/kern/sched_ule.c
+0-11sys/kern/kern_switch.c
+0-1sys/sys/runq.h
+190-1333 files

FreeBSD/src f4be333sys/kern sched_ule.c

sched_ule: Re-implement stealing on top of runq common-code

Stop using internal knowledge of runqueues.  Remove duplicate
boilerplate parts.

Concretely, runq_steal() and runq_steal_from() are now implemented on
top of runq_findq().

Besides considerably simplifying the code, this change also brings an
algorithmic improvement since, previously, set bits in the runqueue's
status words were found by testing each bit individually in a loop
instead of using ffsl()/bsfl() (except for the first set bit per status
word).

This change also makes it more apparent that runq_steal_from() treats
the first thread with highest priority specifically (which runq_steal()
does not).

MFC after:      1 month

    [3 lines not shown]
DeltaFile
+66-54sys/kern/sched_ule.c
+66-541 files

FreeBSD/src fdf31d2sys/kern sched_ule.c

sched_ule: runq_steal_from(): Suppress first thread special case

This special case was introduced as soon as commit "ULE 3.0"
(ae7a6b38d53f, r171482, from July 2007).  It caused runq_steal_from() to
ignore the highest-priority thread while stealing.

Its functionality was changed in commit "Rework CPU load balancing in
SCHED_ULE" (36acfc6507aa, r232207, from February 2012), where the intent
was to keep track of that first thread and return it if no other one was
stealable, instead of returning NULL (no steal).  Some bug prevented it
from working in loaded cases (more than one thread, and all threads but
the first one not stealable), which was subsequently fixed in commit
"sched_ule(4): Fix interactive threads stealing." (bd84094a51c4, from
September 2021).

All the reasons for this mechanism we could second-guess were dubious at
best.  Jeff Roberson, ULE's main author, says in the differential
revision that "The point was to move threads that are least likely to
benefit from affinity because they are unlikely to run soon enough to

    [13 lines not shown]
DeltaFile
+0-14sys/kern/sched_ule.c
+0-141 files

FreeBSD/src 757bab0sys/kern kern_switch.c

runq: Tidy up and rename runq_setbit() and runq_clrbit()

Factorize common sub-expressions in a separate helper (runq_sw_apply())
for better readability.

Rename these functions so that the names refer to the use cases rather
than the implementations.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
DeltaFile
+76-28sys/kern/kern_switch.c
+76-281 files

FreeBSD/src a311931sys/kern kern_switch.c sched_ule.c, sys/sys runq.h

runq: New function runq_is_queue_empty(); Use it in ULE

Indicates if some particular queue of the runqueue is empty.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
DeltaFile
+28-0sys/kern/kern_switch.c
+1-1sys/kern/sched_ule.c
+1-0sys/sys/runq.h
+30-13 files

FreeBSD/src 9c3f468sys/kern kern_switch.c, sys/sys runq.h

runq: New runq_findq(), common low-level search implementation

That new runq_findq(), based on the implementation of the former
runq_findq_range(), is intended to become the foundation and unique
low-level implementation for all searches in a runqueue.  In addition to
a range of queues' indices, it takes a predicate function, allowing to:
- Possibly skip a non-empty queue with higher priority (numerically
  lower index) on some criteria.  This is not yet used but will be in
  a subsequent commit revising ULE's stealing machinery.
- Choose a specific thread in the queue, not necessarily the first.
- Return whatever information is deemed necessary.

It helps to remove duplicated boilerplate code, including redundant
assertions, and generally makes things much clearer.  These effects will
be even greater in a subsequent commit modifying ULE to use it.

runq_first_thread_range() replaces the old runq_findq_range() (returns
the first thread of the highest priority queue in the requested range),
and runq_first_thread() the old runq_findq() (same, but considering all

    [7 lines not shown]
DeltaFile
+124-81sys/kern/kern_switch.c
+12-1sys/sys/runq.h
+136-822 files

FreeBSD/src 439dc92sys/kern kern_switch.c

runq: Revamp runq_find*(), new runq_find_range()

Rename existing functions to use the simpler prefix 'runq_findq' instead
of 'runq_findbit' (that they work on top of bit runs is an
implementation detail).

Add runq_findq_range(), which takes a range of indices to operate on
(bounds included).  This is in preparation for changing ULE to use
a single runqueue, since it needs to treat the timesharing range
differently.

Rename runq_findbit_from() to runq_findq_circular(), which is more
descriptive.

To reduce code duplication, have runq_findq() and runq_findq_circular()
leverage runq_findq_range() internally.  For the latter, this also
brings a small algorithmic improvement, since previously the second pass
(from queue 0) would cover the whole runqueue if it was completely
empty, scanning again empty queues after the start index.

    [6 lines not shown]
DeltaFile
+58-43sys/kern/kern_switch.c
+58-431 files

FreeBSD/src 200fc93sys/kern kern_switch.c, sys/sys runq.h

runq: Re-order functions more logically

No code change in moved functions.

Reviewed by:    kib
MFC after:      1 month
Event:          Kitchener-Waterloo Hackathon 202506
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D45387
DeltaFile
+101-100sys/kern/kern_switch.c
+4-3sys/sys/runq.h
+105-1032 files