ena: Budget rx descriptors, not packets
We had ENA_RX_BUDGET = 256 in order to allow up to 256 received
packets to be processed before we do other cleanups (handling tx
packets and, critically, refilling the rx buffer ring). Since the
ring holds 1024 buffers by default, this was fine for normal packets:
We refill the ring when it falls below 7/8 full, and even with a large
burst of incoming packets allowing it to fall by another 1/4 before we
consider refilling the ring still leaves it at 7/8 - 1/4 = 5/8 full.
With jumbos, the story is different: A 9k jumbo (as is used by default
within the EC2 network) consumes 3 descriptors, so a single rx cleanup
pass can consume 3/4 of the default-sized rx ring; if the rx buffer
ring wasn't completely full before a packet burst arrives, this puts
us perilously close to running out of rx buffers.
This precise failure mode has been observed on some EC2 instance types
within a Cluster Placement Group, resulting in the nominal 10 Gbps
single-flow throughput between instances dropping to ~100 Mbps as a
[19 lines not shown]
ena: Adjust ena_[rt]x_cleanup to return bool
The ena_[rt]x_cleanup functions are limited internally to a maximum
number of packets; this ensures that TX doesn't starve RX (or vice
versa) and also attempts to ensure that we get a chance to refill
the RX buffer ring before the device runs out of buffers and starts
dropping packets.
Historically these functions have returned the number of packets which
they processed which ena_cleanup compares to their respective budgets
to decide whether to reinvoke them. This is unnecessary complication;
since the precise number of packets processed is never used, adjust
the APIs of those functions to return a bool indicating if they want
to be reinvoked (aka if they hit their limits).
Since ena_tx_cleanup now only uses work_done if diagnostics are
enabled (ena_log_io macros to nothing otherwise) eliminate that
variable and pass its value (ENA_TX_BUDGET - budget) to ena_log_io
directly.
[7 lines not shown]
speaker(4): move static data to text
Make this data const (it doesn't change) which will also move it to
a text section.
Signed-off-by: Raphael Poss <knz at thaumogen.net>
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/1922
Fix xhci detection on Raspberry Pi 400
If you use the FreeBSD pre-build Raspberry Pi image, it does not include
the specific .dtb file for the Raspberry Pi 400. On this hardware, it
will fall back to attempting to load the Raspberry Pi 4 .dtb file
instead.
The Pi 4 .dtb file reports the board compatible name as
"raspberrypi,4-model-b" The Pi 400 .dtb file reports the board
compatible name as "raspberrypi,400" However, it's even better to
use the generic name.
When using the official Pi 400 .dtb file from the Raspberry Pi Firmware
collection, the FreeBSD xhci driver currently fails to recognize this,
and thus fails to initialize the xhci device. This means no external
USB, or internal USB (which feeds the build-in keyboard)
The official Raspberry Pi FreeBSD image has been working on the Pi 400
"on accident" simply because it didn't include the Pi 400 .dtb file
[11 lines not shown]
hosts.equiv.5: correct nits to fix `mandoc -T lint` issues
- Rename `.Nm .rhosts` to `.Nm rhosts` to match the MLINK for the
manpage.
- Use `.Pa` instead of `.Nm` when discussing the paths for `.rhosts` and
`hosts.equiv.5` for explicitness and clarity.
Bump .Dd for the change.
MFC after: 1 week
security(7): fix `mandoc -T lint` complaints
- Add `.Nm` section for securelevel(7) to match corresponding MLINKS entry.
- Fix the spelling for mac(4) (the actual subsystem manpage is spelled out in
lowercase.
MFC after: 1 week
Remove -fms-extensions throughout the tree
During a discussion about using -fms-extensions jhb pointed out that
we have them enabled in the kernel for gcc by default (even multiple
times in one part). I had missed all that and clang still failed on
my use case (needing another option).
The original cause for enabling them for our tree back then was that
we needed to support C11 anonymous struct/unions.
Our in-tree gcc 4.2.1, despite later patches, needed the
-fms-extensions to support these even though this was not the expected
use case for that option ( cc4a90c445aa0 enabled it globally for the
kernel).
clang at that time (or at least when it became default for 10.0)
already was fine (with C11).
Any later gcc (4.6.0 onwards) did not need that option anymore, even
when compiled for -std=iso9899:1990 (which does not support anonymous
structs/unions) unless one would add -pedantic (see gcc git 4bdd0a60b27a).
[16 lines not shown]
net: Fix collision between SIOCGI2CPB and IPSECGREQID
It turns out interface ioctls are defined not just in sockio.h, but are
spread among many files. When I added SIOCGI2CPB at the bottom of the
file, the next number (160) collided with an ioctl (IPSECGREQID) that
I was unaware of in another file. Fix this by moving to a number that
is unclaimed.
Fixes: cf1f21572897 (net: Add SIOCGI2CPB ioctl & add page/bank fields to ifi2creq)
Reported by: dhw
Reviewed by: imp
nda: Filter non-storage nvme drives
Non-stroage drives have namespaces, but no storage attached. These
drives have a different interface type than storage drives, so ignore
them for the nvme_sim, which just handles storage.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D56461
cam: kern.cam.max_high_power tuneable / sysctl
Create a tunable for the maxinum number of 'high power' commands to
schedule, kern.cam.max_high_power. Default remains at 4.
Differential Revision: https://reviews.freebsd.org/D56462
cam: Set ccb_h.status on XPT_GDEVLIST early-return paths
XPT_GDEVLIST in xpt_action_default has two early-return paths (list
changed and index not found) that set cgdl->status but not ccb_h.status.
Since xpt_action sets ccb_h.status to CAM_REQ_INPROG before dispatching,
and XPT_GDEVLIST is an non-queued CCB, cam_periph_ccbwait skips the
sleep loop and immediately hits the KASSERT checking that status !=
CAM_REQ_INPROG, causing a panic.
Set ccb_h.status = CAM_REQ_CMP at the top of the code rather than the
bottom. Any future error paths will be right (since this command can't
fail at the command level, just in the status of the data level).
PR: 293899
Assisted-By: Claude Opus 4.6 (1M context)
Sponsored by: Netflix
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D56487
pass(4): Allowlist CCB func_codes to harden passthrough ioctls
The pass(4) driver's CAMIOCOMMAND and CAMIOQUEUE ioctls accept arbitrary
CCBs from userland. This device requires root to open, and thus send
these commands. Previously, the only func_code filter was a blocklist
check against the XPT_FC_XPT_ONLY flag. This missed several dangerous
func_codes that lack that flag:
- XPT_ABORT: the abort_ccb field is a raw kernel pointer from the
user CCB payload. xpt_action_default() dereferences it without
validation, leading to kernel crashes or worse.
- XPT_SASYNC_CB: the callback and callback_arg fields come directly
from the user CCB payload and get registered as a kernel async
callback, allowing arbitrary kernel code execution.
- Target mode CCBs (XPT_EN_LUN, XPT_TARGET_IO, etc.) fall through
directly to the SIM with user-controlled payloads.
[23 lines not shown]
acpi_apm: Narrow scope of ACPI_LOCK
This lock doesn't need to be held across seldrain/knlist_destroy. It
is also redundant (and a bug) to hold it across knlist_add and
knlist_remove since it is the mutex for the knlist.
PR: 293901
Reported by: Jiaming Zhang <r772577952 at gmail.com>
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D55994
(cherry picked from commit cc2715cf1f864345ab175db691d4e152d5fb84af)
acpi_apm: Don't recurse on ACPI_LOCK in apmreadfilt
The lock is already held by the caller since it is used as the knlist
lock.
PR: 293901
Reported by: Jiaming Zhang <r772577952 at gmail.com>
Fixes: cc2715cf1f86 ("acpi_apm: Narrow scope of ACPI_LOCK")
(cherry picked from commit 8c941e313e3925b17e49b093244c159db7a112f8)
LinuxKPI: Fix simple_read_from_buffer for zero-size and off-the-end reads
I noticed that the buf_size < 0 check can never be true (it's a
size_t) and decided to check for this condition by an alternate
expression, and I also noticed that a read_size of 0 would incorrectly
return -EFAULT. Instead, return success for both of these cases as
reading beyond the EOF of a normal file also returns EOF, not EINVAL.
Reviewed by: bz
Sponsored by: AFRL, DARPA
Differential Revision: https://reviews.freebsd.org/D55845
(cherry picked from commit 2353fa1aca553883141a7b5d0aa54312a4610412)
lindebugfs: Pass user buffer pointers to the read/write file operations
The Linux file_operations API expects the read and write operations
to take a single user buffer pointer (along with the length and the
file offset as an in/out parameter).
However, the debugfs_fill function was violating this part of the
contract as it was passing down kernel pointers instead. An earlier
commit (5668c22a13c6befa9b8486387d38457c40ce7af4) hacked around this
by modifying simple_read_from_buffer() to treat its user pointer
argument as a kernel pointer instead. However, other commits keep
tripping over this same API mismatch
(e.g. 78e25e65bf381303c8bdac9a713ab7b26a854b8c passes a kernel pointer
to copy_from_user in fops_str_write).
Instead, change debugfs_fill to use the "raw" pseudofs mode where the
uio is passed down to directly to the fill callback rather than an
sbuf. debufs_fill now iterates over the iovec in the uio similar to
the implementation of uiomove invoking the read or write operation on
[26 lines not shown]
LinuxKPI: Clear the sbuf at the start of each call to seq_read
Each invocation of seq_read invokes the seq_file.show callback which
writes into the sbuf. Then it invokes sbuf_finish before copying the
data into the caller's buffer. Without this, a second call to
seq_read on the same file would try to append data to a finished sbuf.
Reviewed by: bz
Sponsored by: AFRL, DARPA
(cherry picked from commit c181c8f5ca707962359e636ca5aa536e60147eee)
pciconf: Add a tree mode
This lists PCI devices in a hierarchy showing the parent/child
relationship of PCI devices and bridges. While this is inspired by
lspci -t output, the format is closer to ps -d and also prefers using
new-bus device names when possible. If a device does not have a
driver, the PCI selector is output in place of the device name.
When the -v flag is given, the vendor and device ID strings are output
after the device name. If a string for an ID isn't found, the hex ID
values are output instead.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D55774
(cherry picked from commit 14b8a27883c15d3add3114f855eff7c6bda1b015)
pciconf.8: Reorganize slightly to handle additional modes
Move the description of the optional device argument earlier before
describing individual command modes.
Add a subsection for list mode and a second subsection for the other
modes that work with a single device.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D55773
(cherry picked from commit 98a0d2283701e08353ce670c8023803c58a4994c)
pci: Export bus numbers for bridge devices in struct pci_conf
This exports bus information about bridges to userspace via the
less-privileged PCIOCGETCONF ioctl. Previously if userspace wished to
query this information, it had to use direct PCI config register
access which requires higher privilege.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D55771
(cherry picked from commit 7e7a1b61531a29b4a0a5cdac66b96f420e6c66e4)
pci.4: Quote argument to -width for a list block
This fixes an mdoc warning and also properly indents this list. While
here, update the quoted argument to be the longest tag in the list.
Also while here, correct the description of pd_numa_domain. NUMA
domains are a property of the device, not of the driver.
Reviewed by: ziaee, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D55770
(cherry picked from commit c3ac5f14c8b330c036149d1d24cd3369d1418de2)
pciconf: Factor out fetching of matching devices from list_devs
The new fetch_devs function fetches the entire list of PCI devices
into a single list, retrying if the list changes while it is being
fetched.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D55768
(cherry picked from commit 9eb035ff8439195f565b9e3180b727333a4e7170)
sys: Don't pass RF_ALLOCATED to bus_alloc_resource*
This is a nop as eventually these flags are passed to rman_reserve_resource
which unconditionally sets RF_ALLOCATED in the new flags for a region.
However, it's really a layering violation to use RF_ALLOCATED in relation
to struct resource objects outside of subr_rman.c as subr_rman.c uses
this flag to manage it's internal tracking of allocated vs free regions.
In addition, don't document this as a valid flag in the manual. I
think the intention here was that if a caller didn't want to pass
RF_ACTIVE or RF_SHAREABLE, they could pass RF_ALLOCATED instead of 0,
but given the layering violation, I think it's best to just pass 0
instead in that case.
NB: The bhnd bus uses RF_ALLOCATED (along with RF_ACTIVE) in a
separate API to manage resource regions that are not struct resource
objects (but a separate wrapper object). It would perhaps be cleaner
if the chipc_retain_region and chipc_release_region functions used
their own flag constants instead of reusing the rman(9) flags.
[5 lines not shown]