ifconfig: fix reporting optics on most 100g interfaces
This fixes a bug where optics on 100G and faster NICs
were not properly reported.
(cherry picked from commit 709348c21351a783ff0025519d1f7cf884771077)
acpi_ged: Handle events directly
Handle ged interrupts directly from the interrupt handler,
while the interrupt source is masked, so as to conform
with the acpi spec, and avoid spurious interrupts and
lockups on boot.
When an acpi ged interrupt is encountered, the spec requires
the os (as stated in 5.6.4: General Purpose Event Handling)
to leave the interrupt source masked until it runs the
EOI handler. This is not a good fit for our method of
queuing the work (including the EOI ack of the interrupt),
via the AcpiOsExecute() taskqueue mechanism.
Note this fixes a bug where an arm64 server could lock up if
it encountered a ged interrupt at boot. The lockup was
due to running on a single core (due to arm64 not using
EARLY_AP_STARTUP), and due to that core encountering a
new interrupt each time the interrupt handler unmasked
[27 lines not shown]
Revert "When stopping powerd, set the CPU frequency back to its maximum value"
This reverts commit 1dcb6ad173e57b489a859ea59ed6eaa733bdb5bc.
As of "8cb16fdbea6b Restore original frequency on exit.", powerd
restores the original frequency itself.
Further, if the original frequency is not the same as the
first frequency found in the frequency list, then the restoration
done by the powerd_poststop will restore the wrong frequency.
This can happen on Intel machines where Turbo is not enabled,
but the turbo frequency is first in the list of frequencies.
In this case, turbo will be enabled when the user did not want
it to be.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D40197
Reviewed by: imp, mav
ktls: re-work alloc thread
When the ktls_buffer zone needs to expand, it may fail due
to a lack of physically contiguous memory. We tried to rectify
that by introducing an alloc thread to provide a context where
it is harmless to sleep, and letting that thread repopulate
the ktls_buffer zone.
However, it turns out that M_WAITOK is not enough, and we
must call vm_page_reclaim_contig_domain() to reclaim contig
memory. Worse, M_WAITOK results in the allocation essentially
busy-looping around vm_domain_alloc_fail() returning EAGIN,
causing vm_page_alloc_noobj_contig_domain() to loop and resulting
in the alloc thread consuming 100% CPU.
To fix this, we change the alloc thread to call
vm_page_reclaim_contig_domain_ext()
In order to prevent the busy loop around vm_domain_alloc_fail(), we
[12 lines not shown]
vm: implement vm_page_reclaim_contig_domain_ext()
Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple
contiguous regions at once. This makes it more efficient for users
that need multiple contiguous regions to reclaim those regions
efficiently.
This is needed because callers like ktls may need to reclaim many
contiguous regions, and each scan of physical memory can take
multiple seconds on a large memory machine (order of 100GB of
RMA). Rather than modifying the core algorithm, I extended
vm_page_reclaim_contig_domain() to take a "desired_runs" argument to
allow the caller to request that it reclaim more than just a single
run. There is no functional change intended for all existing
callers.
The first user for this interface is the ktls code
(https://reviews.freebsd.org/D39421). By reclaiming multiple runs,
ktls goes from consuming hours of CPU to refill its buffer zone to
[5 lines not shown]
bectl: Improve error message when ZFS root is not found.
When recovering a system that is unbootable due to some
problem with the active BE, it is likely you'll be booted
from a rescue image running UFS. In this case, bectl
needs help finding the zpool root that you want to operate
on. In this case, improve the error message to suggest
specifying a root, rather than just emitting a generic
error message that might imply, to the naive user, that
there is a ZFS compatibility issue between the rescue
image and the on-disk ZFS pool.
Reviewed by: imp, kevans
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D39346
LRO: Add missing checks for invalid IP addresses
LRO bypasses normal ip_input()/tcp_input() and lacks several checks
that are present in the normal path. Without these checks, it
is possible to trigger assertions added in b0ccf53f2455
Reviewed by: glebius, rrs
Sponsored by: Netflix
ktls: Fix comments & whitespace issues with c0e4090e3d43
Address some last minute review feedback on c0e4090e3d43
by fixing spacing around comments, and clarifying that the
newly added destroy_task is not related to tls 1.0.
No functional change intended.
Pointed out by: jhb
Sponsored by: Netflix
ktls: Accurately track if ifnet ktls is enabled
This allows us to avoid spurious calls to ktls_disable_ifnet()
When we implemented ifnet kTLSe, we set a flag in the tx socket
buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that
now, or in the past, ifnet ktls was active on a socket. Later,
I added code to switch ifnet ktls sessions to software in the case
of lossy TCP connections that have a high retransmit rate.
Because TCP was using SB_TLS_IFNET to know if it needed to do math
to calculate the retransmit ratio and potentially call into
ktls_disable_ifnet(), it was doing unneeded work long after
a session was moved to software.
This patch carefully tracks whether or not ifnet ktls is still enabled
on a TCP connection. Because the inp is now embedded in the tcpcb, and
because TCP is the most frequent accessor of this state, it made sense to
move this from the socket buffer flags to the tcpcb. Because we now need
reliable access to the tcbcb, we take a ref on the inp when creating a tx
[12 lines not shown]
ixgbe: Do not count L3/L4 checksum errors as input errors
NIC input errors have traditionally indicated problems at the link
level (crc errors, runts, etc). People tend to build monitoring
infrastructure around such errors in order to monitor for bad network
hardware. When L3/L4 checksum errors are included in the category of
input errors, it breaks such monitoring, as these errors can originate
anywhere on the internet, and do not necessarily indicate faulty
local network hardware.
Reviewed by: erj, glebius
Differential Revision: https://reviews.freebsd.org/D38346
Sponsored by: Netflix
dtrace: conditionally load the systrace_linux klds when loading dtrace.
When dtrace starts, it tries to detect if the dtrace klds are loaded,
and if not, it loads them by loading the dtraceall kld. This module
depends on most dtrace modules, including systrace for the native
freebsd and freebsd32 ABIs. However, it does not depend on the
systrace_linux klds, as they in turn depend on the linux ABI klds, and
we don't want to load an ABI module that the user has not explicitly
requested. This can leave a naive user in a state where they think all
syscall providers have been loaded, yet linux ABI syscalls are
"invisible" to dtrace.
To fix this, check to see if the linux ABI modules are loaded. If they
are, then load their systrace klds.
Reviewed by: markj, (emaste & jhb, earlier versions)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37986
vm: centralize VM_BATCHQUEUE_SIZE definition
Remove the platform-specific definitions of VM_BATCHQUEUE_SIZE
for amd64 and powerpc64, and instead treat all 64-bit platforms
identically. This has the effect of increasing the arm64
and riscv VM_BATCHQUEUE_SIZE to match that of other platforms.
Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37707
tcp: Build RACK and BBR stacks as a part of LINT
When RACK and BBR were added to the kernel, they were put
behind 'WITH_EXTRA_TCP_STACKS=1'. Unfortunately that was
never added to any NOTES file, so RACK & BBR were not compiled
with the various LINT-NOINET, LINT-NOINET6, and LINT-NOIP kernels.
This lead to the stacks sometimes being broken.
This change:
- Fixes RACK so that it compiles with the various LINT-NO* kernels
- Adds WITH_EXTRA_TCP_STACKS=1 to all NOTES kernels so that
RACK and BBR are compile tested regularly
Sponsored by: Netflix
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D37903
Unbreak the build when MAC is not defined
7a2c93b86ef7 removed the use of "error" when MAC was not
defined, resulting in an unused variable error.
Sponsored by: Netflix
Reviewed by: jhb
vm: reduce lock contention when processing vm batchqueues
Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.
So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.
This has been run for several months on a busy Netflix server, as well
as on my personal desktop.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305
allocate inpcb aligned to cachelines
The inpcb struct is one of the most heavily utilized in the kernel
on a busy network server. By aligning it to a cacheline
boundary, we can ensure that closely related fields in the inpcb
and tcbcb can be predictably located on the same cacheline. rrs
has already done a lot of this work to put related fields on the
same line for the tcbcb.
In combination with a forthcoming patch to align the start of the tcpcb,
we see a roughly 3% reduction in CPU use on a busy web server serving
traffic over roughly 50,000 TCP connections.
Reviewed by: glebius, markj, tuexen
Differential Revision: https://reviews.freebsd.org/D37687
Sponsored by: Netflix
ixl: silence runtime warning when PCI_IOV is not enabled
When PCI_IOV is not enabled, do not attempt to call
iflib_softirq_alloc_generic(...IFLIB_INTR_IOV), as it results
in boot-time warnings similar to:
taskqgroup_attach_cpu: qid not found for iov cpu=2
ixl2: taskqgroup_attach_cpu failed 22
Instead, make it conditional on PCI_IOV like the other
SR-IOV related code.
Reviewed by: erj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37609
Fix a panic on boot introduced by 555a861d6826
First, an sbuf_new() in device_get_path() shadows the sb
passed in by dev_wired_cache_add(), leaving its sb in an
unfinished state, leading to a failed KASSERT(). Fixing this
is as simple as removing the sbuf_new() from device_get_path()
Second, we cannot simply take a pointer to the sbuf memory and
store it in the device location cache, because that sbuf
is freed immediately after we add data to the cache, leading
to a use-after-free and eventually a double-free. Fixing this
requires allocating memory for the path.
After a discussion with jhb, we decided that one malloc was
better than two in dev_wired_cache_add, which is why it changed
so much.
Reviewed by: jhb
Sponsored by: Netflix
MFC after: 14 days
LRO: fix BPF filters for lagg in the hpts path
When in the hpts path, we need to handle BPF filters since aggregated
packets do not pass up the stack in the normal way. This is already
done for most interfaces, but lagg needs special handling. This is
because packets received via a lagg are passed up the stack with
the leaf interface's ifp stored in m_pkthdr.rcvif.
To handle lagg packets, we must identify that the passed rcvif is
currently a lagg port by checking for IFT_IEEE8023ADLAG or
IFT_INFINIBANDLAG (since lagg changes the lagg port's type to that
when an interface becomes a lagg member). Then we need to find the
lagg's ifp, and handle any BPF listeners on the lagg.
Note: It is possible to have multiple BPF filters, one on a member
port and one on the lagg itself. That is why we have to have 2
checks and 2 ETHER_BPF_MTAPs.
Reviewed by: jhb, rrs
[2 lines not shown]
lagg: fix lagg ifioctl after SIOCSIFCAPNV
Lagg was broken by SIOCSIFCAPNV when all underlying devices
support SIOCSIFCAPNV. This change updates lagg to work with
SIOCSIFCAPNV and if_capabilities2.
Reviewed by: kib, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35865
pmcstat: fix log analysis
pmcstat has been broken for analyzing logs since D35342 / b6e28991bf3aadb.
This is because the pmc for the first CPU is not added when reading logs
because unlike its clones, its event id is not invalid. That causes us
to fail the assertion at lib/libpmcstat/libpmcstat_logging.c:293
when encountering samples from cpu0.
Fix this by removing the check that the PMC is invalid
Reviewed by: tsoome
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35709
lacp: Remove racy kassert
In lacp_select_tx_port_by_hash(), we assert that the selected port is
DISTRIBUTING. However, the port state is protected by the LACP_LOCK(),
which is not held around lacp_select_tx_port_by_hash(). So this
assertion is racy, and can result in a spurious panic when links
are flapping.
It is certainly possible to fix it by acquiring LACP_LOCK(),
but this seems like an early development assert, and it seems best
to just remove it, rather than add complexity inside an ifdef
INVARIANTS.
Sponsored by: Netflix
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D35396
cxgbe: fix enabling lro & rxtimestamps
A recent change caused iq flags, like LRO, to be set before
init_iq(). However, init_iq() clears those flags, so they
became effectively impossible to set. This change moves
the initializion of these flags to after the call to init_iq().
This fixes LRO.
Differential Revision: https://reviews.freebsd.org/D30460
Reviewed by: np, rrs
Sponsored by: Netflix
Fixes: 43bbae19483fbde0a91e61acad8a6e71e334c8b8
(cherry picked from commit df8437a93dd5268e5bfd06411c01a5cbdb38c6ac)
(cherry picked from commit 392d7f026962b273cdcd3b230403efaa05f29efe)
Approved by: re@ (gjb@)
cxgbe: fix enabling lro & rxtimestamps
A recent change caused iq flags, like LRO, to be set before
init_iq(). However, init_iq() clears those flags, so they
became effectively impossible to set. This change moves
the initializion of these flags to after the call to init_iq().
This fixes LRO.
Differential Revision: https://reviews.freebsd.org/D30460
Reviewed by: np, rrs
Sponsored by: Netflix
Fixes: 43bbae19483fbde0a91e61acad8a6e71e334c8b8
(cherry picked from commit df8437a93dd5268e5bfd06411c01a5cbdb38c6ac)
namei: Treat non-tied KLDs as if they had INVARIANTS enabled
When working with a vendor to debug their kernel module,
I found that a non-tied kld which uses NDINIT will panic
due to "namei: bad debugflags " on a kernel compiled with
INVARIANTS because non-tied KLDs do not pick up the
initialization that is done in NDINIT_DBG/NDREINIT_DBG().
Fix this by making this initialization happen for non-KLD_TIED
as well as INVARIANTS
Reviewed by: mjg
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34588
Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.
This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.
Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.
[7 lines not shown]
tcp: fix leaks in tcp_chg_pacing_rate error paths
tcp_chg_pacing_rate() is expected to release the hw rate limit table,
but failed to do so in several error cases, leading to ever
increasing counts of flows using the rate.
This patch was mostly done by rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34058
Reviewed by: hselasky, rrs, jhb (inital version, outside of Differential)
Fix a memory leak when ip_output_send() returns EAGAIN due to send tag issues
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.
This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.
Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.
[5 lines not shown]
Make hwpmc work for userspace binaries again
hwpmc has been utterly broken for userspace binaries, and has been
labeling all samples from userspace binaries as dubious frames. The
issues are that:
-The check for ph.p_offset & (-ph.p_align) == 0 was mostly bogus. The
intent was to ignore all executable segments other than the first,
which when using BFD appeared in the first page, but with current LLD
a read-only data segment appears before the executable segment,
pushing the latter into the second page or later. This meant no
executable segment was ever found, and thus pi_vaddr remained
0. Instead of relying on BFD's layout, track whether we've seen an
executable segment explicitly with a local bool.
-Shared libraries were not parsing the segments to calculate pi_vaddr,
resulting in it always being 0. Again, when using BFD, the executable
segment started at the first page, and so pi_vaddr was genuinely
meant to be 0, but not with LLD's current layout. This meant that
[19 lines not shown]
ktls: Init reset tag task for cloned sessions
When cloning a ktls session (which is needed when we need to
switch output NICs for a NIC TLS session), we need to also
init the reset task, like we do when creating a new tls session.
Reviewed by: jhb
Sponsored by: Netflix
(cherry picked from commit 95c51fafa40d56d0a32aff857261097acc65ec92)