Merge tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter.
Previous releases - regressions:
- ethtool: fix NULL pointer dereference in phy_reply_size
- netfilter:
- allocate hook ops while under mutex
- close dangling table module init race
- restore nf_conntrack helper propagation via expectation
- tcp:
- fix potential UAF in reqsk_timer_handler().
- fix out-of-bounds access for twsk in tcp_ao_established_key().
- vsock: fix empty payload in tap skb for non-linear buffers
[58 lines not shown]
Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore:
- Correctly log the inheritable capabilities
- Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands
* tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
audit: fix incorrect inheritable capability in CAPSET records
ptrace: slightly saner 'get_dumpable()' logic
The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.
And almost all users do in fact use it only for the case where the task
has a mm pointer.
But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).
It's not what this flag was designed for, but it is what it is.
The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
[12 lines not shown]
net: phy: DP83TC811: add reading of abilities
At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.
Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy")
Suggested-by: Andrew Lunn <andrew at lunn.ch>
Signed-off-by: Sven Schuchmann <schuchmann at schleissheimer.de>
Reviewed-by: Andrew Lunn <andrew at lunn.ch>
Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni at redhat.com: dropped revision history]
Signed-off-by: Paolo Abeni <pabeni at redhat.com>
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:
struct sk_msg_sg {
[...]
/* The extra two elements:
* 1) used for chaining the front and sections when the list becomes
* partitioned (e.g. end < start). The crypto APIs require the
* chaining;
* 2) to chain tailer SG entries after the message.
*/
struct scatterlist data[MAX_MSG_FRAGS + 2];
The current code uses MAX_SKB_FRAGS + 1 as the ring size:
[24 lines not shown]
net: tls: prevent chain-after-chain in plain text SG
Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:
This would create a chain where the wrap link points directly
to another chain link. The scatterlist API sg_next iterator
does not recursively resolve consecutive chain links.
meaning this is illegal input to crypto.
The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:
end start i
v v v
[ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap]
[15 lines not shown]
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.
smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:
BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
Write of size 4 at addr 0000000000000068 by task exploit/144
Call Trace:
_raw_spin_lock_bh
[15 lines not shown]
Merge branch 'macsec-use-rcu_work-to-fix-crypto-cleanup-in-softirq-context'
Jinliang Zheng says:
====================
macsec: use rcu_work to fix crypto cleanup in softirq context
From: Jinliang Zheng <alexjlzheng at tencent.com>
crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs()
in hardware crypto drivers like hisi_sec2), which must not be invoked from
softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that
run in softirq, causing a kernel crash on affected hardware.
This series fixes the issue by deferring the actual cleanup to a workqueue
using rcu_work, which combines the RCU grace period and workqueue dispatch
into a single primitive.
Two design decisions worth noting:
[31 lines not shown]
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
free_txsa() is an RCU callback running in softirq context, but calls
crypto_free_aead() which can invoke vunmap() internally on hardware
crypto drivers (e.g. hisi_sec2), triggering a kernel crash.
Use rcu_work to defer the cleanup to a workqueue, for the same reasons
as the analogous fix to free_rxsa() in the previous patch.
Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jinliang Zheng <alexjlzheng at tencent.com>
Reviewed-by: Sabrina Dubroca <sd at queasysnail.net>
Link: https://patch.msgid.link/20260511153102.2640368-4-alexjlzheng@tencent.com
Signed-off-by: Jakub Kicinski <kuba at kernel.org>
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
crypto_free_aead() can internally invoke vunmap() (e.g. via
dma_free_attrs() in hardware crypto drivers such as hisi_sec2).
vunmap() must not be called from softirq context, but free_rxsa()
is an RCU callback that runs in softirq, leading to a kernel crash:
vunmap+0x4c/0x70
__iommu_dma_free+0xd0/0x138
dma_free_attrs+0xf4/0x100
sec_aead_exit+0x64/0xb8 [hisi_sec2]
crypto_destroy_tfm+0x98/0x110
free_rxsa+0x28/0x50 [macsec]
rcu_do_batch+0x184/0x460
rcu_core+0xf4/0x1f8
handle_softirqs+0x118/0x330
Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches
the worker asynchronously after the RCU grace period, so no thread
[8 lines not shown]
macsec: introduce dedicated workqueue for SA crypto cleanup
Introduce a dedicated ordered workqueue, macsec_wq, which will be used
by subsequent patches to defer SA crypto cleanup (crypto_free_aead and
related teardown) out of softirq context.
Using a dedicated workqueue instead of system_wq allows macsec_exit()
to drain exactly the work items belonging to this module via
destroy_workqueue(), without interfering with unrelated work items on
system_wq or causing unexpected delays elsewhere.
rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks
have enqueued their work items before destroy_workqueue() drains and
destroys the queue, making the two-step teardown correct and complete.
The same sequence is kept in the error path of macsec_init() as a
precaution, to mirror macsec_exit() and stay safe if work ever becomes
queueable before this point in the future.
While at it, rename the error labels in macsec_init() from the
[8 lines not shown]
net: net_failover: Fix the deadlock in slave register
There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
failover_slave_register() in failover_existing_slave_register() adds lock
and unlock ops too.
Call Trace:
<TASK>
__schedule+0x30d/0x7a0
schedule+0x27/0x90
schedule_preempt_disabled+0x15/0x30
__mutex_lock.constprop.0+0x538/0x9e0
__mutex_lock_slowpath+0x13/0x20
mutex_lock+0x3b/0x50
dev_set_mtu+0x40/0xe0
net_failover_slave_register+0x24/0x280
failover_slave_register+0x103/0x1b0
[14 lines not shown]
MAINTAINERS: update atlantic driver maintainer
Igor Russkikh and Egor Pomozov have left Marvell.
Take over maintenance of the atlantic driver and its PTP subsystem.
Signed-off-by: Sukhdeep Singh <sukhdeeps at marvell.com>
Signed-off-by: Jakub Kicinski <kuba at kernel.org>
selftests/tc-testing: Add QFQ/CBS qlen underflow test
Since CBS was not calling reset for its child qdisc, there are scenarios
where it could cause an underflow on its parent's qlen/backlog. When the
parent is QFQ, a null-ptr deref could occur.
Add a test case that reproduces the underflow followed by a null-ptr
deref scenario.
Acked-by: Jamal Hadi Salim <jhs at mojatatu.com>
Signed-off-by: Victor Nogueira <victor at mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba at kernel.org>
FDDI: defza: Sanitise the reset safety timer
The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot. Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.
The safety mechanism depends on the origin of the reset. For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event. In that case no sleep is possible as a device halt is reported
at the hardirq level.
A timer is therefore set up to ensure progress in case no adapter state
[118 lines not shown]
Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"The bulk of this is hardening of the new sub-scheduler infrastructure.
- UAFs and lifecycle bugs on the sub-sched attach/detach paths:
parent sub_kset freed under a racing child, list_del_rcu on an
uninitialized list head, ops->priv stomped by concurrent
attach/detach, and a UAF in the init-failure error path
- Task state-machine reorg closing concurrent enable-vs-dead races: a
task exiting during the unlocked init window could trip NULL ops
derefs or skip exit_task() cleanup
- A scx_link_sched() self-deadlock on scx_sched_lock
- isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
cpumask without RCU, and stop rejecting BPF schedulers when only
cpuset isolated partitions are active
[26 lines not shown]
Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cpuset fixes:
- Partition invalidation could return CPUs still in use by sibling
partitions, producing overlapping effective_cpus
- cpuset_can_attach() over-reserved DL bandwidth on moves that
stayed within the same root domain
- Pending DL migration state leaked into later attaches when a
later can_attach() check failed
- Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
allocate from any node and exit quickly
- dmem: propagate -ENOMEM instead of spinning forever when the fallback
pool allocation also fails
- selftests/cgroup: percpu test error-path leak, bogus numeric
comparison of cpuset strings, and a zero-length read() that silently
[11 lines not shown]
Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path
- Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING
set with no owner
* tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path
workqueue: Release PENDING in __queue_work() drain/destroy reject path
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.
Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:
=============================
WARNING: suspicious RCU usage
-----------------------------
kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage!
1 lock held by scx_flash/281:
[33 lines not shown]
cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs
to return to the parent are computed as:
adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus);
where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().
Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.
Reproducer (on a 4-CPU machine):
[30 lines not shown]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Add the pKVM side of the workaround for ARM's erratum 4193714,
provided that the EL3 firmware does its part of the job. KVM will
refuse to initialise otherwise
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0
- Correctly deal with permission faults in guest_memfd memslots
- Fix the steal-time selftest after the infrastructure was reworked
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure
[72 lines not shown]
selftests/cgroup: Fix error path leaks in test_percpu_basic
When cg_name_indexed() returns NULL partway through the child creation
loop, the code returned -1 without running cleanup_children and cleanup.
That left the `parent` pathname allocation unreleased and did not remove
child cgroup directories already created under the parent. Fix by jumping
to cleanup_children instead of returning.
When cg_create() fails, `child` (the pathname from cg_name_indexed())
was not freed before cleanup_children. Fix by freeing `child` before
branching to cleanup_children.
Fixes: 90631e1dea55 ("kselftests: cgroup: add perpcu memory accounting test")
Signed-off-by: Yu Miao <yumiao at kylinos.cn>
Signed-off-by: Tejun Heo <tj at kernel.org>
Merge tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux
Pull liveupdate fixes from Mike Rapoport:
"A few fixes for kexec handover and liveupdate:
- make sure KHO is skipped for crash kernel
- fix error reporting in memfd preservation if it fails mid-loop
- don't allow preserving memfds whose page count exceeds UINT_MAX
- fix documentation of memfd seals preservation to match the code"
* tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux:
mm/memfd_luo: document preservation of file seals
mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX
mm/memfd_luo: report error when restoring a folio fails mid-loop
kho: skip KHO for crash kernel
KVM: VMX: introduce module parameter to disable CET
There have been reports of host hangs caused by CET virtualization.
Until these are analyzed further, introduce a module parameter that
makes it possible to easily disable it.
Link: https://lore.kernel.org/all/85548beb-1486-40f9-beb4-632c78e3360b@proxmox.com/
Cc: David Riley <d.riley at proxmox.com>
Signed-off-by: Paolo Bonzini <pbonzini at redhat.com>
net: ethernet: ravb: Do not check URAM suspension when WoL is active
When updating the driver to match latest datasheet to suspend access to
URAM when suspending DMA transfers a corner-case was missed, URAM access
will not be suspended if WoL is enabled. This lead to the error message
(correctly) being triggered as URAM access is not suspended even tho
it's requested as part of stopping DMA.
Avoid checking if URAM access is suspended and printing the error
message if WoL is enabled when we suspend the system, as we know it will
not be.
Reported-by: Geert Uytterhoeven <geert at linux-m68k.org>
Closes: https://lore.kernel.org/all/CAMuHMdWnjV%3DHGE1o08zLhUfTgOSene5fYx1J5GG10mB%2BToq8qg@mail.gmail.com/
Fixes: 353d8e7989b6 ("net: ethernet: ravb: Suspend and resume the transmission flow")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas at ragnatech.se>
Reviewed-by: Sai Krishna <saikrishnag at marvell.com>
Tested-by: Geert Uytterhoeven <geert+renesas at glider.be>
Signed-off-by: Jakub Kicinski <kuba at kernel.org>
ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
ethnl_bitmap32_not_zero() should return true if some bit in [start, end)
is set:
- Fix inverted memchr_inv() sense: return true when the scan finds a
non-zero byte, not when the middle words are all zero.
- Return false for an empty interval (end <= start).
- When end is 32-bit aligned, indices in [start, end) do not include any
bits from map[end_word]; return false after earlier checks found no
non-zero data.
Fixes: 10b518d4e6dd ("ethtool: netlink bitset handling")
Signed-off-by: Chenguang Zhao <zhaochenguang at kylinos.cn>
Signed-off-by: Jakub Kicinski <kuba at kernel.org>
net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
The smc_msg_event tracepoint class, shared by smc_tx_sendmsg and
smc_rx_recvmsg, unconditionally dereferences smc->conn.lnk:
__string(name, smc->conn.lnk->ibname)
conn->lnk is only set for SMC-R; for SMC-D it is NULL. Other code on
these paths already handles this (e.g. !conn->lnk in
SMC_STAT_RMB_TX_SIZE_SMALL()). With the tracepoint enabled, the first
sendmsg()/recvmsg() on an SMC-D socket crashes:
Oops: general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [...]
RIP: 0010:strlen+0x1e/0xa0
Call Trace:
trace_event_raw_event_smc_msg_event (net/smc/smc_tracepoint.h:44)
smc_rx_recvmsg (net/smc/smc_rx.c:515)
smc_recvmsg (net/smc/af_smc.c:2859)
[18 lines not shown]
net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
A logic flaw in __smc_setsockopt() allows a local unprivileged user to
cause a Denial of Service (DoS) by holding the socket lock indefinitely.
The function __smc_setsockopt() calls copy_from_sockptr() while holding
lock_sock(sk). By passing a userfaultfd-monitored memory page (or
FUSE-backed memory on systems where unprivileged userfaultfd is disabled)
as the optval, an attacker can halt execution during the copy operation,
keeping the lock held.
Combined with asynchronous tear-down operations like shutdown(), this
exhausts the kernel wq (kworkers) and triggers the hung task watchdog.
[ 240.123456] INFO: task kworker/u8:2 blocked for more than 120 seconds.
[ 240.123489] Call Trace:
[ 240.123501] smc_shutdown+...
[ 240.123512] lock_sock_nested+...
[8 lines not shown]
net: atm: fix skb leak in sigd_send() default branch
The default branch in sigd_send() calls sock_put() and returns -EINVAL
without freeing the skb, while all other exit paths do so. Add the
missing dev_kfree_skb() before sock_put() to fix the leak.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Wei Yang <albinwyang at tencent.com>
Link: https://patch.msgid.link/20260509122358.1102997-1-albin_yang@163.com
Signed-off-by: Jakub Kicinski <kuba at kernel.org>