Merge tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Fix devm_alloc_workqueue() passing a va_list as a positional arg to
the variadic alloc_workqueue() macro, which garbled wq->name and
skipped lockdep init on the devm path. Fold both noprof entry points
onto a va_list helper.
Also, annotate it using __printf(1, 0)
* tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Annotate alloc_workqueue_va() with __printf(1, 0)
workqueue: fix devm_alloc_workqueue() va_list misuse
Merge tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- During v6.19, cgroup task unlink was moved from do_exit() to after the
final task switch to satisfy a controller invariant. That left the kernel
seeing tasks past exit_signals() longer than userspace expected, and
several v7.0 follow-ups tried to bridge the gap by making rmdir wait for
the kernel side. None held up.
The latest is an A-A deadlock when rmdir is invoked by the reaper of
zombies whose pidns teardown the rmdir itself is waiting on, which
points at the synchronizing approach being fundamentally wrong.
Take a different approach: drop the wait, leave rmdir's user-visible
side returning as soon as cgroup.procs is empty, and defer the css
percpu_ref kill that drives ->css_offline() until the cgroup is fully
depopulated.
[16 lines not shown]
Merge tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr
when the BPF caller's allowed mask was wider. Stable backport.
- Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode
versus the global mode:
- Tasks past exit_signals() are filtered by the cgroup walk but kept
by global. Sub-scheduler enable abort leaked __scx_init_task()
state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task
iterator (scx_task_iter is its only user) and use it.
- Tasks past sched_ext_dead() are still returned, tripping
WARN_ON_ONCE() in callers or making them touch torn-down state.
Mark and skip under the per-task rq lock.
[4 lines not shown]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"All in drivers.
The largest change is the ufs one which has to introduce a new
function to check the power state before doing the update and the most
widely encountered one is the obvious change to sg to not use
GFP_ATOMIC"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: iscsi: reject invalid size Extended CDB AHS
scsi: ufs: core: Fix bRefClkFreq write failure in HS-LSS mode
scsi: hisi_sas: Fix sparse warnings in prep_ata_v3_hw()
scsi: pmcraid: Fix typo in comments
scsi: scsi_dh_alua: Increase default ALUA timeout to maximum spec value
scsi: smartpqi: Silence a recursive lock warning
scsi: mpt3sas: Limit NVMe request size to 2 MiB
scsi: sg: Don't use GFP_ATOMIC in sg_start_req()
scsi: target: configfs: Bound snprintf() return in tg_pt_gp_members_show()
Merge tag 'fbdev-for-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev fixes from Helge Deller:
"Four small patches for fbdev, of which two are important: One fixes
the bitmap font generation and the other prevents a possible
use-after-free in udlfb:
- Fix rotating fonts by 180 degrees (Thomas Zimmermann)
- Drop duplicate include of linux/module.h in fb_defio (Chen Ni)
- Add vm_ops in udlfb to prevent use-after-free (Rajat Gupta)
- ipu-v3: clean up kernel-doc warnings (Randy Dunlap)"
* tag 'fbdev-for-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: udlfb: add vm_ops to dlfb_ops_mmap to prevent use-after-free
lib/fonts: Fix bit position when rotating by 180 degrees
fbdev: defio: Remove duplicate include of linux/module.h
fbdev: ipu-v3: clean up kernel-doc warnings
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
- Several error unwind misses on system calls in mlx5, mana, ocrdma,
vmw_pvrdma, mlx4, and hns
- More rxe bugs processing network packets
- User triggerable races in mlx5 when destroying and creating the same
same object when the FW returns the same object ID
- Incorrect passing of an IPv6 address through netlink
RDMA_NL_LS_OP_IP_RESOLVE
- Add memory ordering for mlx5's lock avoidance pattenr
- Protect mana from kernel memory overflow
[24 lines not shown]
Merge tag 'media/v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- rc: ttusbir: fix inverted error logic
- Venus/Iris fixes:
- Kconfig cross compile build testing for x86
- Use-after-free fix for internal buffers
- dma_free_attrs size fix
- Switch to hardware mode clocks
- Use-after-free fix for a concurrency path
- Fix H265D_MAX_SLICE size for sc7280 devices
- camoss: fix some clock-related issues
* tag 'media/v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: qcom: camss: avoid format string warning
media: qcom: camss: Add missing clocks for VFE lite on sa8775p
[9 lines not shown]
Merge tag 'linux_kselftest-fixes-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
- Fix extra test number increment in ksft_exit_skip() that results in
incorrect KTAP result
- Fix regression introduced by addition of explicit constructor orders
for fixture tests. This addition broke the ordering of those relative
to non-fixture tests and the reverse-constructor-order detection
* tag 'linux_kselftest-fixes-7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: harness: Restore order of test functions
selftests: kselftest: fix wrong test number in ksft_exit_skip
docs: cgroup-v1: Update charge-commit section
Commit 1d8f136a421f ("memcg/hugetlb: remove memcg hugetlb
try-commit-cancel protocol") removed mem_cgroup_commit_charge() and
mem_cgroup_cancel_charge(), but the docs still refer to those functions.
There is no longer any charge cancellation.
Update the docs to match the code.
Signed-off-by: T.J. Mercier <tjmercier at google.com>
Signed-off-by: Tejun Heo <tj at kernel.org>
sched_ext: idle: Recheck prev_cpu after narrowing allowed mask
scx_select_cpu_dfl() narrows @allowed to @cpus_allowed & @p->cpus_ptr
when the BPF caller supplies a @cpus_allowed that differs from
@p->cpus_ptr and @p doesn't have full affinity. However,
@is_prev_allowed was computed against the original (wider)
@cpus_allowed, so the prev_cpu fast paths could pick a @prev_cpu that
is in @cpus_allowed but not in @p->cpus_ptr, violating the intended
invariant that the returned CPU is always usable by @p. The kernel
masks this via the SCX_EV_SELECT_CPU_FALLBACK fallback, but the
behavior contradicts the documented contract.
Move the @is_prev_allowed evaluation past the narrowing block so it
tests against the final @allowed mask.
Fixes: ee9a4e92799d ("sched_ext: idle: Properly handle invalid prev_cpu during idle selection")
Cc: stable at vger.kernel.org # v6.16+
Assisted-by: Claude <noreply at anthropic.com>
Signed-off-by: David Carlier <devnexen at gmail.com>
[2 lines not shown]
Merge tag 'for-linus-7.1-2' of https://github.com/cminyard/linux-ipmi
Pull IPMI fixes from Corey Minyard:
"Fix a number of issues that came up recently
The first two fixes are workarounds for buggy IPMI hardware. The
hardware says it has data for the IPMI driver to read constantly, so
the driver reads the data constantly, causing any new requests to be
blocked.
The first fix was to check for invalid data right when the data was
read from the device and stop the operation there (there was a later
check for invalid data, but it could not stop the operation at that
point). It turned out the device was providing good data, so that
didn't fix the issue, but it's still a good check.
The second fix stops fetching this data after a few fetches and allows
other operations to occur. The driver won't work very well, but at
least it won't wedge. This seems to fix the issue.
[13 lines not shown]
sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked()
scx_task_iter's cgroup-scoped mode can return tasks whose
sched_ext_dead() has already completed: cgroup_task_dead() removes
from cset->tasks after sched_ext_dead() in finish_task_switch() and is
irq-work deferred on PREEMPT_RT. The global mode is fine -
sched_ext_dead() removes from scx_tasks via list_del_init() first.
Callers (sub-sched enable prep/abort/apply, scx_sub_disable(),
scx_fail_parent()) assume returned tasks are still on @sch and trip
WARN_ON_ONCE() or operate on torn-down state otherwise.
Set %SCX_TASK_OFF_TASKS in sched_ext_dead() under @p's rq lock and
have scx_task_iter_next_locked() skip flagged tasks under the same
lock. Setter and reader serialize on the per-task rq lock - no race.
Signed-off-by: Tejun Heo <tj at kernel.org>
cgroup, sched_ext: Include exiting tasks in cgroup iter
a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") made
css_task_iter_advance() skip exiting tasks so cgroup.procs stays consistent
with waitpid() visibility. Unfortunately, this broke scx_task_iter.
scx_task_iter walks either scx_tasks (global) or a cgroup subtree via
css_task_iter() and the two modes are expected to cover the same set of
tasks. After the above change the cgroup-scoped mode silently skips tasks
past exit_signals() that are still on scx_tasks.
scx_sub_enable_workfn()'s abort path is one of the symptoms: an exiting
SCX_TASK_SUB_INIT task can race past the cgroup iter leaking
__scx_init_task() state. Other iterations share the same gap.
Add CSS_TASK_ITER_WITH_DEAD to opt out of the skip and use it from
scx_task_iter().
Fixes: b0e4c2f8a0f0 ("sched_ext: Implement cgroup subtree iteration for scx_task_iter")
[2 lines not shown]
cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
A chain of commits going back to v7.0 reworked rmdir to satisfy the
controller invariant that a subsystem's ->css_offline() must not run while
tasks are still doing kernel-side work in the cgroup.
[1] d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out")
[2] a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup")
[3] 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir")
[4] 4c56a8ac6869 ("cgroup: Fix cgroup_drain_dying() testing the wrong condition")
[5] 13e786b64bd3 ("cgroup: Increment nr_dying_subsys_* from rmdir context")
[1] moved task cset unlink from do_exit() to finish_task_switch() so a
task's cset link drops only after the task has fully stopped scheduling.
That made tasks past exit_signals() linger on cset->tasks until their final
context switch, which led to a series of problems as what userspace expected
to see after rmdir diverged from what the kernel needs to wait for. [2]-[5]
tried to bridge that divergence: [2] filtered the exiting tasks from
cgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]
[54 lines not shown]
fbdev: udlfb: add vm_ops to dlfb_ops_mmap to prevent use-after-free
dlfb_ops_mmap() uses remap_pfn_range() to map vmalloc framebuffer pages
to userspace but sets no vm_ops on the VMA. This means the kernel cannot
track active mmaps. When dlfb_realloc_framebuffer() replaces the backing
buffer via FBIOPUT_VSCREENINFO, existing mmap PTEs are not invalidated.
On USB disconnect, dlfb_ops_destroy() calls vfree() on the old pages
while userspace PTEs still reference them, resulting in a use-after-free:
the process retains read/write access to freed kernel pages.
Add vm_operations_struct with open/close callbacks that maintain an
atomic mmap_count on struct dlfb_data. In dlfb_realloc_framebuffer(),
check mmap_count and return -EBUSY if the buffer is currently mapped,
preventing buffer replacement while userspace holds stale PTEs.
Tested with PoC using dummy_hcd + raw_gadget USB device emulation.
Signed-off-by: Rajat Gupta <rajgupt at qti.qualcomm.com>
Acked-by: Greg Kroah-Hartman <gregkh at linuxfoundation.org>
[2 lines not shown]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Three bug fixes for x86:
- Check that nEPT/nNPT is enabled in slow flush hypercalls. If it is
not, the hypercalls can be processed as usual even while running a
nested guest
- Fix shadow paging use-after-free due to page tables changing
outside execution of the guest. A bug that is 16 years old and
stems from an imprecision in the very first KVM series
- Scan IRR whenever PID.ON is true, even if PIR is empty, which
avoids a somewhat rare WARN"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Fix shadow paging use-after-free due to unexpected GFN
KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow
[2 lines not shown]
KVM: x86: Fix shadow paging use-after-free due to unexpected GFN
The shadow MMU computes GFNs for direct shadow pages using sp->gfn plus
the SPTE index. This assumption breaks for shadow paging if the guest
page tables are modified between VM entries (similar to commit
aad885e77496, "KVM: x86/mmu: Drop/zap existing present SPTE even
when creating an MMIO SPTE", 2026-03-27). The flow is as follows:
- a PDE is installed for a 2MB mapping, and a page in that area is
accessed. KVM creates a kvm_mmu_page consisting of 512 4KB pages;
the kvm_mmu_page is marked by FNAME(fetch) as direct-mapped because
the guest's mapping is a huge page (and thus contiguous).
- the PDE mapping is changed from outside the guest.
- the guest accesses another page in the same 2MB area. KVM installs
a new leaf SPTE and rmap entry; the SPTE uses the "correct" GFN
(i.e. based on the new mapping, as changed in the previous step) but
that GFN is outside of the [sp->gfn, sp->gfn + 511] range; therefore
[37 lines not shown]
KVM: x86: Fix misleading variable names and add more comments for PIR=>IRR flow
Rename kvm_apic_update_irr()'s "irr_updated" and vmx_sync_pir_to_irr()'s
"got_posted_interrupt" to a more accurate "max_irr_is_from_pir", as neither
"irr_updated" nor "got_posted_interrupt" is accurate.
__kvm_apic_update_irr() and thus kvm_apic_update_irr() specifically return
true if and only if the highest priority IRQ, i.e. max_irr, is a "new"
pending IRQ from the PIR. I.e. it's possible for the IRR to be updated,
i.e. for a posted IRQ to be "got", *without* the APIs returning true.
Expand vmx_sync_pir_to_irr()'s comment to explain why it's necessary to
set KVM_REQ_EVENT only if a "new" IRQ was found, and to explain why it's
safe to do so only if a new IRQ is also the highest priority pending IRQ.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc at google.com>
Link: https://patch.msgid.link/20260503201703.108231-3-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini at redhat.com>
KVM: x86: Do IRR scan in __kvm_apic_update_irr even if PIR is empty
Fall back to apic_find_highest_vector() when PID.ON is set but PIR
turns out to be empty, to correctly report the highest pending interrupt
from the existing IRR.
In a nested VM stress test, the following WARNING fires in
vmx_check_nested_events() when kvm_cpu_has_interrupt() reports a pending
interrupt but the subsequent kvm_apic_has_interrupt() (which invokes
vmx_sync_pir_to_irr() again) returns -1:
WARNING: CPU: 99 PID: 57767 at arch/x86/kvm/vmx/nested.c:4449 vmx_check_nested_events+0x6bf/0x6e0 [kvm_intel]
Call Trace:
kvm_check_and_inject_events
vcpu_enter_guest.constprop.0
vcpu_run
kvm_arch_vcpu_ioctl_run
kvm_vcpu_ioctl
__x64_sys_ioctl
[40 lines not shown]
KVM: x86: check for nEPT/nNPT in slow flush hypercalls
Checking is_guest_mode(vcpu) is incorrect, because translate_nested_gpa()
is only valid if an L2 guest is running *with nested EPT/NPT enabled*.
Instead use the same condition as translate_nested_gpa() itself.
Cc: stable at vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc at google.com>
Fixes: aee738236dca ("KVM: x86: Prepare kvm_hv_flush_tlb() to handle L2's GPAs", 2022-11-18)
Link: https://patch.msgid.link/20260503200905.106077-1-pbonzini@redhat.com/
Signed-off-by: Paolo Bonzini <pbonzini at redhat.com>
Merge tag 'sh-for-v7.1-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux
Pull sh fix from John Paul Adrian Glaubitz:
"The ZERO_PAGE consolidation in v7.1, introduced a regression on sh
which made these systems unbootable.
The problem was that on sh, the initial boot parameters were
previously referenced as an array and after 6215d9f4470f ("arch, mm:
consolidate empty_zero_page"), they were referenced as a pointer which
caused wrong code generation and boot hang.
This changes the declaration back to being an array which fixes the
boot hang"
* tag 'sh-for-v7.1-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/glaubitz/sh-linux:
sh: Fix fallout from ZERO_PAGE consolidation
Merge tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- Stable fixes for CONFIG_SMP=n where _nolock() allocations in NMI both
at kmalloc and page allocator levels are not properly protected by
the spin_trylock() semantics on !SMP (Harry Yoo)
* tag 'slab-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: return NULL early from kmalloc_nolock() in NMI on UP
mm/page_alloc: return NULL early from alloc_frozen_pages_nolock() in NMI on UP
Merge tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix lockup in requeue-PI during signal/timeout wakeups, by Sebastian
Andrzej Siewior"
* tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Prevent lockup in requeue-PI during signal/ timeout wakeup
Merge tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Fix the delayed dequeue negative lag increase fix in the
fair scheduler (Peter Zijlstra)
- Fix wakeup_preempt_fair() to do proper delayed dequeue
(Vincent Guittot)
- Clear sched_entity::rel_deadline when initializing
forked entities, which bug can cause all tasks to be
EEVDF-ineligible, causing a NULL pointer dereference
crash in pick_next_entity() (Zicheng Qu)
* tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Clear rel_deadline when initializing forked entities
sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue
sched/fair: Fix the negative lag increase fix
sh: Fix fallout from ZERO_PAGE consolidation
Consolidation of empty_zero_page declarations broke boot on sh.
sh stores its initial boot parameters in a page reserved in
arch/sh/kernel/head_32.S. Before commit 6215d9f4470f ("arch, mm:
consolidate empty_zero_page") this page was referenced in C code
as an array and after that commit it is referenced as a pointer.
This causes wrong code generation and boot hang.
Declare boot_params_page as an array to fix the issue.
Reported-by: Thomas Weißschuh <thomas.weissschuh at linutronix.de>
Tested-by: Thomas Weißschuh <thomas.weissschuh at linutronix.de>
Fixes: 6215d9f4470f ("arch, mm: consolidate empty_zero_page")
Signed-off-by: Mike Rapoport (Microsoft) <rppt at kernel.org>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz at physik.fu-berlin.de>
Tested-by: Geert Uytterhoeven <geert+renesas at glider.be>
[2 lines not shown]
Merge tag 'v7.1-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
- Reject algorithms with authsizes that are too short in authencesn
* tag 'v7.1-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: authencesn - reject short ahash digests during instance creation
Merge tag 'ntfs-for-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/ntfs
Pull ntfs fixes from Namjae Jeon:
- Fix a NULL pointer dereference in ntfs_index_walk_down() by
validating index block allocation
- Fix a memory leak of the symlink target string in
ntfs_reparse_set_wsl_symlink() during error paths
- Prevent VCN overflow and validate lowest_vcn in
ntfs_mapping_pairs_decompress() to avoid runlist corruption
- Fix a page reference leak in ntfs_write_iomap_end_resident()
when attribute search context allocation fails
- Fix an invalid PTR_ERR() usage on a valid folio pointer in
__ntfs_bitmap_set_bits_in_run()
[14 lines not shown]
RDMA/hns: Fix xarray race in hns_roce_create_qp_common()
Similar to the SRQ case the hr_qp is stored in the xarray before it is
fully initialized. Unlike the SRQ case the error unwinds do not wait for
the completion so keep the refcount 0 until the function succeeds.
Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Link: https://patch.msgid.link/r/14-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Suggested-by: Junxian Huang <huangjunxian6 at hisilicon.com>
Reviewed-by: Junxian Huang <huangjunxian6 at hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg at nvidia.com>
RDMA/mlx5: Restore zero-init to mlx5_ib_modify_qp() ucmd
Sashiko points out the check for inlen==0 got missed, the ={} was not
redundant, put it back.
Fixes: a9cd442a5347 ("RDMA: Remove redundant = {} for udata req structs")
Link: https://patch.msgid.link/r/2-v1-41f3135e5565+9d2-rdma_ai_fixes1_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg at nvidia.com>