[Flang][Driver] NFC: Move Flang/Fortran only options to a separate file (#194398)
This PR factors out the Flang/Fortranly only options from Options.td
into a separately file (FlangOptions.td).
Assisted-by: codex
ena: Budget rx descriptors, not packets
We had ENA_RX_BUDGET = 256 in order to allow up to 256 received
packets to be processed before we do other cleanups (handling tx
packets and, critically, refilling the rx buffer ring). Since the
ring holds 1024 buffers by default, this was fine for normal packets:
We refill the ring when it falls below 7/8 full, and even with a large
burst of incoming packets allowing it to fall by another 1/4 before we
consider refilling the ring still leaves it at 7/8 - 1/4 = 5/8 full.
With jumbos, the story is different: A 9k jumbo (as is used by default
within the EC2 network) consumes 3 descriptors, so a single rx cleanup
pass can consume 3/4 of the default-sized rx ring; if the rx buffer
ring wasn't completely full before a packet burst arrives, this puts
us perilously close to running out of rx buffers.
This precise failure mode has been observed on some EC2 instance types
within a Cluster Placement Group, resulting in the nominal 10 Gbps
single-flow throughput between instances dropping to ~100 Mbps as a
[21 lines not shown]
ena: Report RX overrun errors
Extract rx_overruns from the keep alive descriptor reported by
the device and expose it via sysctl hw stats.
RX overrun errors occur when a packet arrives but there are not
enough free buffers in the RX ring to receive it.
MFC after: 2 weeks
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D56640
(cherry picked from commit e3f4a63af63bea70bc86b6c790b14aa5ee99fcd0)
ena: Adjust ena_[rt]x_cleanup to return bool
The ena_[rt]x_cleanup functions are limited internally to a maximum
number of packets; this ensures that TX doesn't starve RX (or vice
versa) and also attempts to ensure that we get a chance to refill
the RX buffer ring before the device runs out of buffers and starts
dropping packets.
Historically these functions have returned the number of packets which
they processed which ena_cleanup compares to their respective budgets
to decide whether to reinvoke them. This is unnecessary complication;
since the precise number of packets processed is never used, adjust
the APIs of those functions to return a bool indicating if they want
to be reinvoked (aka if they hit their limits).
Since ena_tx_cleanup now only uses work_done if diagnostics are
enabled (ena_log_io macros to nothing otherwise) eliminate that
variable and pass its value (ENA_TX_BUDGET - budget) to ena_log_io
directly.
[9 lines not shown]
[HLSL] Add type traits for ConstantBuffers templates
This commit adds the type traits to restrict the template type in a
ConstantBuffer to structs or classes that do not contain a resource
type.
Assisted-by: Gemini
arm64: Ditch arm64-specific unsound PCPU optimisation
The current arm64 PCPU implementation uses a global register asm
variable to use x18, which we reserve with -ffixed-x18, from C. Inside a
critical_enter() or sched_pin(), it is vital that any PCPU reads use the
right PCPU pointer, as often the whole point of the critical_enter() or
sched_pin() is to ensure consistent PCPU use (e.g. for SMR it relies on
zpcpu giving the same SMR state). critical_enter() and sched_pin() both
include atomic_interrupt_fence(), i.e. asm volatile("" ::: "memory"),
barriers to ensure that memory accesses don't get moved by the compiler
outside the critical section, which on most architectures will also
order the read of the PCPU pointer itself (whether due to the read being
another asm volatile statement, or due to using a segment-relative
memory access as on x86). However, this approach on arm64 is in no sense
a memory access, and therefore the register access is not ordered with
respect to the the critical_enter() or sched_pin(), or more specifically
the curthread->td_critnest++ / curthread->td_pinned++ within.
In practice upstream today this works out ok because the read of x18 is
[115 lines not shown]
[AMDGPU] Add readfirstlane for inline asm SGPR with VGPR input (#176330)
SIFixSGPRCopies was incorrectly handling inline assembly operands with
SGPR ("s") constraints when the value came from a memory load (which
produces a VGPR). The pass would fail to insert the necessary
v_readfirstlane instruction instead directly passes the vgpr value.
example:
asm sideeffect buffer_load_dwordx4 $0, $1, $2, 0 =v,v,s,n
previously it generated:
buffer_load_dwordx4 v[0:3], v0, v[8:11] (but sgpr is expected), 0 offen
The fix adds readfirstlanes during lowering when there is a copy from
divergent register to SGPR.
---------
Co-authored-by: Matt Arsenault <arsenm2 at gmail.com>
loader.efi: Defer efi_translate(e_entry) until after bi_load
bi_load itself loads various things into the staging area which can
cause it to grow, which may result in the staging area moving, including
the kernel. Therefore the address we get for the kernel entry point
prior to bi_load may not be correct afterwards when we actually call it,
and so we must defer the translation.
On arm and riscv (but not arm64, which predates both of them in
loader.efi and did not gain a copy of arm's added printf when arm
support was added) we also printf this entry point to the console, which
we can no longer do since bi_load calls ExitBootServices, so remove this
printf that, in practice, seems to not be so useful, given nobody ever
felt the need to add it to arm64. If anyone really feels this is an
important printf to have then bi_load will need to be split so we can
call printf after all the loading and potential reallocation of the
staging area, but before ExitBootServices is called.
We may also want to make this code more uniform and shared between the
[12 lines not shown]
loader.efi: Defer efi_translate(e_entry) until after bi_load
bi_load itself loads various things into the staging area which can
cause it to grow, which may result in the staging area moving, including
the kernel. Therefore the address we get for the kernel entry point
prior to bi_load may not be correct afterwards when we actually call it,
and so we must defer the translation.
On arm and riscv (but not arm64, which predates both of them in
loader.efi and did not gain a copy of arm's added printf when arm
support was added) we also printf this entry point to the console, which
we can no longer do since bi_load calls ExitBootServices, so remove this
printf that, in practice, seems to not be so useful, given nobody ever
felt the need to add it to arm64. If anyone really feels this is an
important printf to have then bi_load will need to be split so we can
call printf after all the loading and potential reallocation of the
staging area, but before ExitBootServices is called.
We may also want to make this code more uniform and shared between the
[12 lines not shown]
arm64/vmm: Enable 16-bit VMIDs when in use by pmap
pmap_init always uses 16-bit VMIDs when supported, but we never enable
them in VTCR_EL2 (for ASIDs, locore enables them in TCR_EL1 and
pmap_init keys off whether they've been enabled, but the order in which
pmap_init and vmmops_modinit run is reversed). As a result, although the
full 16-bit value can be stored to VTTBR_EL2 and read back, the upper 8
bits are treated as 0, and so VMIDs that our VMID allocation believes
are distinct end up aliasing.
In future this interface may change such that vmm decides on the VMID
width and tells the pmap to use that, with appropriate support for
unloading and reloading vmm, but that can come as a follow-up change, as
this is a more minimal bug fix.
Reviewed by: markj
Obtained from: CheriBSD
Fixes: 47e073941f4e ("Import the kernel parts of bhyve/arm64")
MFC after: 1 week
[3 lines not shown]
arm64: Ditch arm64-specific unsound PCPU optimisation
The current arm64 PCPU implementation uses a global register asm
variable to use x18, which we reserve with -ffixed-x18, from C. Inside a
critical_enter() or sched_pin(), it is vital that any PCPU reads use the
right PCPU pointer, as often the whole point of the critical_enter() or
sched_pin() is to ensure consistent PCPU use (e.g. for SMR it relies on
zpcpu giving the same SMR state). critical_enter() and sched_pin() both
include atomic_interrupt_fence(), i.e. asm volatile("" ::: "memory"),
barriers to ensure that memory accesses don't get moved by the compiler
outside the critical section, which on most architectures will also
order the read of the PCPU pointer itself (whether due to the read being
another asm volatile statement, or due to using a segment-relative
memory access as on x86). However, this approach on arm64 is in no sense
a memory access, and therefore the register access is not ordered with
respect to the the critical_enter() or sched_pin(), or more specifically
the curthread->td_critnest++ / curthread->td_pinned++ within.
In practice upstream today this works out ok because the read of x18 is
[115 lines not shown]
[Inliner] Use store-to-load forwarding to resolve call arguments (#190607)
Uses `FindAvailableLoadedValue` to resolve load instructions in call
arguments to constants before inline cost analysis. This gives the
inliner more precise cost estimate and option to inline functions which
would not be inlined otherwise.
The `-O3` doesn't inline empty `std::set` and `std::map` because node
deletion is recursive. The inliner doesn't know that `nullptr` is passed
in as it is a `load` from a member.
This addresses both `libstdc++` and `libc++`:
- `libstdc++` - `FindAvailableLoadedValue` requires `MaxInstToScan=0`,
because relevant store is 7 instructions away and `DefMaxInstsToScan =
6`. Benchmarking on large LLVM TUs showed no measurable compile-time
difference between limit=6 and whole basic block
- `libc++` - uses `memset` to zero all members in ctor, this patch
handles only `memset` to zero (the type mismatch case), which could be
generalized but seems very rare
[20 lines not shown]
[libc][docs] Fix docgen macro lookup for underscored headers (#194367)
While adding implementation status for nl_types.h, I noticed docgen
resolves it to nl-types.h instead of nl_types.h. As a result, headers
with underscores are not matched correctly and their implementation
status is not marked.
This patch fixes the handling of underscored header names in docgen so
they are processed consistently.
Add middleware support for LIO ALUA HA
Wire up the middleware side of LIO ALUA high-availability: load
lio_ha.ko with per-node addresses on service start, manage the
4-row ALUA state table (MASTER/BACKUP × synced/not-synced) across
failover events, clean up STANDBY configfs on pool export, and
add pre-flight validation that targets have static initiator ACLs
before ALUA can be enabled.
[libc] Add sys/mman syscall wrappers (#195103)
Added ErrorOr-returning syscall wrappers for mmap, munmap, mprotect, and
pkey_mprotect in src/__support/OSUtil/linux/syscall_wrappers/. Migrated
the sys/mman Linux entrypoint implementations to use them, following the
design in libc/docs/dev/syscall_wrapper_refactor.rst.
Removed the shared mprotect_common.h in favour of per-syscall wrapper
headers. Added hdr/sys_mman_macros.h proxy header.
[libc][docs][NFC] Remove dead files and consolidate check.rst (#194442)
Deleted check.rst, Helpers/Styles.rst, dev/cmake_build_rules.rst, and
dev/clang_tidy_checks.rst. Moved the |check| substitution into
rst_prolog in conf.py so it is available globally without per-file
include directives.
Removed all '.. include:: check.rst' lines from hand-written header docs
and from the docgen.py generator that emits them for auto-generated
header pages.
Merged the clang-tidy checks documentation into code_style.rst under a
new 'Static Analysis & Clang-Tidy' section, preserving the
_clang_tidy_checks label for existing cross-references.
Updated code examples in both libc docs and the upstream clang-tidy
check docs to replace the stale LLVM_LIBC_ENTRYPOINT macro with the
current LLVM_LIBC_FUNCTION macro.
Updated dev/index.rst to drop the two deleted toctree entries.
[WebAssembly][GlobalISel] Implement basic floating point instructions (#194798)
Adds `RegBankSelect` support for floats, and adds legalization of most
basic FP instructions.
Split from #157161