[AMDGPU] Codegen for min/max instructions for gfx1170 (#185625)
gfx1170 does not have s_minimum/maximum_f16/f32 instructions so a new
feature `SALUMinimumMaximumInsts` is added for gfx12+ subtargets.
[HLSL] Implement Texture2D::Load methods and builtin (#185708)
Implements the Textur2D::Load methods. A new HLSL buildin is added to
implement the method. The HLSL builtin is lowered to the
resource_load_level intrinsic.
We chose to have have a single operand hold both the coordinate and the
level in the builtin, as is done in the Load method itself. This was to
make the external sema source easier. It is easier to split the vector
during codegen than in sema.
Assisted-by: Gemini
[mlir][LLVM] Add support for `ptrtoaddr` (#185104)
The `ptrtoaddr` op is akin to `ptrtoint` with some important
differences:
* It does not capture the provenance of the pointer, meaning a pointer
does not escape and subsequent `inttoptr` don't make a legal pointer.
LLVM can then assume the pointer never escaped, which helps alias
analysis.
* It does not support arbitrary integer types, but only exactly the
integer type that is equal in width to the pointer type as specified by
the data layout.
This PR adds the op the MLIR dialect and adds the corresponding
verification for the datalayout property.
[BOLT][AArch64] Support block reordering beyond 1KB for FEAT_CMPBR. (#185443)
Currently LongJmpPass::relaxLocalBranches bails early if the estimated
size of a binary function is less than 32KB assuming that the shortest
branches are 16 bits. Therefore the fixup value for the cold branch
target may go out of range if the function is larger than 1KB.
I am decreasing ShortestJumpSpan from 32KB to 1KB, since FEAT_CMPBR
branches are 11 bits.
[libc] Fix hdrgen test test_small_proxy.h (#185890)
The expected output was outdated as it did not contain the macro
definitions.
This patch fixes the issue.
libclc: Improve fdim handling (#186085)
The maxnum is somewhat overconstraining. This gives slightly
better codegen and avoids the noise from the select and convert,
and saves the cost of materializing the nan literal.
libclc: Replace nextafter implementation (#186082)
Use a more straightforward version which allows
optimizations to delete the edge case checks, and also
codegens better. Implement in terms of new nextup and nextdown
helper functions, which are IEEE functions, and usable in other
functions.
libclc: Replace fmod implementation with elementwise builtin (#186083)
This corresponds to frem, which for whatever reason is a first
class IR instruction. The backend has a heroic freestanding
implementation that should be nearly identical to what was here.
libclc: Replace nextafter implementation
Use a more straightforward version which allows
optimizations to delete the edge case checks, and also
codegens better. Implement in terms of new nextup and nextdown
helper functions, which are IEEE functions, and usable in other
functions.
libclc: Improve fdim handling
The maxnum is somewhat overconstraining. This gives slightly
better codegen and avoids the noise from the select and convert,
and saves the cost of materializing the nan literal.
py-scikit-build: updated to 0.19.0
0.19.0
This release updates for changes in setuptools and CMake 4, and drops Python 3.7.
Features
* Drop Python 3.7 in :pr:`1134`
Bug fixes
* Update for newer setuptools in :pr:`1120`
* ``setuptools_wrap.py``: parse ``CMAKE_ARGS`` with ``shlex.split`` like elsewhere by :user:`haampie` in :pr:`1126`
* Drop ``dry-run`` (removed in setuptools) in :pr:`1166`
* Ensure generic f2py executable is looked up first by :user:`smiet` in :pr:`1111`
Testing
[6 lines not shown]
[Clang][AArch64] Remove duplicate CodeGen test for bf16 get/set intrinsics
The following test files contain identical test bodies (aside from the
RUN lines):
* clang/test/CodeGen/AArch64/bf16-getset-intrinsics.c
* clang/test/CodeGen/arm-bf16-getset-intrinsics.c
The differences in the RUN lines do not appear to be relevant for the
tested functionality. This change keeps a single test file and
simplifies its RUN lines to match the generic style used in
clang/test/CodeGen/AArch64/neon.
This also moves toward unifying and reusing RUN lines across tests.
compiler-rt/arm: Check for overflow when adding float denorms (#185245)
When the sum of two sub-normal values is not also subnormal, we need to
set the exponent to one.
Test case:
static volatile float x = 0x1.362b4p-127;
static volatile float x2 = 0x1.362b4p-127 * 2;
int
main (void)
{
printf("x %a x2 %a x + x %a\n", x, x2, x + x);
return x2 == x + x ? 0 : 1;
}
Signed-off-by: Keith Packard <keithp at keithp.com>
Revert "[SDAG] (abs (add nsw a, -b)) -> (abds a, b)" (#17580) (#186068)
Reverts llvm/llvm-project#175801 while #185467 miscompilation is being investigated
libclc: Replace fmod implementation with elementwise builtin
This corresponds to frem, which for whatever reason is a first
class IR instruction. The backend has a heroic freestanding
implementation that should be nearly identical to what was here.
Add support for BCM575xx devices, variously known as Thor or P5.
There are a few significant differences to earlier devices.
The nic now requires some host memory to use as backing store for its queues,
and for now we're overallocating to some extent. It's not a noticeable amount
of memory for a system with one of these nics in it, so this isn't a huge
concern.
P5 devices have notification queues to act as an indirection between tx/rx
completion rings and msi-x vectors. We set up one per queue and statically
map them to msi-x vectors in turn according to the intrmap.
The doorbell structures are now 64 bits, and all written to through the same
memory address.
Ring groups are not used, so the functions to allocate and free ring groups
don't do anything for P5 devices; instead, rings are directly associated
with each other on creation, and aggregation rings are identified by a
different ring type.
[3 lines not shown]