[X86][NewPM] Mark X86AsmPrinter isRequired (#188278)
Otherwise the pass does not run when a function has the optnone
attribute, which means we get no assembly out for functions marked
optnone.
[CIR][NFC] Mark invalid-linkage.cir as XFAIL (#188279)
The invalid-linkage.cir test is currently failing as a result of a
recent change to the MLIR attribute parser. I am temporarily marking
this test as XFAIL while that problem is being worked on to unblock CIR
development. I added a check that will force the test to fail even after
the problem is fixed so that we don't start getting unexpected passes
when the fix is merged. (CIR testing isn't run during CI for MLIR
changes.) I will reenable the test after the problem has been fixed.
NAS-140394 / 26.0.0-BETA.2 / use new pylibsed library (by yocalebo) (#18548)
Integrate and use the new and much more efficient SED library that I
wrote. This is a C python module and uses ioctl to interact with the
disks so it removes all fork+exec possibilities when dealing with SED
drives.
Do note that these changes could be much more polished but this needs to
be back ported to BETA.2 so we'll start with this approach and a
subsequent PR could be made to 27 to make it even more efficient.
Build and API run
[here](http://jenkins.eng.ixsystems.net:8080/job/master/job/custom/2187)
Original PR: https://github.com/truenas/middleware/pull/18547
Co-authored-by: caleb <yocalebo at gmail.com>
NAS-140394 / 27.0.0-BETA.1 / use new pylibsed library (#18547)
Integrate and use the new and much more efficient SED library that I
wrote. This is a C python module and uses ioctl to interact with the
disks so it removes all fork+exec possibilities when dealing with SED
drives.
Do note that these changes could be much more polished but this needs to
be back ported to BETA.2 so we'll start with this approach and a
subsequent PR could be made to 27 to make it even more efficient.
Build and API run
[here](http://jenkins.eng.ixsystems.net:8080/job/master/job/custom/2187)
[AMDGPU][Uniformity][TTI] Make Uniformity Analysis Operand-Aware via Custom Uniformity Checks (#137639)
See: https://github.com/llvm/llvm-project/issues/131779
Extends uniformity analysis to support instructions whose uniformity
depends on which specific operands are uniform. Introduces
`InstructionUniformity::Custom` and a target hook `TTI::isUniform(I,
UniformArgs)` that allows targets to define custom uniformity rules.
During propagation, custom candidates are checked via the target hook.
If we can prove they are uniform, we skip marking them divergent and let
iterative propagation re-evaluate as operands change.
Implements AMDGPU's `llvm.amdgcn.wave.shuffle` rules (uniform when
either operand is uniform, divergent only when both are divergent) as
the motivating example.
This inverted-logic approach is critical for correctness: proving
uniformity early during propagation would be unsafe, as operands can
transition from uniform to divergent during divergence propagation.
[3 lines not shown]
AMDGPU: Codegen for v_dual_dot2acc_f32_f16/bf16 from VOP3
For V_DOT2_F32_F16 and V_DOT2_F32_BF16 add their VOPDName and mark
them with usesCustomInserter whihc will be used to add pre-RA register
allocation hints to preferably assign dst and src2 to the same physical
register. When the hint is satisfied, canMapVOP3PToVOPD recognises the
instruction as eligible for VOPD pairing by checking if it is VOP2 like:
dst==src2, no source modifiers, no clamp, and src1 is a register.
Mark both instructions as commutable to allow a literal in src1 to be
moved to src0, since VOPD only permits a literal in src0.
[Support] Use atomic counter in parallelFor instead of per-task spawning (#187989)
This function is primarily used by lld and debug info tools.
Instead of pre-splitting work into up to MaxTasksPerGroup (1024) tasks
and spawning each through the Executor's mutex+condvar, use an atomic
counter for work distribution. Only ThreadCount workers are spawned;
each grabs the next chunk via atomic fetch_add.
This reduces futex calls from ~31K (glibc, release+assertions build) to
~1.4K when linking clang-14 (191MB PIE with --export-dynamic) with
`ld.lld --threads=8` (each parallelFor spawned up to 1024 tasks, each
requiring mutex lock + condvar signal).
```
Wall System futex
glibc (assertions) before: 927ms 897ms 31K
glibc (assertions) after: 879ms 765ms 1.4K
mimalloc before: 872ms 694ms 25K
mimalloc after: 830ms 661ms 1K
```
[compiler-rt] Support unit tests for the GPU build (#187895)
Summary:
This PR enables the basic unit tests for builtins to be run on the GPU
architectures. Other targets like profiling are supported, but the
host-device natures will make it more difficult to adequately unit
test. It may be be possible to do basic tests there, to simply verify
that
counters are present and in the proper format for when they are copied
to the host.
[clang-tidy] Do not provide diagnostics for cert-dcl58-cpp on implicit declarations (#188152)
Do not provide diagnostics for cert-dcl58-cpp for compiler generated
intrinsic as it will be a false positive.
In provided tests compiler generates align_val_t which ends up inside
std namespace, resulting in std::align_val_t symbol. This symbol is
compiler generated, having no location, causing compiler crash. Also
there is no point to notify user about violations which user has no
control of.
Resolution: Diagnostics suppressed.
Co-authored-by: Vladislav Aranov <vladislav.aranov at ericsson.com>