[UniformityAnalysis] Skip CycleAnalysis on targets without branch divergence (#189948)
UniformityAnalysis unconditionally computes CycleAnalysis even on
targets that don't care about divergence, causing measurable
compile-time overhead (see [#99878
(comment)](https://github.com/llvm/llvm-project/pull/175167#issuecomment-4156230947)).
---------
Co-authored-by: padivedi <padivedi at amd.com>
[orc-rt] Remove Session::waitForShutdown. (#191124)
The existing implementation triggered Session shutdown and then blocked
on a std::future that would be unblocked by an on-shutdown callback that
waitForShutdown had installed. Since there is no guarantee that this
callback would be the last one run, the result was that waitForShutdown
only guaranteed that it would not return until the shutdown sequence had
started (rather than completed).
This could have been fixed, but the Session destructor is already
supposed to block until the Session can be safely destroyed, so a
"working" waitForShutdown would be effectively redundant. Since it was
also a potential footgun (calling it from an on-detach or on-shutdown
callback could deadlock) it was safer to just remove it entirely.
Some Session unit tests do rely on testing properties of the Session
after the shutdown sequence has started, so a new utility has been added
to SessionTests.cpp to support this.
[AMDGPUUsage] Specify what one-as syncscopes do
This matches the currently implemented and (as far as I could determine)
intended semantics of these syncscopes.
The sync scope table is unchanged except for removing its indentation;
otherwise it would be rendered as part of the preceding note.
[LangRef] Specify that syncscopes can affect the monotonic modification order
If a target specifies that atomics with mismatching syncscopes appear
non-atomic to each other, there is no point in requiring them to be ordered in
the monotonic modification order. Notably, the [AMDGPU target user
guide](https://llvm.org/docs/AMDGPUUsage.html#memory-scopes) has specified
syncscopes to relax the modification order for years.
So far, I haven't found an example where this less constrained ordering would
be observable (at least with the AMDGPU inclusive scope rules). Whenever a load
would be able to see two monotonic stores with non-inclusive scope, that's
considered a data race (i.e., the load would return `undef`), so it cannot be
used to observe the order of the stores.
[LangRef][AMDGPU] Specify that syncscope can cause atomic operations to race
Targets should be able to specify that the syncscope of atomic operations
influences whether they participate in data races with each other.
For example, in AMDGPU, we want (and already implement) the load in the
following case to be in a data race (i.e., return `undef` according to the
current definition), because there is an atomic store with workgroup syncscope
executing in a different workgroup:
```
; workgroup 0:
store atomic i32 1, ptr %p syncscope("workgroup") monotonic, align 4
; workgroup 1:
store atomic i32 2, ptr %p syncscope("workgroup") monotonic, align 4
load atomic i32, ptr %p syncscope("workgroup") monotonic, align 4
```
[3 lines not shown]
[IR] Add llvm.masked.{udiv,sdiv,urem,srem} intrinsics (#189705)
Because division by zero is undefined behaviour, when the loop
vectorizer encounters a div that's not unconditionally executed it needs
to replace its divisor with a non-zero value on any lane that wouldn't
have been executed in the scalar loop:
%safedivisor = select <vscale x 2 x i1> %mask, <vscale x 2 x i64> %divisor, <vscale x 2 x i64> splat (i64 1)
%res = udiv <vscale x 2 x i64> %dividend, %safedivisor
https://godbolt.org/z/jczc3ovbr
We need this for architectures like x86 where division by zero (or
overflow for sdiv/srem) can trap. But on AArch64 and RISC-V division
doesn't trap so we don't actually need to mask off any divisors. Not
only that, but there are also dedicated vector division instructions
that can be predicated.
Originally we tried to optimize this on RISC-V by transforming `udiv x,
[11 lines not shown]
[Passes] Enable vectorizers at Oz (#190182)
The way this is handled right now is very inconsistent. When using
`-passes="default<Oz>"` (the code modified here), both vectorizers were
disabled. The clang frontend enables SLP at Oz but not LoopVectorize.
All the LTO backends enable both vectorizers at Oz.
I'm proposing here that `default<Oz>` should enable both vectorizers by
default. There seems to be a consensus that this is the right thing to
do for SLP (as both Clang and LTO backends do this). It's a bit less
clear for LoopVectorize, but given that the implementation already has
special handling for minsize functions (like switching to code size for
cost modelling and disabling various size-increasing transforms) I'm
inclined that we should also be enabling it at minsize.
This is part of trying to make optsize/minsize purely attribute based
and independent of the pipeline optimization level.
[LangRef] Allow monotonic & seq_cst accesses to inter-operate with other accesses
Currently, the LangRef says that atomic operations (which includes `unordered`
operations, which don't participate in the monotonic modification order) must
read a value from the modification order of monotonic operations.
In the following example, this means that the load does not have a store it
could read from, because all stores it may see do not participate in the
monotonic modification order:
```
; thread 0:
store atomic i32 1, ptr %p unordered, align 4
; thread 1:
store atomic i32 2, ptr %p unordered, align 4
load atomic i32, ptr %p unordered, align 4
```
[18 lines not shown]
[CodeGen] Preserve big-endian trunc in concat_vectors (#190701)
A transform from `concat_vectors(trunc(scalar), undef)` to
`scalar_to_vector(scalar)` is only equivalent for little-endian targets.
On big-endian, that would put the extra upper bytes ahead of the desired
truncated bytes. This problem was seen on Rust s390x in [RHEL-147748].
[RHEL-147748]: https://redhat.atlassian.net/browse/RHEL-147748
Assisted-by: Claude Code
(cherry picked from commit 5df89ae3da8b24804c17479ce74a930783db045e)
[LLD] [COFF] Fix crashes for conflicting exports with -export-all-symbols (#190492)
Commit adcdc9cc3740adba3577b328fa3ba492cbccd3a5 (since LLD 17) added a
warning message if there are conflicting attempts to export a specific
symbol.
That commit missed one source of exports, from the LLD specific
-export-all-symbols flag (which only has an effect in mingw mode).
To trigger this case, one needs to have an export set by a def file,
combined with the -export-all-symbols flag (which attempts to export all
global symbols, despite explicit exports through embedded directives or
a def file).
To trigger the warning (and the previous crash), one would have to have
some difference between the export produced by -export-all-symbols and
the one from the def file. That difference could be e.g. that the def
file contained an explicit ordinal, or that the def file lacked a DATA
marking for a symbol that the automatic export of all symbols decides to
[7 lines not shown]
[MC] Track per-section inner relaxation iterations and add convergence test (#191121)
Count inner iterations (max across sections) instead of outer relaxOnce
calls. This more accurately reflects the work done during relaxation.
Add a test that verifies boundary alignment convergence may require
O(N) iterations where N is the number of BoundaryAlign fragments.
This will be fixed by #190318
[AArch64] Avoid expensive getStrictlyReservedRegs calls in isAnyArgRegReserved (#190957)
`AArch64RegisterInfo::isAnyArgRegReserved` is used during call lowering
across all instruction selectors (SDAG, GISel, FastISel) to emit an
error if any of the arg registers (x0-x7) are reserved. This puts
`AArch64RegisterInfo::getStrictlyReservedRegs` which computes this in
the hot-path and it shows up in compile-time profiles since it's
computed for every call.
As the intent was to guard against using +reserve-x{0-7} with function
calls we can instead call `isXRegisterReserved` which is faster since
it's a simple BitVector lookup.
Compile-time improves across all instruction selectors on CTMark:
geomean
SDAG ~ -0.14%
GISel ~ -0.6%
FastISel ~ -0.7% (measured locally)
[12 lines not shown]
[test] Add MC relaxation stretch tests (#191118)
Verify:
- ARM tLDRpci instructions don't spuriously widen to t2LDRpci when
upstream branches relax, which would push cbz targets out of range.
This would catch the #184544 regression.
- CSKY lrw16 instructions don't spuriously widen to lrw32 when
upstream branches relax. Similar to ARM.
[orc-rt] Simplify notification service construction in Session. NFC. (#191113)
We can replace the addNotificationService method with a call to the
generic createService method that was introduced in 98ccac607a9ff.
[Clang] [MinGW] Handle `-nolibc` argument (#182062)
This implementation differs from GCC, but arguably more in line with
Unix systems, because it stops linking of default Win32 system
libraries.
On GCC it works like this:
```
❯ /ucrt64/bin/gcc -### /dev/null -nolibc 2>&1 | tr ' ' '\n' | rg '^\-l' | sort -u
-lgcc
-lgcc_eh
-lkernel32
-lmingw32
-lmingwex
-lmsvcrt
❯ /ucrt64/bin/gcc -### /dev/null 2>&1 | tr ' ' '\n' | rg '^\-l' | sort -u
-ladvapi32
-lgcc
[21 lines not shown]