Reland [C++20] [Modules] Don't profiling the callee of CXXFoldExpr (#190732) (#195983)
Close https://github.com/llvm/llvm-project/issues/190333
For the test case, the root cause of the problem is, the compiler
thought the declaration of `operator &&` in consumer.cpp may change the
meaning of '&&' in the requrie clause of `F::operator()`. But it doesn't
make sense. Here we skip profiling the callee to solve the problem. Note
that we've already record the kind of the operator. So '&&' and '||'
won't be confused.
---
See the discussion in https://github.com/llvm/llvm-project/pull/194283
For the new found pattern that we may have other binary operator (e.g.,
operator +) in the require clause, e.g.,
```C++
[5 lines not shown]
[Instrumentor] Add a global function regexp to limit the instrumentation
Only functions that match the "function_regex" will be instrumented,
or if they have the instrumentation attribute.
[Instrumentor] Add unreachable support; unreachable stack trace printing
Allow to instrument unreachable and provide a use case for stack trace
printing.
[LoongArch] Custom lowering for LSX vector sign extensions (#194325)
Custom LSX sign-extensions to combinations of `SLTI` + `VILVL` + `VILVH`
if possible.
For example, we could lower vector sext to following instructions:
```
%B = sext <4 x i16> %A to <4 x i32>
vslti.h v2, v1, 0
vilvl.h v1, v2, v1
%B = sext <4 x i32> %A to <4 x i64>
vslti.w v3, v1, 0
vilvh.w v2, v3, v1
vilvl.w v1, v3, v1
```
When these combinations is worse than convert sext to shuffle, we simply
use the latter one instead.
[flang][cuda][openacc] Reject UseDevice actual against managed/unified dummy (#196428)
After #195182 introduced the `UseDevice` attribute, a `use_device(...)`
actual was treated as compatible with **any** dummy attribute. Combined
with the matching distance returning ∞ for `UseDevice →
managed/unified`, this caused generic resolution to misreport a clean
"no match" as an **ambiguity** when only managed/unified specifics
existed.
This PR tightens `AreCompatibleCUDADataAttrs`: a `UseDevice` actual is
only compatible with a `Device` dummy or a host (no-attribute) dummy.
Other attributes (`Managed`, `Unified`, `Pinned`, ...) require their
actual to live in that specific kind of memory.
[msan] Handle fpto[us]i_sat (#196429)
This adds explicit handling for fpto[us]i_sat, similar to how the
non-saturating versions are handled.
N.B. PR #191365 lowered NEON fcvtz[us] intrinsics into fpto[us]i.sat.
There is a slight inconsistency in MSan insofar as fcvtz[us] were
handled by handleNEONVectorConvertIntrinsic(), which takes an
all-or-nothing propagation approach to the shadows (i.e., even a single
uninitialized bit will result in the corresponding integer being fully
uninitialized), while fpto[us]i were handled by propagating the shadow
unchanged. For now, we choose to have fpto[us]i_sat follow the laxer
behavior of fpto[usi. Future work may consider changing the behavior of
fpto[us]i and fpto[us]i_sat to use the all-or-nothing approach.
[mlir][reducer] Change mlir-reducer apply pattern logic (#195997)
This PR aligns the pattern application logic with the operation deletion
strategy, It indirectly achieves the separation of operation deletion
and pattern application. It also fixes a bug where trivially dead ops
within `opsInRange` was being incorrectly deleted when apply patterns.
While `opsNotInRange` grows from zero (via binary search), `opsInRange
`shrinks from the entire module down to zero. This fixes a crash where
patterns were initially applied to the whole module. If the module in
the current iteration is 'uninteresting', it gets erased. Consequently,
when the iterator increments, it fails to clone the parent iteration's
module, leading to a crash.
[Instrumentor] Add Alloca and Function support; stack usage example
This adds support for alloca instrumentation and function pre/post
instrumentation. Alloca support follows load/store support directly.
Functions require special care to determine the insertion points.
Together, we can showcase how the stack high watermark can be profiled,
see InstrumentorStackUsage.cpp.
[clang-format] Align stuff containing multi-line comment (#195398)
Fixes #194717.
Previously the information about the comment's scope could get lost.
Then the program would not align it.
new
```C++
foo fooNode(ConvertStdStringToUString(fieldNames[chIdx]),
// asdf
// foo1 foo2 foo12345
SomeFunctionAB(a123456789012345));
const size_t v1234567890123456789012345678901234;
```
old
[6 lines not shown]
[MLIR][XeGPU] Fix layout inference issues blocking MXFP_GEMM test (#196243)
This branch fixes layout inference issues in XeGPU passes that were
blocking MXFP (microscaled floating point) GEMM workloads:
- Fix bitcast layout adjustment to use result shape instead of source
shape. The setupBitCastResultLayout function were incorrectly bounding
the layout adjustment loop against the source shape. Added tests.
- Fix blocking pass to drop inst_data from anchor operations. Operations
whose shape already matches inst_data don't get unrolled, so their
layout attributes retained stale inst_data that broke downstream passes.
Now inst_data is unconditionally stripped from all op attributes after
blocking.
- Propagate layout to both results of vector.deinterleave. The layout
recovery pass was only setting the layout on result 0, leaving result 1
without a layout.
Test plan
[9 lines not shown]
[NFC][AMDGPU] Use a worklist and remember results in AMDGPUAttributor
This was a recursive function with a Map to cache things that was never filled.
Now it's a worklist and the map is actually used.
Co-authored-by: Johannes Doerfert <johannes at jdoerfert.de>
[CodeGen][RISCV] Inline stack probes immediately after `allocateStack` in `eliminateCallFramePseudoInstr` (#195456)
This PR adds a call to `inlineStackProbe` immediately after
`allocateStack` in `eliminateCallFramePseudoInstr`. This allows code
generation for stack probe pseudoinstructions in non-entry BBs.
Fixes #195454.
[SLP]Bail out on non-schedulable expanded binop with stale operand deps
In tryScheduleBundle's DoesNotRequireScheduling path, an expanded binop
(shl X, 1 modeled as add X, X) doubles the dependency count of the
duplicated operand. If the operand has a
single IR use yet its ScheduleData already has Dependencies populated
by an earlier calculation that did not see the expanded duplicate use,
double decrement still exceeds calculateDependencies' single increment
and UnscheduledDeps goes negative.
Fixes #196281.
Reviewers:
Pull Request: https://github.com/llvm/llvm-project/pull/196449
[Clang][HLSL] Fix -Wunused-variable (#196445)
LookupSucceeded is only used in an assertion. Mark it [[maybe_unused]]
so we do not get -Wunused-variable in non-assertions builds.
[AMDGPU][True16] relax d16-write-vgpr32 condition (#194477)
Patch https://github.com/llvm/llvm-project/pull/157795 work around a D16
load HW issue.
We found the condition of this workaround could be relaxed for
instructions from same order groups. Downstream testing looks ok.
[DebugInfo] Remove old decls when converting DI (#194964)
We were trying to remove declarations of old debug intrinsics whenever
printing modules or writing them to file. This is no longer necessary as
we use the new-style debug values exclusively now, other than when a
target pass specifically converts back to the old style. If a target
pass does that, removing the intrinsics is not right as the intrinsics'
users will still linger.
This change should be NFC except for the experimental DirectX target
where we do exactly that.
Fixes #194884
Fix metadirective loop variant lowering
Preserve the associated DO evaluation when a dynamic metadirective can
select either a loop-associated directive or a standalone fallback, so
the fallback still lowers the original loop body.
Scope temporary loop-IV data-sharing attributes to the selected variant.
Use the selected variant's collapse clause to determine how many loop IVs
to mark, avoiding DSA state leaking between alternatives.