Expand description

Propagates assignment destinations backwards in the CFG to eliminate redundant assignments.

Motivation

MIR building can insert a lot of redundant copies, and Rust code in general often tends to move values around a lot. The result is a lot of assignments of the form dest = {move} src; in MIR. MIR building for constants in particular tends to create additional locals that are only used inside a single block to shuffle a value around unnecessarily.

LLVM by itself is not good enough at eliminating these redundant copies (eg. see https://github.com/rust-lang/rust/issues/32966), so this leaves some performance on the table that we can regain by implementing an optimization for removing these assign statements in rustc itself. When this optimization runs fast enough, it can also speed up the constant evaluation and code generation phases of rustc due to the reduced number of statements and locals.

The Optimization

Conceptually, this optimization is “destination propagation”. It is similar to the Named Return Value Optimization, or NRVO, known from the C++ world, except that it isn’t limited to return values or the return place _0. On a very high level, independent of the actual implementation details, it does the following:

  1. Identify dest = src; statements with values for dest and src whose storage can soundly be merged.
  2. Replace all mentions of src with dest (“unifying” them and propagating the destination backwards).
  3. Delete the dest = src; statement (by making it a nop).

Step 1) is by far the hardest, so it is explained in more detail below.

Soundness

We have a pair of places p and q, whose memory we would like to merge. In order for this to be sound, we need to check a number of conditions:

  • p and q must both be constant - it does not make much sense to talk about merging them if they do not consistently refer to the same place in memory. This is satisfied if they do not contain any indirection through a pointer or any indexing projections.

  • We need to make sure that the goal of “merging the memory” is actually structurally possible in MIR. For example, even if all the other conditions are satisfied, there is no way to “merge” _5.foo and _6.bar. For now, we ensure this by requiring that both p and q are locals with no further projections. Future iterations of this pass should improve on this.

  • Finally, we want p and q to use the same memory - however, we still need to make sure that each of them has enough “ownership” of that memory to continue “doing its job.” More precisely, what we will check is that whenever the program performs a write to p, then it does not currently care about what the value in q is (and vice versa). We formalize the notion of “does not care what the value in q is” by checking the liveness of q.

    Because of the difficulty of computing liveness of places that have their address taken, we do not even attempt to do it. Any places that are in a local that has its address taken is excluded from the optimization.

The first two conditions are simple structural requirements on the Assign statements that can be trivially checked. The third requirement however is more difficult and costly to check.

Future Improvements

There are a number of ways in which this pass could be improved in the future:

  • Merging storage liveness ranges instead of removing storage statements completely. This may improve stack usage.

  • Allow merging locals into places with projections, eg _5 into _6.foo.

  • Liveness analysis with more precision than whole locals at a time. The smaller benefit of this is that it would allow us to dest prop at “sub-local” levels in some cases. The bigger benefit of this is that such liveness analysis can report more accurate results about whole locals at a time. For example, consider:

    _1 = u;
    // unrelated code
    _1.f1 = v;
    _2 = _1.f1;

    Because the current analysis only thinks in terms of locals, it does not have enough information to report that _1 is dead in the “unrelated code” section.

  • Liveness analysis enabled by alias analysis. This would allow us to not just bail on locals that ever have their address taken. Of course that requires actually having alias analysis (and a model to build it on), so this might be a bit of a ways off.

  • Various perf improvents. There are a bunch of comments in here marked PERF with ideas for how to do things more efficiently. However, the complexity of the pass as a whole should be kept in mind.

Previous Work

A previous attempt at implementing an optimization like this turned out to be a significant regression in compiler performance. Fixing the regressions introduced a lot of undesirable complexity to the implementation.

A subsequent approach tried to avoid the costly computation by limiting itself to acyclic CFGs, but still turned out to be far too costly to run due to suboptimal performance within individual basic blocks, requiring a walk across the entire block for every assignment found within the block. For the tuple-stress benchmark, which has 458745 statements in a single block, this proved to be far too costly.

Another approach after that was much closer to correct, but had some soundness issues - it was failing to consider stores outside live ranges, and failed to uphold some of the requirements that MIR has for non-overlapping places within statements. However, it also had performance issues caused by O(l² * s) runtime, where l is the number of locals and s is the number of statements and terminators.

Since the first attempt at this, the compiler has improved dramatically, and new analysis frameworks have been added that should make this approach viable without requiring a limited approach that only works for some classes of CFGs:

  • rustc now has a powerful dataflow analysis framework that can handle forwards and backwards analyses efficiently.
  • Layout optimizations for generators have been added to improve code generation for async/await, which are very similar in spirit to what this optimization does.

Also, rustc now has a simple NRVO pass (see nrvo.rs), which handles a subset of the cases that this destination propagation pass handles, proving that similar optimizations can be performed on MIR.

Pre/Post Optimization

It is recommended to run SimplifyCfg and then SimplifyLocals some time after this pass, as it replaces the eliminated assign statements with nops and leaves unused locals behind.

Structs

Container for the various allocations that we need.
Candidates 🔒
Merger 🔒
WriteInfo 🔒
Describes where a statement/terminator writes to

Enums

Functions

Collects the candidates for merging
Some locals are part of the function’s interface and can not be removed.
If the pair of places is being considered for merging, returns the candidate which would be merged in order to accomplish this.