Expand description

This pretty-printer is a direct reimplementation of Philip Karlton’s Mesa pretty-printer, as described in the appendix to Derek C. Oppen, “Pretty Printing” (1979), Stanford Computer Science Department STAN-CS-79-770, http://i.stanford.edu/pub/cstr/reports/cs/tr/79/770/CS-TR-79-770.pdf.

The algorithm’s aim is to break a stream into as few lines as possible while respecting the indentation-consistency requirements of the enclosing block, and avoiding breaking at silly places on block boundaries, for example, between “x” and “)” in “x)”.

I am implementing this algorithm because it comes with 20 pages of documentation explaining its theory, and because it addresses the set of concerns I’ve seen other pretty-printers fall down on. Weirdly. Even though it’s 32 years old. What can I say?

Despite some redundancies and quirks in the way it’s implemented in that paper, I’ve opted to keep the implementation here as similar as I can, changing only what was blatantly wrong, a typo, or sufficiently non-idiomatic rust that it really stuck out.

In particular you’ll see a certain amount of churn related to INTEGER vs. CARDINAL in the Mesa implementation. Mesa apparently interconverts the two somewhat readily? In any case, I’ve used usize for indices-in-buffers and ints for character-sizes-and-indentation-offsets. This respects the need for ints to “go negative” while carrying a pending-calculation balance, and helps differentiate all the numbers flying around internally (slightly).

I also inverted the indentation arithmetic used in the print stack, since the Mesa implementation (somewhat randomly) stores the offset on the print stack in terms of margin-col rather than col itself. I store col.

I also implemented a small change in the String token, in that I store an explicit length for the string. For most tokens this is just the length of the accompanying string. But it’s necessary to permit it to differ, for encoding things that are supposed to “go on their own line” – certain classes of comment and blank-line – where relying on adjacent hardbreak-like Break tokens with long blankness indication doesn’t actually work. To see why, consider when there is a “thing that should be on its own line” between two long blocks, say functions. If you put a hardbreak after each function (or before each) and the breaking algorithm decides to break there anyways (because the functions themselves are long) you wind up with extra blank lines. If you don’t put hardbreaks you can wind up with the “thing which should be on its own line” not getting its own line in the rare case of “really small functions” or such. This re-occurs with comments and explicit blank lines. So in those cases we use a string with a payload we want isolated to a line and an explicit length that’s huge, surrounded by two zero-length breaks. The algorithm will try its best to fit it on a line (which it can’t) and so naturally place the content on its own line to avoid combining it with other lines and making matters even worse.

Explanation

In case you do not have the paper, here is an explanation of what’s going on.

There is a stream of input tokens flowing through this printer.

The printer buffers up to 3N tokens inside itself, where N is linewidth. Yes, linewidth is chars and tokens are multi-char, but in the worst case every token worth buffering is 1 char long, so it’s ok.

Tokens are String, Break, and Begin/End to delimit blocks.

Begin tokens can carry an offset, saying “how far to indent when you break inside here”, as well as a flag indicating “consistent” or “inconsistent” breaking. Consistent breaking means that after the first break, no attempt will be made to flow subsequent breaks together onto lines. Inconsistent is the opposite. Inconsistent breaking example would be, say:

foo(hello, there, good, friends)

breaking inconsistently to become

foo(hello, there,
    good, friends);

whereas a consistent breaking would yield:

foo(hello,
    there,
    good,
    friends);

That is, in the consistent-break blocks we value vertical alignment more than the ability to cram stuff onto a line. But in all cases if it can make a block a one-liner, it’ll do so.

Carrying on with high-level logic:

The buffered tokens go through a ring-buffer, ‘tokens’. The ‘left’ and ‘right’ indices denote the active portion of the ring buffer as well as describing hypothetical points-in-the-infinite-stream at most 3N tokens apart (i.e., “not wrapped to ring-buffer boundaries”). The paper will switch between using ‘left’ and ‘right’ terms to denote the wrapped-to-ring-buffer and point-in-infinite-stream senses freely.

There is a parallel ring buffer, size, that holds the calculated size of each token. Why calculated? Because for Begin/End pairs, the “size” includes everything between the pair. That is, the “size” of Begin is actually the sum of the sizes of everything between Begin and the paired End that follows. Since that is arbitrarily far in the future, size is being rewritten regularly while the printer runs; in fact most of the machinery is here to work out size entries on the fly (and give up when they’re so obviously over-long that “infinity” is a good enough approximation for purposes of line breaking).

The “input side” of the printer is managed as an abstract process called SCAN, which uses scan_stack, to manage calculating size. SCAN is, in other words, the process of calculating ‘size’ entries.

The “output side” of the printer is managed by an abstract process called PRINT, which uses print_stack, margin and space to figure out what to do with each token/size pair it consumes as it goes. It’s trying to consume the entire buffered window, but can’t output anything until the size is >= 0 (sizes are set to negative while they’re pending calculation).

So SCAN takes input and buffers tokens and pending calculations, while PRINT gobbles up completed calculations and tokens from the buffer. The theory is that the two can never get more than 3N tokens apart, because once there’s “obviously” too much data to fit on a line, in a size calculation, SCAN will write “infinity” to the size and let PRINT consume it.

In this implementation (following the paper, again) the SCAN process is the methods called Printer::scan_*, and the ‘PRINT’ process is the method called Printer::print.

Modules

ring 🔒

Structs

Enums

How to break. Described in more detail in the module docs.
PrintFrame 🔒

Constants

MARGIN 🔒
Target line width.
MIN_SPACE 🔒
Every line is allowed at least this much space, even if highly indented.