Sunday, June 19, 2022

Type checking, bidirectional typing, and search

Thanks to some excellent recent posts by Gabriella Gonzalez I've been taking another look at bidirectional type checking, this time with the help of Dunfield & Krishnaswamy's well-written introductory and survey papers on the subject. To all of the above authors, my most hearty thanks, as I had a chance to discuss using these techniques in anger this week with folks who know a lot more about the subject than I do. If you're curious how modern type inference works, I recommend these introductions unreservedly.

When I first learned about Hindley-Milner type inference, I was struck by a few things: first, I'd just spent a semester learning first-order mathematical logic, and the inference rules were identical to the ones I'd just learned. Second, the type inference algorithm looked just like logic programming only without backtracking search. This bugged me for the longest time, but of course it turns out for this style of polymorphic type checking, you simply don't need search – you always get a most general type or you get a type error.

Bidirectional Typing: Dealing with more complicated type systems

As soon as the type system gets a little more complicated, though, it's easy to lose that "always the right answer" guarantee, and these days relatively few general-purpose programming languages try. Instead, when the going gets tough, we ask programmers to add explicit type annotations (which pure Hindley-Milner typing lacks) to their program. And it turns out even in languages that do allow variants of Hindley-Milner inference (like 90's versions of Haskell), well-placed type annotations in code provide useful signposts, and even if they aren't necessary to type inference they are incredibly useful for readers.

Dunfield & Krishnaswamy describe bidirectional type checking as a way for the language designer to systematically decide where type annotations should be required. Broadly, types for some program constructs should be inferred by looking at the code itself; when the going gets tough, types can be checked based on contextual information and explicit type annotations. In their survey, they lay out some criteria for making those choices. The first three criteria together basically say that the rules should fit together to check all the programs you want to check, without requiring unnecessary typing rules.

The fourth and last criterion is the tricky one: You shouldn't need too many or overly verbose annotations, and you should be able to predict where the annotations will belong (which shouldn't change wildly as the program evolves). They give some rules of thumb:

  • Introduction rules should check an input type.
  • Elimination rules should synthesize an output type.
  • For other rules, make use of available information as much as possible. If the preconditions / subexpressions of a rule can be formed by pulling apart an input type, make it a checking rule. If there's a subexpression type that provides complete information about all the types in play, infer that and use it to check everything else and infer the result type.

This all comes together in the let rule (the rule for let x = a in b), which is the most interesting rule in Hindley-Milner type inference, and frustratingly isn't treated in detail in the D&K introductory paper. In Hindley-Milner, the let rule performs generalization, allowing us to write expressions like (let id = 位x. x in ...) and assign id its most general type, ∀饾浖. 饾浖→饾浖. This generalization step is only performed for expressions bound in let bindings. In the lambda calculus (let x = a in b) is exactly the function application ((位x.b) a), but Hindley-Milner type inference will generalize a in the former but not in the latter.

How does bidirectional typechecking handle ((位x.b) a)? For a function application the output type and the argument type are fully determined by the type of the function being applied, so we try to infer the type of (位x.b). In the simple system in the tutorial, we require a type annotation on this function, since we can only check its type (indeed, the only places we ever need annotations are immediately surrounding a lambda like this). With a bit more extension, we can infer a type for this function without annotation. But we can't infer a polymorphic type for a. Indeed, if b requires x to be polymorphic, type checking b will fail – we'll use b at two different types (for example, if were (位y. y), and we called (x 5) and (x true) in b, we'd be told that x can't take both an int and a bool as an input.

So instead we treat (let x = a in b) differently, attempting to infer a type for a, then using that type for x in b. But that still leaves open a question: are we checking or inferring the type of b? If we want the minimum number of typing rules, we should pick one or the other. But neither choice is perfect: If b is something we could easily infer, and we require it to be checked, we'll be adding obvious type annotations to our program. If b can't be inferred, but we had context we could have checked against, requiring an extra annotation in our program is unfortunate.

Part of the problem here, I think, is conceiving of bidirectional type checking as two functions:

    infer :: TEnv → Expr → Either Error Type
    check :: TEnv → Expr → Type → Maybe Error

I find it clearer to think of it as a single function:

    checkInfer :: TEnv → Expr → Maybe Type → Either Error Type

That is, if the last argument is a type, we're checking; otherwise we're inferring. In either case we return the expression type, or we fail.

With this signature, I propose a slightly different set of rules of thumb from those proposed by D&K:

  • Rules with no precondition (such as integer constants), or only contextual sanity checks (such as the variable rule which just looks up a variable in the type environment), always synthesize.
  • Rules for which one more more preconditions mention exactly the output type should work as both inference and checking rules if the remaining preconditions make that possible (it should be easy to write this down as a single bidirectional rule, where the presence of an input type says that we're checking, and otherwise we're inferring).
  • Otherwise, follow the other heuristics.

The addition of bidirectional rules is a way to keep the system compact while minimizing annotation burden, and it should still be obvious where to annotate (we shouldn't need to annotate such a bidirectional expression, but we might choose to do so – for example, we can imagine adding if to our language, and we can propagate bidirectionally to both the then and else branches, but if both require annotation we might prefer to annotate the if expression as a whole.

Or use search?

We could imagine an alternate universe where we can treat ((位x.b) a) similarly to (let x = a in b). Imagine we start by trying to infer a type for (位x.b), but fail due to the need for a type annotation. At this point we could try to make progress by instead inferring a type T for a. Now we can infer a type for b under the assumption that x:T. The tutorial introduction to bidirectional typing with inference essentially does the same thing in reverse for application, propagating the function type to operands using special rules.

But notice that we're now into full logic programming: we try one typing rule, and if that fails we backtrack and try again with a different rule whose structure of inference and checking is completely different. Is this a good idea?

  • It lets us successfully type check more programs, or equivalently to type check some programs with strictly fewer type annotations than originally required.
  • But it makes every function application a two-way branch point, so that the time to type check our program is exponential in the depth of function applications. We can work around this somewhat by making repeated application (f a b c d) pick a direction, but in general this concern has kept search from wide use for common language constructs like function application and let binding.
  • Search is also viewed as leading to surprising behavior, where seemingly innocuous program changes (like adding an additional line of code) lead a previously checkable program to no longer type check.

That said, there is still be a useful role for search in providing better type errors. Imagine that, having encountered an error on our primary type checking path, we try some of the alternative paths. If any of these successfully type the program, they allow us tell the user what type annotations they can add to their program to type check.

Unspeakable types

It's worth mentioning one big problem with type annotations, and that's what I like to refer to as unspeakable types (as Nick Benton pointed out to me, they're actually called non-denotable types; I'll stick with unspeakable here as it's evocative). In a polymorphic type system like Hindley-Milner, if we're inside a polymorphic function we can't actually write down the type of any value that mentions one of the polymorphic types, unless we have some way of naming those types and talking about them. These types are unspeakable. To work around unspeakability we need a form of type annotation that not only says "This expression has thus-and-such a polymorphic type", but introduces a bunch of new types in its body. So, for example, I'd like to be able to write ((位x. (x : 饾浖)) : ∀饾浖. 饾浖→饾浖) and have the inner 饾浖 refer to the outer polymorphic type. In most presentations (including the D&K papers) this is not the case, and it looks like the typing rules to allow it can get hairy pretty fast: you need to start distinguishing syntactic types from the actual types you're inferring, and you need to translate between the two.

Unspeakable types don't just come up due to polymorphism – in Rust, for example, the type of a particular lambda expression is unspeakable, and the best you can do is some supertype of that lambda (this is a big problem if the type of your lambda expression ends up instantiating a polymorphic type variable of another type); this is one of the big places the impl Trait notation gets used a lot, meaning roughly "a specific type I can't or don't want to speak, but it's a subtype of this type". The documentation link above talks about "types that only the compiler knows" – unspeakable.

The lesson I'd like to take away from this is "don't make types in your system unspeakable", or perhaps "never require a type annotation in a context where there might be an unspeakable type".  But in practice it isn't necessarily that easy, as Rust's example shows.

Monday, January 3, 2022

Mucking About with Linear Probing

Back in November, my colleague Bradley Kuszmaul sent out a link to some work he did with his son William and with Michael Bender.  The paper is titled "Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering," and still seems to be in preprint / under submission.  I'll refer to the authors and their paper as BKK here.  The basic idea is classic Mike Bender stuff – handle quadratic insertion bottlenecks by adding fake "tombstone" entries that get skipped over during lookups but make space for new entries to be inserted.  I was interested in what this looked like in practice and wrote some code to check it out.

A Bit of Background

I won't attempt to answer the question "What is a hash table" here, except to highlight the stuff most intro to computer science students know – it allows you to keep a key/value dictionary and expect to look up a key (or find it's missing) in constant time (O(1)), by computing a hash of the key (which turns the key into a random-looking integer) and using that hash to look up the entry in an array.  There are a lot of ways to handle the details here; the most important problem is that keys can end up with the same hash value, which will happen if you have more possible keys than table entries, and we need a strategy for handling the resulting collisionsLinear probing means we store all the elements of the table in the table itself (this is called open addressing, as opposed to chaining) and that when a collision occurs we search for the next available empty slot and put data there.

Open addressing schemes require keeping some empty table entries around; if you don't have enough of them, you end up searching through most of the table to find the next available entry, and that takes O(n) time rather than O(1) time.  Worse still, if you don't have some clever way of cutting off searches, looking up a missing key can also take O(n) time!  Because of its simplicity, linear probing is more sensitive to collisions than most competing schemes.  Even if your hash function is good, you can end up with collisions on nearby buckets piling up and forming clusters.  We like linear probing anyway because it's simple to implement and it makes it easy for the memory system in a modern computer to predict what memory we're likely to touch when we perform hashtable operations.  It turns out we can predict how bad clusters are expected to get.  Unsurprisingly it depends on how much empty space is left in the table.

We usually state this in terms of the table's load factor 饾浖.  Suppose our n-slot table contains k full entries full and j empty slots, n = k + j.  Then load factor is the proportion of full slots: 饾浖 = k / n.  BKK instead talk in terms of x = 1 / (1 - 饾浖).  That means x = n / (n - k) = n / j.  So 1/x is the proportion of empty slots, and x starts at 1 and rises rapidly as the table gets full.

This reflects the fact that it gets harder and harder to find empty space as the amount of empty space drops to nothing.

It's not too hard to resolve clustering for lookups – simply make sure that keys are stored in the table in hash order, so that we can stop searching for a key the moment we find a key with a bigger hash (and we know we can skip an entry if the hash is smaller).  That makes all lookups, including for missing keys, take expected time linear in x (O(1+x)).  Note "expected" – we can get unlucky (or choose a bad hash function) and do much worse than this, but if our hash function is any good we'll usually be OK.

The key ordering trick doesn't work for insertion – we can find the place where a new key belongs exactly as described above for lookup, but we still need to move existing entries over until we reach an empty slot.  We instead expect it to take O(1 + x²), which is quite a bit worse – see the red line in the picture above, which eventually saturates saying, roughly, "at this point we can expect to have to look through a large proportion of the table to find an empty entry".

We care about this because those j unused entries are wasted space.  The higher the load factor we can support, especially as tables grow ever larger, the more economically we can use memory.  This has knock-on advantages, such as better use of CPU caches.


Deletion


We didn't talk about deletion above, but there are two approaches in common use.  First is simply to remove the entry and shuffle later entries backwards to fill the hole until we find a place where we can leave a hole (an entry whose hash already matches the slot in the table where it resides, or an entry that's already empty).  That takes about as long as insertion.

We could instead just mark the entry deleted, leaving a so-called tombstone.  We'll skip it when we do lookups (slowing down lookups a teensy bit); it can be treated as empty space during subsequent insertions, or we can periodically sweep the table and remove all the tombstones if too many of them build up.

Key Insight: To Make Insertion Fast, Make Sure Every Cluster Contains Enough Regularly-Spaced Tombstones


A tombstone sitting in the middle of a cluster creates space for insertions earlier in that cluster.  We can insert a tombstone directly into the table, just as if we'd inserted an element and then immediately deleted it.  It obeys the same key ordering constraints, and will be ignored by lookups (as a result, it'll make lookups a little bit more expensive).  Now there's somewhere nearby for insertions to expand into.  The goal is to trade off a little bit of lookup time for a lot of insertion time.

Graveyard Hashing


They suggest a strategy (given a fixed target x) that BKK call graveyard hashing:
  • Rebuild the table every n / 4x updates (insertions or deletions).
  • To rebuild, first remove all existing tombstones.  Then add n / 2x regularly-spaced tombstones.

Checking it out


What's this look like in practice?  I wrote some Rust code (https://github.com/jmaessen/bkk-hash) to check it out.  I use a fixed-size table with 1024 entries, and into it I insert random 64-bit integers.  I use the uppermost 10 bits as the hash index.  The remaining bits are used to arbitrarily order entries within a bucket (this isn't strictly necessary, but you do want keys with more than 10 bits so that collision behavior is interesting).  Since I only care about collision behavior, I don't need to store values.  What I do store is a histogram of the lookup cost (number of elements I probe minus 1) of all elements as the table fills.  I also store the insertion cost (again, number of elements touched minus 1) for each element I insert.  I repeat this 10,000 times, each time starting with an empty table and a fresh random number sequence.  I then aggregate the histograms and dump them along with average and standard deviation of insertion and lookup costs.  Note that I never actually delete elements!

Graveyard Hashing talks about steady-state behavior.  To get a feel for behavior as the table fills, I take thresholds where x > (n / 4x) since the last rebuild and rebuild there.  I remove all previously-added tombstones and insert new tombstones every n / 2x elements.   If a tombstone would fall in an empty slot, I simply leave the slot empty (this may reduce lookup cost and has no effect on insertion cost).

Lookup (probing) behavior


So, what does this look like in practice?  Here's average lookup probe length versus k:

Note the stairstep behavior for BKK: when we rebuild and insert tombstones, the probe cost goes up a bit compared to the standard (std) linear probing strategy.  Our average lookup cost is pretty reasonable, even as the table becomes completely full.  Note that the standard deviation is really large in both cases – this is a heavy-tailed distribution.  This is unsurprising – we can't look an element up in -1 probes, so all the growth is at the high end of the curve.

This is the same graph as above, scaled by x.  We can see that the expected case for standard probing is bounded by x / 2, which is a tighter bound from Knuth, and that BKK roughly doubles this probe cost in the worst case.  Note how the curve falls off quickly near n: x grows quickly, while the observed average probe length doesn't actually grow to O(n).




If we look at the distribution of probe lengths on a log scale, we can see the effect of the heavy-tailed distribution, which shows up as almost-straight lines in log space.  Here a line like "std probe_len 301" corresponds to the distribution of probe sizes for the standard algorithm in a hash table with 301 elements.  The corresponding distributions for BKK and standard algorithms have similar colors (apologies if Google Sheets doesn't do a great job of color-blind accessibility with this many lines).  By far the bulk of data ends up over on the left for all distributions, indicating most elements can be found with a short probe of 0 or 1 elements.  Note that BKK and standard probe length histograms overlap in the middle of the distribution.  This is because of the 2x probe length overhead we saw in the previous graph.  Graveyard hashing probably only makes sense if our load is insertion / deletion-heavy.  If we're going to fill a table and then start doing lookups, it may make sense to squeeze out all the tombstones.  Interestingly, hash table APIs rarely provide an operation to do this, and APIs such as unordered_map in C++ and HashMap in Rust promise not to mutate the table during a lookup operation (so we can't remove tombstones lazily in reponse to a burst of lookups).

Insertion behavior


By contrast, insertion behavior is quite a bit worse (as we'd expect from the quadratic big-O factor):

 It's hard to see the difference between this graph and the one above until you look at the scale on the y-axis – lines for lookup went up to about 21 whereas lines for insertion reach all the way to 512 or so!  On average we're looking at half the elements when the hash table is practically full, which is what we'd expect.  But note that the jagged bkk line is below the line for the standard algorithm.  It looks like bkk is doing its job and keeping insertion cost lower.  But is it doing so in an amortized (big O) way?

It is.  Note how insertion time scaled by x starts to grow quickly as things fill up, whereas we see steady growth to about 2 when we use bkk.  Note that the bkk insertion cost slips a tiny bit above the standard insertion cost around k = 625, but is otherwise reliably lower.  We see that in order to get constant behavior from the standard algorithm we need to scale insertion cost by x².  Note how the standard deviations are much larger for insertion.  Even with bkk we can end up with a far more difficult worst case.

The insertion histograms are far less pretty than the corresponding histograms for lookup.  This is in part because they're noisier – we only get a single datapoint per trial for each k, rather than being able to test the k already-inserted points as we did for lookup.  So each point corresponds to 10,000 trials rather than millions of datapoints.  Nonetheless, we can also see how rapidly insertion length spirals out of control (here the horizontal axis cuts off at an insertion probe length of 200, versus 65 or so for the lookup histograms).  We can also see how much worse things get for large set sizes (even bkk does badly at 960 elements).  Finally we can see there's quite a long smattering of outliers even for the smaller curves; we can get ourselves into trouble no matter what we do.

Is Graveyard Hashing Worth It?  Probably Not.


So based on my mucking about, is graveyard hashing worth it?  Note that I didn't do any head-to-head timing in any of the above tests – that'd be invalidated by the enormous amount of recordkeeping I do to maintain the histograms.  Instead, let's consider the additional runtime complexity it adds.

For BKK my code ends up paying roughly 5 times for each insertion.  That's because for every i operations we perform 2i tombstone deletions followed by 2i tombstone insertions!  If we fix a particular target load factor x we will be inserting and deleting the same fixed set of tombstones (though we'll still need to handle cleanup of tombstones created by deletions).  Note also that naive cleanup of randomly-placed tombstones will cost O(n) – we'll need to scan the entire table for deleted entries if we're not careful.

Using Graveyard Hashing as Inspiration


Instead, it's probably worth taking graveyard hashing as a jumping-off point for other techniques to improve the performance of linear probing.  For example, a lot of modern hash table implementations use multi-entry buckets to speed up probing (for example Swiss Table, F14, and the Swiss Table-inpired Rust HashMap).  These use SIMD operations to quickly check 8-16 entries at a time for a matching hash.  Load factor can be quite high, since only a single slot in a bucket needs to be empty to stop probing or create space for insertion.  That still leaves open the collision strategy when a bucket does fill.  We can view BKK as one possibility in the space of collision strategies, one with the potential to run at high load factors while keeping insertion costs low.  We shouldn't need to insert more than a tombstone per filled bucket.

We can go in a different direction and consider that graveyard hashing is just trying to solve the problem of breaking up large clusters into smaller clusters to reduce insertion cost.  We can imagine doing a lazy version of this by stopping insertion after O(x) steps; at this point we instead search backwards for the start of a cluster (the first empty or tombstone element before the insertion point), and systematically insert tombstones every O(x) elements until such a tombstone coincides with an existing tombstone or an empty slot.  We can then re-try insertion secure in the knowledge that it will succeed.


Side Note on Sorting with Gaps


I mentioned that this paper was "classic Mike Bender stuff".  One example of Mike's past work is library sort, which insertion sorts in O(n lg n) rather than O(n²) by leaving gaps in the set to be sorted.  If we end up with a run without enough gaps, we insert gaps until there are enough gaps again.  If you squint at my hash table code, you realize it's actually sorting 64-bit numbers in O(n) time, again by leaving gaps!  This is actually better than radix sort can manage (since radix sort cares about the number of bits in my keys as well).  How can this be?  Well, it's because we know the numbers we're sorting are uniformly distributed, and so the hash code (the upper bits) provide a good estimate of where each will end up in the sorted sequence.  So we can basically bucket sort them in a single pass, using insertion sort within each bucket.  That's exactly what the hash table implementation ends up doing.






Tuesday, December 21, 2021

Ideas you should understand: Using Second-Order Types for Safe Mutation

Hiding state with Phantom Type Parameters


In this post we'll take a brief look at new work on hiding state manipulations using phantom type parameters and second-order quantification.  This has become a workhorse of type-based safety over the last 30 years or so, and it's worth understanding at least the basics if you want to use a safe language like Rust or Haskell.

Haskell (which I think of as my Home Programming Language) is purely functional, but sometimes it's convenient (or even asymptotically faster) to use an imperative algorithm with mutable state.  How can we delimit side effects safely?  There are two problems: First, making sure we don't store objects of different types into the same mutable storage.  For this we simply make sure our mutable storage is typed.  More difficult is preventing data races and delimiting the scope of side effects (so that we can, for example, use mutation to compute a result and then access it in a read-only fashion when we're done).  For this we can use local universal quantification over a phantom type.  The basic idea is most popularly embodied in the Haskell ST library which allows us to (for example) create an array and fill it in using side effects, then obtain an immutable result:

fibArray n = runSTArray $ do
  r <- newArray_ (0, n)
  writeArray r 0 0
  writeArray r 1 1
  forM_ [2..n] $ \i -> do
    i1 <- readArray r (i - 1)
    i2 <- readArray r (i - 2)
    writeArray r i (i1 + i2)
  return r

This relies on an array type STArray s i v, where i is the index type and v is the value type, but s is a phantom type variable that is never visibly instantiated with anything.  To provide isolation, runSTArray has the type (∀s. ST s (STArray s i v)) → Array i v which we can read as say "Takes a (monadic) computation manipulating mutable state that results in a mutable array, and returns an immutable array".  The key idea is that ∀s at the front, which states that the computation must work for any s, and in particular we don't get to know and aren't allowed to depend upon what particular type s might get plumbed in under the covers.  For example, imagine we pass back an array of pointers to a mutable reference:

broken :: Array Int (STRef s Int)
broken = runSTArray (newArray (1, 10) =<< newSTRef 0)

This gets rejected by the compiler, because the s that escapes is no longer arbitrary – it can be chosen by code elsewhere in the program.

This idea was floating in the air in the early 90s; Shail Aditya used hidden type variables in the Id compiler to accomplish a similar thing in his Ph.D. thesis, and the fx region system had similar ideas, but the presentation here is from Launchbury and Peyton Jones 93 and is the first time I recall seeing the idea represented at the language level using explicit second-order quantification.  It was also the first deep use of second-order quantified types that I (and many others) ever saw, and is at the heart of a lot of subsequent work that plays variations upon the same theme.


GhostCell: Separating Permissions from Data


This leads me up to the work the Rust Belt project has been doing, and their recent GhostCell paper.  In Rust, we have a system of lifetime parameters that control object lifespan and allow safe mutation without interference.  The problem is that, in order to get permission to mutate an object, we need to get our hands on a unique pointer to that object; no other references to it may exist anywhere (we can borrow this reference from another unique pointer so long as we return it when we're done).  This means every object must have a single unique owner.  This classically runs into trouble the moment we try to implement (say) a doubly-linked list: A cell at rest in the middle of a list has at least two references to it, the prev pointer of the next element and the next pointer of the previous element.  How can we safely create and manipulate a doubly linked list in Rust?  The RustBelt folks use references to a special kind of cell, a GhostCell:

pub struct Node<'brand, T> {
    data: T,
    prev: Option<NodeRef<'brand, T>>,
    next: Option<NodeRef<'brand, T>>,
}
pub type NodeRef<'brand, T> = &GhostCell<'brand, Node<'brand, T>>;

pub fn insert_next(
    node1: NodeRef<'brand, T>,
    node2: NodeRef<'brand, T>,
    token: &mut GhostToken<'brand>,
) {
    Self::remove(node2, token);
    let node1_old_next : Option<NodeRef<_>> = node1.borrow(token).next;
    if let Some(node1_old_next) = node1_old_next {
        node1_old_next.borrow_mut(token).prev = Some(node2);
    }
    node1.borrow_mut(token).next = Some(node2);
    let node2: &mut Node<_> = node2.borrow_mut(token);
    node2.prev = Some(node1);
    node2.next = node1_old_next;
}

Now the only way to access the contents of a GhostCell is using a GhostToken, and that's where the safety comes from – the GhostToken obeys the same restrictive lifetime and sharing rules as regular Rust objects, but if we get our hands on a mutable GhostToken<'brand> we can mutate GhostCell<'brand, T>s to our heart's content.  How do we make sure that there's one and only one token with the right brand?  It's the same second-order universal type quantification, updated for Rust:

impl<'brand> GhostToken<'brand> {
    pub fn new<R>(f: for<'new_brand> FnOnce(GhostToken<'new_brand>) -> R) -> R
}

So: f receives a GhostToken<'new_brand>, and within f we can create and use GhostCell<'new_brand, T>s, so long as we throw them away (or freeze them into regular Rust objects) before we leave f.

The token is a 0-sized object that gets inlined away, so in practice this costs us (almost) nothing.


The Hitch: Object Lifetimes


The hitch with a GhostCell is this: how do we manage the lifetimes of our linked list elements?  For an ordinary single-linked list, Rust lets us splice elements in and out and handles the deallocation for us.  But in this world of GhostCells, the compiler can't tell how many references exist to a given cell, it just knows that they've all got the same 'brand and will be dead by the time the GhostToken is dead.

As a result, we've got to manage GhostCell storage differently from the storage of other Rust objects.  One simple thing we can do (and this is done in the actual example code accompanying the paper) is to use arena allocation – just let all the objects sit around until the GhostToken dies, then clean them all up in one go.  This works well if we don't allocate a lot of junk along the way, but it may work badly for a long-lived data structure with a lot of churn (like, say, an LRU cache).

Otherwise, we're going to have to use some kind of garbage collection.  The repo for the paper shows an example using Arc (reference counting), where the prev pointers are weak (don't count for object lifetime, to avoid reference cycles).  But Arc has a lot of space overhead – two 64-bit counters for every linked list cell.

Another intriguing possibility is to manage the pair of references using static reference counting.  This uses numeric type parameters to describe fractions of an object reference; you can mutate an object if you have the whole thing, or you can split it into immutable pieces and put those pieces back together again later.  Combined with GhostCell you gain the ability to mutate fractional pieces as well, and the system is simply managing reference lifetimes on your behalf.  You still need to be very disciplined, though – it's possible to drop one half of a reference and end up with something you can't clean up any more.


More Work is Needed!


So there we go.  You can get unique reference combined with lifetime management, or you can get arbitrary side effects within a limited scope, but we don't have a good way to combine them (yet?).  I'm hopeful that we can find a solution that gives us the best of both worlds, but it's not obvious and certainly not easy.

One thing it'd be nice to see by the by is extending Rust to permit existentially-quantified lifetimes in results of functions, not just local universal quantification in function arguments.  This would actually behave pretty nicely from Rust's point of view: you get a lifetime that's distinct from all the others in scope, but still obeys the usual Rust rules.  Not only would this let us write code in direct style:

let token = GhostToken::new();
... use token...

It would also allow us to write functions that themselves return lifetime-scoped objects.  Finally, it would allow us to create multiple tokens with interleaved lifetimes that control disjoint pieces of memory.  This is something regular Rust lifetimes can do, but right now GhostToken::new takes a function argument and those functions have to nest neatly.  We can work around it by (say) dropping a GhostToken early, but that requires a lot of thought on our part – thought the Rust borrow checker can and should put in safely and automatically on our behalf.


(picture from Launchbury and Peyton Jones 93)

Sunday, February 16, 2020

Ideas You Should Understand: Call-by-Need is Clairvoyant Call-by-Value

I quite enjoyed this paper by Jennifer Hackett and Graham Hutton.  It's interesting that it's taken us more than 50 years of lazy programming to come to this simple idea — on reading, I immediately kicked myself and wondered why I hadn't thought of that.

The difference between call-by-need and call-by-value is conceptually simple.  In call-by-value (cbv), we evaluate function arguments before we call the function.  In call-by-need (cbn, implemented using lazy evaluation) we evaluate them when we discover they're actually required, then we remember the results in case we need them again.  What it means for a computation to be "required" can get fiddly, but it's easy to think about a function that's returning a single value.  We only compute the things we need to obtain that value.  So if we're adding together two numbers to obtain the result, we need to evaluate both numbers.  If we are doing some conditional control flow, we need to evaluate the value of the condition.  And if we're calling another function, we need to know what function to call and we need to compute the value of that function call in turn.

The key idea here isn't just what is being evaluated, but more importantly what isn't being evaluated.  If we discover we don't need something, we don't evaluate it.  As a result, cbn programs terminate more often than cbv programs – to see this, imagine calling a function that throws its argument away with an argument that loops forever.  In fact, we can prove that cbn represents a normalizing strategy for evaluation — if there's a way to obtain a result, cbn will do so.

Efficiency

This gives us an intuition that cbn is in some sense more efficient than cbv, in that it can take a finite number of computational steps in a situation where cbv requires an infinite number.  Indeed if we just count computational steps then cbn takes the minimum possible number of computational steps of any strategy we might pick, and there are terminating computations that will always run asymptotically (big-O) faster under cbn than under cbv.

The dark downside to all this is that an implementation of cbn requires a bunch of extra mechanisms to keep track of unevaluated computations, remember results once evaluation is complete, and check whether something has been evaluated or not.  So in practice cbn evaluation imposes a heavy (but constant factor) overhead to all computations.  It turns out that there's a similar conceptual overhead when we're doing abstract reasoning about cbn programs as well — we end up having to replicate the record keeping we do during evaluation.

Clairvoyance

The simple insight of the paper is this: every time we bind an expression, we can non-deterministically decide to just throw it away.  Otherwise we can evaluate it immediately as we would with cbv.  If we never throw anything away we get cbv.  If we throw away too much, our program gets stuck and can't find a variable binding it needs.  If we throw away exactly the computations we didn't need, then our program does exactly the same work it would have done under cbn.

We can look at this another way — if we have an oracle that tells us whether we're going to need an expression somewhere down the line, we can just throw away expressions the oracle says are unnecessary, and immediately evaluate expressions the oracle thinks we'll want.

Now if we're just trying to count evaluation steps, we can do it easily in this framework – the cbn evaluation steps are the minimum number we need for any evaluation that doesn't get stuck.  No need to simulate all the mechanisms of laziness.  The authors show, in particular, that it makes it much easier to show that Common Subexpression Elimination is actually an optimization.

Non-strictness, or why I care

One obvious follow-on to all this is to observe that there's a whole set of strategies in the middle, where our oracle gives more or less accurate approximations to whether we'll need the value of an expression.  This is where I lived for much of grad school — the Id and pH languages evaluated expressions in parallel, and Eager Haskell used cbv techniques for a while but would occasionally give up and fall back to lazy evaluation.  Eager Haskell lived in a world that was semantically equivalent to cbn, but which would do "too much" work (a very large constant factor too much in the worst case).

Interestingly, while I thought this would perfectly capture the non-strict eager semantics of Id and pH as other points on the oracle's continuum, that isn't obviously true – non-strict parallel execution lets us express some cyclic dependencies that we can exploit in a lazy program, but that would deadlock if we used cbv-style execution.  In that respect, the ordering constraints in the face of mutual recursion covered later in the paper are relevant.  In Id and pH it looks like we might have to interleave partial reduction in those mutual recursions, which is exactly what we'd like to avoid.

Is there a similarly simple semantics that captures the computational complexity of non-strict eagerness without simulating its run-time behavior?  I'd love to see one if it exists.

Strictness Analysis

One of the most important optimizations in a compiler for a lazy language like Haskell is strictness analysis.  The job of the strictness analyzer is to identify a subset of expressions in the program that are guaranteed to be evaluated.  These are then transformed to use cbv rather than cbn.  So we can think of the strictness analyzer as an approximation of our oracle.  There are a couple of key differences:
  • The strictness analyzer operates on the source program.  If a function f is called 100 times, and it uses its first argument in only 99 of the calls, we have to conservatively say the argument is not guaranteed to be evaluated.  By contrast our oracle gives its answers as we evaluate, so it answers the question correctly all 100 times.
  • Even given that constraint, giving an accurate approximation is equivalent to the halting problem (as with most static analysis) – so a strictness analyzer also ends up being conservative because it can't find an accurate answer.
  • As a result of being conservative, strictness analysis can never throw out expressions; the compiler has to use cbn evaluation on the expressions it can't figure out.
We can imagine a dual analysis that identifies when an expression is never used.  The compiler could replace the computation with any other computation it wanted to (for example, undefined in Haskell).  This is closely related to dead-code elimination; as it turns out, just getting rid of every trace of the expression (as dead-code elimination does) is far more effective and there just aren't very many cases left after that, so my understanding is it's generally not thought to be worth it.

Wednesday, January 8, 2020

You Can Sort Dancing Links

From Knuth, Fig 4
I've long been fascinated by the Dancing Links algorithm, popularized by Donald Knuth (who's careful as always to attribute it to Hitotumatu and Noshita). It organizes a search problem into a 0/1 matrix with the goal of selecting rows that cover each column exactly once. The key insight is to represent each 1 in the matrix by a node that's part of two circular doubly linked lists – one a list of all the 1 nodes in its row, the other of all the 1 nodes in its column. To add a row to a potential solution, remove the row and all of the rows which share a column with a 1 with that row. This just requires splicing the nodes out of the horizontal and vertical doubly linked lists. When backtracking, splice the nodes back in again in the reverse order they were removed. If you leave the next and prev pointers in a spliced-out node alone, it's easy and fast to splice it back exactly where it came from.

I'm interested in a slight variation of the problem, where we have a set of n candidates that we want to assign to m roles and n ≤ m. We assign a goodness score to each role for each candidate. We'd like to find an assignment of candidates to roles that maximizes the sum of the goodness scores.

We can use a simplified variant of the dancing links representation, where rows are roles, columns are candidates, and we assign a candidate to a role by splicing out the candidate column and the role row. We label each node with its goodness score. We also label each row and column with its maximum goodness, updating this as we splice out nodes. This gives us an instantaneous upper bound on the possible goodness score of a solution being tried. We discard a solution if this bound falls below the best solution seen so far.

How do we decide what row and column to choose, and how do we quickly update goodness scores? This is the key idea:
        Sort the doubly linked lists for each row and each column.
It turns out we don't actually care what order the list elements occur in when we dance the links – for example, we only care that entry i,j belongs to row list i and column list j, but we don't care (for example) if its predecessors in the row list belong to entries <j.

We can start our search with a greedy approach to assigning candidates to roles, by simply assigning each candidate in turn to the first (highest scoring) role remaining in its column. Search can then backtrack and try to improve on this solution. Note that we should always assign the last candidate to the best available role, but otherwise we can choose to assign candidates to gradually less ideal roles in the hopes of finding a better overall match. We can keep the candidates in a doubly linked list ordered by their best role score, and update this over time if their best role is taken by someone else.

Some candidates are not suited to a given role. We can imagine their goodness score is -∞, but it's easier just to leave nodes out of the matrix. Per Knuth, we should attempt to assign a candidate with the minimal number of available roles at each step to keep the search tree from getting too bushy —so really we should be ordering candidates first by number of available roles, then by maximum role value.

If n = m (which can happen along the way even if it's not initially true) the problem becomes symmetrical and we can instead choose to assign a role to a candidate. In particular, if a role has fewer candidates available than any candidate has roles, we should assign a candidate to that role rather than vice versa. Again we can sort the roles by available candidates and maximum score.

Finally, we can prune entries from the matrix early. Every time the upper bound shrinks, we can remove entries that, if chosen, would drive the upper bound below the best solution found so far. We can do this by walking the sorted doubly linked lists in reverse, removing entries until we reach an entry that's still viable.

We're done as soon as every candidate has a role. If we ever find that a candidate has no available roles, we immediately backtrack. Similarly if n > m we know that some candidate will eventually run out of roles and we can immediately backtrack.

It's not hard to generalize this to a version where roles can accept between p and q candidates. It's trickier for me to see how to do this in the original dancing links without redundancy, though I'm sure it's equally easy in practice.