Sunday, June 19, 2022

Type checking, bidirectional typing, and search

Thanks to some excellent recent posts by Gabriella Gonzalez I've been taking another look at bidirectional type checking, this time with the help of Dunfield & Krishnaswamy's well-written introductory and survey papers on the subject. To all of the above authors, my most hearty thanks, as I had a chance to discuss using these techniques in anger this week with folks who know a lot more about the subject than I do. If you're curious how modern type inference works, I recommend these introductions unreservedly.

When I first learned about Hindley-Milner type inference, I was struck by a few things: first, I'd just spent a semester learning first-order mathematical logic, and the inference rules were identical to the ones I'd just learned. Second, the type inference algorithm looked just like logic programming only without backtracking search. This bugged me for the longest time, but of course it turns out for this style of polymorphic type checking, you simply don't need search – you always get a most general type or you get a type error.

Bidirectional Typing: Dealing with more complicated type systems

As soon as the type system gets a little more complicated, though, it's easy to lose that "always the right answer" guarantee, and these days relatively few general-purpose programming languages try. Instead, when the going gets tough, we ask programmers to add explicit type annotations (which pure Hindley-Milner typing lacks) to their program. And it turns out even in languages that do allow variants of Hindley-Milner inference (like 90's versions of Haskell), well-placed type annotations in code provide useful signposts, and even if they aren't necessary to type inference they are incredibly useful for readers.

Dunfield & Krishnaswamy describe bidirectional type checking as a way for the language designer to systematically decide where type annotations should be required. Broadly, types for some program constructs should be inferred by looking at the code itself; when the going gets tough, types can be checked based on contextual information and explicit type annotations. In their survey, they lay out some criteria for making those choices. The first three criteria together basically say that the rules should fit together to check all the programs you want to check, without requiring unnecessary typing rules.

The fourth and last criterion is the tricky one: You shouldn't need too many or overly verbose annotations, and you should be able to predict where the annotations will belong (which shouldn't change wildly as the program evolves). They give some rules of thumb:

  • Introduction rules should check an input type.
  • Elimination rules should synthesize an output type.
  • For other rules, make use of available information as much as possible. If the preconditions / subexpressions of a rule can be formed by pulling apart an input type, make it a checking rule. If there's a subexpression type that provides complete information about all the types in play, infer that and use it to check everything else and infer the result type.

This all comes together in the let rule (the rule for let x = a in b), which is the most interesting rule in Hindley-Milner type inference, and frustratingly isn't treated in detail in the D&K introductory paper. In Hindley-Milner, the let rule performs generalization, allowing us to write expressions like (let id = 位x. x in ...) and assign id its most general type, ∀饾浖. 饾浖→饾浖. This generalization step is only performed for expressions bound in let bindings. In the lambda calculus (let x = a in b) is exactly the function application ((位x.b) a), but Hindley-Milner type inference will generalize a in the former but not in the latter.

How does bidirectional typechecking handle ((位x.b) a)? For a function application the output type and the argument type are fully determined by the type of the function being applied, so we try to infer the type of (位x.b). In the simple system in the tutorial, we require a type annotation on this function, since we can only check its type (indeed, the only places we ever need annotations are immediately surrounding a lambda like this). With a bit more extension, we can infer a type for this function without annotation. But we can't infer a polymorphic type for a. Indeed, if b requires x to be polymorphic, type checking b will fail – we'll use b at two different types (for example, if were (位y. y), and we called (x 5) and (x true) in b, we'd be told that x can't take both an int and a bool as an input.

So instead we treat (let x = a in b) differently, attempting to infer a type for a, then using that type for x in b. But that still leaves open a question: are we checking or inferring the type of b? If we want the minimum number of typing rules, we should pick one or the other. But neither choice is perfect: If b is something we could easily infer, and we require it to be checked, we'll be adding obvious type annotations to our program. If b can't be inferred, but we had context we could have checked against, requiring an extra annotation in our program is unfortunate.

Part of the problem here, I think, is conceiving of bidirectional type checking as two functions:

    infer :: TEnv → Expr → Either Error Type
    check :: TEnv → Expr → Type → Maybe Error

I find it clearer to think of it as a single function:

    checkInfer :: TEnv → Expr → Maybe Type → Either Error Type

That is, if the last argument is a type, we're checking; otherwise we're inferring. In either case we return the expression type, or we fail.

With this signature, I propose a slightly different set of rules of thumb from those proposed by D&K:

  • Rules with no precondition (such as integer constants), or only contextual sanity checks (such as the variable rule which just looks up a variable in the type environment), always synthesize.
  • Rules for which one more more preconditions mention exactly the output type should work as both inference and checking rules if the remaining preconditions make that possible (it should be easy to write this down as a single bidirectional rule, where the presence of an input type says that we're checking, and otherwise we're inferring).
  • Otherwise, follow the other heuristics.

The addition of bidirectional rules is a way to keep the system compact while minimizing annotation burden, and it should still be obvious where to annotate (we shouldn't need to annotate such a bidirectional expression, but we might choose to do so – for example, we can imagine adding if to our language, and we can propagate bidirectionally to both the then and else branches, but if both require annotation we might prefer to annotate the if expression as a whole.

Or use search?

We could imagine an alternate universe where we can treat ((位x.b) a) similarly to (let x = a in b). Imagine we start by trying to infer a type for (位x.b), but fail due to the need for a type annotation. At this point we could try to make progress by instead inferring a type T for a. Now we can infer a type for b under the assumption that x:T. The tutorial introduction to bidirectional typing with inference essentially does the same thing in reverse for application, propagating the function type to operands using special rules.

But notice that we're now into full logic programming: we try one typing rule, and if that fails we backtrack and try again with a different rule whose structure of inference and checking is completely different. Is this a good idea?

  • It lets us successfully type check more programs, or equivalently to type check some programs with strictly fewer type annotations than originally required.
  • But it makes every function application a two-way branch point, so that the time to type check our program is exponential in the depth of function applications. We can work around this somewhat by making repeated application (f a b c d) pick a direction, but in general this concern has kept search from wide use for common language constructs like function application and let binding.
  • Search is also viewed as leading to surprising behavior, where seemingly innocuous program changes (like adding an additional line of code) lead a previously checkable program to no longer type check.

That said, there is still be a useful role for search in providing better type errors. Imagine that, having encountered an error on our primary type checking path, we try some of the alternative paths. If any of these successfully type the program, they allow us tell the user what type annotations they can add to their program to type check.

Unspeakable types

It's worth mentioning one big problem with type annotations, and that's what I like to refer to as unspeakable types (as Nick Benton pointed out to me, they're actually called non-denotable types; I'll stick with unspeakable here as it's evocative). In a polymorphic type system like Hindley-Milner, if we're inside a polymorphic function we can't actually write down the type of any value that mentions one of the polymorphic types, unless we have some way of naming those types and talking about them. These types are unspeakable. To work around unspeakability we need a form of type annotation that not only says "This expression has thus-and-such a polymorphic type", but introduces a bunch of new types in its body. So, for example, I'd like to be able to write ((位x. (x : 饾浖)) : ∀饾浖. 饾浖→饾浖) and have the inner 饾浖 refer to the outer polymorphic type. In most presentations (including the D&K papers) this is not the case, and it looks like the typing rules to allow it can get hairy pretty fast: you need to start distinguishing syntactic types from the actual types you're inferring, and you need to translate between the two.

Unspeakable types don't just come up due to polymorphism – in Rust, for example, the type of a particular lambda expression is unspeakable, and the best you can do is some supertype of that lambda (this is a big problem if the type of your lambda expression ends up instantiating a polymorphic type variable of another type); this is one of the big places the impl Trait notation gets used a lot, meaning roughly "a specific type I can't or don't want to speak, but it's a subtype of this type". The documentation link above talks about "types that only the compiler knows" – unspeakable.

The lesson I'd like to take away from this is "don't make types in your system unspeakable", or perhaps "never require a type annotation in a context where there might be an unspeakable type".  But in practice it isn't necessarily that easy, as Rust's example shows.

Monday, January 3, 2022

Mucking About with Linear Probing

Back in November, my colleague Bradley Kuszmaul sent out a link to some work he did with his son William and with Michael Bender.  The paper is titled "Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering," and still seems to be in preprint / under submission.  I'll refer to the authors and their paper as BKK here.  The basic idea is classic Mike Bender stuff – handle quadratic insertion bottlenecks by adding fake "tombstone" entries that get skipped over during lookups but make space for new entries to be inserted.  I was interested in what this looked like in practice and wrote some code to check it out.

A Bit of Background

I won't attempt to answer the question "What is a hash table" here, except to highlight the stuff most intro to computer science students know – it allows you to keep a key/value dictionary and expect to look up a key (or find it's missing) in constant time (O(1)), by computing a hash of the key (which turns the key into a random-looking integer) and using that hash to look up the entry in an array.  There are a lot of ways to handle the details here; the most important problem is that keys can end up with the same hash value, which will happen if you have more possible keys than table entries, and we need a strategy for handling the resulting collisionsLinear probing means we store all the elements of the table in the table itself (this is called open addressing, as opposed to chaining) and that when a collision occurs we search for the next available empty slot and put data there.

Open addressing schemes require keeping some empty table entries around; if you don't have enough of them, you end up searching through most of the table to find the next available entry, and that takes O(n) time rather than O(1) time.  Worse still, if you don't have some clever way of cutting off searches, looking up a missing key can also take O(n) time!  Because of its simplicity, linear probing is more sensitive to collisions than most competing schemes.  Even if your hash function is good, you can end up with collisions on nearby buckets piling up and forming clusters.  We like linear probing anyway because it's simple to implement and it makes it easy for the memory system in a modern computer to predict what memory we're likely to touch when we perform hashtable operations.  It turns out we can predict how bad clusters are expected to get.  Unsurprisingly it depends on how much empty space is left in the table.

We usually state this in terms of the table's load factor 饾浖.  Suppose our n-slot table contains k full entries full and j empty slots, n = k + j.  Then load factor is the proportion of full slots: 饾浖 = k / n.  BKK instead talk in terms of x = 1 / (1 - 饾浖).  That means x = n / (n - k) = n / j.  So 1/x is the proportion of empty slots, and x starts at 1 and rises rapidly as the table gets full.

This reflects the fact that it gets harder and harder to find empty space as the amount of empty space drops to nothing.

It's not too hard to resolve clustering for lookups – simply make sure that keys are stored in the table in hash order, so that we can stop searching for a key the moment we find a key with a bigger hash (and we know we can skip an entry if the hash is smaller).  That makes all lookups, including for missing keys, take expected time linear in x (O(1+x)).  Note "expected" – we can get unlucky (or choose a bad hash function) and do much worse than this, but if our hash function is any good we'll usually be OK.

The key ordering trick doesn't work for insertion – we can find the place where a new key belongs exactly as described above for lookup, but we still need to move existing entries over until we reach an empty slot.  We instead expect it to take O(1 + x²), which is quite a bit worse – see the red line in the picture above, which eventually saturates saying, roughly, "at this point we can expect to have to look through a large proportion of the table to find an empty entry".

We care about this because those j unused entries are wasted space.  The higher the load factor we can support, especially as tables grow ever larger, the more economically we can use memory.  This has knock-on advantages, such as better use of CPU caches.


Deletion


We didn't talk about deletion above, but there are two approaches in common use.  First is simply to remove the entry and shuffle later entries backwards to fill the hole until we find a place where we can leave a hole (an entry whose hash already matches the slot in the table where it resides, or an entry that's already empty).  That takes about as long as insertion.

We could instead just mark the entry deleted, leaving a so-called tombstone.  We'll skip it when we do lookups (slowing down lookups a teensy bit); it can be treated as empty space during subsequent insertions, or we can periodically sweep the table and remove all the tombstones if too many of them build up.

Key Insight: To Make Insertion Fast, Make Sure Every Cluster Contains Enough Regularly-Spaced Tombstones


A tombstone sitting in the middle of a cluster creates space for insertions earlier in that cluster.  We can insert a tombstone directly into the table, just as if we'd inserted an element and then immediately deleted it.  It obeys the same key ordering constraints, and will be ignored by lookups (as a result, it'll make lookups a little bit more expensive).  Now there's somewhere nearby for insertions to expand into.  The goal is to trade off a little bit of lookup time for a lot of insertion time.

Graveyard Hashing


They suggest a strategy (given a fixed target x) that BKK call graveyard hashing:
  • Rebuild the table every n / 4x updates (insertions or deletions).
  • To rebuild, first remove all existing tombstones.  Then add n / 2x regularly-spaced tombstones.

Checking it out


What's this look like in practice?  I wrote some Rust code (https://github.com/jmaessen/bkk-hash) to check it out.  I use a fixed-size table with 1024 entries, and into it I insert random 64-bit integers.  I use the uppermost 10 bits as the hash index.  The remaining bits are used to arbitrarily order entries within a bucket (this isn't strictly necessary, but you do want keys with more than 10 bits so that collision behavior is interesting).  Since I only care about collision behavior, I don't need to store values.  What I do store is a histogram of the lookup cost (number of elements I probe minus 1) of all elements as the table fills.  I also store the insertion cost (again, number of elements touched minus 1) for each element I insert.  I repeat this 10,000 times, each time starting with an empty table and a fresh random number sequence.  I then aggregate the histograms and dump them along with average and standard deviation of insertion and lookup costs.  Note that I never actually delete elements!

Graveyard Hashing talks about steady-state behavior.  To get a feel for behavior as the table fills, I take thresholds where x > (n / 4x) since the last rebuild and rebuild there.  I remove all previously-added tombstones and insert new tombstones every n / 2x elements.   If a tombstone would fall in an empty slot, I simply leave the slot empty (this may reduce lookup cost and has no effect on insertion cost).

Lookup (probing) behavior


So, what does this look like in practice?  Here's average lookup probe length versus k:

Note the stairstep behavior for BKK: when we rebuild and insert tombstones, the probe cost goes up a bit compared to the standard (std) linear probing strategy.  Our average lookup cost is pretty reasonable, even as the table becomes completely full.  Note that the standard deviation is really large in both cases – this is a heavy-tailed distribution.  This is unsurprising – we can't look an element up in -1 probes, so all the growth is at the high end of the curve.

This is the same graph as above, scaled by x.  We can see that the expected case for standard probing is bounded by x / 2, which is a tighter bound from Knuth, and that BKK roughly doubles this probe cost in the worst case.  Note how the curve falls off quickly near n: x grows quickly, while the observed average probe length doesn't actually grow to O(n).




If we look at the distribution of probe lengths on a log scale, we can see the effect of the heavy-tailed distribution, which shows up as almost-straight lines in log space.  Here a line like "std probe_len 301" corresponds to the distribution of probe sizes for the standard algorithm in a hash table with 301 elements.  The corresponding distributions for BKK and standard algorithms have similar colors (apologies if Google Sheets doesn't do a great job of color-blind accessibility with this many lines).  By far the bulk of data ends up over on the left for all distributions, indicating most elements can be found with a short probe of 0 or 1 elements.  Note that BKK and standard probe length histograms overlap in the middle of the distribution.  This is because of the 2x probe length overhead we saw in the previous graph.  Graveyard hashing probably only makes sense if our load is insertion / deletion-heavy.  If we're going to fill a table and then start doing lookups, it may make sense to squeeze out all the tombstones.  Interestingly, hash table APIs rarely provide an operation to do this, and APIs such as unordered_map in C++ and HashMap in Rust promise not to mutate the table during a lookup operation (so we can't remove tombstones lazily in reponse to a burst of lookups).

Insertion behavior


By contrast, insertion behavior is quite a bit worse (as we'd expect from the quadratic big-O factor):

 It's hard to see the difference between this graph and the one above until you look at the scale on the y-axis – lines for lookup went up to about 21 whereas lines for insertion reach all the way to 512 or so!  On average we're looking at half the elements when the hash table is practically full, which is what we'd expect.  But note that the jagged bkk line is below the line for the standard algorithm.  It looks like bkk is doing its job and keeping insertion cost lower.  But is it doing so in an amortized (big O) way?

It is.  Note how insertion time scaled by x starts to grow quickly as things fill up, whereas we see steady growth to about 2 when we use bkk.  Note that the bkk insertion cost slips a tiny bit above the standard insertion cost around k = 625, but is otherwise reliably lower.  We see that in order to get constant behavior from the standard algorithm we need to scale insertion cost by x².  Note how the standard deviations are much larger for insertion.  Even with bkk we can end up with a far more difficult worst case.

The insertion histograms are far less pretty than the corresponding histograms for lookup.  This is in part because they're noisier – we only get a single datapoint per trial for each k, rather than being able to test the k already-inserted points as we did for lookup.  So each point corresponds to 10,000 trials rather than millions of datapoints.  Nonetheless, we can also see how rapidly insertion length spirals out of control (here the horizontal axis cuts off at an insertion probe length of 200, versus 65 or so for the lookup histograms).  We can also see how much worse things get for large set sizes (even bkk does badly at 960 elements).  Finally we can see there's quite a long smattering of outliers even for the smaller curves; we can get ourselves into trouble no matter what we do.

Is Graveyard Hashing Worth It?  Probably Not.


So based on my mucking about, is graveyard hashing worth it?  Note that I didn't do any head-to-head timing in any of the above tests – that'd be invalidated by the enormous amount of recordkeeping I do to maintain the histograms.  Instead, let's consider the additional runtime complexity it adds.

For BKK my code ends up paying roughly 5 times for each insertion.  That's because for every i operations we perform 2i tombstone deletions followed by 2i tombstone insertions!  If we fix a particular target load factor x we will be inserting and deleting the same fixed set of tombstones (though we'll still need to handle cleanup of tombstones created by deletions).  Note also that naive cleanup of randomly-placed tombstones will cost O(n) – we'll need to scan the entire table for deleted entries if we're not careful.

Using Graveyard Hashing as Inspiration


Instead, it's probably worth taking graveyard hashing as a jumping-off point for other techniques to improve the performance of linear probing.  For example, a lot of modern hash table implementations use multi-entry buckets to speed up probing (for example Swiss Table, F14, and the Swiss Table-inpired Rust HashMap).  These use SIMD operations to quickly check 8-16 entries at a time for a matching hash.  Load factor can be quite high, since only a single slot in a bucket needs to be empty to stop probing or create space for insertion.  That still leaves open the collision strategy when a bucket does fill.  We can view BKK as one possibility in the space of collision strategies, one with the potential to run at high load factors while keeping insertion costs low.  We shouldn't need to insert more than a tombstone per filled bucket.

We can go in a different direction and consider that graveyard hashing is just trying to solve the problem of breaking up large clusters into smaller clusters to reduce insertion cost.  We can imagine doing a lazy version of this by stopping insertion after O(x) steps; at this point we instead search backwards for the start of a cluster (the first empty or tombstone element before the insertion point), and systematically insert tombstones every O(x) elements until such a tombstone coincides with an existing tombstone or an empty slot.  We can then re-try insertion secure in the knowledge that it will succeed.


Side Note on Sorting with Gaps


I mentioned that this paper was "classic Mike Bender stuff".  One example of Mike's past work is library sort, which insertion sorts in O(n lg n) rather than O(n²) by leaving gaps in the set to be sorted.  If we end up with a run without enough gaps, we insert gaps until there are enough gaps again.  If you squint at my hash table code, you realize it's actually sorting 64-bit numbers in O(n) time, again by leaving gaps!  This is actually better than radix sort can manage (since radix sort cares about the number of bits in my keys as well).  How can this be?  Well, it's because we know the numbers we're sorting are uniformly distributed, and so the hash code (the upper bits) provide a good estimate of where each will end up in the sorted sequence.  So we can basically bucket sort them in a single pass, using insertion sort within each bucket.  That's exactly what the hash table implementation ends up doing.