Monday, January 3, 2022

Mucking About with Linear Probing

Back in November, my colleague Bradley Kuszmaul sent out a link to some work he did with his son William and with Michael Bender.  The paper is titled "Linear Probing Revisited: Tombstones Mark the Death of Primary Clustering," and still seems to be in preprint / under submission.  I'll refer to the authors and their paper as BKK here.  The basic idea is classic Mike Bender stuff – handle quadratic insertion bottlenecks by adding fake "tombstone" entries that get skipped over during lookups but make space for new entries to be inserted.  I was interested in what this looked like in practice and wrote some code to check it out.

A Bit of Background

I won't attempt to answer the question "What is a hash table" here, except to highlight the stuff most intro to computer science students know – it allows you to keep a key/value dictionary and expect to look up a key (or find it's missing) in constant time (O(1)), by computing a hash of the key (which turns the key into a random-looking integer) and using that hash to look up the entry in an array.  There are a lot of ways to handle the details here; the most important problem is that keys can end up with the same hash value, which will happen if you have more possible keys than table entries, and we need a strategy for handling the resulting collisionsLinear probing means we store all the elements of the table in the table itself (this is called open addressing, as opposed to chaining) and that when a collision occurs we search for the next available empty slot and put data there.

Open addressing schemes require keeping some empty table entries around; if you don't have enough of them, you end up searching through most of the table to find the next available entry, and that takes O(n) time rather than O(1) time.  Worse still, if you don't have some clever way of cutting off searches, looking up a missing key can also take O(n) time!  Because of its simplicity, linear probing is more sensitive to collisions than most competing schemes.  Even if your hash function is good, you can end up with collisions on nearby buckets piling up and forming clusters.  We like linear probing anyway because it's simple to implement and it makes it easy for the memory system in a modern computer to predict what memory we're likely to touch when we perform hashtable operations.  It turns out we can predict how bad clusters are expected to get.  Unsurprisingly it depends on how much empty space is left in the table.

We usually state this in terms of the table's load factor 𝛼.  Suppose our n-slot table contains k full entries full and j empty slots, n = k + j.  Then load factor is the proportion of full slots: 𝛼 = k / n.  BKK instead talk in terms of x = 1 / (1 - 𝛼).  That means x = n / (n - k) = n / j.  So 1/x is the proportion of empty slots, and x starts at 1 and rises rapidly as the table gets full.

This reflects the fact that it gets harder and harder to find empty space as the amount of empty space drops to nothing.

It's not too hard to resolve clustering for lookups – simply make sure that keys are stored in the table in hash order, so that we can stop searching for a key the moment we find a key with a bigger hash (and we know we can skip an entry if the hash is smaller).  That makes all lookups, including for missing keys, take expected time linear in x (O(1+x)).  Note "expected" – we can get unlucky (or choose a bad hash function) and do much worse than this, but if our hash function is any good we'll usually be OK.

The key ordering trick doesn't work for insertion – we can find the place where a new key belongs exactly as described above for lookup, but we still need to move existing entries over until we reach an empty slot.  We instead expect it to take O(1 + x²), which is quite a bit worse – see the red line in the picture above, which eventually saturates saying, roughly, "at this point we can expect to have to look through a large proportion of the table to find an empty entry".

We care about this because those j unused entries are wasted space.  The higher the load factor we can support, especially as tables grow ever larger, the more economically we can use memory.  This has knock-on advantages, such as better use of CPU caches.


Deletion


We didn't talk about deletion above, but there are two approaches in common use.  First is simply to remove the entry and shuffle later entries backwards to fill the hole until we find a place where we can leave a hole (an entry whose hash already matches the slot in the table where it resides, or an entry that's already empty).  That takes about as long as insertion.

We could instead just mark the entry deleted, leaving a so-called tombstone.  We'll skip it when we do lookups (slowing down lookups a teensy bit); it can be treated as empty space during subsequent insertions, or we can periodically sweep the table and remove all the tombstones if too many of them build up.

Key Insight: To Make Insertion Fast, Make Sure Every Cluster Contains Enough Regularly-Spaced Tombstones


A tombstone sitting in the middle of a cluster creates space for insertions earlier in that cluster.  We can insert a tombstone directly into the table, just as if we'd inserted an element and then immediately deleted it.  It obeys the same key ordering constraints, and will be ignored by lookups (as a result, it'll make lookups a little bit more expensive).  Now there's somewhere nearby for insertions to expand into.  The goal is to trade off a little bit of lookup time for a lot of insertion time.

Graveyard Hashing


They suggest a strategy (given a fixed target x) that BKK call graveyard hashing:
  • Rebuild the table every n / 4x updates (insertions or deletions).
  • To rebuild, first remove all existing tombstones.  Then add n / 2x regularly-spaced tombstones.

Checking it out


What's this look like in practice?  I wrote some Rust code (https://github.com/jmaessen/bkk-hash) to check it out.  I use a fixed-size table with 1024 entries, and into it I insert random 64-bit integers.  I use the uppermost 10 bits as the hash index.  The remaining bits are used to arbitrarily order entries within a bucket (this isn't strictly necessary, but you do want keys with more than 10 bits so that collision behavior is interesting).  Since I only care about collision behavior, I don't need to store values.  What I do store is a histogram of the lookup cost (number of elements I probe minus 1) of all elements as the table fills.  I also store the insertion cost (again, number of elements touched minus 1) for each element I insert.  I repeat this 10,000 times, each time starting with an empty table and a fresh random number sequence.  I then aggregate the histograms and dump them along with average and standard deviation of insertion and lookup costs.  Note that I never actually delete elements!

Graveyard Hashing talks about steady-state behavior.  To get a feel for behavior as the table fills, I take thresholds where x > (n / 4x) since the last rebuild and rebuild there.  I remove all previously-added tombstones and insert new tombstones every n / 2x elements.   If a tombstone would fall in an empty slot, I simply leave the slot empty (this may reduce lookup cost and has no effect on insertion cost).

Lookup (probing) behavior


So, what does this look like in practice?  Here's average lookup probe length versus k:

Note the stairstep behavior for BKK: when we rebuild and insert tombstones, the probe cost goes up a bit compared to the standard (std) linear probing strategy.  Our average lookup cost is pretty reasonable, even as the table becomes completely full.  Note that the standard deviation is really large in both cases – this is a heavy-tailed distribution.  This is unsurprising – we can't look an element up in -1 probes, so all the growth is at the high end of the curve.

This is the same graph as above, scaled by x.  We can see that the expected case for standard probing is bounded by x / 2, which is a tighter bound from Knuth, and that BKK roughly doubles this probe cost in the worst case.  Note how the curve falls off quickly near n: x grows quickly, while the observed average probe length doesn't actually grow to O(n).




If we look at the distribution of probe lengths on a log scale, we can see the effect of the heavy-tailed distribution, which shows up as almost-straight lines in log space.  Here a line like "std probe_len 301" corresponds to the distribution of probe sizes for the standard algorithm in a hash table with 301 elements.  The corresponding distributions for BKK and standard algorithms have similar colors (apologies if Google Sheets doesn't do a great job of color-blind accessibility with this many lines).  By far the bulk of data ends up over on the left for all distributions, indicating most elements can be found with a short probe of 0 or 1 elements.  Note that BKK and standard probe length histograms overlap in the middle of the distribution.  This is because of the 2x probe length overhead we saw in the previous graph.  Graveyard hashing probably only makes sense if our load is insertion / deletion-heavy.  If we're going to fill a table and then start doing lookups, it may make sense to squeeze out all the tombstones.  Interestingly, hash table APIs rarely provide an operation to do this, and APIs such as unordered_map in C++ and HashMap in Rust promise not to mutate the table during a lookup operation (so we can't remove tombstones lazily in reponse to a burst of lookups).

Insertion behavior


By contrast, insertion behavior is quite a bit worse (as we'd expect from the quadratic big-O factor):

 It's hard to see the difference between this graph and the one above until you look at the scale on the y-axis – lines for lookup went up to about 21 whereas lines for insertion reach all the way to 512 or so!  On average we're looking at half the elements when the hash table is practically full, which is what we'd expect.  But note that the jagged bkk line is below the line for the standard algorithm.  It looks like bkk is doing its job and keeping insertion cost lower.  But is it doing so in an amortized (big O) way?

It is.  Note how insertion time scaled by x starts to grow quickly as things fill up, whereas we see steady growth to about 2 when we use bkk.  Note that the bkk insertion cost slips a tiny bit above the standard insertion cost around k = 625, but is otherwise reliably lower.  We see that in order to get constant behavior from the standard algorithm we need to scale insertion cost by x².  Note how the standard deviations are much larger for insertion.  Even with bkk we can end up with a far more difficult worst case.

The insertion histograms are far less pretty than the corresponding histograms for lookup.  This is in part because they're noisier – we only get a single datapoint per trial for each k, rather than being able to test the k already-inserted points as we did for lookup.  So each point corresponds to 10,000 trials rather than millions of datapoints.  Nonetheless, we can also see how rapidly insertion length spirals out of control (here the horizontal axis cuts off at an insertion probe length of 200, versus 65 or so for the lookup histograms).  We can also see how much worse things get for large set sizes (even bkk does badly at 960 elements).  Finally we can see there's quite a long smattering of outliers even for the smaller curves; we can get ourselves into trouble no matter what we do.

Is Graveyard Hashing Worth It?  Probably Not.


So based on my mucking about, is graveyard hashing worth it?  Note that I didn't do any head-to-head timing in any of the above tests – that'd be invalidated by the enormous amount of recordkeeping I do to maintain the histograms.  Instead, let's consider the additional runtime complexity it adds.

For BKK my code ends up paying roughly 5 times for each insertion.  That's because for every i operations we perform 2i tombstone deletions followed by 2i tombstone insertions!  If we fix a particular target load factor x we will be inserting and deleting the same fixed set of tombstones (though we'll still need to handle cleanup of tombstones created by deletions).  Note also that naive cleanup of randomly-placed tombstones will cost O(n) – we'll need to scan the entire table for deleted entries if we're not careful.

Using Graveyard Hashing as Inspiration


Instead, it's probably worth taking graveyard hashing as a jumping-off point for other techniques to improve the performance of linear probing.  For example, a lot of modern hash table implementations use multi-entry buckets to speed up probing (for example Swiss Table, F14, and the Swiss Table-inpired Rust HashMap).  These use SIMD operations to quickly check 8-16 entries at a time for a matching hash.  Load factor can be quite high, since only a single slot in a bucket needs to be empty to stop probing or create space for insertion.  That still leaves open the collision strategy when a bucket does fill.  We can view BKK as one possibility in the space of collision strategies, one with the potential to run at high load factors while keeping insertion costs low.  We shouldn't need to insert more than a tombstone per filled bucket.

We can go in a different direction and consider that graveyard hashing is just trying to solve the problem of breaking up large clusters into smaller clusters to reduce insertion cost.  We can imagine doing a lazy version of this by stopping insertion after O(x) steps; at this point we instead search backwards for the start of a cluster (the first empty or tombstone element before the insertion point), and systematically insert tombstones every O(x) elements until such a tombstone coincides with an existing tombstone or an empty slot.  We can then re-try insertion secure in the knowledge that it will succeed.


Side Note on Sorting with Gaps


I mentioned that this paper was "classic Mike Bender stuff".  One example of Mike's past work is library sort, which insertion sorts in O(n lg n) rather than O(n²) by leaving gaps in the set to be sorted.  If we end up with a run without enough gaps, we insert gaps until there are enough gaps again.  If you squint at my hash table code, you realize it's actually sorting 64-bit numbers in O(n) time, again by leaving gaps!  This is actually better than radix sort can manage (since radix sort cares about the number of bits in my keys as well).  How can this be?  Well, it's because we know the numbers we're sorting are uniformly distributed, and so the hash code (the upper bits) provide a good estimate of where each will end up in the sorted sequence.  So we can basically bucket sort them in a single pass, using insertion sort within each bucket.  That's exactly what the hash table implementation ends up doing.