Sunday, February 16, 2020

Ideas You Should Understand: Call-by-Need is Clairvoyant Call-by-Value

I quite enjoyed this paper by Jennifer Hackett and Graham Hutton.  It's interesting that it's taken us more than 50 years of lazy programming to come to this simple idea — on reading, I immediately kicked myself and wondered why I hadn't thought of that.

The difference between call-by-need and call-by-value is conceptually simple.  In call-by-value (cbv), we evaluate function arguments before we call the function.  In call-by-need (cbn, implemented using lazy evaluation) we evaluate them when we discover they're actually required, then we remember the results in case we need them again.  What it means for a computation to be "required" can get fiddly, but it's easy to think about a function that's returning a single value.  We only compute the things we need to obtain that value.  So if we're adding together two numbers to obtain the result, we need to evaluate both numbers.  If we are doing some conditional control flow, we need to evaluate the value of the condition.  And if we're calling another function, we need to know what function to call and we need to compute the value of that function call in turn.

The key idea here isn't just what is being evaluated, but more importantly what isn't being evaluated.  If we discover we don't need something, we don't evaluate it.  As a result, cbn programs terminate more often than cbv programs – to see this, imagine calling a function that throws its argument away with an argument that loops forever.  In fact, we can prove that cbn represents a normalizing strategy for evaluation — if there's a way to obtain a result, cbn will do so.

Efficiency

This gives us an intuition that cbn is in some sense more efficient than cbv, in that it can take a finite number of computational steps in a situation where cbv requires an infinite number.  Indeed if we just count computational steps then cbn takes the minimum possible number of computational steps of any strategy we might pick, and there are terminating computations that will always run asymptotically (big-O) faster under cbn than under cbv.

The dark downside to all this is that an implementation of cbn requires a bunch of extra mechanisms to keep track of unevaluated computations, remember results once evaluation is complete, and check whether something has been evaluated or not.  So in practice cbn evaluation imposes a heavy (but constant factor) overhead to all computations.  It turns out that there's a similar conceptual overhead when we're doing abstract reasoning about cbn programs as well — we end up having to replicate the record keeping we do during evaluation.

Clairvoyance

The simple insight of the paper is this: every time we bind an expression, we can non-deterministically decide to just throw it away.  Otherwise we can evaluate it immediately as we would with cbv.  If we never throw anything away we get cbv.  If we throw away too much, our program gets stuck and can't find a variable binding it needs.  If we throw away exactly the computations we didn't need, then our program does exactly the same work it would have done under cbn.

We can look at this another way — if we have an oracle that tells us whether we're going to need an expression somewhere down the line, we can just throw away expressions the oracle says are unnecessary, and immediately evaluate expressions the oracle thinks we'll want.

Now if we're just trying to count evaluation steps, we can do it easily in this framework – the cbn evaluation steps are the minimum number we need for any evaluation that doesn't get stuck.  No need to simulate all the mechanisms of laziness.  The authors show, in particular, that it makes it much easier to show that Common Subexpression Elimination is actually an optimization.

Non-strictness, or why I care

One obvious follow-on to all this is to observe that there's a whole set of strategies in the middle, where our oracle gives more or less accurate approximations to whether we'll need the value of an expression.  This is where I lived for much of grad school — the Id and pH languages evaluated expressions in parallel, and Eager Haskell used cbv techniques for a while but would occasionally give up and fall back to lazy evaluation.  Eager Haskell lived in a world that was semantically equivalent to cbn, but which would do "too much" work (a very large constant factor too much in the worst case).

Interestingly, while I thought this would perfectly capture the non-strict eager semantics of Id and pH as other points on the oracle's continuum, that isn't obviously true – non-strict parallel execution lets us express some cyclic dependencies that we can exploit in a lazy program, but that would deadlock if we used cbv-style execution.  In that respect, the ordering constraints in the face of mutual recursion covered later in the paper are relevant.  In Id and pH it looks like we might have to interleave partial reduction in those mutual recursions, which is exactly what we'd like to avoid.

Is there a similarly simple semantics that captures the computational complexity of non-strict eagerness without simulating its run-time behavior?  I'd love to see one if it exists.

Strictness Analysis

One of the most important optimizations in a compiler for a lazy language like Haskell is strictness analysis.  The job of the strictness analyzer is to identify a subset of expressions in the program that are guaranteed to be evaluated.  These are then transformed to use cbv rather than cbn.  So we can think of the strictness analyzer as an approximation of our oracle.  There are a couple of key differences:
  • The strictness analyzer operates on the source program.  If a function f is called 100 times, and it uses its first argument in only 99 of the calls, we have to conservatively say the argument is not guaranteed to be evaluated.  By contrast our oracle gives its answers as we evaluate, so it answers the question correctly all 100 times.
  • Even given that constraint, giving an accurate approximation is equivalent to the halting problem (as with most static analysis) – so a strictness analyzer also ends up being conservative because it can't find an accurate answer.
  • As a result of being conservative, strictness analysis can never throw out expressions; the compiler has to use cbn evaluation on the expressions it can't figure out.
We can imagine a dual analysis that identifies when an expression is never used.  The compiler could replace the computation with any other computation it wanted to (for example, undefined in Haskell).  This is closely related to dead-code elimination; as it turns out, just getting rid of every trace of the expression (as dead-code elimination does) is far more effective and there just aren't very many cases left after that, so my understanding is it's generally not thought to be worth it.

Wednesday, January 8, 2020

You Can Sort Dancing Links

From Knuth, Fig 4
I've long been fascinated by the Dancing Links algorithm, popularized by Donald Knuth (who's careful as always to attribute it to Hitotumatu and Noshita). It organizes a search problem into a 0/1 matrix with the goal of selecting rows that cover each column exactly once. The key insight is to represent each 1 in the matrix by a node that's part of two circular doubly linked lists – one a list of all the 1 nodes in its row, the other of all the 1 nodes in its column. To add a row to a potential solution, remove the row and all of the rows which share a column with a 1 with that row. This just requires splicing the nodes out of the horizontal and vertical doubly linked lists. When backtracking, splice the nodes back in again in the reverse order they were removed. If you leave the next and prev pointers in a spliced-out node alone, it's easy and fast to splice it back exactly where it came from.

I'm interested in a slight variation of the problem, where we have a set of n candidates that we want to assign to m roles and n ≤ m. We assign a goodness score to each role for each candidate. We'd like to find an assignment of candidates to roles that maximizes the sum of the goodness scores.

We can use a simplified variant of the dancing links representation, where rows are roles, columns are candidates, and we assign a candidate to a role by splicing out the candidate column and the role row. We label each node with its goodness score. We also label each row and column with its maximum goodness, updating this as we splice out nodes. This gives us an instantaneous upper bound on the possible goodness score of a solution being tried. We discard a solution if this bound falls below the best solution seen so far.

How do we decide what row and column to choose, and how do we quickly update goodness scores? This is the key idea:
        Sort the doubly linked lists for each row and each column.
It turns out we don't actually care what order the list elements occur in when we dance the links – for example, we only care that entry i,j belongs to row list i and column list j, but we don't care (for example) if its predecessors in the row list belong to entries <j.

We can start our search with a greedy approach to assigning candidates to roles, by simply assigning each candidate in turn to the first (highest scoring) role remaining in its column. Search can then backtrack and try to improve on this solution. Note that we should always assign the last candidate to the best available role, but otherwise we can choose to assign candidates to gradually less ideal roles in the hopes of finding a better overall match. We can keep the candidates in a doubly linked list ordered by their best role score, and update this over time if their best role is taken by someone else.

Some candidates are not suited to a given role. We can imagine their goodness score is -∞, but it's easier just to leave nodes out of the matrix. Per Knuth, we should attempt to assign a candidate with the minimal number of available roles at each step to keep the search tree from getting too bushy —so really we should be ordering candidates first by number of available roles, then by maximum role value.

If n = m (which can happen along the way even if it's not initially true) the problem becomes symmetrical and we can instead choose to assign a role to a candidate. In particular, if a role has fewer candidates available than any candidate has roles, we should assign a candidate to that role rather than vice versa. Again we can sort the roles by available candidates and maximum score.

Finally, we can prune entries from the matrix early. Every time the upper bound shrinks, we can remove entries that, if chosen, would drive the upper bound below the best solution found so far. We can do this by walking the sorted doubly linked lists in reverse, removing entries until we reach an entry that's still viable.

We're done as soon as every candidate has a role. If we ever find that a candidate has no available roles, we immediately backtrack. Similarly if n > m we know that some candidate will eventually run out of roles and we can immediately backtrack.

It's not hard to generalize this to a version where roles can accept between p and q candidates. It's trickier for me to see how to do this in the original dancing links without redundancy, though I'm sure it's equally easy in practice.