Tuesday, December 21, 2021

Ideas you should understand: Using Second-Order Types for Safe Mutation

Hiding state with Phantom Type Parameters


In this post we'll take a brief look at new work on hiding state manipulations using phantom type parameters and second-order quantification.  This has become a workhorse of type-based safety over the last 30 years or so, and it's worth understanding at least the basics if you want to use a safe language like Rust or Haskell.

Haskell (which I think of as my Home Programming Language) is purely functional, but sometimes it's convenient (or even asymptotically faster) to use an imperative algorithm with mutable state.  How can we delimit side effects safely?  There are two problems: First, making sure we don't store objects of different types into the same mutable storage.  For this we simply make sure our mutable storage is typed.  More difficult is preventing data races and delimiting the scope of side effects (so that we can, for example, use mutation to compute a result and then access it in a read-only fashion when we're done).  For this we can use local universal quantification over a phantom type.  The basic idea is most popularly embodied in the Haskell ST library which allows us to (for example) create an array and fill it in using side effects, then obtain an immutable result:

fibArray n = runSTArray $ do
  r <- newArray_ (0, n)
  writeArray r 0 0
  writeArray r 1 1
  forM_ [2..n] $ \i -> do
    i1 <- readArray r (i - 1)
    i2 <- readArray r (i - 2)
    writeArray r i (i1 + i2)
  return r

This relies on an array type STArray s i v, where i is the index type and v is the value type, but s is a phantom type variable that is never visibly instantiated with anything.  To provide isolation, runSTArray has the type (∀s. ST s (STArray s i v)) → Array i v which we can read as say "Takes a (monadic) computation manipulating mutable state that results in a mutable array, and returns an immutable array".  The key idea is that ∀s at the front, which states that the computation must work for any s, and in particular we don't get to know and aren't allowed to depend upon what particular type s might get plumbed in under the covers.  For example, imagine we pass back an array of pointers to a mutable reference:

broken :: Array Int (STRef s Int)
broken = runSTArray (newArray (1, 10) =<< newSTRef 0)

This gets rejected by the compiler, because the s that escapes is no longer arbitrary – it can be chosen by code elsewhere in the program.

This idea was floating in the air in the early 90s; Shail Aditya used hidden type variables in the Id compiler to accomplish a similar thing in his Ph.D. thesis, and the fx region system had similar ideas, but the presentation here is from Launchbury and Peyton Jones 93 and is the first time I recall seeing the idea represented at the language level using explicit second-order quantification.  It was also the first deep use of second-order quantified types that I (and many others) ever saw, and is at the heart of a lot of subsequent work that plays variations upon the same theme.


GhostCell: Separating Permissions from Data


This leads me up to the work the Rust Belt project has been doing, and their recent GhostCell paper.  In Rust, we have a system of lifetime parameters that control object lifespan and allow safe mutation without interference.  The problem is that, in order to get permission to mutate an object, we need to get our hands on a unique pointer to that object; no other references to it may exist anywhere (we can borrow this reference from another unique pointer so long as we return it when we're done).  This means every object must have a single unique owner.  This classically runs into trouble the moment we try to implement (say) a doubly-linked list: A cell at rest in the middle of a list has at least two references to it, the prev pointer of the next element and the next pointer of the previous element.  How can we safely create and manipulate a doubly linked list in Rust?  The RustBelt folks use references to a special kind of cell, a GhostCell:

pub struct Node<'brand, T> {
    data: T,
    prev: Option<NodeRef<'brand, T>>,
    next: Option<NodeRef<'brand, T>>,
}
pub type NodeRef<'brand, T> = &GhostCell<'brand, Node<'brand, T>>;

pub fn insert_next(
    node1: NodeRef<'brand, T>,
    node2: NodeRef<'brand, T>,
    token: &mut GhostToken<'brand>,
) {
    Self::remove(node2, token);
    let node1_old_next : Option<NodeRef<_>> = node1.borrow(token).next;
    if let Some(node1_old_next) = node1_old_next {
        node1_old_next.borrow_mut(token).prev = Some(node2);
    }
    node1.borrow_mut(token).next = Some(node2);
    let node2: &mut Node<_> = node2.borrow_mut(token);
    node2.prev = Some(node1);
    node2.next = node1_old_next;
}

Now the only way to access the contents of a GhostCell is using a GhostToken, and that's where the safety comes from – the GhostToken obeys the same restrictive lifetime and sharing rules as regular Rust objects, but if we get our hands on a mutable GhostToken<'brand> we can mutate GhostCell<'brand, T>s to our heart's content.  How do we make sure that there's one and only one token with the right brand?  It's the same second-order universal type quantification, updated for Rust:

impl<'brand> GhostToken<'brand> {
    pub fn new<R>(f: for<'new_brand> FnOnce(GhostToken<'new_brand>) -> R) -> R
}

So: f receives a GhostToken<'new_brand>, and within f we can create and use GhostCell<'new_brand, T>s, so long as we throw them away (or freeze them into regular Rust objects) before we leave f.

The token is a 0-sized object that gets inlined away, so in practice this costs us (almost) nothing.


The Hitch: Object Lifetimes


The hitch with a GhostCell is this: how do we manage the lifetimes of our linked list elements?  For an ordinary single-linked list, Rust lets us splice elements in and out and handles the deallocation for us.  But in this world of GhostCells, the compiler can't tell how many references exist to a given cell, it just knows that they've all got the same 'brand and will be dead by the time the GhostToken is dead.

As a result, we've got to manage GhostCell storage differently from the storage of other Rust objects.  One simple thing we can do (and this is done in the actual example code accompanying the paper) is to use arena allocation – just let all the objects sit around until the GhostToken dies, then clean them all up in one go.  This works well if we don't allocate a lot of junk along the way, but it may work badly for a long-lived data structure with a lot of churn (like, say, an LRU cache).

Otherwise, we're going to have to use some kind of garbage collection.  The repo for the paper shows an example using Arc (reference counting), where the prev pointers are weak (don't count for object lifetime, to avoid reference cycles).  But Arc has a lot of space overhead – two 64-bit counters for every linked list cell.

Another intriguing possibility is to manage the pair of references using static reference counting.  This uses numeric type parameters to describe fractions of an object reference; you can mutate an object if you have the whole thing, or you can split it into immutable pieces and put those pieces back together again later.  Combined with GhostCell you gain the ability to mutate fractional pieces as well, and the system is simply managing reference lifetimes on your behalf.  You still need to be very disciplined, though – it's possible to drop one half of a reference and end up with something you can't clean up any more.


More Work is Needed!


So there we go.  You can get unique reference combined with lifetime management, or you can get arbitrary side effects within a limited scope, but we don't have a good way to combine them (yet?).  I'm hopeful that we can find a solution that gives us the best of both worlds, but it's not obvious and certainly not easy.

One thing it'd be nice to see by the by is extending Rust to permit existentially-quantified lifetimes in results of functions, not just local universal quantification in function arguments.  This would actually behave pretty nicely from Rust's point of view: you get a lifetime that's distinct from all the others in scope, but still obeys the usual Rust rules.  Not only would this let us write code in direct style:

let token = GhostToken::new();
... use token...

It would also allow us to write functions that themselves return lifetime-scoped objects.  Finally, it would allow us to create multiple tokens with interleaved lifetimes that control disjoint pieces of memory.  This is something regular Rust lifetimes can do, but right now GhostToken::new takes a function argument and those functions have to nest neatly.  We can work around it by (say) dropping a GhostToken early, but that requires a lot of thought on our part – thought the Rust borrow checker can and should put in safely and automatically on our behalf.


(picture from Launchbury and Peyton Jones 93)