Wednesday, October 19, 2011

Garbage collection for an SD/MMC-based persistent store

My Lisp/Scheme based OS is moving on apace. I've actually got a basic OS running on STM32, but it's feeling a bit cramped, and there's a new game in town - Raspberry Pi. Raspberry Pi (or, to shorten it a bit, RasPi) is a credit-card sized ARM11 based machine intended for educational use. And it's cheap. Really, really, cheap. there's a few oddities to it; it's actually based on a Broadcom SoC that's, in reality, a top whack GPU with a sorta old-hat ARM core bolted on as a bit of an afterthought. As such, it boots "oddly", GPU first and then ARM, but hey. The board itself has no onboard flash; persistent storage is SD card only (or, I guess, USB mass storage).

As could probably be guessed, it's coming with Linux on board, but hey, who cares about Linux?

Never one to make things easy on myself, I've expanded the scope of my OS. The Lisp/Scheme core is still there, the concept of syscalls as closures, all that good stuff. But I keep coming back to the Newton. Ah, my beloved NewtonOS. A totally persistent OS, one that takes power failures and years sitting in a drawer in its stride, a system where you whack new batteries in and you're back exactly where you were almost the moment you've let go of the power toggle switch. This is exactly what Stanislav is talking about in his third law of sane computing.

So. Persistence, then. How's that gonna work?

Before we start, I should note that the following technical stuff is nothing to do with wear levelling (which task is carried out quite adequately by the onboard controller of the SD card).

We have a quantity of RAM (256MB in the case of the Pi) which we want to use as a read / write cache for a larger quantity of persistent storage.  In this case, Flash memory accessed through an SD/MMC card interface.

We could take a na├»ve approach, and simply page stuff in and out, but there is a significant performance overhead, particularly with flash memory.  Not only are writes to flash slow, but you can only write to a block once, and then it needs to be erased before you write again.  And erasing is really slow.

Any given SD card is split into a number of sectors (typically between 512 bytes and 2K long).  A sector is the smallest unit we can read.  Those sectors are grouped into erase groups (typically in the 128KB - 2MB range).  An erase group is the smallest unit we can erase.  We can write to a sector, but if we try to write to it twice, the entire erase group has to be erased and the old contents (plus the new contents of our sector) written back.  This is a positively glacial operation.

If we take a sane approach, we block everything at the erase group size, and control when (and how) we write back to the card.  If we take the approach of writing data back to a different erase group, we can put off the pain of erasing the old one, and do it at our (or the card's) leisure.

Now, as we don't know exactly where an erase group will be paged into memory (and we can't restrict it to a specific location, as the card storage is potentially bigger than real memory), we have to use an indirect addressing method, using a "logical erase group number" (which can't be the actual erase group number, as that will change underneath us) and an index or address into the block of memory.  This adds some overhead to pointer dereferencing, of course, but it's mainly done using shifts and masks; dropping to assembler to do this should make it at least tolerable.

The bigger pain is garbage collection.  Lisp and Scheme are garbage collected languages, and we need a garbage collector that doesn't hurt us.  

At this point, I should probably set out some requirements for garbage collection:

  • It should reclaim as much memory as possible (duh)
  • It should not require paging in gigabytes of memory (that costs potentially enormous amounts of time).
  • It should never (if possible) force data to be written back to storage (which would be worse than paging in).
  • It should (if possible) be concurrent with application usage.
  • It should have a bounded worst case time per cycle (Java, I'm looking at you)

A traditional garbage collector is gonna hurt us plenty.  Why is that?

The way most garbage collectors work is by tracing from a group of accessible objects to find all objects in the heap that are addressable.  These addressable objects are then copied somewhere "safe", and the rest thrown away.  That's pretty much the gist of it (there are various complications, but the reader is referred to the relevant literature for further detail).

Now, that whole "copying" thing means that, for every garbage collection cycle, the erase group(s) containing the "live" objects will have to be written back to the SD card, whether the objects themselves have changed or not.  Ouch.

But surely, you say, there exists a way of collecting garbage without moving objects around? As it happens, dear reader, there does.  H.G. Baker's "Treadmill" non-copying garbage collector is a particularly good (and easy to code) example.  It comes with a certain amount of hurt itself, but it gets rid of the copying requirement, and that's a good start.

The main hurt of Baker's treadmill is the requirement to hold every object in a doubly-linked list.  For a lisp implementation where objects generally look like this:

struct cell { 
 uint32_t car;
 uint32_t cdr;

adding in the doubly-linked list produces this, which has twice the storage requirement:

struct cell { 
 struct cell * next; 
 struct cell * prev; 
 uint32_t car; 
 uint32_t cdr; 

Ouch, again.  But hey, memory's cheap, and we're doing this in order to use RAM as a window on the bigger, persistent store - not really an issue.  It should be noted that the structure above is only indicative; the links don't need to be persisted and can be kept separate of the actual cell contents.  The overhead, of course, remains the same.

The other hurt is not to do with Baker's algorithm, but is more general - GC tends to want to access all your heap.  And when your heap is feasibly orders of magnitude larger than real memory, that's gonna cause a load of thrashing.

So I started with the idea of not GCing non-swapped-in erase groups.  Do your garbage collection on what you have in memory at the time, don't try and pull in any non-swapped-in erase blocks, and leave it at that - the other erase groups will get GCed when they eventually get swapped back in.  Great, said I, coded about half of it up, and went to bed.  And then woke up in the middle of the night.
"What about objects in a swapped in erase group which are only pointed to by objects in erase groups that are not swapped in?  How the hell do I distinguish them from free objects if I don't know where they come from, or even that they are being pointed to?"
 Ah, tits!

Yep, if we're doing things in a naive way, but not swapping in every erase group as we traverse, we're gonna lose objects, and corrupt structures that are partially swapped in.

We need to get clever.

Firstly, let's stick with the idea of erase groups, and not garbage collecting the "world".  Indeed, let's not even try collecting our entire heap, we can restrict ourselves to a single erase group, or even a single sector if we like.  All we have to do is decide not to follow links outside the (relatively) little block of memory we're looking at.  We take our set of system roots, for every one that's in our block, we trace its links until we get to leaves or a link outside the block.  The rest is totally normal as per Baker's algorithm.

By restricting ourselves to a single block of memory, we can obviate the need to have a full doubly-linked list for our entire heap; all we need is a doubly linked list for the block we're currently collecting, and that can be generated trivially at the beginning of the GC.  This buggers the simplicity of allocating under Baker (which is a dereference and pointer increment), but rather than holding a full list all we need to hold is a bitmap.

Still, that hasn't solved the issue of links into our collected block.

As it turns out, the solution to that is incredibly simple, and falls back to one of the oldest forms of garbage collection - reference counting.  Along with our cell data, we carry a reference count variable, which counts the number of times a cell is referenced by cells outside the block.  This looks a bit like this:

struct cell { 
 uint32_t car; 
 uint32_t cdr; 
 uint16_t refcount;

If we create a cross-block link, we increment the reference count, if we destroy one (either by explicitly removing the link in code, or by freeing the referencing object during a garbage collection), we decrement it.  As part of building our list of root objects, we scan the block for those objects with a non-zero refcount, and push those direct into the list of roots.

Now, maintaining refcounts is a pain in the ass, but with the granularity of allocation blocks we're talking about here (between 128KB and 2MB, remember), the number of cross-block references are liable to be very small.  We can help this by preferring allocation of new objects in the block their "owner" inhabits (which will also help reduce object scattering, and thus paging issues).

So.  what does the solution look like?

Firstly, we have some basic types:

typedef struct cell {
  uint32_t car;
  uint32_t cdr;
} cell_t;

typedef struct reference_count {
  uint16_t colour:1; // indicator for gc, see Baker
  uint16_t count:15;
} reference_count_t;

Some calculations:

// Every erase group must store :
// n cells (8 bytes each), n refcounts (2 bytes each) 
// and n / 32 free cell bitmaps (4 bytes each)
#define CELLS_PER_BLOCK (erase_group_size * 8) / (64 + 16 + 1)

Therefore cell storage within an erase group looks like this (it's not actually defined like this, as CELLS_PER_BLOCK is dynamically calculated based on actual card characteristics):

cell_t            cells[CELLS_PER_BLOCK];
reference_count_t refcounts[CELLS_PER_BLOCK];
uint8_t           bitmaps[CELLS_PER_BLOCK];

The garbage collection algorithm is thus (using a single, transient, doubly linked list for CELLS_PER_BLOCK elements) :

- For each cell in the erase group
  - if cell is in set of root objects, or has a refcount != 0, add to "grey" objects
  - else add to "ecru" objects
- Do Baker's treadmill algorithm, restricting ourselves to intra-block links
- For each free object
  - set "free" indicator in bitmaps
  - if object has extra-group link(s)
    - decrement refcount(s) on externally referenced groups (may require paging)
    - reinitialize to NIL . NIL or similar
    - flag group as "dirty"

And that's about it.  Allocating involves finding a free cell in the relevant erase group (made relatively easy using ARM's CLZ operator), assignment has to check for cross-erase-group links.

Note that the only time garbage collection will require the erase group to be flushed back to storage is in the (rare) case where we free a cell that has extra-group links.  The "free" bitmap may be out of date on loading a group from flash, but it will never indicate that a non-free cell is free (allocation of a cell will update the bitmap to indicate the non-free status of that cell, and require a flush).  The bitmap will be updated on the first gc cycle for the group, at which point it should be up-to-date.

As an added bonus, as long as everything is persisted, suspending a process is simply a case of incrementing the reference counts of all its root objects and stopping the process.  Starting it again is decrementing those counts, and you're back to exactly where you were.

How d'ya like them apples?