Wednesday, December 14, 2011

Developing a multitasking OS for ARM part 3

Okay.  We're almost done with the big bits of scary assembler.  Indeed, this post is almost totally assembler free, and will deal with some C functions and definitions we need for later.

We want to do something useful with what we have at the moment.  We could simply implement "something useful" as a C language routine called c_entry(), which would run in SVC mode with interrupts off. In some cases, that would be sufficient.  But it would hardly count as an OS, let alone a multitasking one.

So, let's look at what we want to do, make some definitions.  We want tasks that run in an unprivileged mode (i.e user mode) and are either preemptively swapped out by the OS in order to run another task, or which periodically yield control of the processor to another task.  They must be able to terminate.  For the moment, we won't worry too much about protected memory spaces, IPC, or any of that jazz (which complicate matters, but which will come later).

A task, must have its own, inviolate, set of registers, and its own stack.  It must also have some other information - entry point, state, and potentially priority.  My implementation is based on scheme, so a task must also have an environment, but that's not absolutely necessary.

/* function pointer type returning void and taking a pointer to an environment */
typedef void(*task_entry_point_t)(void * environment);

/* potential task states */
typedef enum {
} task_state_t;

/* And the task itself */
typedef struct {
  void * stack_top;     /* limit of the task stack */
  void * stack_pointer; /* current stack pointer   */
  uint32_t priority:5;  /* priority, 0-31          */
  uint32_t state:1;     /* state, task_state_t     */
  uint32_t id:26;       /* task id                 */
} task_t;

Obviously, we need to know what the current task is, and have a list of other tasks that might want to run.  This is not identical to my code, as tasks are actually scheme objects, but you get the idea.

/* linked list structure */
typedef struct task_list {
  task_t * _car;
  struct task_list * _cdr;
} task_list_t;

/* some LISPy primitives */
#define CAR(x) (x)->_car
#define CDR(x) (x)->_cdr

#define SET_CAR(x,y) CAR(x)=(y)
#define SET_CDR(x,y) CDR(x)=(y)

/* We'll need these later */
#define nil  (task_list_t*)0
#define skip (task_t*)0

/* And the bits we need for the actual lists */
task_t * __current_task;
task_list_t * __priority_lists[31];

Now, the approach we'll be taking to multitasking is this:
Each task is created with a prority, and positioned as such in one of the priority lists.  Every time we need to find a task, we go through the priority lists, starting at zero, and ending at 31.  We look at each element in turn of the list by removing it from the head of the list and then grafting it onto the end of the list.  This way we round-robin schedule within each priority.  Only "runnable" tasks get scheduled, obviously.  If the task is actually the placeholder "skip", we will skip onto the next lowest priority. This way, all tasks eventually get a bite of CPU, with high priority tasks getting vastly more than the low priority ones.

Obviously, the initial setup of the lists is critical, and should be done in c_entry:

task_t * __sleep_task;

for (int i = 0; i < 31; i++) {
  __priority_lists[i] = (task_list_t *)malloc(sizeof(task_list_t));
  SET_CAR(__priority_lists[i], skip);
  SET_CDR(__priority_lists[i], nil);

__priority_lists[31] = (task_list_t *)malloc(sizeof(task_list_t));
SET_CAR(__priority_lists[31], __sleep_task);
SET_CDR(__priority_lists[31], nil);

The astute amongst you will notice the use of the C library function malloc() in there, despite not having a c library.  Don't worry about it.  It'll come later. Do worry about me not checking for errors :)

Note that priority 31 has *no* 'skip' entry, and points to a real task.  This task should not, under any circumstances, be missed out.

Now that's all done, we can find the next runnable task.

/* lispish function */
task_list_t * nconc(task_list_t * car, task_list_t * cdr) {
  if (car == nil) {
    return cdr;
  } else {
    task_list_t * x = car;
    while (CAR(x) != nil) x = CAR(x);
    SET_CDR(x, cdr);
    return car;

task_t * next_runnable_task() {
  for (int i = 0; i < 32; i++) {
    while(CAR(__priority_lists[i] != skip)) {
      /* rotate the list */
      task_list_t * car = CAR(__priority_lists[i]);
      task_list_t * cdr = CDR(__priority_lists[i]);
      SET_CDR(car, nil);
      __priority_lists[i] = nconc(cdr, car);
      /* check runnability */
      if (car->_car->state == TASK_RUNNABLE)
        return car->_car;
  /* we should never get here, but just in case, eh? */
  return __sleep_task;

Now, that's all fine and well, but what about setting up tasks and actually making them swap?  Ah.  That's a bit more complex, and we're gonna have to delve down into assembler again.  I'll get into that next time round.

Until then, though, here's a couple of little functions we needed before.

void * malloc(size_t size) {

  extern char * __heap_top;
  extern char * __memtop;
  char * prev_heap_top = __heap_top;
  if (__heap_top + size > __memtop) {
    return (void *)0;
  __heap_top += size;
  return (void *) prev_heap_top;


void sys_sleep() {
    // ARMv6 Wait For Interrupt (WFI)
    uint32_t * reg = 0;
    __asm__ __volatile__ ("MCR p15,0,%[t],c7,c0,4" :: [t] "r" (reg));

Sunday, December 11, 2011

Developing a multitasking OS for ARM part 2

Okay, kids, gather around and we'll carry on where we left off.

Now, we have the ARM booting, jumping to a reset handler, and dropping into an endless loop.  That's a pretty good start.  But really, we'd like to do something more - well - how to put this - "more".

In order to do this, we need to have a bit more understanding about how the ARM itself works.

If we go to the ARMv7AR Architecture Reference Manual (which can be had by registering at, or by downloading a hooky copy off the internets, either approach is feasible, and one at least of which is recommended), we see, in section B1 (the System Level Programmer's Model) a certain amount of interesting reading.  Forget the "privilege" aspect for the moment, and let's skip ahead to section B1.3.

We find that the processor has 8 separate operating modes.  These are:

User Mode, System Mode, Supervisor Mode, Monitor Mode, Abort Mode, Undefined Mode, IRQ Mode, and FIQ Mode

If we look back at the set of vectors we set up earlier, a lot of these "cross over".  So when we drop into the IRQ vector, we will be in IRQ mode.  FIQ, FIQ mode.  Either of the aborts, Abort mode.  Undefined instruction, undefined mode, and so on.  What's interesting is how the machine registers are shared between modes, and particularly the fact that all but system/user modes have their own stack pointers.

Now, when the ARM starts up, it is in SVC mode.  That's the way it is, and you can't change that.  And when it starts up, no stack has been defined.  So you need to be really damned careful in the first bits of the reset code.

Stacks on the ARM grow downwards, so the best thing to do generally is to put them at the top of memory.  As such, a typical reset routine will start by finding out how much memory is available, then setting up stack pointers for each of the operating modes.  We're nothing if not typical, so let's look at how we do that.

First thing - sizing memory.  On the versatile baseboard as emulated by qemu, this is easy.  We try writing to a bit of memory, then read back - if the value is set, there's memory there, if there's not, then we are above the top of memory.  It's not quite so simple on the Pi, as trying to write outside of physical RAM will cause an exception.  However, we're going to be a bit clever, and try to kill 2 birds with one stone.

Firstly, we need to set up some big fat global variables.

.global __memtop
__memtop: .word 0x00400000 /* Start checking memory from 4MB */
.global __system_ram
__system_ram: .word 0x00000000 /* System memory in MB */
.global __heap_start
__heap_start: .word __bss_end__ /* Start of the dynamic heap */
.global __heap_top
__heap_top: .word __bss_end__ /* Current end of dynamic heap */

__bss_end__ is set up by the linker, and it would be much better of me to use that for the initial value of __memtop (rounded up to the nearest megabyte) as well.  But hey, I'm lazy.  It'll come back to bite me later, I'm sure.

Now, as the Pi causes an exception on writes outside memory, we need to patch in a handler, temporarily.  Here's the handler:

/* temporary data abort handler that sets r4 to zero */
/* this will force the "normal" check to work in the */
/* case (as, I believe, on RasPi) where access 'out  */
/* of bounds' causes a page fault                    */
mov r4, #0x00000000
sub lr, lr, #0x08
movs pc, lr

Note how the comment indicates I'm not absolutely sure this will work.  This is, frankly, because I'm not sure if this will work on a real Pi, and nobody wants to let me get my hands on one.  Still, let's pretend, eh?

/* This tries to work out how much memory we have available */
/* Should work on both Pi and qemu targets */
FUNC _size_memory
/* patch in temporary fault handler */
ldr r5, =.Ldaha
ldr r5, [r5]
ldr r6, [r5]
ldr r7, =temp_abort_handler
str r7, [r5] 
DMB r12

/* Try and work out how much memory we have */
ldr r0, .Lmemtop
ldr r1, .Lmem_page_size
ldr r1, [r1]
ldr r2, .Lsystem_ram
ldr r3, [r0]
add r3, r3, #0x04
str r3, [r3] /* Try and store a value above current __memtop */
DMB r12 /* Data memory barrier, in case */
ldr r4, [r3] /* Test if it stored */
cmp r3, r4 /* Did it work? */
bne .Lmem_done
ldr r3, [r0]
add r3, r3, r1 /* Add block size onto __memtop and try again */
str r3, [r0]
b .Lmem_check
ldr r3, [r0] /* get final memory size */
lsr r3, #0x14 /* Get number of megabytes */
str r3, [r2] /* And store it */
/* unpatch handlers */
str r6, [r5]
DMB r12

bx lr

.extern __memtop
.word __memtop

.extern __mem_page_size
.word __mem_page_size

.extern __system_ram
.word __system_ram

.extern data_abort_handler_address
.word data_abort_handler_address

We see a few things here.  Firstly, how to patch in and out the handler.  Also, that I've got fed up with doing the whole .code 32; .global foo; foo: rigmarole and defined a macro called FUNC.  We also see a macro called DMB, which implements the ARMv6 Data Memory Barrier (ARMv7 has a 'dmb' instruction, to do that, we don't).  For what it's worth, these are the macros:

.macro FUNC name
.code 32
.global \name

/* Data memory barrier */
/* pass in a spare register */
.macro DMB reg
mov \reg, #0
mcr p15,0,\reg,c7,c10,5 /* Data memory barrier on ARMv6 */

So, we can hopefully now find out how much memory we have, with __memtop containing the actual top of memory and __system_ram containing the number of megabytes in case it's useful to know.

So let's look at the start of _reset...

.equ MODE_BITS,   0x1F /* Bit mask for mode bits in CPSR */
.equ USR_MODE,    0x10 /* User mode */
.equ FIQ_MODE,    0x11 /* Fast Interrupt Request mode */
.equ IRQ_MODE,    0x12 /* Interrupt Request mode */
.equ SVC_MODE,    0x13 /* Supervisor mode */
.equ ABT_MODE,    0x17 /* Abort mode */
.equ UND_MODE,    0x1B /* Undefined Instruction mode */
.equ SYS_MODE,    0x1F /* System mode */

FUNC _reset
/* Do any hardware intialisation that absolutely must be done first */
/* No stack set up at this point - be careful */
ldr r0, =.Lsize_memory
ldr r0, [r0]
cmp r0, #0
blxne r0

/* Assume that at this point, __memtop and __system_ram are populated
/* Let's get on with initialising our stacks */
mrs r0, cpsr /* Original PSR value */
ldr r1, __memtop /* Top of memory */

bic r0, r0, #MODE_BITS /* Clear the mode bits */
orr r0, r0, #IRQ_MODE /* Set IRQ mode bits */
msr cpsr_c, r0 /* Change the mode */
mov sp, r1 /* End of IRQ_STACK */
/* Subtract IRQ stack size */
ldr r2, __irq_stack_size
sbc r1, r1, r2

bic    r0, r0, #MODE_BITS /* Clear the mode bits */
orr    r0, r0, #SYS_MODE /* Set SYS mode bits */
msr    cpsr_c, r0 /* Change the mode   */
mov    sp, r1 /* End of SYS_STACK  */
/* Subtract SYS stack size */
ldr r2, __sys_stack_size
sbc r1, r1, r2

bic    r0, r0, #MODE_BITS /* Clear the mode bits */
orr    r0, r0, #FIQ_MODE /* Set FIQ mode bits */
msr    cpsr_c, r0 /* Change the mode   */
mov    sp, r1 /* End of FIQ_STACK  */
/* Subtract FIQ stack size */
ldr r2, __fiq_stack_size
sbc r1, r1, r2

bic    r0, r0, #MODE_BITS /* Clear the mode bits */
orr    r0, r0, #SVC_MODE /* Set Supervisor mode bits */
msr    cpsr_c, r0 /* Change the mode */
mov    sp, r1 /* End of stack */
/* And finally subtract Kernel stack size to get final __memtop */
ldr r2, __svc_stack_size
sbc r1, r1, r2
str r1, __memtop
/*-- Leave core in SVC mode ! */
/* Zero the memory in the .bss section.  */
mov a2, #0 /* Second arg: fill value */
mov fp, a2 /* Null frame pointer */
ldr a1, .Lbss_start /* First arg: start of memory block */
ldr a3, .Lbss_end
sub a3, a3, a1 /* Third arg: length of block */
bl memset

ldr r2, .Lc_entry /* Let C coder have at initialisation */
        mov     lr, pc
        bx      r2

cpsie i /* enable irq */
cpsie f /* and fiq */

/* Initialisation done, sleep */
ldr r2, .Lsleep
        mov     lr, pc
        bx      r2

.Lbss_start: .word __bss_start__
.Lbss_end: .word __bss_end__
.Lc_entry: .word c_entry
.Lsleep: .word sys_sleep

Note the use of msr cpsr_c, rx - this is how we change mode.  We can change mode this way from any mode except user mode.  Luckily, the user mode stack pointer is shared with system mode, so we don't need to drop into user mode at all.  So we go off, find how much memory we have, then for certain of the operating modes, we set up a stack pointer.  We then use a pre-written implementation of memset() to zero out the bss section, let the 'c' code have a go at initialising its stuff via c_entry(), turn on interrupts, and go to sleep via sys_sleep().

Next up, how we go about doing useful work...

Wednesday, December 7, 2011

Developing a multitasking OS for ARM part 1

My Scheme OS for the Raspberry Pi SBC is coming along nicely (code at, and a couple of comments on the Raspberry Pi forum kinda kicked me into actually documenting some of the process of what I've been doing.  So, here goes.

Firstly, the toolset.

The first thing we'll be needing is development tools.  Yeah, there's "off the peg" toolsets available, but I wanted to be up at the bleeding edge.  So, off to GNU's site, and let's get cracking.

I built and installed the latest versions of libtools (which includes the assembler and linker), gcc, g++, newlib and gdb, all for target arm-none-eabi.  If you want to know how to do this, googling "arm bare metal" should elucidate.  Otherwise, there's always codesourcery.

Now, booting.  Obviously, the first thing we need to do is boot the board.  In my case, it's very uncomplicated.  No first-stage booters, no relocating stuff from flash, just a bunch of RAM  that your binary gets loaded into, starting at address 0x00000000.  Easy peasy.

So.  How does ARM (specifically, the ARM1176jzf-s processor on the Raspberry Pi) boot?  Well, there's chapter and verse on the ARM site, but here's the TL;DR version.

When the ARM powers on, it executes ARM (32 bit) instructions starting from address 0x00000000.  

Simples, right?  Well, not quite.  Address 0x00000000 is the start of what's known as the exception vector table, which contains 8 bytes for each of 8 potential exceptions.  8 bytes (or 2 words) is enough to store an absolute jump instruction, or an instruction to move an address from memory into the program counter.  So the simplest vector table would look like this:

.section .reset
.code 32
.global __reset
  b _reset           @ Power on reset
  b _undef           @ Undefined instruction
  b _swi             @ Software interrupt
  b _prefetch_abort  @ Prefetch Abort
  b _data_abort      @ Data abort
  b .                @ Unused
  b _irq             @ IRQ
  b _fiq             @ "fast interrupt"

And that would be fine.  However, that's not how it's normally done, mainly because it's impossible, with this setup, to change the vectors on the fly.  So what we do is this:

.section .reset, "ax"
.code 32
.global __reset
  ldr  pc, _reset_address
  ldr  pc, _undef_address
  ldr  pc, _swi_address
  ldr  pc, _prefetch_abort_address
  ldr  pc, _data_abort_address
  b    .
  ldr  pc, _irq_address
.global _fiq
  @ Fast interrupt handler starts here
  b    _no_handler

.global _no_handler
  b    _no_handler

_reset_address:          .word _reset
_undef_address:          .word _undef
_swi_address:            .word _swi
_prefetch_abort_address: .word _prefetch_abort
_data_abort_address:     .word _data_abort
_irq_address:            .word _irq

.global _reset_address
.global _undef_address
.global _swi_address
.global _prefetch_abort_address
.global _data_abort_address
.global _irq_address

.weak _undef
.set _undef, _no_handler
.weak _prefetch_abort
.set _prefetch_abort, _no_handler
.weak _data_abort
.set _data_abort, _no_handler

.global _reset
  b     _no_handler

.global _swi
  b     _no_handler

.global _irq
  b     _no_handler

That's loads bigger, but what does it change, exactly?

The "fast interrupt" code gets to miss an indirection, so it's faster.  We simply start the interrupt handler directly at the end of the vector table.  I'm not actually doing this at the moment, but it's possible.

The other exceptions load their address from an indirection table, so we can repatch them on the fly.

We have a "generic" handler for unhandled exceptions.  The way that gets patched in is to do with the linker.  A .weak directive for a symbol will allow us to simply not define a symbol in our code, and the linker will replace it with zero instead of barfing.  The .set directive enables us to use a different default to zero.  Thus, any of the _undef, _prefetch_abort or _data_abort entry points (in the code above) will redirect to _no_handler unless we define those entry points elsewhere.  This is a trick we'll use again later.  Note _reset, _swi and _irq have no defaults, and thus must be defined elsewhere (I've defined them to simply jump to _no_handler for the moment.

All we need to do is assemble that and link it to load at 0x00000000, and we have a booter.  It will do bugger all, but it will work.

Wednesday, October 19, 2011

Garbage collection for an SD/MMC-based persistent store

My Lisp/Scheme based OS is moving on apace. I've actually got a basic OS running on STM32, but it's feeling a bit cramped, and there's a new game in town - Raspberry Pi. Raspberry Pi (or, to shorten it a bit, RasPi) is a credit-card sized ARM11 based machine intended for educational use. And it's cheap. Really, really, cheap. there's a few oddities to it; it's actually based on a Broadcom SoC that's, in reality, a top whack GPU with a sorta old-hat ARM core bolted on as a bit of an afterthought. As such, it boots "oddly", GPU first and then ARM, but hey. The board itself has no onboard flash; persistent storage is SD card only (or, I guess, USB mass storage).

As could probably be guessed, it's coming with Linux on board, but hey, who cares about Linux?

Never one to make things easy on myself, I've expanded the scope of my OS. The Lisp/Scheme core is still there, the concept of syscalls as closures, all that good stuff. But I keep coming back to the Newton. Ah, my beloved NewtonOS. A totally persistent OS, one that takes power failures and years sitting in a drawer in its stride, a system where you whack new batteries in and you're back exactly where you were almost the moment you've let go of the power toggle switch. This is exactly what Stanislav is talking about in his third law of sane computing.

So. Persistence, then. How's that gonna work?

Before we start, I should note that the following technical stuff is nothing to do with wear levelling (which task is carried out quite adequately by the onboard controller of the SD card).

We have a quantity of RAM (256MB in the case of the Pi) which we want to use as a read / write cache for a larger quantity of persistent storage.  In this case, Flash memory accessed through an SD/MMC card interface.

We could take a na├»ve approach, and simply page stuff in and out, but there is a significant performance overhead, particularly with flash memory.  Not only are writes to flash slow, but you can only write to a block once, and then it needs to be erased before you write again.  And erasing is really slow.

Any given SD card is split into a number of sectors (typically between 512 bytes and 2K long).  A sector is the smallest unit we can read.  Those sectors are grouped into erase groups (typically in the 128KB - 2MB range).  An erase group is the smallest unit we can erase.  We can write to a sector, but if we try to write to it twice, the entire erase group has to be erased and the old contents (plus the new contents of our sector) written back.  This is a positively glacial operation.

If we take a sane approach, we block everything at the erase group size, and control when (and how) we write back to the card.  If we take the approach of writing data back to a different erase group, we can put off the pain of erasing the old one, and do it at our (or the card's) leisure.

Now, as we don't know exactly where an erase group will be paged into memory (and we can't restrict it to a specific location, as the card storage is potentially bigger than real memory), we have to use an indirect addressing method, using a "logical erase group number" (which can't be the actual erase group number, as that will change underneath us) and an index or address into the block of memory.  This adds some overhead to pointer dereferencing, of course, but it's mainly done using shifts and masks; dropping to assembler to do this should make it at least tolerable.

The bigger pain is garbage collection.  Lisp and Scheme are garbage collected languages, and we need a garbage collector that doesn't hurt us.  

At this point, I should probably set out some requirements for garbage collection:

  • It should reclaim as much memory as possible (duh)
  • It should not require paging in gigabytes of memory (that costs potentially enormous amounts of time).
  • It should never (if possible) force data to be written back to storage (which would be worse than paging in).
  • It should (if possible) be concurrent with application usage.
  • It should have a bounded worst case time per cycle (Java, I'm looking at you)

A traditional garbage collector is gonna hurt us plenty.  Why is that?

The way most garbage collectors work is by tracing from a group of accessible objects to find all objects in the heap that are addressable.  These addressable objects are then copied somewhere "safe", and the rest thrown away.  That's pretty much the gist of it (there are various complications, but the reader is referred to the relevant literature for further detail).

Now, that whole "copying" thing means that, for every garbage collection cycle, the erase group(s) containing the "live" objects will have to be written back to the SD card, whether the objects themselves have changed or not.  Ouch.

But surely, you say, there exists a way of collecting garbage without moving objects around? As it happens, dear reader, there does.  H.G. Baker's "Treadmill" non-copying garbage collector is a particularly good (and easy to code) example.  It comes with a certain amount of hurt itself, but it gets rid of the copying requirement, and that's a good start.

The main hurt of Baker's treadmill is the requirement to hold every object in a doubly-linked list.  For a lisp implementation where objects generally look like this:

struct cell { 
 uint32_t car;
 uint32_t cdr;

adding in the doubly-linked list produces this, which has twice the storage requirement:

struct cell { 
 struct cell * next; 
 struct cell * prev; 
 uint32_t car; 
 uint32_t cdr; 

Ouch, again.  But hey, memory's cheap, and we're doing this in order to use RAM as a window on the bigger, persistent store - not really an issue.  It should be noted that the structure above is only indicative; the links don't need to be persisted and can be kept separate of the actual cell contents.  The overhead, of course, remains the same.

The other hurt is not to do with Baker's algorithm, but is more general - GC tends to want to access all your heap.  And when your heap is feasibly orders of magnitude larger than real memory, that's gonna cause a load of thrashing.

So I started with the idea of not GCing non-swapped-in erase groups.  Do your garbage collection on what you have in memory at the time, don't try and pull in any non-swapped-in erase blocks, and leave it at that - the other erase groups will get GCed when they eventually get swapped back in.  Great, said I, coded about half of it up, and went to bed.  And then woke up in the middle of the night.
"What about objects in a swapped in erase group which are only pointed to by objects in erase groups that are not swapped in?  How the hell do I distinguish them from free objects if I don't know where they come from, or even that they are being pointed to?"
 Ah, tits!

Yep, if we're doing things in a naive way, but not swapping in every erase group as we traverse, we're gonna lose objects, and corrupt structures that are partially swapped in.

We need to get clever.

Firstly, let's stick with the idea of erase groups, and not garbage collecting the "world".  Indeed, let's not even try collecting our entire heap, we can restrict ourselves to a single erase group, or even a single sector if we like.  All we have to do is decide not to follow links outside the (relatively) little block of memory we're looking at.  We take our set of system roots, for every one that's in our block, we trace its links until we get to leaves or a link outside the block.  The rest is totally normal as per Baker's algorithm.

By restricting ourselves to a single block of memory, we can obviate the need to have a full doubly-linked list for our entire heap; all we need is a doubly linked list for the block we're currently collecting, and that can be generated trivially at the beginning of the GC.  This buggers the simplicity of allocating under Baker (which is a dereference and pointer increment), but rather than holding a full list all we need to hold is a bitmap.

Still, that hasn't solved the issue of links into our collected block.

As it turns out, the solution to that is incredibly simple, and falls back to one of the oldest forms of garbage collection - reference counting.  Along with our cell data, we carry a reference count variable, which counts the number of times a cell is referenced by cells outside the block.  This looks a bit like this:

struct cell { 
 uint32_t car; 
 uint32_t cdr; 
 uint16_t refcount;

If we create a cross-block link, we increment the reference count, if we destroy one (either by explicitly removing the link in code, or by freeing the referencing object during a garbage collection), we decrement it.  As part of building our list of root objects, we scan the block for those objects with a non-zero refcount, and push those direct into the list of roots.

Now, maintaining refcounts is a pain in the ass, but with the granularity of allocation blocks we're talking about here (between 128KB and 2MB, remember), the number of cross-block references are liable to be very small.  We can help this by preferring allocation of new objects in the block their "owner" inhabits (which will also help reduce object scattering, and thus paging issues).

So.  what does the solution look like?

Firstly, we have some basic types:

typedef struct cell {
  uint32_t car;
  uint32_t cdr;
} cell_t;

typedef struct reference_count {
  uint16_t colour:1; // indicator for gc, see Baker
  uint16_t count:15;
} reference_count_t;

Some calculations:

// Every erase group must store :
// n cells (8 bytes each), n refcounts (2 bytes each) 
// and n / 32 free cell bitmaps (4 bytes each)
#define CELLS_PER_BLOCK (erase_group_size * 8) / (64 + 16 + 1)

Therefore cell storage within an erase group looks like this (it's not actually defined like this, as CELLS_PER_BLOCK is dynamically calculated based on actual card characteristics):

cell_t            cells[CELLS_PER_BLOCK];
reference_count_t refcounts[CELLS_PER_BLOCK];
uint8_t           bitmaps[CELLS_PER_BLOCK];

The garbage collection algorithm is thus (using a single, transient, doubly linked list for CELLS_PER_BLOCK elements) :

- For each cell in the erase group
  - if cell is in set of root objects, or has a refcount != 0, add to "grey" objects
  - else add to "ecru" objects
- Do Baker's treadmill algorithm, restricting ourselves to intra-block links
- For each free object
  - set "free" indicator in bitmaps
  - if object has extra-group link(s)
    - decrement refcount(s) on externally referenced groups (may require paging)
    - reinitialize to NIL . NIL or similar
    - flag group as "dirty"

And that's about it.  Allocating involves finding a free cell in the relevant erase group (made relatively easy using ARM's CLZ operator), assignment has to check for cross-erase-group links.

Note that the only time garbage collection will require the erase group to be flushed back to storage is in the (rare) case where we free a cell that has extra-group links.  The "free" bitmap may be out of date on loading a group from flash, but it will never indicate that a non-free cell is free (allocation of a cell will update the bitmap to indicate the non-free status of that cell, and require a flush).  The bitmap will be updated on the first gc cycle for the group, at which point it should be up-to-date.

As an added bonus, as long as everything is persisted, suspending a process is simply a case of incrementing the reference counts of all its root objects and stopping the process.  Starting it again is decrementing those counts, and you're back to exactly where you were.

How d'ya like them apples?