Monday, July 9, 2012

Atomic setting on ARM

Back to our scheduled programming.  Big gobs of ugly assembly code.

When implementing a multitasking operating system, there is a need for dealing with mutating shared state.  It's one of the unpleasant realities of the world of multitasking (and, indeed, multithreading), and it's fraught with danger.

One of the basic functions we need to implement is the "atomic set" operation - setting a variable to a given value.

Let's take the example of incrementing a variable.  Naively, we might do this:

    ldr r0, [r1]
    add r0, r0, #1
    str r0, [r1]

In most cases, this will work.  However, there exists a case where, in fact, this can fail - if two threads of execution are trying to increment the value, and a task swap happens whilst one is actually doing the increment, the possibility is that the value will be (incorrectly) incremented only once, as follows:

; Assume r1 points at a given address, holding the value 0
[thread 1]
    ldr r0, [r1]  ; r0 in thread 1 is now 0

[thread 2]
    ldr r0, [r1] ; r0 in thread 2 is now 0
    add r0, r0, #1 ; r0 in thread 2 is now 1
    str r0, [r1] ; memory is now 1

[thread 1]
    add r0, r0, #1 ; r0 in thread 1 is now 1
    str r0, [r1] ; memory is now 1

Obviously, we would expect memory to be set to 2, not to 1.  So somehow we need to either stop the interrupts happening (easy enough, turn interrupts off, but that has fairly big impacts elsewhere) or somehow deal with the case where we are interrupted mid-operation.

As luck would have it, ARMv6 provides us with 3 handy opcodes for this : ldrex, strex and clrex.  Basically, we use ldrex to signal that we want to have exclusive write access to a memory location, strex signals that we're going to write to a location and close that exclusive access, with a test for success, and clrex says "hey, we're no longer interested".  So, how do we use these to do what we want?

Let's go back to our example above - incrementing a value in memory.  Using ldrex / strex it would look like this:

    ldrex r0, [r1]
    add r0, r0, #1
    strex r2, r0, r1
    cmp r2, #0
    bne try_increment

What happens here is:

  • the initial ldrex loads the memory, and indicates that it wants an exclusive access to the memory itself.
  • We then increment our value, as usual.
  • We write the value back using strex - this will only succeed if:

  1. we still have an exclusive lock on the memory
  2. no newer writes to that memory have happened since we established our exclusive lock
  • success of strex is indicated by register r2 (the "extra" operand that strex uses) being set to 0.
  • If strex has failed, we go back and try again from the point where we loaded the initial value.

For our super-simple increment case, this will probably catch 99.99% of cases.  We add a "belt-and-braces" approach, however, by making our task scheduler explicitly invalidate all exclusive reads, using the clrex opcode.  This has the possibility of making any in-process ldrex-strex blocks restart (and thus take more time), but covers all the bases.

Now, that's all fine and well, but we may want to use this method in our C code, without resorting to subroutine calls (by their very nature, exclusive operations happen at a very low level, and probably want to be inlined).  So we're going to want to use some of that nastiest of nasties, inline assembler in gcc.  Believe me, it's vile.

Here's an implementation of an inline atomic increment using gcc inline assembler:

inline uint32_t atomic_increment(uint32_t * memory) {
  uint32_t temp1, temp2;
  __asm__ __volatile__ (
    : [t1] "=&r" (temp1), [t2] "=&r" (temp2)
    : [m] "r" (memory)
  return temp1;

Horrible, no?  Note the use of local labels ('1:' and then branching to '1b' to indicate the latest local label called '1'), having to use encoded tabs and newlines to stop the assembler itself barfing, and the horrible workaround of multiple names for the same variables because gcc is, quite simply, broken.  Still, it works, and the C optimiser can deal with it.

If you want to get more complex, I'd suggest looking at the ARM site for the example implementations of mutexes and condition variables using ldrex/strex.  You'll have to deal with converting from ARM assembler to GNU, but as long as you don't try inlining them, you should be fine.

Wednesday, May 9, 2012

An aside. Vector maths for games

Most people, when they start writing games, run up against having to rotate and move objects.  Generally speaking, they attack this problem "head on" using using trigonometry, specifically using sines, cosines and arctangents.  This is a direct application of the SOH/CAH/TOA trigonometry you learned (or are in the process of learning, or will be learning later on) at secondary school.

Although simple and easy to understand, this approach has a few drawbacks.

  • The first (and usually most noticeable) of these is that it's quite slow.  It's manageable for a small amount of objects, but if, for example, you're trying to do something like "Geometry Wars" with thousands of objects rotating and moving at the same time, you're soon going to find that all the trig rapidly becomes a bit of a bottleneck.  It gets even worse if you move into 3 dimensions.
  • The second is related to the first, and, in these days where GPU acceleration and multi-core processing are commonplace, can be a major sticking point - Trigonometry can be hard to efficiently parallelise.
  • The third, and perhaps most important despite being the least obvious, is accuracy.  Floating point calculations are never exact, and the errors in a purely trigonometric approach are usually cumulative.  Generally speaking, this means that if you rotate an object by 1° clockwise, 360 times, it will not end up where it started.

Luckily, there exists a way to almost totally bypass the trigonometry.  Let's recap what state we usually end up using for an object.

  • We keep track of its location in space.  We typically do this using a pair of co-ordinates.
  • We keep track of the direction it's facing.  Typically using an angle.
  • We keep track of the direction in which it is moving, again using an angle.
  • We keep track of its current speed.

We also know how to convert from cartesian co-ordinates (x and y) to polar coordinates (angle and size) and vice versa, we know that polar co-ordinates are equivalent to cartesian co-ordinates, and we've seen that, for some tasks, it's easier to use polar co-ordinates than cartesian ones.

What we're going to do is get rid of all the references to angles, and all the trigonometry.  We're going to do this by using vector mathematics.

So, we have location.  This is a set of co-ordinates, and thus a vector. So far, so good.

By combining direction of movement and speed into a single conceptual value, we get velocity, which is a set of (polar) co-ordinates, and also a vector.  Unfortunately, it has an angle embedded in it, but as we know that polar and cartesian co-ordinates are equivalent, we can represent it in cartesian form (velocity-x and velocity-y).

Which leaves us with that unfortunate direction we're facing, which is a single value indicating angle, and thus a scalar.  However, the only bit we're interested in is the angle, so we can convert it into a vector with a size of 1 (a unit vector), and then get rid of the 'angle' part by converting it to cartesian form (facing-x and facing-y).

So what does that get us?

Well, let's look at the tasks we have done so far.

Moving an object from its current location to the next location

This is the most basic task we want to do, and it's also the most simple.  It's a trivial vector addition of location and velocity, which is to say :

location-x = location-x + velocity-x
location-y = location-y + velocity-y

Accelerating an object in a particular direction

Again, this is pretty simple.  We represent the acceleration as a vector quantity, and add it to velocity :

velocity-x = velocity-x + acceleration-x
velocity-y = velocity-y + acceleration-y

That's the basics of acceleration, but how do we get the acceleration value?

I'm going to explain a basic model of Newtonian physics.  Newton's Second Law is most commonly stated as *F=ma*, where F is the force applied to an object, m is its mass, and a is the acceleration - it can be restated as *a=F/m*.  So to calculate the acceleration of each object, we need to know the force acting on it, and its mass.

We'll assume the mass of the object is constant, for the moment (no near-light-speed effects here, thank you), so what wee need to calculate is the force acting on the object.  Force is a vector value, and the overall force acting on the object is simply the addition of all the forces acting on it.  So, we'll start each "step" of the game with a clean slate, and add the forces up as we go:

force-x = 0
force-y = 0

... Accumulate forces

acceleration-x = force-x / mass
acceleration-y = force-y / mass

So, let's look at the common forces we might


To apply a "thrust" in the direction we're facing, we can use the unit vector *facing*, multiplied by some constant of "thrust strength", as follows (remember, *facing* is a unit vector, so it always has a length of 1) :

force-x += facing-x * thrust-strength
force-y += facing-y * thrust-strength

Acceleration towards the floor due to gravity

Assuming that gravity always works downwards, and that our co-ordinate system starts at the bottom left, we can make gravity work as follows (obviously, the value of gravity-constant should be chosen to work with whatever system of units we're using):

force-y -= gravity-constant

Acceleration towards (or away from) other objects

There's 2 cases here.

The first is a very simplistic "chasing" behaviour, where an object "chases" another one.  This is usually best done by rotating the object to point at its target (see later) and then "thrusting" as above, but it can also be useful to carry out a simple "go that way" approach.

The second is a more complex simulation of gravity - if we were implementing something like Geometry Wars, we might have "black holes" that pull objects towards them.  In this case, gravity is a bit more complex, as it not only works in a different direction for each pair of objects, but it gets stronger depending on how close the objects are to each other.

In both cases, the first thing we do is to create a vector that gives both the direction in which we are going to be pulled, and the distance between the objects:

vector-x = other.location-x - location-x
vector-y = other.location-y - location-y

In both cases, we need to know the distance between the two objects.  For the simple chasing case, we'll calculate it directly:

distance = size(vector)

Now, gravity usually works according to what's known as an inverse square law, which is to say that it's inversely proportional to the square of the distance between the two objects.

Remember, Pythagorus tells us that, for any right angled triangle, the sum of the squares on the short sides are equal to the square on the hypotenuse - the square on the hypotenuse is, in this case, equal to the distance squared, so rather that calculating the distance in a single step, we do it in 2 steps so that we can keep the "distance-squared" value :

distance-squared = (vector.x * vector.x) + (vector.y * vector.y)
distance = sqrt(distance)

Now, in both cases, we want to scale the vector by a factor.  To do this, we need to scale the vector to a unit size, and then multiply it by our factor.  To scale to a unit size, we divide each co-ordinate by "distance" :

vector-x = vector-x / distance
vector-y = vector-y / distance

For the simple chasing case, we now multiply by our "chasing strength", and we have our force :

force-x = vector-x * chase-strength
force-y = vector-y * chase-strength

For the attraction due to gravity case, we divide by the distance squared, and then multiply by some constant to get a usable value :

force-x = (vector-x * gravity-constant) / distance-squared
force-y = (vector-y * gravity-constant) / distance-squared

Braking / Friction

Braking is a special case.  We could implement it by making a force vector that points in the opposite direction to the current velocity, but it's simpler (and faster) to simply multiply the current velocity by a factor.

"Proportional" braking is very simple.  If we wanted, for example, to brake by 5% every "step", we would do this:

braking-factor = 0.95
velocity-x = velocity-x * 0.95
velocity-y = velocity-y * 0.95

This makes a very realistic simulation of friction.

"Fixed amount" braking is a tiny bit more tricky - we need to know the size of the current velocity (using Pythagoras' theorem) and work out the factor from there:

speed = size(velocity)
if speed < braking-amount
braking-factor = 0
braking-factor = (speed - braking-amount) / speed
velocity-x = (velocity-x * braking-factor)
velocity-y = (velocity-y * braking-factor)


Now, so far, everything we've done is incredibly simple - the most complex thing we've come across is a square root, and we haven't had to bother with angles *at all*.

Rotating objects is a bit more complex.  After all, how can we rotate something to a given angle if we don't know what angle it's already at?

The short answer to this is

 I lied.  It's not complex at all.

but the short answer isn't very interesting (and, above all, doesn't tell us anything useful).  The long answer follows, and is based on matrix multiplication.  As a reminder (or a first taste, if you've never seen this before), a matrix is a (conceptually) rectangular grid of values.

Given that we're dealing with 2 dimensions, one of the most useful matrices we can imagine is a 2 x 2 matrix; we can multiply a vector (x, y) by such a matrix as follows :

| a b | |x| = |(x * a) + (y * b)|
| c d | |y|   |(x * c) + (y * d)|

Now, that's pretty simple to understand, and it's trivially simple to implement in code. So, multiplying a matrix by a vector is going to return a vector with (potentially) different values, depending on what's in the matrix. So far, so good, right?

Now, let's go back to rotating stuff using SOHCAHTOA, keeping that in mind.  When we rotate a point (x, y) by an angle θ, we get a point whose x co-ordinate is given by (x * cos(θ)) - (y * sin(θ)), and whose y co-ordinate is given by (x * sin(θ)) + (y * cos(θ)).  Now that looks a lot like the matrix multiplication above, and indeed it is.  We can encode a rotation by some arbitrary angle θ as a matrix, as follows:

| cos θ -sin θ |
| sin θ  cos θ |

This is, in fact so useful that the majority of graphical toolkits (OpenGL, DirectX and so on) provide an API that uses matrices (actually, an extension of the rotation vector given above, but we'll see that later)

Now that in itself is pretty nifty, but here's where the real magick occurs.  Remember SOHCAHTOA?  Remember what it means?

  • Sin of angle = length of Opposite / length of Hypotenuse
  • Cos of angle = length of Adjacent / length of Hypotenuse
  • Tan of angle = length of Opposite / length of Adjacent

We've already used the TOA bit, when we used atan2 beforehand.  We're about to use the SOH and CAH bits to our profit.

Consider a unit vector. The length of its hypotenuse will always be 1.  And as any number divided by 1 is unchanged, we can see that the sin of the angle of that vector is given by the length of the opposite and the cos of the angle of that vector is given by the length of the adjacent.  And we know those values already - they are, quite simply, the co-ordinates of the vector.

So, rather than talking about angles, we can talk about unit vectors, and do away with all that messy trigonometry.  Well, at least when we're talking about taking something that's aligned at an angle of zero, and rotating it to a particular angle as described by a unit vector v - we use a rotation matrix like this:

| v.x -v.y |
| v.y  v.x |

giving us

new-x = v.x * x - v.y * y
new-y = v.y * x + v.x * y

What does this mean?

  • We can, using our handy unit "facing" vector, draw a graphic of our object, rotated to the correct angle, without having to call any trigonometric functions whatsoever.
  • We can make an object A point at another object B by calculating the vector A->B (trivial, this is given by B.location - A.location) and then scaling it to unit size (again, pretty easy - divide each coordinate by the length of the original vector), and using that as the new "facing" vector for object A.
  • If we need "increment" or "decrement" ability (for example, "rotate by 5° CCW") we can precalculate unit vectors describing the angles we need (obviously, this would need to be done using trigonometric functions, but it would only need to be done once), and then use those vectors to create our rotation matrices.

The only thing that's really left is rotating one object rotate towards another (think turret slowly rotating to face player) in increments.  To do this we need to know whether we are going to rotate clockwise or anticlockwise.  The easiest (although probably not the fastest) way to do this is to use atan2() to get the angles of the current "facing" vector and the vector A->B, to decide which way to rotate, and then use the incremental approach above.

Earlier, I touched on the fact that the various toolkits all provide matrix APIs.  How these work is by combining both a rotation and translation matrix (yes, we can deal with translations too) into one matrix.  For a 2-d case, we would do something like this:

|  cosθ -sinθ  t.x  |   | x |
|  sinθ  cosθ  t.y  | * | y |
|   0     0     1   |   | 1 |

Ignoring the pointless result given by the 1 in the vector, we would get :

x = (x * cosθ) - (y * sinθ) + t.x
y = (x * sinθ) + (y * cosθ) + t.y

This is, simply put, the value we had before, with the x and y values of t added on.  The matrix therefore rotates an object by *θ*, and then translates the result by *t*.

The matrices used by OpenGL and so on are not, in fact, 2D specific.  Everything is generalised to the 3D case, so we use :

|  cosθ -sinθ   0    t.x  |
|  sinθ  cosθ   0    t.y  |
|   0     0     1     0   |
|   0     0     0     1   |

This is actually a matrix for 3D translations and rotations, with all the z axis stuff left out.

Sunday, April 29, 2012

Back to Tech : Building for ARM "on the metal"

It's been a while since I posted anything technical, what with the winter season and moaning about the launch of the Pi, price gouging on ebay, playing hockey, teaching my 12 year old to program, and a whole  load of other stuff.  Fear not, though; LambdaPi is still moving onwards, I have a branch on my machine that separates everything that is Lisp from everything that is kernel, and is moving towards a more solidly real-time base.

Anyway.  I had a request from Angus Hammond asking about build scripts, as follows:

Would you be able to do a blog post at some point explaining how the build script you use in lambdapi works? The linking needed for an OS is obviously different to that needed normally if only because of the interrupts table, and that script is completely beyond my understanding. It would be greatly appreciated if you could explain any of the working behind it.
The answer to this is, of course, "of course".  So let's have a look at what's required.

Firstly, we need to understand what we want.  The way the Pi boots (and the way I'm using qemu) is to (eventually) load a binary image into memory at location 0x00000000.  This binary image is required to start with a standard ARM vector table, which we have already seen.  Execution actually starts by transferring control to the ARM with the PC set to 0x00000000, the standard "reset" vector.

So, what's going on in the build process?

Well, firstly, we compile and assemble everything we need into object files.  So far, so standard.  The tricky bits come with the following Makefile rules :

bin/kernel.img: bin/kernel.elf
  ${OBJCOPY} -O binary $< $@

bin/kernel.elf: lambdapi.ld $(OBJ) $(SYSLIBS)
  ${LD} ${LDFLAGS} -T lambdapi.ld $(OBJ) $(SYSLIBS) -o $@

The first of these to be executed is the one that builds bin/kernel.elf - this is dependent on the link script lambdapi.ld, all the object files $(OBJ), and all the assorted libraries we're using $(SYSLIBS).  We call the linker ${LD} with a set of flags ${LDFLAGS}, using a linker script lambdapi.ld, passing in all the object files and system libraries.  LDFLAGS are set to 

-nostdlib -static --error-unresolved-symbols

Which means "don't try and link any standard libraries, link everything statically, and throw an error if there's anything you can't find".  

So, at the end of that step, we should have a properly formatted elf-format file.  Unfortunately, that's not what we want, we need to lose all the elf stuff and end up with a raw image to be loaded at a particular location, hence the objcopy stage, which takes the binary bit of the elf image and spits it out, standalone.

Now, all we need to know is how to make the linker file.  A standard ld script takes a bunch of object files and munges them all together, keeping the .text (code) sections together, then appending a load of other stuff at the end, notably the bss and data sections.  Unfortunately, it doesn't guarantee any particular order of entry, and we *need* our reset table to be the very first entry.  So, we need a custom link script.  Let's look at it.

. = 0x0;
.text : {

__exidx_start = .;
.ARM.exidx   : { *(.ARM.exidx* .gnu.linkonce.armexidx.*) }
__exidx_end = .;

.data : { *(.data) }
__bss_start__ = .;
.bss : { *(.bss) }
__bss_end__ = .;

The first thing we do is tell the linker where the entry point is : symbol __reset.  In reality, we only need this if we'll ever be using the elf file directly, which we won't, but it doesn't hurt.

Now for the meat of the script.  We start by setting the link location to be 0x0, and then we create a text (code) section.  The first part of this will be the vector table, which we have handily tagged in vector_table.s with a .section directive as follows:

.section .reset, "ax"

This tells the assembler that the code assembled should be placed in a section of code with a name of  .reset (which I made up, I could equally well have used .simon, .foo, or anything else, as long as I was consistent between the ld script and the section name in the code itself).

In fact, we're linking all code labeled with a .reset section name first, but as there's only one of these, we can guarantee it will be first.

After that, and still within the .text section, comes anything linked as .text.  Basically, that's everything "code" spat out of the C compiler, along with everything else we've labelled as code in the assembler.

Then we have a .ARM.exidx section, that's required for C++ exception handling if we ever need it.  We don't, at the moment, but might do later.

Then we have the .data section - that's for any pre-defined, read-only values.

And finally, we have the .bss section - space allocated for unassigned data we might write to.

You'll also note that we set a couple of variables, __bss_start__ and __bss_end__ to the current link location "." - these are referred to in the code (so that we know where bss starts and ends, and thus where free memory starts) and will be replaced by the linker with the correct values.

And that's pretty much it.  It looks scary, but it's not.

If you want chapter and verse on this, check out this article which is much more explicit, and covers using system libraries and so on.  Also, it's worth looking at balau's blog entry on newlib (and all of balau's other bare metal articles, which are highly instructive).  And, of course, the gnu binutils documentation.


Saturday, March 3, 2012

Radio silence and gouging (A follow-up to Black Piday)

It seems that the forums are back up, and have been for 12 hours or so.  So far, there's only 2 people on the forums who seem to have got confirmed orders for Pis from the first batch.  Given that there were upwards of 12K people registered on the forum, and most of them were probably frantically trying to get orders in from 06:01 on the 1st, we can safely say that demand was "underestimated".

Oddly, there seems to be very little spleen being vented towards the Foundation, but some not inconsiderable anger directed at Farnell and RS, mainly over their incompetent and inconsistent handling of ordering and preordering.  This anger is, I believe, misplaced - there's far more wrong with what RS (to a lesser extent) and Farnell are doing than being simply incompetent.

I'll say it now.  I feel that handing over control to Farnell and RS was a massive mistake, and it may cost the foundation much of the goodwill they have accumulated even whilst they were missing their hoped-for delivery dates.

RS are seemingly not acknowledging anything.  Nobody is reporting any response at all from them.  "Register interest" and - nothing, nada, que chi.

Farnell are, logistically speaking, doing better - they (if you can find the link to order in your country) are accepting preorders.  However, they are:

  • Giving wildly optimistic delivery dates on order, which are then being modified in email
  • Advertising and confirming orders at one price, and then subsequently modifying the price (which, as far as I understand it, but IANAL, is contrary to the Sale of Goods Act, at least in the UK)
  • Selling the Pi at inflated prices.  This is the worst, as the price point was the one thing the Foundation have stuck to, and was a major selling point. The Model B (which is all that's on "sale" at the moment) was supposed to be 35 USD plus locally applicable taxes.  In France (and, I believe, the rest of Europe), Farnell are quoting 33.07 EUR before tax, which comes to 42.65 USD.  That's a 21% markup.  The UK gets an advertised 24.65 GBP, or 39.03 USD, but a reportedly subsequently modified price of 26.65 GBP, or 42.20 USD.  Again, a >20% markup.
I cannot understand why the Foundation are allowing this.  RS are advertising at 21.60 GBP, which is, at least, at the 35 USD price point.

Farnell are destroying the Foundation's year-long commitment to a given price point.  The Foundation should pull their licence NOW.

It has been pointed out that, in their defence, Farnell are merely rolling the shipping cost into the purchase price and shipping for free, but that's actually fairly dishonest - it's a "reverse eBay" (i.e. the reverse of the situation where an article is listed with a low price and then the rest is made up by gouging the shipping).  If you were (able) to order more Pis than one, you effectively pay for single item shipping on each and every one (or, put simply, overpay your shipping cost massively).

Black Piday

So, I didn't get a Raspberry Pi.  I'm not feeling overly bitter about it, as I wasn't actually expecting to, if nothing else because I was already at work before the 6AM GMT deadline. What I am mildly miffed about is that I can't currently do anything other than "express an interest in buying one" (or, in supplier-speak, "become a mailing list asset") from the major suppliers.  I've had an interest since May last year.

As I type this, the Raspberry Pi site is still showing a static page.  There's no official news about what happened, who got one, who didn't, what the details of the deal with Farnell and RS are, apart from Twitter.  The unofficial news (i.e. the stuff in print) is almost all wrong in fundamental details.

I wouldn't want to have been one of the inhabitants of #1 Fruity Loop recently. It was becoming increasingly evident that there was no way 10,000 Pis were going to suffice.  With every hour, new "news sites" were picking up on the Pi as a "$25 computer that does XBMC". None of them were picking up on the actual goal of the Pi itself, or mentioning the limitations of the Pi as an XBMC machine, or even any of the actual hardware features of the Pi beyond "it does XBMC". The wave upon wave of new users descending on the Pi forums and asking if they can install Windows on the Pi bore witness to this. What had started as a dream was rapidly becoming a monster. It was totally out of control, and there was no way the foundation would ever be able to meet the demand.  Hence, I guess, the throwing up of hands and handing control of the hardware delivery side of things to a pair of corporate rapists[1]

When people start realising what the limitations of the Pi as an XBMC machine is, the nerd rage will be spectacular.  We probably have a week or so before they start being delivered to homes around the globe, and the "this thing's a fucking pile of shit" threads on the already overloaded forums start sprouting up by the thousand.

The foundation themselves are largely to blame for this turn of events, as recent news has all been about multimedia and XBMC, and nothing to do with education at all.  Indeed, the "About Us" page on the Raspberry Pi site has been changed from a short and sweet statement of the goals of the foundation into a bunch of word salad.  From being about teaching programming, it's suddenly become
We want owning a truly personal computer to be normal for children.
So, when 6AM GMT ran around, Farnell, RS and Raspberry went down simultaneously. Not taken down by wave upon wave of pent up demand for educational computers, but by scalpers and people from These are not people who are going to do educational projects with the board, they are going to stick them to the back of the telly with blu-tac. the foundation wanted "the community" to take the board and supplied software, run with it, and create interesting projects.  Installing XBMC is no more interesting in terms of computer science than editing a document in Microsoft Word, the kind of crap that's being taught in ICT, the kind of thing the Raspberry Pi was suposed to fix.

None of the people I know who are planning educational projects managed to get one from the first batch. Only one of them managed to get a preorder rather than "expressing an interest".

There are probably less than a hundred people on the forums who are planning interesting stuff.  putting aside 1% of the first batch "for them", even if it was done behind the scenes, would have done far more for the foundation's erstwhile goals than selling a million Pis to people who want to make Mame machines and set top boxes.

So, like I said, I'm not bitter.  But I am fucking angry.  The emphasis has gone from producing something important, i.e. a machine that helps in teaching kids (and adults) how to actually control the hardware they own, on to simply producing another gadget.  That emphasis has been changing slowly since around December.  The sale itself was a total fucking mess, and, it appears, fucked up by both the foundation (setting a solid date and time for server meltdown, days in advance, and thus further engorging the until-then self-sustaining hype machine, was never gonna be a good idea), and RS/Farnell, by all accounts, don't appear to have been even close to ready (from denying the existence of the Pi through to stating that they will only be available to resellers).

I remain totally committed to the originally stated goals of the foundation.  The ones that don't appear on the site any more.  I truly believe the Pi could change the nature of computing in a way that no other computer has ever done, that it could make an understanding of computer science part of basic education.  Indeed, even if the foundation had never managed to get a single board out of the door, it has highlighted fundamental issues in education today, raised questions, and shown that, with a bit of will, something can be done.

But all that is being lost, being driven into obscurity by a botched sale of millions of cheap, credit card sized computers running XBMC.


I hope the foundation people at least get to pay off their personal loans, but I assume Farnell and RS are taking a substantial cut of the already slim margins.

[1] Arguably, the handing over of sales to people like RS and Farnell should have happened around December, when the beta boards landed.  At that point, the hype was containable. But that's with hindsight, and I guess at that point the foundation guys and girls still wanted to (and believe they could) keep control, which is understandable.

Tuesday, January 31, 2012

CAR, CDR, CONS - an aside

Commenter pcpete has been following the ARM OS stuff, but was having some trouble with my CAR/CDR/etc macros.  I figured I'd do a little writeup for those who aren't totally au fait with what that's all about.

CAR, CDR and CONS are terminology which has stuck around since the '50s.  It's not going away, so if you want to Lisp, you need to get used to it.

The original Lisp was implemented on the IBM 704 computer.  This was a machine which had a 36 bit word (peculiar, but useful) and a 15 bit address bus. 36 bit words could be split into 4 parts (and hardware support was provided for doing so) known as the address (a 15 bit value), the decrement (again, 15 bits), the tag (3 bits) and the flags (3 bits).  The very earliest work on Lisp provided operators to extract these 4 parts (Contents of Address part of Register or CAR,  and likewise for Decrement, Tag and Flags, providing CDR, CTR and CFR).  Later work (before the first implementation of Lisp) dropped direct manipulation of flags and tag, leaving CAR and CDR, and a "Construct" operator, CONS, which took values for the address and decrement parts respectively and stuffed them into a machine word.

The fact that a machine word was bigger than the size of the address of a machine word meant that it was possible to implement pairs of values and singly linked list cells within one machine word.  A pair of values (for example, 1 & 2) could be created as follows:

CONS(1, 2)

and singly linked lists by treating the 'address' part (paradoxically enough) as a value, and the 'decrement' part as the address of the next cell in the list.  By setting the last pointer to some easily-recognised value (known as 'nil'), it is possible to find the end of a list.  Thus, creating a list containing the numbers 1, 2 and 3, looks like this:

CONS(1, CONS(2, CONS(3, nil)))

Binary trees are also easy to construct using pairs - for example:

CONS(CONS(1, 2), CONS(3, 4))

I'm generally the last one to point people at Wikipedia, but there's a reasonable and rather more graphical explanation of what have become known as 'cons cells' over there

It's worth noting that Lisp also allows you to use the notation 'first' and 'rest' instead of 'car' and 'cdr', although this only really makes sense when in list context, and doesn't allow for composition (caar, cadr, cdar, etc).  Pretty much nobody uses first and rest.

So, Lisp uses what is, in effect, a pair (or list) oriented memory model as opposed to the more usual (these days) approach of addressing single memory cells individually.  And if one is implementing a Lisp (as I am), it makes a certain amount of sense for the low level memory allocation and manipulation functions to operate in this way.  Unfortunately, "modern" machines in general don't have words that are larger than their address bus size, and ARM is no exception to this.  ARM is a 32 bit machine - we could restrict addressing within our lisp to 16 bits (maximum 64Kwords or 256KB) and all values to 16 bits, or, more sensibly, use "virtual" words of 2 machine words (or maybe more) in size.  Usefully, ARM has LDRD (Load Register Dual) and STRD (Store Register Dual) operands that make this a snip to do.

So, we can define cons cells as 8-byte aligned pairs of 32-bit values, manipulate them as pairs using LDRD/STRD, and we're a long way towards implementing a Lisp.  

But what of the Flags and Tag stuff that was originally in Lisp?  Although, by the time Lisp came out, you had no way of directly manipulating them, they were still used to differentiate between pointers, direct values, and constants, amongst other things.  And this is something we need to be able to do as well, otherwise we can't tell if (for example) the value 0x00000000 in the CDR of a cell is a pointer to a cell at the address 0x00000000, the integer value zero, or some special value indicating nil/end of list.

As it happens, all is well. LDRD/STRD require an 8-byte aligned address, which means the low 3 bits of a pointer will *always* be 0.  We can, therefore, use those 3 bits as flags to indicate various other types, and, if we're careful, we can encode the majority of types in such a way as to keep their values "boxed" (i.e. containable in a 32 bit value).  Values requiring more than 32 bits to encode (strings, large numbers, rational numbers, etc) can either be encoded as lists of boxed values, pairs, or using any other encoding the programmer deems useful as long as they are easily recognisable.

So, back to the code I posted, which uses CAR, CDR, SET_CAR and SET_CDR macros.

/* linked list structure */
typedef struct task_list {
  task_t * _car;
  struct task_list * _cdr;
} task_list_t;

/* some LISPy primitives */
#define CAR(x) (x)->_car
#define CDR(x) (x)->_cdr

#define SET_CAR(x,y) CAR(x)=(y)
#define SET_CDR(x,y) CDR(x)=(y)

As we can see, task_list_t is an analogue for the cons cell, being simply 2 address-bus sized values held together.  Indeed, my code elsewhere looks like this:

typedef cons_t task_list_t;

but the original formulation is perhaps easier to grok at first.

So, given a pointer x to a cons cell c, we can see that CAR(x) expands to(x)->_car, in other words returning the _car element of the cell c pointed to by x.  Likewise SET_CAR(x, y) becomes (x)->_car = (y), assigning the value y to the _car element of the cell c pointed to by x. Doing this as macros is probably premature optimisation, but it's "the done thing".

Our task scheduler uses a list (actually, a plurality of lists) of tasks to execute, and considers them for execution in a "round robin" style.  So, assuming we have 4 tasks, a, b, c and d, we might have a list that looks, in lisp syntax, like this:

(a b c d)

or, in diagram form, where the top row is the CDR of the cell, and the bottom the CAR, and * indicates a pointer to the next cell

* - * - * - nil
|   |   |   |
a   b   c   d

We need to rotate the list as we consider each task, so that the next time through we consider the next task. So after considering task a for execution, we want the list to look like this:

(b c d a)

or, in diagram form

* - * - * - nil
|   |   |   |
b   c   d   a

In true lisp style, we'll use a singly linked list for this, with the CAR of each cell being a pointer to the task, and the CDR being a pointer to the next cell, or nil for end of list.  

So, we come in, and we find the CAR and CDR of the (original) list.  CAR is, obviously enough, task 'a', the task we want to consider for execution.  CDR, however, is not task b, but the rest of the list (hence the alternative 'rest' in Lisp) - i.e.

(b c d)

or, in diagram form

* - * - nil
|   |   |
b   c   d

So our next step is to put 'a' onto the end of the list.  This is not as simple as it seems - CONS can only be used to push items onto the front of a list - CONS(list, a) would result in this:

((bcd) . a)

or, in diagram form

* - * - nil
|   |   |
b   c   d

which is not a true list in Lisp terms (it's a "dotted pair" with a list as its first element). We need some sort of APPEND operator that adds an element to the end of a list, a sort of 'reverse cons'.  Lisp has one of those, it's (oddly enough) called APPEND, it's usually defined recursively, and it always creates new cons cells all over the place.  We really don't want to go allocating new memory in a function that's going to be called thousands of times per second simply to move a few values about - we want to keep the same cons cells and simply swap their CDR values around.  The algorithm thus becomes:

first := list
if CDR(first) != NIL
  list, end := CDR(first)
  end := CDR(end) while CDR(end) != NIL
  SET_CDR(first, NIL)
  SET_CDR(end, first)

With that in mind, have another read through


Monday, January 16, 2012

Developing a multitasking OS for ARM part 3 3/4

So, we looked at how to do the difficult part of task swapping last time, now let's look at how we go about handling interrupts and the syscall layer.

Firstly, interrupts.  We'll not bother with FIQs for the moment, we'll stick with IRQs.

Despite having an ARM1176 core, the Raspberry Pi (or, rather, its Broadcom SoC) doesn't have a vectored interrupt controller - instead it has 3 "standard" ARM interrupt controllers.  That makes things a bit more complex for interrupt handling, but it's not too arduous.  As a quick reminder, here's the IRQ code we had before, with a little bit of meat in the middle:

.global _irq_handler

sub lr, lr, #4 /* Save adjusted LR_IRQ */
srsdb sp!, #SYS_MODE /* save LR_irq and SPSR_irq to system mode stack */

cpsid i,#SYS_MODE /* Go to system mode */

push {r0-r12} /* Save registers */

and r0, sp, #4 /* align the stack and save adjustment with LR_user */
sub sp, sp, r0
push {r0, lr}

/* Identify and clear interrupt source */
/* Should return handler address in r0 */
bl identify_and_clear_irq

blxne r0 /* go handle our interrupt if we have a handler */
/* An interruptible handler should disable / enable irqs */

/* Exit is via context switcher */
b switch_context_do

and the context switcher looks like this:

.global switch_context_do
/* Do we need to switch context? */
mov r3, #0x0c /* offset to fourth word of task block */
ldr r0, =__current_task
ldr r1, [r0]
ldr r0, =__next_task
ldr r2, [r0]
cmp r2, #0 /* If there's no next task, we can't switch */
beq .Lswitch_context_exit
cmp r1, #0 /* In the normal case, we will have a __current_task */
bne .Lnormal_case

/* When we get here, we're either idling in system mode at startup, or we've */
/* just voluntarily terminated a task.  In either case, we need to remove the */
/* return information we just pushed onto the stack, as we're never, ever going */
/* back. */
pop {r0, r1} /* remove any potential stack alignment */
add sp, sp, r0
add sp, sp, #0x3c /* and the other registers that should be there */
/* r0-r12, interrupted pc & spsr */
/* Now we can do our first actual task swap */
ldr r0,  =__next_task /* swap out the task */
ldr r2,  [r0]
ldr r0,  =__current_task
str r2,  [r0]
ldr sp,  [r2, r3] /* and restore stack pointer */
b .Lswitch_context_exit /* bail */
cmp r1, r2 /* otherwise, compare current task to next */
beq .Lswitch_context_exit

clrex /* Clear all mutexes */

/* At this point we have everything we need on the sysmode (user) stack */
/* {stack adjust, lr}_user, {r0-r12}_user, {SPSR, LR}_irq */
/* Save our stack pointer, and swap in the new one before returning */

ldr r0, =__current_task /* save current stack pointer */
ldr r0, [r0]
str sp, [r0, r3] /* stack pointer is second word of task object */
ldr r0,  =__next_task /* swap out the task */
ldr r2,  [r0]
ldr r0,  =__current_task
str r2,  [r0]
ldr sp,  [r2, r3] /* and restore stack pointer */
pop {r0, lr} /* restore LR_user and readjust stack */
add sp, sp, r0
pop {r0-r12} /* and other registers */
rfeia sp! /* before returning */

The context switcher looks complex, but it isn't really.  There's 4 cases to cater for, viz:

  • No 'next' task, don't switch
  • No 'current' task, switch to 'next' task, cleanup stack and switch to 'next' task
  • 'next' task is the same as 'current' task, don't switch
  • 'next' task is different from 'current' task, switch
Obviously, the meat of the interrupt handler is held in 'identify_and_clear_interrupt', which does pretty much what it says on the tin.  In this article, I'll show the handler for the qemu platform, which is significantly simpler than that for the Pi, but the Pi handler looks largely the same modulo having to deal with 3 controllers.
.global identify_and_clear_irq

FUNC identify_and_clear_irq

ldr r4, =.Lirq_base

ldr r4, [r4]
/* read the vector address to indicate we're handling the interrupt */
ldr r0, [r4, #IRQ_HANDLER]
/* which IRQs are asserted? */
ldr r0, [r4, #IRQ_STATUS]
ldr r5, =__irq_handlers

clz r6, r0 /* which IRQ was asserted? */
mov r1, #1 /* make a mask */
bic r0, r0, r1, lsl r6 /* clear flag */
str r0, [r4, #IRQ_ACK] /* Now acknowledge the interrupt */
str r0, [r4, #IRQ_SOFTCLEAR] /* and make sure we clear software irqs too */
ldr r0, [r5, r6, lsl #2] /* load handler address */
.Lret: bx lr /* exit */
.word IRQ_BASE

.global __irq_handlers
__irq_handlers: .skip 32 * 4

and patching in a hander is as simple as setting the address for the interrupt handler into __irq_handlers at the appropriate place.  Simples, as they say in internet-land.

Syscalls are very similar.  Here's the syscall handler:

.global _svc_handler
srsdb sp!, #SYS_MODE /* save LR_svc and SPSR_svc to sys mode stack */
cpsid i,#SYS_MODE
push {r0-r12} /* Save registers */

and r0, sp, #4 /* align the stack and save adjustment with LR_user */
sub sp, sp, r0
push {r0, lr}
ldr r0,[lr,#-4] /* Calculate address of SVC instruction */
/* and load it into R0. */
and r0,r0,#0x000000ff /* Mask off top 24 bits of instruction */
/* to give SVC number. */
ldr r2, =__syscall_table /* get the syscall */
ldr r3, [r2, r0, lsl#2]
cmp r3, #0
beq _syscall_exit
tst r3, #0x01 /* what linkage are we using */
bxeq r3 /* ASM, just run away */
bic r3, r3, #0x01
blx r3 /* C, must come back here */

.global _syscall_exit
pop {r0, lr} /* restore LR_user and readjust stack */
add sp, sp, r0

pop {r0-r12} /* and other registers */
rfeia sp! /* before returning */

.section .bss
.global __syscall_table
/* leave space for 256 syscall addresses */
.skip 2048

The fun bit is how we go about getting the syscall number into the handler.  I've taken the "canonical" approach of using the svc operand, hence the bit where we get the instruction and extract the number.  Other ways include using a register, a global variable, pushing onto the stack, or some combination of these.

The other twist here is that I allow for both C and assembler syscall functions by setting (or not) bit 0 of the function address in the syscall table.  Assembler syscall handlers must, of course, exit via _syscall_exit or equivalent, but that's down to the programmer to get it right.

Saturday, January 7, 2012

Developing a multitasking OS for ARM part 3 1/2

Okay, so last time we looked at what we need to schedule and store top level information about tasks in our fledgling OS.  And it was pretty easy to do, because we could do all of it in C.  Unfortunately, things are about to get complicated again, because we're about to dive down into assembler again.

Whoopee!  I can almost hear the sound of "back" buttons being clicked as I type this.

So.  Multitasking.  How's that gonna work, then?  Largely speaking, there's 2 types of multitasking - preemptive multitasking, where tasks run for a specified amount of time then get forcibly swapped out, and cooperative multitasking, where tasks run until they decide to give some time to someone else.  We're going to implement (because there's precious little overhead in doing so) a hybrid where tasks can be "nice" and give time to others, but where the absolute maximum time they get is capped by a preemptive scheduler.

So.  Let's look at the sequence of events for a user-triggered task swap.  We want the user to use a line of code that looks like this (remember, I'm developing something that runs scheme...)


However, the user's code is running in user mode (remember the ARM processor states), and it can't directly access the task scheduler.  This is where software interrupts / SVC calls come in - the user's code *can* use the svc instruction.  This, in fact, is the beginning of what is known as a syscall interface, the way that unprivileged user code calls (or causes to happen) kernel functions.

Executing the svc instruction causes the following to happen:

  • CPSR_usr is transferred to SPSR_svc
  • PC is stored in LR_svc
  • Processor switches to SVC mode (SP_usr and LR_usr are now hidden by SP_svc and LR_svc, CPSR is identical except for processor state change)
  • PC is loaded with the address of the SVC exception handler

At this point, what we need to do is store R0-R12, LR_usr and PC (before entry to svc handler) into the user stack, in a known order, then save the user stack pointer to the task's "housekeeping" information, load the equivalents back in for the next task, and jump out of the handler back to user code.  We'll get onto that in a minute.

Preemptive task swapping will be done by using a timer interrupt.  Thus, for a preemptive task swap, the situation is quasi-identical, with _svc above replaced by _irq.  The only difference is that the PC stored to LR_svc is actually 4 bytes on from where we want to restart, so we need to remember to take that into account.

So.  How do we go about saving our information to the user stack?  This is made quite easy by the fact the user mode register are identical to the system mode registers.  So we need to save the svc/irq mode stuff (LR, SPSR) into the system mode stack, then swap into system mode and save the rest.  The first bit is covered by one instruction, which might have been made for the task:

srs (Store Return State) - store LR and SPSR of the current mode onto the stack of a specified mode

and, of course, its "twin"

rfe (Return From Exception) - load PC and CPSR from address

So, the preamble for the svc handler is as follows:

FUNC _svc_handler
srsdb sp!, #SYS_MODE /* save LR_svc and SPSR_svc to svc mode stack */
cpsid i,#SYS_MODE     /* go sys mode, interrupts disabled */

push {r0-r12} /* Save registers */

and r0, sp, #4 /* align the stack and save adjustment with LR_user */
sub sp, sp, r0
push {r0, lr}

and for an irq:

FUNC _irq_handler
sub lr, lr, #4 /* Save adjusted LR_IRQ */
srsdb sp!, #SYS_MODE /* save LR_irq and SPSR_irq to system mode stack */
cpsid i,#SYS_MODE /* Go to system mode */
push {r0-r12} /* Save registers */
and r0, sp, #4 /* align the stack and save adjustment with LR_user */
sub sp, sp, r0
push {r0, lr}

Note that the only difference is that the irq handler adjusts the return address (as we want to go back to the interrupted instruction, not the next one).

Given that we are now *always* in system mode, exiting from either handler is identical:

pop {r0, lr} /* restore LR_user and readjust stack */
add sp, sp, r0
pop {r0-r12} /* and other registers */
rfeia sp! /* before returning */

I'll move onto how to implement the guts of the two handlers in the next post.  But before we go, here's how we set up a task in "C" land.

Remember, when we switch into a task, we will be pulling a stored process state from the task's stack into the registers, and then restoring the PC and CPSR, also from the task's stack.  Setting up a task, then, involves "faking" the preamble of the exception handlers above.  Note the use of exit_fn as the value of LR_usr, this is where we go when a task dies, and is used to clean up after task exit, and the use of 'entry' (the address of the task's entry function) as the value of LR_svc/irq, which will be used as the "return" address from the exception handler.

The use of exit_fn means we can schedule "one-shot" tasks (which not all realtime OSs can do).  Hooray for us.

  // Allocate stack space and the actual object
  task_t * task = malloc( sizeof(task_t) );
  void * stack = malloc( stack_size * 4 );
  unsigned int * sp = stack + stack_size;
  task->stack_top = stack;
  task->priority = priority;
  task->state = TASK_SLEEP;
  task->id = (unsigned int)task & 0x3fffffff;
  *(--sp) = 0x00000010;             // CPSR (user mode with interrupts enabled)
  *(--sp) = (unsigned int)entry;  // 'return' address (i.e. where we come in)
  *(--sp) = 0x0c0c0c0c;             // r12
  *(--sp) = 0x0b0b0b0b;             // r11
  *(--sp) = 0x0a0a0a0a;             // r10
  *(--sp) = 0x09090909;             // r9
  *(--sp) = 0x08080808;             // r8
  *(--sp) = 0x07070707;             // r7
  *(--sp) = 0x06060606;             // r6
  *(--sp) = 0x05050505;             // r5
  *(--sp) = 0x04040404;             // r4
  *(--sp) = 0x03030303;             // r3
  *(--sp) = 0x02020202;             // r2
  *(--sp) = 0x01010101;             // r1
  *(--sp) = (unsigned int) env;     // r0, i.e. arg to entry function

  if ((unsigned int)sp & 0x07) {
    *(--sp) = 0xdeadc0de;           // Stack filler
    *(--sp) = (unsigned int)exit_fn;              // lr, where we go on exit
    *(--sp) = 0x00000004;           // Stack Adjust
  } else {
    *(--sp) = (unsigned int)exit_fn;              // lr, where we go on exit
    *(--sp) = 0x00000000;           // Stack Adjust
  task->stack_pointer = sp;