Monday, July 9, 2012

Atomic setting on ARM

Back to our scheduled programming.  Big gobs of ugly assembly code.

When implementing a multitasking operating system, there is a need for dealing with mutating shared state.  It's one of the unpleasant realities of the world of multitasking (and, indeed, multithreading), and it's fraught with danger.

One of the basic functions we need to implement is the "atomic set" operation - setting a variable to a given value.

Let's take the example of incrementing a variable.  Naively, we might do this:

    ldr r0, [r1]
    add r0, r0, #1
    str r0, [r1]


In most cases, this will work.  However, there exists a case where, in fact, this can fail - if two threads of execution are trying to increment the value, and a task swap happens whilst one is actually doing the increment, the possibility is that the value will be (incorrectly) incremented only once, as follows:

; Assume r1 points at a given address, holding the value 0
[thread 1]
    ldr r0, [r1]  ; r0 in thread 1 is now 0

[thread 2]
    ldr r0, [r1] ; r0 in thread 2 is now 0
    add r0, r0, #1 ; r0 in thread 2 is now 1
    str r0, [r1] ; memory is now 1
...

[thread 1]
    add r0, r0, #1 ; r0 in thread 1 is now 1
    str r0, [r1] ; memory is now 1


Obviously, we would expect memory to be set to 2, not to 1.  So somehow we need to either stop the interrupts happening (easy enough, turn interrupts off, but that has fairly big impacts elsewhere) or somehow deal with the case where we are interrupted mid-operation.

As luck would have it, ARMv6 provides us with 3 handy opcodes for this : ldrex, strex and clrex.  Basically, we use ldrex to signal that we want to have exclusive write access to a memory location, strex signals that we're going to write to a location and close that exclusive access, with a test for success, and clrex says "hey, we're no longer interested".  So, how do we use these to do what we want?

Let's go back to our example above - incrementing a value in memory.  Using ldrex / strex it would look like this:

try_increment:
    ldrex r0, [r1]
    add r0, r0, #1
    strex r2, r0, r1
    cmp r2, #0
    bne try_increment

What happens here is:


  • the initial ldrex loads the memory, and indicates that it wants an exclusive access to the memory itself.
  • We then increment our value, as usual.
  • We write the value back using strex - this will only succeed if:

  1. we still have an exclusive lock on the memory
  2. no newer writes to that memory have happened since we established our exclusive lock
  • success of strex is indicated by register r2 (the "extra" operand that strex uses) being set to 0.
  • If strex has failed, we go back and try again from the point where we loaded the initial value.

For our super-simple increment case, this will probably catch 99.99% of cases.  We add a "belt-and-braces" approach, however, by making our task scheduler explicitly invalidate all exclusive reads, using the clrex opcode.  This has the possibility of making any in-process ldrex-strex blocks restart (and thus take more time), but covers all the bases.

Now, that's all fine and well, but we may want to use this method in our C code, without resorting to subroutine calls (by their very nature, exclusive operations happen at a very low level, and probably want to be inlined).  So we're going to want to use some of that nastiest of nasties, inline assembler in gcc.  Believe me, it's vile.

Here's an implementation of an inline atomic increment using gcc inline assembler:

inline uint32_t atomic_increment(uint32_t * memory) {
  uint32_t temp1, temp2;
  __asm__ __volatile__ (
    "1:\n"
    "\tldrex\t%[t1],[%[m]]\n"
    "\tadd\t%[t1],%[t1],#1\n"
    "\tstrex\t%[t2],%[t1],[%[m]]\n"
    "\tcmp\t%[t2],#0\n"
    "\tbne\t1b"
    : [t1] "=&r" (temp1), [t2] "=&r" (temp2)
    : [m] "r" (memory)
  );
  return temp1;
}


Horrible, no?  Note the use of local labels ('1:' and then branching to '1b' to indicate the latest local label called '1'), having to use encoded tabs and newlines to stop the assembler itself barfing, and the horrible workaround of multiple names for the same variables because gcc is, quite simply, broken.  Still, it works, and the C optimiser can deal with it.


If you want to get more complex, I'd suggest looking at the ARM site for the example implementations of mutexes and condition variables using ldrex/strex.  You'll have to deal with converting from ARM assembler to GNU, but as long as you don't try inlining them, you should be fine.

8 comments:

  1. Hmm, it sounds so complex with the clrex command. What do you mean you clear it in the task schedule, if i use the atomic_add for semaphores, do the task schedule check all semaphore and clear them? sounds a bit of work then to do? or didnt i get it?`:-)

    ReplyDelete
  2. I agree with Claus, this post makes it seem complex. And buggy - when you say ldrex/strex covers 99.99% of cases in this super-simple example, what happens with the remaining .01%? How do they fail, and how does clrex fix the problem? And if ldrex/strex sometimes fails when the case is this simple, does it fail more often or more horribly with more complicated cases?

    ReplyDelete
    Replies
    1. There are two reasons the exclusivity could fail: (1) another core (or bus agent) could write the location between the ldrex/strex. This is detected by the hardware monitor, since it knows which core is issuing each access. (2) A interrupt could occur between the ldrex/strex, and the same core could then modify the location, either in an ISR or another thread, before the strex can be done in the original thread. That can't be detected by the monitor. So, clrex is used to reset all the outstanding exclusive accesses whenever a context switch occurs. What this means is - if an interrupt occurs between your ldrex and strex, then the strex will always fail, even though it maybe didn't need to. It's no big deal since the event is (a) quite rare and (b) not a lot of extra work anyway.

      So, it should be completely reliable if used as described here. You don't need to do any 'clrex' in application code since the OS scheduler does that. (If you are writing a low-level ISR then maybe you might need it there, I don't know)

      Delete
    2. There are two reasons the exclusivity could fail: (1) another core (or bus agent) could write the location between the ldrex/strex. This is detected by the hardware monitor, since it knows which core is issuing each access. (2) A interrupt could occur between the ldrex/strex, and the same core could then modify the location, either in an ISR or another thread, before the strex can be done in the original thread. That can't be detected by the monitor. So, clrex is used to reset all the outstanding exclusive accesses whenever a context switch occurs. What this means is - if an interrupt occurs between your ldrex and strex, then the strex will always fail, even though it maybe didn't need to. It's no big deal since the event is (a) quite rare and (b) not a lot of extra work anyway.

      So, it should be completely reliable if used as described here. You don't need to do any 'clrex' in application code since the OS scheduler does that. (If you are writing a low-level ISR then maybe you might need it there, I don't know)

      Delete
  3. The code would be easier to read if you replaced \t with spaces or real tabs. Also, you can reduce the number of double-quotes by continuing the line with a final backslash. At the assembly level, you should look at cbnz, this replaces your last two lines. Finally, I don't think you actually compiled this code: the extra set of square brackets around the memory operands are wrong.

    ReplyDelete
    Replies
    1. One could also remove the label, cmp, bne, and wrap the asm in "do { ... }while( temp2 != 0);" so the compiler will use whatever branch is best on the specified machine. Also, I'd make it 'uint32_t volatile *mem' , since any variable used across threads (and which you might pass the address-of to this function) should be 'volatile' (it won't affect how this asm is generated, but it would save you having to 'cast away' volatile when invoking it).

      Delete
  4. Konz

    The code compiles fine. The reason for not using cbnz, by the way, is that it's specific to the Thumb instruction set.

    ReplyDelete
  5. Found a better way: the entire thing may be written as

    return __sync_add_and_fetch( memory, 1);

    This is a compiler builtin (supported on gcc and clang, at least) which on ARM will be the same "ldrex add 1 strex until !=0" construct. And it's portable; on x86 it's done with a locked 'xaddl' instruction.

    See https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

    ReplyDelete

Followers