Sunday, December 6, 2009

Why Assembler is better

Following on from my last post, I thought I'd expand a little on how assembler is better than C. On "real" computers, it's often said that a C or C++ compiler will, in the vast majority of cases, produce better code than you can do by hand in assembler. This is largely true, especially when you're dealing with multiple cores and the like.

On microcontrollers, however, and especially 8 / 16 bit ones, the C compilers aren't "all that", and the additional overhead imposed by a C compiler can kill your application stone dead.

Let's take a concrete example.

In my (very non-optimal) assembler example posted before, I need to clear the interrupt pending flag for the Timer interrupt I'm servicing. This is easily done in assembler, it's a one-line, one clock cycle instruction, as follows:

bres        0x5255, #0x00   ; Clear TIM1 Interrupt pending bit


Simple, right? Now, let's look at what the C compiler gives us.

Here's the "C" code we use:

// Clear the interrupt pending bit for TIM1.
TIM1_ClearITPendingBit(TIM1_IT_UPDATE);


Simple enough, right? There's obviously the overhead of a function call, but we might expect the guts of the function to do a simple bit of inline assembler as above. Let's look.

void TIM1_ClearITPendingBit(TIM1_IT_TypeDef TIM1_IT)
{
/* Check the parameters */
assert_param(IS_TIM1_IT_OK(TIM1_IT));

/* Clear the IT pending Bit */
TIM1->SR1 = (u8)(~(u8)TIM1_IT);
}


So, let's look at what that C code produces.


main.c:64 TIM1_ClearITPendingBit(TIM1_IT_UPDATE);
0x91c3 0xA601 LD A,#0x01 LD A,#0x01
0x91c5 0xCD8C9C CALL 0x8c9c CALL _TIM1_ClearITPendingBit

...

stm8s_tim1.c:2156 TIM1->SR1 = (u8)(~(u8)TIM1_IT);
0x8c9c <.ClearITPendingBit> 0x43 CPL A CPL A
0x8c9d <.earITPendingBit+1> 0xC75255 LD 0x5255,A LD 0x5255,A
stm8s_tim1.c:2157 }
0x8ca0 <.earITPendingBit+4> 0x81 RET RET


So, we load the accumulator with a value, that's one cycle. Call a function, 4 cycles. Complement the accumulator, one cycle. Store the accumulator in the flag, 1 cycle. Return from function, 4 cycles.

In total, that's 11 cycles and 10 bytes to do the same thing we did in one cycle and 3 bytes. "inlining" this function doesn't make our code any fatter, either, as the 3 bytes we're using are the same as the 3 bytes we would have used to call the function.

What we do lose is readability, but that's easily enough got back by writing macros.

No comments:

Post a Comment

Followers