Sunday, December 6, 2009

Performance, performance, performance

Let's look at improving the performance of the synthesis code I had written. It's pretty nippy, but we can shave some cycles off it, I'm sure. Remember, we're running an interrupt at 32kHz, and then using a divider per "virtual oscillator" to synthesize our waveform; given that we toggle the volume up or down per interrupt, that gives us a maximum frequency of 16kHz. On top of that, the high frequency range response is going to be pretty poor, as illustrated below

"Ticks"  Frequency  Note (approx)
0x0000 = 16kHz
0x0001 = 8kHz
0x0002 = 4kHz = C8
0x0003 = 2.6kHz = E7
0x0004 = 2 kHz = C7
0x0005 = 1.6kHz = G6
0x0006 = 1.3kHz = E6
0x0007 = 1.143kHz = D6
0x0008 = 1kHz = C6
0x0009 = 0.89kHz = A5


As can be seen, we don't even begin to hit (for varying values of "hit", those with perfect pitch need not apply) every note until we get down pretty low. If we up our interrupt rate to 64kHz, we get the following:

"Ticks"  Frequency  Note (approx)
0x0000 = 32kHz
0x0001 = 16kHz
0x0002 = 8kHz
0x0003 = 5.3kHz
0x0004 = 4 kHz = C8
0x0005 = 3.2kHz = G7
0x0006 = 2.6kHz = E7
0x0007 = 2.3kHz = D7
0x0008 = 2kHz = C7
0x0009 = 1.78kHz = A6
0x000a = 1.6kHz = G6


That gives us a much better spread, and still leaves us plenty of sub-audio / LFO range at the bottom. However, 64kHz only leaves us 256 cycles to play with, and our existing code only leaves us about 50 cycles of headroom for other code, like dealing with user interface and actually twiddling frequencies. Not good enough. Also, it would be nice to have per-channel fading.

So, first off, let's look at the original code.

f_timer_interrupt:
clr _volume ; 1 ; volume to zero
ld a, #0x07 ; 1 ; set initial channel flag
ldw x, #_channels ; 2 ; load y register with address of channel data
dochannel:
ldw y, x ; 1 ; Load y register with ticks_left
ldw y, (y) ; 2
jrmi br3 ; 1 / 2 ; skip if channel is off (bit 15 of ticks_left set)
decw y ; 2 ; decrement
jrpl br1 ; 1 / 2 ; if not negative, skip
ldw y, x ; 1 ; reset ticks_left
ldw y, (0x02, y) ; 2
ldw (x), y ; 2 ; store ticks left
jra br2 ; 2 ; and skip unnecessary work
br1: ldw (x), y ; 2 ; store ticks_left
ldw y, x ; 1 ; get number of ticks per cycle
ldw y, (0x02, y) ; 2
br2: srlw y ; 2 ; divide by 2
cpw y, (x) ; 2 ; compare with ticks_left
jrmi br3 ; 1 / 2
inc _volume ; 1 ; increment volume
br3: addw x, #0x0004 ; 2 ; go 4 bytes up the channel data list
dec a ; 1
jrpl dochannel ; 1 / 2 ; go around if we have more channels to do

bres 0x5255, #0x00 ; 1 ; Clear TIM1 Interrupt pending bit
iret ; 11 ; and return


Okay, it's pretty good, but an obvious optimisation would be to "unroll the loop", which saves us a few cycles per channel. Unfortunately, this is gonna make our code unreadable and unmaintainable, a great long string of assembler.

That's what macros are made for. Macros are a bit like the C preprocessor on steroids, you can define a bunch of "inlined" code as a macro, that will be substituted into the code.

So, what we're going to do is write a macro that deals with any particular channel. Having 8 instances of a macro avoids the loop, without rewriting the same code over and over again. Nice. On top of that, we can pass in the "channel base address" to the macro, and avoid all the "ld y, x; ld y,(y)" tango we were required to do. That's good for a load of cycles per iteration, and gives us enough headroom to do channel fading as well.

So, let's redefine our structures, adding a per-channel volume control. We'll use this, instead of the hokey "top bit" approach taken previously, to know if we can exit fast.

typedef struct {
u8 volume;
u16 ticks_left;
u16 ticks;
} channel;


Now, in our assembler code, we change the way the zero-page variables are laid out, to match.
switch .ubsct
_channels:
chan0: ds.b 5
chan1: ds.b 5
chan2: ds.b 5
chan3: ds.b 5
chan4: ds.b 5
chan5: ds.b 5
chan6: ds.b 5
chan7: ds.b 5
_volume: ds.b 1
; make them visible to C code
xdef _channels
xdef _volume

We keep the "top" label for C, and then we have individual channel labels for the assembler code (we could use simple math, but this makes things more explicit).

Now, the macro itself.
dochan: macro \chan
ld a, \chan ; 1 ; move volume to accumulator
jreq \@done ; 1 / 2 ; quit if zero volume
ldw x, \chan + 1 ; 2 ; get ticks_left
ldw y, \chan + 3 ; 2 ; load ticks
decw x ; 2 ; decrement ticks_left
jrpl \@br1 ; 1 / 2 ; branch if still ticking
ldw x, y ; 1 ; load ticks
\@br1: ldw \chan + 1, x ; 2 ; store new ticks_left
srlw y ; 2 ; divide ticks by 2
cpw y, \chan + 1 ; 2 ; compare with ticks_left
jrmi \@done ; 1 / 2 ;
add a, _volume ; 1 ; add volume
ld _volume, a ; 1 ; and store
\@done:
endm

Passing in a base address, this will handle the calculations for a single channel. Best case performance (channel muted) is a mere 3 cycles, best case (channel non muted) is 18 cycles, worst case is 19 cycles. Any given channel will spend half of its time at 18 cycles and half at 19 cycles, giving us an average of 18.5 cycles / non-muted channel. There's probably a few more cycles to shave off there, too.

Calling the macro is simple.
f_timer_interrupt:
clr _volume ; 1 ; volume to zero
dochan chan0 ; 18.5
dochan chan1 ; 18.5
dochan chan2 ; 18.5
dochan chan3 ; 18.5
dochan chan4 ; 18.5
dochan chan5 ; 18.5
dochan chan6 ; 18.5
dochan chan7 ; 18.5
bres 0x5255, #0x00 ; 1 ; Clear TIM1 Interrupt pending bit
iret ; 11 ; and return

As can be seen, the static overhead is now down to 13 cycles, so assuming all channels are "on", worst case time taken is going to be 13 + (19 * channels). For 8 channels, this is 165 cycles (161 cycles average). That fits quite neatly into the 256 cycles we have per interrupt, with 90-odd cycles left over. Not only that, but we can easily trim down the number of channels to fit if needs be - if we wanted, for example, 4 channels would take only 89 cycles, or 6 channels (for 4 drones and 2 LFOs) 127 cycles, or 50% CPU.

That's pretty good. But we can do better.

Let's think about how we are going to go about getting audio off the board. An R-2R network would work, but uses up to 8 pins and a bunch of additional hardware - why do extra work when we can let the processor do the hard lifting for us? What we need is to use hardware PWM, and all we need then is a single pin and a lowpass filter. As this is all going to be thrown out to a set of analog filters anyway, a single-pin approach seems more than reasonable. Now, we're already doing PWM for the LED brightness, so why not generalise that?

So. First of all, we set Timer 3 to a prescaler of one and period of 256. That gives us 8-bit PWM, straight off the bat. We can now do away with the 'volume' variable, and rather than storing stuff in and out of memory, do it all in the accumulator, dumping that direct to the timer at the end of the interrupt. We lose the extreme time saving of skipping when a channel is set to volume zero, but we gain speed and cycles overall.

Here's the new assembler code:


dochan: macro \chan
ldw x, \chan + 1 ; 2 ; get ticks_left
ldw y, \chan + 3 ; 2 ; load ticks
decw x ; 2 ; decrement ticks_left
jrpl \@br1 ; 1 / 2 ; branch if still ticking
ldw x, y ; 1 ; load ticks
\@br1: ldw \chan + 1, x ; 2 ; store new ticks_left
srlw y ; 2 ; divide ticks by 2
cpw y, \chan + 1 ; 2 ; compare with ticks_left
jrmi \@done ; 1 / 2 ;
add a, \chan ; 1 ; add volume
\@done:
endm

; 14 + (16 * 8) = 142 cycles
f_timer_interrupt:
clr a ; 1 ; initial volume to 0
dochan chan0 ;
dochan chan1 ;
dochan chan2 ;
dochan chan3 ;
dochan chan4 ;
dochan chan5 ;
dochan chan6 ;
dochan chan7 ;
ld 0x5330, a ; 1 ; Store volume in TIM3 PWM register
bres 0x5255, #0x00 ; 1 ; Clear TIM1 Interrupt pending bit
iret ; 11 ; and return


That gets us down to 16 cycles per channel in *all* cases, leaving us with a total of 142 cycles for the entire interrupt (just over 50% of CPU), with 8 channels.

Our main loop is, of course, now reduced to a simple

while(1);


Moving from flashy LED action to annoying drony noise action is a simple case of switching to TIM3_OC1, which outputs its PWM on Port D pin 2 - hooking a speaker across CN4 pin 7 (PD2) and CN1 Pin 4 (GND) indeed gives us an annoying drony beaty noise after pushing some of the virtual oscillator frequencies into the audio range.

Software Synthesis. Gotta love it.

Why Assembler is better

Following on from my last post, I thought I'd expand a little on how assembler is better than C. On "real" computers, it's often said that a C or C++ compiler will, in the vast majority of cases, produce better code than you can do by hand in assembler. This is largely true, especially when you're dealing with multiple cores and the like.

On microcontrollers, however, and especially 8 / 16 bit ones, the C compilers aren't "all that", and the additional overhead imposed by a C compiler can kill your application stone dead.

Let's take a concrete example.

In my (very non-optimal) assembler example posted before, I need to clear the interrupt pending flag for the Timer interrupt I'm servicing. This is easily done in assembler, it's a one-line, one clock cycle instruction, as follows:

bres        0x5255, #0x00   ; Clear TIM1 Interrupt pending bit


Simple, right? Now, let's look at what the C compiler gives us.

Here's the "C" code we use:

// Clear the interrupt pending bit for TIM1.
TIM1_ClearITPendingBit(TIM1_IT_UPDATE);


Simple enough, right? There's obviously the overhead of a function call, but we might expect the guts of the function to do a simple bit of inline assembler as above. Let's look.

void TIM1_ClearITPendingBit(TIM1_IT_TypeDef TIM1_IT)
{
/* Check the parameters */
assert_param(IS_TIM1_IT_OK(TIM1_IT));

/* Clear the IT pending Bit */
TIM1->SR1 = (u8)(~(u8)TIM1_IT);
}


So, let's look at what that C code produces.


main.c:64 TIM1_ClearITPendingBit(TIM1_IT_UPDATE);
0x91c3 0xA601 LD A,#0x01 LD A,#0x01
0x91c5 0xCD8C9C CALL 0x8c9c CALL _TIM1_ClearITPendingBit

...

stm8s_tim1.c:2156 TIM1->SR1 = (u8)(~(u8)TIM1_IT);
0x8c9c <.ClearITPendingBit> 0x43 CPL A CPL A
0x8c9d <.earITPendingBit+1> 0xC75255 LD 0x5255,A LD 0x5255,A
stm8s_tim1.c:2157 }
0x8ca0 <.earITPendingBit+4> 0x81 RET RET


So, we load the accumulator with a value, that's one cycle. Call a function, 4 cycles. Complement the accumulator, one cycle. Store the accumulator in the flag, 1 cycle. Return from function, 4 cycles.

In total, that's 11 cycles and 10 bytes to do the same thing we did in one cycle and 3 bytes. "inlining" this function doesn't make our code any fatter, either, as the 3 bytes we're using are the same as the 3 bytes we would have used to call the function.

What we do lose is readability, but that's easily enough got back by writing macros.

Saturday, December 5, 2009

Mixing Assembler and C on the STM8S-Discovery board

So, been playing with my STM8S-discovery boards, here's some stuff you might like. This builds on Ben's stuff here - http://www.benryves.com/journal/3567231

Now, one of my projects is to do with making a multi-channel drone synthesizer. I should really do it with discrete logic and amps, but hey, it's a quickie. What I want to do is 8 channels of audio-range square waves, at differing and varying frequencies. How we go about varying the frequencies is another matter, but the main problem is doing 8 channels - I could do 4 with just the timers, but 8 is pretty tricksy.

Basically, the approach I'm taking is to use a timer firing at a regular frequency, and do additive synthesis to make the final output volume (I'll initially be using a simple approach with no fading of the 8 pseudo-oscillators, so the maximum output volume is 8; that lets us get away with a 4 pin R-2R DAC (but more on that later).

Now, we need to work out how much we can do, and how long we have to do it. The clock of the disco board runs at up to 16 MHz, so we can work out easily enough what frequencies we can call our hypothtical interrupt at. The prescaler value gives us, direcly, the number of clocks we have to play with at that frequency

  1. 1024 = 16kHz
  2. 512 = 32kHz
  3. 256 = 64kHz
  4. 128 = 128kHz
and so on. Bearing in mind that human audio range peaks at ~20kHz, we'd probably like to be using a 32kHz or better clock. Indeed, due to the way my software works, I can only generate up to 16kHz with a 32kHz clock, and there's a really coarse granularity on the high end - it's basically not much cop above 1kHz or so, around 2 octaves up from middle C.

Anyway, 512 cycles is not much to play with. We're gonna need some assembler. C code is waaaay too fat. My first cut interrupt routine has a "worst case" of just over 200 cycles, so that cuts out 64kHz if we want to do anything clever elsewhere. And we do. So, 32kHz it is, and my interrupt is gonna use around 1/3 of total CPU all by itself. I should be able to streamline it a bit (or even a lot), but I doubt it's ever going to be able to run at 64kHz. As an idea of how fat C code is, the equivalent C routine had a worst case of 500 cycles. Yowza.

So, how do we make our assembler code work with existing "C"? Well, first thing to do is take a trivial program and compile it; in the "Debug" folder of the project you'll find the generated assembler code (for example, main.ls) - look at this and you can find out how the name mangling works.

In my case, I was interested in the names of functions and global variables, so I compiled up a quick program with one global variable and one function.

For a function name of "MyFunction", the actual label generated in the assembler (and thus the format of the name we need to use) is "f_MyFunction". Groovy. Variables have an underscore, so "my_var" becomes "_my_var". Now we know that, we can do some work.

First, we edit the interrupt table, adding an "extern" reference to our assembler routine. This stops the C compiler kicking up a fuss.
extern @far @interrupt void timer_interrupt(void);
Now, we add a reference to that routine to the table itself.
...
{0x82, timer_interrupt}, /* irq11 */
...
I'm using irq11 because that's the timer 1 overflow interrupt.

Now. I'm going to need access to the interrupt's data from elsewhere, so we need to add some references to them, too. I've shoved them in main.c.
// For our purposes
typedef struct {
u16 ticks_left;
u16 ticks;
} channel;

// Extern, these are defined in synth.asm
extern channel channels[8];
extern u8 volume;
So, we have an array of 8 "channel data" structures, with "ticks" defining the number of interrupt "ticks" before we roll over, and "ticks_left" being the number of interrupt ticks this channel has had since last rollover. Pretty simple, really. There's also a single 8-bit "volume" field. Note the "extern" on both of these - this tells the compiler they are actually defined elsewhere.

As I haven't bothered to wire up an R2R ladder on my board yet, I decided to test with a bit of PWM on the LED. this follows on directly from Ben's stuff.

main()
{
// Set the internal high-speed oscillator to 1 to run at 16/1=16MHz.
CLK_HSIPrescalerConfig(CLK_PRESCALER_HSIDIV1);

// Reset ("de-initialise") TIM1.
TIM1_DeInit();
// Set TIM1 to use a prescaler of 512 and to have a period of 1.
TIM1_TimeBaseInit(512, TIM1_COUNTERMODE_UP, 1, 0);
// Set TIM1 to generate interrupts every time the counter overflows. With
// prescaling of 512, this is a frequency of 32kHz
TIM1_ITConfig(TIM1_IT_UPDATE, ENABLE);
// Enable TIM1.
TIM1_Cmd(ENABLE);

// For visualisation only
// Reset ("de-initialise") TIM3.
TIM3_DeInit();
// Set TIM3 to use a prescaler of 1 and have a period of 999.
TIM3_TimeBaseInit(TIM3_PRESCALER_1, 999);
// Initialise output channel 2 of TIM3.
TIM3_OC2Init(TIM3_OCMODE_PWM1, TIM3_OUTPUTSTATE_ENABLE, 0, TIM3_OCPOLARITY_LOW);

// Enable TIM3.
TIM3_Cmd(ENABLE);

// Set up our channels
channels[0].ticks_left = 0;
channels[1].ticks_left = 0;
channels[2].ticks_left = 0;
channels[3].ticks_left = 0;
channels[4].ticks_left = 0;
channels[5].ticks_left = 0;
channels[6].ticks_left = 0;
channels[7].ticks_left = 0;
channels[0].ticks = 0x4000;
channels[1].ticks = 0x4010;
channels[2].ticks = 0x4020;
channels[3].ticks = 0x4040;
channels[4].ticks = 0x4080;
channels[5].ticks = 0x4100;
channels[6].ticks = 0x4200;
channels[7].ticks = 0x4400;

enableInterrupts();

while (1) {
TIM3_SetCompare2(volume * (1000 / 8));
}
}

What's interesting is the setting of the various channels to different (very low frequency) rollover periods - it starts off with everything together, and the channels slowly end up "beating" against one another, which does "interesting" stuff to the LED. The LED itself is PWM controlled to set its brightness to one of 8 levels.

Of course, none of this is any good without the interrupt routine itself. This goes in a totally separate file with a ".asm" extension. The toolkit knows how to deal with it.

; names mangled to C code specs
; uninitialised data in page zero
switch .ubsct
_channels:
ds.w 16
_volume:
ds.b 1

; make them visible to C code
xdef _channels
xdef _volume

; code
switch .text
f_timer_interrupt:
clr _volume ; 1 ; volume to zero
ld a, #0x07 ; 1 ; set initial channel flag
ldw x, #_channels ; 2 ; load y register with address of channel data
dochannel:
ldw y, x ; 1 ; Load y register with ticks_left
ldw y, (y) ; 2
jrmi br3 ; 1 / 2 ; skip if channel is off (bit 15 of ticks_left set)
decw y ; 2 ; decrement
jrpl br1 ; 1 / 2 ; if not negative, skip
ldw y, x ; 1 ; reset ticks_left
ldw y, (0x02, y) ; 2
ldw (x), y ; 2 ; store ticks left
jra br2 ; 2 ; and skip unnecessary work
br1: ldw (x), y ; 2 ; store ticks_left
ldw y, x ; 1 ; get number of ticks per cycle
ldw y, (0x02, y) ; 2
br2: srlw y ; 2 ; divide by 2
cpw y, (x) ; 2 ; compare with ticks_left
jrmi br3 ; 1 / 2
inc _volume ; 1 ; increment volume
br3: addw x, #0x0004 ; 2 ; go 4 bytes up the channel data list
dec a ; 1
jrpl dochannel ; 1 / 2 ; go around if we have more channels to do

bres 0x5255, #0x00 ; 1 ; Clear TIM1 Interrupt pending bit
iret ; 11 ; and return

; make function visible to C code
xdef f_timer_interrupt
You'll note the name mangling, and the "xdef" lines to make the labels externally visible. Also, that I've somewhat anally added the number of cycles per operation to the source.

This routine allows me to kill any one channel at the next call to the interrupt by setting its "ticks" value to 0x8000, and "ticks_left" to 0x0000.

Hope this is of some use to people.

Followers