Sunday, December 6, 2009

Performance, performance, performance

Let's look at improving the performance of the synthesis code I had written. It's pretty nippy, but we can shave some cycles off it, I'm sure. Remember, we're running an interrupt at 32kHz, and then using a divider per "virtual oscillator" to synthesize our waveform; given that we toggle the volume up or down per interrupt, that gives us a maximum frequency of 16kHz. On top of that, the high frequency range response is going to be pretty poor, as illustrated below

"Ticks"  Frequency  Note (approx)
0x0000 = 16kHz
0x0001 = 8kHz
0x0002 = 4kHz = C8
0x0003 = 2.6kHz = E7
0x0004 = 2 kHz = C7
0x0005 = 1.6kHz = G6
0x0006 = 1.3kHz = E6
0x0007 = 1.143kHz = D6
0x0008 = 1kHz = C6
0x0009 = 0.89kHz = A5

As can be seen, we don't even begin to hit (for varying values of "hit", those with perfect pitch need not apply) every note until we get down pretty low. If we up our interrupt rate to 64kHz, we get the following:

"Ticks"  Frequency  Note (approx)
0x0000 = 32kHz
0x0001 = 16kHz
0x0002 = 8kHz
0x0003 = 5.3kHz
0x0004 = 4 kHz = C8
0x0005 = 3.2kHz = G7
0x0006 = 2.6kHz = E7
0x0007 = 2.3kHz = D7
0x0008 = 2kHz = C7
0x0009 = 1.78kHz = A6
0x000a = 1.6kHz = G6

That gives us a much better spread, and still leaves us plenty of sub-audio / LFO range at the bottom. However, 64kHz only leaves us 256 cycles to play with, and our existing code only leaves us about 50 cycles of headroom for other code, like dealing with user interface and actually twiddling frequencies. Not good enough. Also, it would be nice to have per-channel fading.

So, first off, let's look at the original code.

clr _volume ; 1 ; volume to zero
ld a, #0x07 ; 1 ; set initial channel flag
ldw x, #_channels ; 2 ; load y register with address of channel data
ldw y, x ; 1 ; Load y register with ticks_left
ldw y, (y) ; 2
jrmi br3 ; 1 / 2 ; skip if channel is off (bit 15 of ticks_left set)
decw y ; 2 ; decrement
jrpl br1 ; 1 / 2 ; if not negative, skip
ldw y, x ; 1 ; reset ticks_left
ldw y, (0x02, y) ; 2
ldw (x), y ; 2 ; store ticks left
jra br2 ; 2 ; and skip unnecessary work
br1: ldw (x), y ; 2 ; store ticks_left
ldw y, x ; 1 ; get number of ticks per cycle
ldw y, (0x02, y) ; 2
br2: srlw y ; 2 ; divide by 2
cpw y, (x) ; 2 ; compare with ticks_left
jrmi br3 ; 1 / 2
inc _volume ; 1 ; increment volume
br3: addw x, #0x0004 ; 2 ; go 4 bytes up the channel data list
dec a ; 1
jrpl dochannel ; 1 / 2 ; go around if we have more channels to do

bres 0x5255, #0x00 ; 1 ; Clear TIM1 Interrupt pending bit
iret ; 11 ; and return

Okay, it's pretty good, but an obvious optimisation would be to "unroll the loop", which saves us a few cycles per channel. Unfortunately, this is gonna make our code unreadable and unmaintainable, a great long string of assembler.

That's what macros are made for. Macros are a bit like the C preprocessor on steroids, you can define a bunch of "inlined" code as a macro, that will be substituted into the code.

So, what we're going to do is write a macro that deals with any particular channel. Having 8 instances of a macro avoids the loop, without rewriting the same code over and over again. Nice. On top of that, we can pass in the "channel base address" to the macro, and avoid all the "ld y, x; ld y,(y)" tango we were required to do. That's good for a load of cycles per iteration, and gives us enough headroom to do channel fading as well.

So, let's redefine our structures, adding a per-channel volume control. We'll use this, instead of the hokey "top bit" approach taken previously, to know if we can exit fast.

typedef struct {
u8 volume;
u16 ticks_left;
u16 ticks;
} channel;

Now, in our assembler code, we change the way the zero-page variables are laid out, to match.
switch .ubsct
chan0: ds.b 5
chan1: ds.b 5
chan2: ds.b 5
chan3: ds.b 5
chan4: ds.b 5
chan5: ds.b 5
chan6: ds.b 5
chan7: ds.b 5
_volume: ds.b 1
; make them visible to C code
xdef _channels
xdef _volume

We keep the "top" label for C, and then we have individual channel labels for the assembler code (we could use simple math, but this makes things more explicit).

Now, the macro itself.
dochan: macro \chan
ld a, \chan ; 1 ; move volume to accumulator
jreq \@done ; 1 / 2 ; quit if zero volume
ldw x, \chan + 1 ; 2 ; get ticks_left
ldw y, \chan + 3 ; 2 ; load ticks
decw x ; 2 ; decrement ticks_left
jrpl \@br1 ; 1 / 2 ; branch if still ticking
ldw x, y ; 1 ; load ticks
\@br1: ldw \chan + 1, x ; 2 ; store new ticks_left
srlw y ; 2 ; divide ticks by 2
cpw y, \chan + 1 ; 2 ; compare with ticks_left
jrmi \@done ; 1 / 2 ;
add a, _volume ; 1 ; add volume
ld _volume, a ; 1 ; and store

Passing in a base address, this will handle the calculations for a single channel. Best case performance (channel muted) is a mere 3 cycles, best case (channel non muted) is 18 cycles, worst case is 19 cycles. Any given channel will spend half of its time at 18 cycles and half at 19 cycles, giving us an average of 18.5 cycles / non-muted channel. There's probably a few more cycles to shave off there, too.

Calling the macro is simple.
clr _volume ; 1 ; volume to zero
dochan chan0 ; 18.5
dochan chan1 ; 18.5
dochan chan2 ; 18.5
dochan chan3 ; 18.5
dochan chan4 ; 18.5
dochan chan5 ; 18.5
dochan chan6 ; 18.5
dochan chan7 ; 18.5
bres 0x5255, #0x00 ; 1 ; Clear TIM1 Interrupt pending bit
iret ; 11 ; and return

As can be seen, the static overhead is now down to 13 cycles, so assuming all channels are "on", worst case time taken is going to be 13 + (19 * channels). For 8 channels, this is 165 cycles (161 cycles average). That fits quite neatly into the 256 cycles we have per interrupt, with 90-odd cycles left over. Not only that, but we can easily trim down the number of channels to fit if needs be - if we wanted, for example, 4 channels would take only 89 cycles, or 6 channels (for 4 drones and 2 LFOs) 127 cycles, or 50% CPU.

That's pretty good. But we can do better.

Let's think about how we are going to go about getting audio off the board. An R-2R network would work, but uses up to 8 pins and a bunch of additional hardware - why do extra work when we can let the processor do the hard lifting for us? What we need is to use hardware PWM, and all we need then is a single pin and a lowpass filter. As this is all going to be thrown out to a set of analog filters anyway, a single-pin approach seems more than reasonable. Now, we're already doing PWM for the LED brightness, so why not generalise that?

So. First of all, we set Timer 3 to a prescaler of one and period of 256. That gives us 8-bit PWM, straight off the bat. We can now do away with the 'volume' variable, and rather than storing stuff in and out of memory, do it all in the accumulator, dumping that direct to the timer at the end of the interrupt. We lose the extreme time saving of skipping when a channel is set to volume zero, but we gain speed and cycles overall.

Here's the new assembler code:

dochan: macro \chan
ldw x, \chan + 1 ; 2 ; get ticks_left
ldw y, \chan + 3 ; 2 ; load ticks
decw x ; 2 ; decrement ticks_left
jrpl \@br1 ; 1 / 2 ; branch if still ticking
ldw x, y ; 1 ; load ticks
\@br1: ldw \chan + 1, x ; 2 ; store new ticks_left
srlw y ; 2 ; divide ticks by 2
cpw y, \chan + 1 ; 2 ; compare with ticks_left
jrmi \@done ; 1 / 2 ;
add a, \chan ; 1 ; add volume

; 14 + (16 * 8) = 142 cycles
clr a ; 1 ; initial volume to 0
dochan chan0 ;
dochan chan1 ;
dochan chan2 ;
dochan chan3 ;
dochan chan4 ;
dochan chan5 ;
dochan chan6 ;
dochan chan7 ;
ld 0x5330, a ; 1 ; Store volume in TIM3 PWM register
bres 0x5255, #0x00 ; 1 ; Clear TIM1 Interrupt pending bit
iret ; 11 ; and return

That gets us down to 16 cycles per channel in *all* cases, leaving us with a total of 142 cycles for the entire interrupt (just over 50% of CPU), with 8 channels.

Our main loop is, of course, now reduced to a simple


Moving from flashy LED action to annoying drony noise action is a simple case of switching to TIM3_OC1, which outputs its PWM on Port D pin 2 - hooking a speaker across CN4 pin 7 (PD2) and CN1 Pin 4 (GND) indeed gives us an annoying drony beaty noise after pushing some of the virtual oscillator frequencies into the audio range.

Software Synthesis. Gotta love it.

No comments:

Post a Comment