Mr Foo: Performance, performance, performance

Let's look at improving the performance of the synthesis code I had written. It's pretty nippy, but we can shave some cycles off it, I'm sure. Remember, we're running an interrupt at 32kHz, and then using a divider per "virtual oscillator" to synthesize our waveform; given that we toggle the volume up or down per interrupt, that gives us a maximum frequency of 16kHz. On top of that, the high frequency range response is going to be pretty poor, as illustrated below

"Ticks"  Frequency  Note (approx)
0x0000 = 16kHz
0x0001 = 8kHz
0x0002 = 4kHz     = C8
0x0003 = 2.6kHz   = E7
0x0004 = 2 kHz    = C7
0x0005 = 1.6kHz   = G6
0x0006 = 1.3kHz   = E6
0x0007 = 1.143kHz = D6
0x0008 = 1kHz     = C6
0x0009 = 0.89kHz  = A5

As can be seen, we don't even begin to hit (for varying values of "hit", those with perfect pitch need not apply) every note until we get down pretty low. If we up our interrupt rate to 64kHz, we get the following:

"Ticks"  Frequency  Note (approx)
0x0000 = 32kHz
0x0001 = 16kHz
0x0002 = 8kHz
0x0003 = 5.3kHz 
0x0004 = 4 kHz    = C8
0x0005 = 3.2kHz   = G7
0x0006 = 2.6kHz   = E7
0x0007 = 2.3kHz   = D7
0x0008 = 2kHz     = C7
0x0009 = 1.78kHz  = A6
0x000a = 1.6kHz   = G6

That gives us a much better spread, and still leaves us plenty of sub-audio / LFO range at the bottom. However, 64kHz only leaves us 256 cycles to play with, and our existing code only leaves us about 50 cycles of headroom for other code, like dealing with user interface and actually twiddling frequencies. Not good enough. Also, it would be nice to have per-channel fading.

So, first off, let's look at the original code.

f_timer_interrupt:
        clr     _volume         ; 1     ; volume to zero
        ld      a, #0x07        ; 1     ; set initial channel flag
        ldw     x, #_channels   ; 2     ; load y register with address of channel data
dochannel:
        ldw     y, x            ; 1     ; Load y register with ticks_left
        ldw     y, (y)          ; 2
        jrmi    br3             ; 1 / 2 ; skip if channel is off (bit 15 of ticks_left set)
        decw    y               ; 2     ; decrement
        jrpl    br1             ; 1 / 2 ; if not negative, skip
        ldw     y, x            ; 1     ; reset ticks_left
        ldw     y, (0x02, y)    ; 2
        ldw     (x), y          ; 2     ; store ticks left
        jra     br2             ; 2     ; and skip unnecessary work
br1:    ldw     (x), y          ; 2     ; store ticks_left
        ldw     y, x            ; 1     ; get number of ticks per cycle
        ldw     y, (0x02, y)    ; 2
br2:    srlw    y               ; 2     ; divide by 2
        cpw     y, (x)          ; 2     ; compare with ticks_left
        jrmi    br3             ; 1 / 2
        inc     _volume         ; 1     ; increment volume
br3:    addw    x, #0x0004      ; 2     ; go 4 bytes up the channel data list
        dec     a               ; 1
        jrpl    dochannel       ; 1 / 2 ; go around if we have more channels to do
               
        bres    0x5255, #0x00   ; 1     ; Clear TIM1 Interrupt pending bit
        iret                    ; 11    ; and return

Okay, it's pretty good, but an obvious optimisation would be to "unroll the loop", which saves us a few cycles per channel. Unfortunately, this is gonna make our code unreadable and unmaintainable, a great long string of assembler.

That's what macros are made for. Macros are a bit like the C preprocessor on steroids, you can define a bunch of "inlined" code as a macro, that will be substituted into the code.

So, what we're going to do is write a macro that deals with any particular channel. Having 8 instances of a macro avoids the loop, without rewriting the same code over and over again. Nice. On top of that, we can pass in the "channel base address" to the macro, and avoid all the "ld y, x; ld y,(y)" tango we were required to do. That's good for a load of cycles per iteration, and gives us enough headroom to do channel fading as well.

So, let's redefine our structures, adding a per-channel volume control. We'll use this, instead of the hokey "top bit" approach taken previously, to know if we can exit fast.

typedef struct {
        u8      volume;
        u16     ticks_left;
        u16     ticks;
} channel;

Now, in our assembler code, we change the way the zero-page variables are laid out, to match.

switch .ubsct
_channels:
chan0:  ds.b    5
chan1:  ds.b    5
chan2:  ds.b    5
chan3:  ds.b    5
chan4:  ds.b    5
chan5:  ds.b    5
chan6:  ds.b    5
chan7:  ds.b    5
_volume: ds.b    1
; make them visible to C code
xdef _channels
xdef _volume

We keep the "top" label for C, and then we have individual channel labels for the assembler code (we could use simple math, but this makes things more explicit).

Now, the macro itself.

dochan: macro \chan
        ld      a, \chan        ; 1     ; move volume to accumulator
        jreq    \@done          ; 1 / 2 ; quit if zero volume
        ldw     x, \chan + 1    ; 2     ; get ticks_left
 ldw     y,     \chan + 3       ; 2     ; load ticks
        decw    x               ; 2     ; decrement ticks_left
        jrpl    \@br1           ; 1 / 2 ; branch if still ticking
        ldw     x, y            ; 1     ; load ticks
\@br1:  ldw     \chan + 1, x    ; 2     ; store new ticks_left
        srlw    y               ; 2     ; divide ticks by 2
        cpw     y, \chan + 1    ; 2     ; compare with ticks_left
        jrmi    \@done          ; 1 / 2 ;
        add     a, _volume      ; 1     ; add volume
        ld      _volume, a      ; 1     ; and store
\@done:
endm

Passing in a base address, this will handle the calculations for a single channel. Best case performance (channel muted) is a mere 3 cycles, best case (channel non muted) is 18 cycles, worst case is 19 cycles. Any given channel will spend half of its time at 18 cycles and half at 19 cycles, giving us an average of 18.5 cycles / non-muted channel. There's probably a few more cycles to shave off there, too.

Calling the macro is simple.

f_timer_interrupt:
 clr     _volume                ; 1     ; volume to zero
 dochan  chan0                  ; 18.5
        dochan  chan1           ; 18.5
        dochan  chan2           ; 18.5
        dochan  chan3           ; 18.5
        dochan  chan4           ; 18.5
        dochan  chan5           ; 18.5
        dochan  chan6           ; 18.5
        dochan  chan7           ; 18.5
        bres 0x5255, #0x00      ; 1     ; Clear TIM1 Interrupt pending bit
        iret                    ; 11    ; and return

As can be seen, the static overhead is now down to 13 cycles, so assuming all channels are "on", worst case time taken is going to be 13 + (19 * channels). For 8 channels, this is 165 cycles (161 cycles average). That fits quite neatly into the 256 cycles we have per interrupt, with 90-odd cycles left over. Not only that, but we can easily trim down the number of channels to fit if needs be - if we wanted, for example, 4 channels would take only 89 cycles, or 6 channels (for 4 drones and 2 LFOs) 127 cycles, or 50% CPU.

That's pretty good. But we can do better.

Let's think about how we are going to go about getting audio off the board. An R-2R network would work, but uses up to 8 pins and a bunch of additional hardware - why do extra work when we can let the processor do the hard lifting for us? What we need is to use hardware PWM, and all we need then is a single pin and a lowpass filter. As this is all going to be thrown out to a set of analog filters anyway, a single-pin approach seems more than reasonable. Now, we're already doing PWM for the LED brightness, so why not generalise that?

So. First of all, we set Timer 3 to a prescaler of one and period of 256. That gives us 8-bit PWM, straight off the bat. We can now do away with the 'volume' variable, and rather than storing stuff in and out of memory, do it all in the accumulator, dumping that direct to the timer at the end of the interrupt. We lose the extreme time saving of skipping when a channel is set to volume zero, but we gain speed and cycles overall.

Here's the new assembler code:


dochan: macro \chan
        ldw     x, \chan + 1    ; 2     ; get ticks_left
        ldw     y, \chan + 3    ; 2     ; load ticks
        decw    x               ; 2     ; decrement ticks_left
        jrpl    \@br1           ; 1 / 2 ; branch if still ticking
        ldw     x, y            ; 1     ; load ticks
\@br1:  ldw     \chan + 1, x    ; 2     ; store new ticks_left
        srlw    y               ; 2     ; divide ticks by 2
        cpw     y, \chan + 1    ; 2     ; compare with ticks_left
        jrmi    \@done          ; 1 / 2 ;
        add     a, \chan        ; 1     ; add volume
\@done:
endm

; 14 + (16 * 8) = 142 cycles
f_timer_interrupt:
        clr     a               ; 1     ; initial volume to 0
        dochan  chan0           ;
        dochan  chan1           ;
        dochan  chan2           ;
        dochan  chan3           ;
        dochan  chan4           ;
        dochan  chan5           ;
        dochan  chan6           ;
        dochan  chan7           ;
        ld 0x5330, a            ; 1     ; Store volume in TIM3 PWM register
        bres 0x5255, #0x00      ; 1     ; Clear TIM1 Interrupt pending bit
        iret                    ; 11    ; and return

That gets us down to 16 cycles per channel in *all* cases, leaving us with a total of 142 cycles for the entire interrupt (just over 50% of CPU), with 8 channels.

Our main loop is, of course, now reduced to a simple

while(1);

Moving from flashy LED action to annoying drony noise action is a simple case of switching to TIM3_OC1, which outputs its PWM on Port D pin 2 - hooking a speaker across CN4 pin 7 (PD2) and CN1 Pin 4 (GND) indeed gives us an annoying drony beaty noise after pushing some of the virtual oscillator frequencies into the audio range.

Software Synthesis. Gotta love it.

Mr Foo

Sunday, December 6, 2009

Performance, performance, performance

No comments:

Post a Comment

Blog Archive

Followers

About Me