Aviator on the BBC Micro

# Hard-coded division in the dashboard routines

## Using hand-tailored division code for fast calculations

Updating the dashboard in a timely manner is an essential task in a flight simulator as complex as Aviator, so it's perhaps no surprise that the update routines contain some beautifully efficient and hand-tailored code to perform the many calculations that power the dashboard's dials. Let's take a closer look at what's involved.

The long, multi-part UpdateIndicator routine performs calculations for the various indicators on the dashboard. This typically involves taking the relevant measurement, scaling it to fit on the appropriate indicator, and then drawing a suitable vector line (for dial hands), orthogonal line (for vertical bar indicators), joystick cross (for the joystick position display) or artificial horizon line.

A common feature of these calculations is the scaling step, which is done by multiplying the measurement by a specific scale factor. The shift-and-add multiplication algorithm is used, but instead of using a generic multiply routine, the relevant indicators contain hard-coded multiply routines that have their scale factors baked into the code.

Generally, if we want to calculate A * T / 256, then the traditional approach would be to use a shift-and-add loop like this:

``` LSR A               \ Set P = A >> 1 and C flag = bit 0 of A
STA P

LDX T               \ Set T1 = T - 1
DEX                 \
STX T1              \ We subtract 1 as the C flag will be set when we want
\ to do an addition in the loop below

\ We are now going to work our way through the bits of
\ P, and do a shift-add for any bits that are set,
\ keeping the running total in A. We already set up
\ the first shift at the start of this routine, as
\ P = |A| >> 1 and C = bit 0 of A, so we now need to set
\ up a loop to sift through the other 7 bits in P

LDA #0              \ Set A = 0 so we can start building the answer in A

LDX #7              \ Set up a counter in X to count the 7 bits remaining
\ in P

.MUL4

BCC P%+4            \ If C (i.e. the next bit from P) is set, do the
\
\   A = A + T1 + C
\     = A + T - 1 + 1
\     = A + T

ROR A               \ As mentioned above, this ROR shifts A right and
\ catches bit 0 in C - giving another digit for our
\ result - and the next ROR sticks that bit into the
\ left end of P while also extracting the next bit of P

ROR P               \ Add the overspill from shifting A to the right onto
\ the start of P, and shift P right to fetch the next
\ bit for the calculation

DEX                 \ Decrement the loop counter

BNE MUL4            \ Loop back for the next bit until P has been rotated
\ all the way

LSR A               \ Rotate (A P) once more to get the final result, as
ROR P               \ we only pushed 7 bits through the above process
```

You can read all about this algorithm in my Elite source code project, and you can see it in situ in the Elite source code in the MULT1 routine.

If we aren't worried about memory usage, then we can unroll the central loop of the algorithm like this, to save a bit of time going around the loop:

``` BCC P%+4            \ Shift-and-add the first bit of P
ROR A
ROR P

BCC P%+4            \ Repeat for the second bit
ROR A
ROR P

BCC P%+4            \ Repeat for the third bit
ROR A
ROR P
```

The above code does three bits, and we can repeat the same block for all eight bits to implement the full calculation. You can see this approach in the MULT1 routine in the 6502 Second Processor version of Elite.

Note that we still have to shift the result right by one place after the unrolled loop (i.e. the LSR A and ROR P after the end of the loop).

Aviator takes this to the next level by hard-coding a specific value of T into the code itself - let's refer to this fixed value of T as n, so our routine will now calculate A * n / 256, for a fixed n.

Given this, let's look at the above unrolled code. If a bit of n is 0 above, then the BCC P%+4 will skip the next instruction (ADC T), leaving just the ROR A and ROR P instructions. Also, in this case, we know the C flag is clear, as we took the BCC, so the ROR A is identical to LSR A.

We can therefore hard-code the value of n into the code by replacing each of these unrolled blocks:

``` BCC P%+4
ROR A
ROR P
```

with either this, when that bit of n is 0:

``` LSR A
ROR P
```

or this, when that bit of n is 1:

``` ADC T1
ROR A
ROR P
```

We can also ditch the need to calculate T1 = T - 1 by adding in a CLC, like this:

``` CLC
ROR A
ROR P
```

Finally, we only need to keep the ROR P instructions if we need to keep the overspill from the result, as we are no longer shifting P to see whether or not to branch over each bit's addition. If we don't need the overspill, we can replace the unrolled calculation with this, when that bit of n is 0:

``` LSR A
```

or this, when that bit of n is 1:

``` CLC
ROR A
```

Putting all of this together into a working example, the following code calculates A = T * 68 / 256, using a value of n = 34 (%00100010). The factor of 2 is because the following omits the final LSR A that normally appears after the loop, for optimal performance.

``` LDA T               \ Set A = T

LSR A               \ Bit 0 of n is 0

CLC                 \ Bit 1 of n is 1
ROR A

LSR A               \ Bit 2 of n is 0

LSR A               \ Bit 3 of n is 0

LSR A               \ Bit 4 of n is 0

CLC                 \ Bit 5 of n is 1