Times tables and nibble arithmetic

Inside Aviator's multiplication algorithms

Aviator contains a lot of calculations, and a fair number of them require multiplication. The 6502 CPU doesn't have a multiplication instruction, so 8-bit authors have to roll their own. There are lots of ways of doing this: Elite, for example, implements loop-based shift-and-add multiplication that's optimised for tight memory, though it uses faster logarithm-based multiplication in later versions of the game (see the deep dives on my Elite site for details).

Aviator, meanwhile, goes for speed from the get-go, and implements 4-bit multiplication using times tables. It then uses this as a building block to implement multiplication routines that can multiply all the way up to two 16-bit numbers.

Let's take a look at the building blocks before moving on to the algorithms.

Times tables
------------

The core of the multiplication process is the timesTable table. This table lets us quickly multiply two unsigned four-bit numbers, using a single lookup to get the result.

The way it works is simple. Say our two four-bit numbers are A and B, which are both in the range 0 to 15. If, in binary, they are written as A = %aaaa and B = %bbbb, then the location timesTable+%aaaabbbb contains A * B (and so does timesTable+%bbbbaaaa, incidentally).

So, to multiply A and B, we just have to set X = %aaaabbbb (or %bbbbaaaa), and then we can fetch the result of A * B using a simple instruction:

  LDA timesTable,X

We can, if course, use Y instead of X, and fetch the same result with the following:

  LDA timesTable,Y

We can encode the values of %aaaa and %bbbb into X or Y by using the nibble and shift tables, which we look at next.

Extracting nibbles and quick shifting
-------------------------------------

The most convenient way of setting X to %aaaabbbb, given two numbers %aaaa and %bbbb, is to shift one of them left by four places, and OR the results together into one combined byte. This is not that difficult to do, but as the bitwise and shift instructions are only available for the accumulator, it's convenient to have shortcuts for shifting the X and Y registers. It's also useful to have a shortcut for extracting, say, just the high nibble from a register, as we will use that a lot in the following algorithms.

To this end, Aviator has the highNibble and lowNibble tables for extracting the high and low nibble from an 8-bit value, and the shift4Right and shift4Left tables for shifting four places to the right or left.

If we consider the 8-bit value of X to be %XXXXxxxx in binary, then these lookups work like this:

highNibble,X contains the high nibble of X, i.e. %XXXX0000
lowNibble,X contains the low nibble of X, i.e. %0000xxxx
shift4Right,X contains X >> 4, i.e. %0000XXXX
shift4Left,X contains X << 4, i.e. %xxxx0000

Note that unlike the CPU's shift commands, these lookups don't affect or use the C flag.

Using these and the times table lookup above, we can write multiplication routines that quickly multiply various magnitudes of numbers. That's exactly what Aviator does, so let's take a look at the details.

Multiply8x8
-----------

The most basic multiplication routine in Aviator is Multiply8x8, which multiplies two unsigned 8-bit numbers, as follows:

  (A V) = X * Y

It uses the following algorithm, where X is %XXXXxxxx in binary and Y is %YYYYyyyy in binary:

  X * Y = %XXXXxxxx * %YYYYyyyy 

        = (%XXXX0000 + %0000xxxx) * (%YYYY0000 + %0000yyyy)

        = (%XXXX0000 * %YYYY0000) + (%XXXX0000 * %0000yyyy)
                                  + (%0000xxxx * %YYYY0000)
                                  + (%0000xxxx * %0000yyyy)

        = (%YYYY * %XXXX) << 8    + (%XXXX * %yyyy << 4)
                                  + (%xxxx * %YYYY << 4)
                                  + (%xxxx * %yyyy)

        = (%YYYY * %XXXX) << 8    + ((%XXXX * %yyyy) + (%xxxx * %YYYY)) << 4
                                  + (%xxxx * %yyyy)

All the constituent parts of the expanded equation can be calculated using the timesTable, shift and nibble tables, so that's what the routine does.

Multiply4x16
------------

The Multiply4x16 routine multiplies a 4-bit number by a 16-bit number, again using the lookup table at timesTable for fast results. It does the following calculation:

  (G W) = V * (S R) >> 8

The algorithm goes as follows. First, we perform a couple of lookups from timesTable to calculate the following:

  X = %vvvv * %RRRR

  Y = %vvvv * %SSSS

where V is %VVVVvvvv, S is %SSSSssss, and R is %RRRRrrrr.

If we write X as %XXXXxxxx in binary and Y as %YYYYyyyy, then we now calculate the following, which is again pretty easy using the timesTable, shift and nibble tables:

  %YYYYyyyyXXXX + (%vvvv * %ssss)

This is the answer we are looking for. To see why, we can expand it as follows:

    %YYYYyyyyXXXX + (%vvvv * %ssss)
  = %YYYYyyyy0000 + %XXXX + (%vvvv * %ssss)
  =   %vvvv * %SSSS << 4
    + %vvvv * %RRRR >> 4
    + %vvvv * %ssss
  = %vvvv * (%SSSS << 4 + %ssss + %RRRR >> 4)
  = %vvvv * (%SSSSssss + %RRRR >> 4)
  = %vvvv * (%SSSSssss + %RRRRrrrr >> 8)
  = %vvvv * (S R) >> 8
  = V * (S R) >> 8

Note, the step that goes from %RRRR >> 4 to %RRRRrrrr >> 8 is a bit of a fudge; this appears to be the algorithm ignoring the lowest nibble of R. I am not entirely sure this analysis is correct, but it can't be too far off...

Multiply8x16
------------

The Multiply8x16 routine multiplies an unsigned 8-bit and a signed 16-bit number, as follows:

  (G W V) = (Q P) * R

It uses the following algorithm to break it down into multiple multiplications of two 8-bit numbers, which can be done by Multiply8x8:

    (Q P) * R
  = (Q << 8 + P) * R
  = (Q << 8) * R + (P * R)
  = (Q * R) << 8 + (P * R)

Multiply16x16
-------------

The Multiply16x16 routine multiplies two unsigned 16-bit numbers:

  (H A) = (J I) * (S R) >> 16

It uses the following algorithm to break it down into multiple multiplications of two 8-bit numbers, which can be done by Multiply8x8:

  (J I) * (S R) = (J << 8 + I) * (S << 8 + R)
                = (J << 8 * S << 8) + (J << 8 * R)
                                    + (I * S << 8)
                                    + (I * R)
                = (J * S) << 16 + (J * R) << 8
                                + (I * S) << 8
                                + (I * R)

This returns a fractional result with the lowest byte containing the fraction. The routine doesn't care about the very lowest byte, so it actually calculates the following, effectively shifting the above to the right by 8 places and dropping the (I * R) part:

  (J I) * (S R) = (J * S) << 8 + (J * R) + (I * S)

before adding 128 to round the result up or down.

Inside Aviator's multiplication algorithms

Times tables ------------

Extracting nibbles and quick shifting -------------------------------------

Multiply8x8 -----------

Multiply4x16 ------------

Multiply8x16 ------------

Multiply16x16 -------------