Two systems were implemented: 16 bit mantissa and 8 bit mantissa. Overall, I would say that the 16 bit version is more useful. It is only slightly slower, but much more accurate.


16 bit mantissa float representation

The floating format with 16 bits of mantissa, 7 bits of exponent, and a sign bit, is stored in the space of a 32-bit long integer. This format gives a factor of 2.5-3 speed up in multiplication (over IEEE) and a speed up of about a factor of 1.3-4.0 for addition. The speed for the multiply is about 35 cycles. The speed for the add is 35-106 cycles. My short float operations do not support overflow, denorm, or infinity detection (but underflow is detected and the value set to zero).

This section will concentrate on numbers stored as 32-bit long ints. The lower 16 bits are the mantissa (more properly, significand). The mantissa value is considered a binary fraction with values 0.5<=mantissa<1.0. The top 8 bits are the exponent, but the top bit is used for overflow during the calculation, so the exponent range is 0x00 to 0x7f, or about 10-18 to 1018. The sign bit is stored in the 23rd bit (high bit, 3rd byte). The high order bit of the significand is always one (unless the actual value is zero), because there are no denorms allowed. Typical numbers are shown below.

Examples:

Decimal Value Short float Representation
0.0 0x0000_0000
1.0 0x3f00_8000
1.5 0x3f00_c000
10000 0x4c00_9c40
1.0001 0x3f00_8003
-1.0 0x3f80_8000
-1.5 0x3f80_c000
1e-18 0x0300_9392
-1e-18 0x0380_9392

16 bit float code

Test program: This program includes float to short-float (fp2sfp), short-float to float (sfp2fp), and negate (neg_sfp).
It is used to check accuracy and performance of the short-float operations.
The typedefs:

typedef unsigned long sfp;
// declare mult routine
sfp mult_sfp(sfp, sfp);
// declare add routine
sfp add_sfp(sfp, sfp);
// sfp format is:
//   top byte is exponent, range +63/-64 (7 bits, offset binary)
//   third byte has sign bit in top bit
//   lower two bytes are mantissa fraction, normalized so that
//   the top mantissa bit is ALWAYS one, unless the value is zero
// A zero is represented by all zero mantissa

multiply routine: This assembler routine multiplies two short-floats.

  1. If the sums of the input exponents is less than 0x3f then the exponent will underflow and the product is zero.
  2. If the exponents don't underflow:
    1. The result, (mantissa_a)x(mantissa_b), must be 0.25<=product<1.0
    2. Then if the product has the high order-bit set, the output exponent is exp_input_a + exp_input_b - 0x3e.
    3. Otherwise the second bit of the product will be set, and the output mantissa is the product<<1
      and the output exponent is exp_input_a + exp_input_b - 0x3f.
    4. The sign of the product is (sign_a ) xor (sign_b)

add routine: This assembler routine adds two short-floats.

  1. If either input is zero, the output is the other input..
  2. Determine which input is bigger, which smaller (absolute value) by first comparing the exponents, then the mantissas if necessary.
  3. Determine the difference in the exponents and shift the smaller input mantissa right by the difference.
    But if the exponent difference is greater than 15 then just output the bigger input.
  4. If the signs of the inputs are the same, add the bigger and (shifted) smaller mantissas.
    The result must be 0.5<sum<2.0.
    If the result is greater than one, shift the mantissa sum right one bit and increment the bigger exponent.
    The sign is the sign of either input.
  5. If the signs of the inputs are different, subtract the bigger and (shifted) smaller mantissas so that the result is always positive.
    The result must be 0.0<difference<0.5.
    Shift the mantissa left until the high bit is set, while decrementing the bigger exponent.
    The sign is the sign of the bigger input.

8 bit mantissa float representation

The floating format with 8 bits of mantissa, 7 bits of exponent, and a sign bit, is stored in the space of a 16-bit unsigned integer....

Read more »