*Contributed by John Drew*

*Level: Intermediate*

The following description applies to all platforms although there is reference to Proton Development System when it is helpful to do so. Limitations in floating point arithmetic apply to ALL platforms; differences in the degree of accuracy are mostly a result of the number of bytes allocated for storage of the floating point number. In most microcontrollers with their limited memory and speed we may have up to 32 bit storage for floating point while in a Desktop there may be 64 bit storage, although this varies depending on the language. In PDS, floats are stored in 32 bits (4 bytes). Note: most modern desktop languages use single and double precision decimal signed numbers instead of floats.

Floating point numbers are known as irrational numbers, whereas 1/10 or 2/3 with a numerator and a denominator are rational numbers (a ratio of two integers). This page is about the storage and use of irrational numbers and the limitations of doing this.

There are many ways to display a number using decimal notation. Floating point numbers for humans use a sequence of numerals with a decimal point to indicate place value. For many of us, the transition is shown with a “.” although some countries use a “,”. Some examples of floating point numbers are 3.14 or 0.017 or 234.0696 and so on. The position of the decimal point “floats” to inform the reader of the transition between units and tenths.

Computers need to store numbers in a standard way so that they may be read by different machines. To understand how this is done it is useful to show floating point numbers alongside their scientific notation equivalent.

**Floating point notation / Scientific notation**

3.14 becomes 3.14 * 10

^{0}(note 10

^{0}equates to 1)

31.4 becomes 3.14 * 10

^{1}

3140 becomes 3.14 * 10

^{3}

0.0314 becomes 3.14 * 10

^{-2}

Picking up clues from the scientific notation way of doing things it can be seen that it should be possible to store a number using just a series of numerals eg 314 (the significand or mantissa), then a further number eg +1 (the exponent including its sign) that tells where to place the decimal point to create the float of 31.4. To cater for negative numbers we would also need to store whether the number is – or +.

**Floating point notation / Possible computer storage**

3.14 becomes +314 (significand) and -2 (exponent)

31.4 becomes +314 and -1

3140 becomes +314 and +1

0.0314 becomes +314 and -4

In the PDS help file, Les shows us how this is done in the system we use. Just 4 bytes are used to store:

a) The sign (one bit of the 32 available)

b) The mantissa (or significand) without a decimal point ( 23 bits of the 32)

c) The exponent (8 bits that provide the information on where to put the decimal point)

For more detail read the Help file under Floating Point Numbers or refer to this excellent reference in Wikipedia (http://en.wikipedia.org/wiki/Floating_point).

**General comments**As you can see from above there are just 23 binary bits to store the number. The maximum number that can be stored in 23 bits is 8,388,607 (2^23-1). In the IEE754 Standard there is an implicit bit, so effectively the maximum number becomes 16,777,215 (2^24-1).

The IEEE754 single precision standard has the following structure:

1 sign bit, 8 exponent bits, 23 significand bits, for a total of 32 bits.

The exponent bias is 127, precision bits are 24 (because of the implicit bit) and number of decimal digits ~7.

8 bit Proton uses a Microchip modified form of the IEEE754 standard and the implicit bit may not be implemented, therefore the maximum significand may be 8,388,607.

In summary, when using an 8 bit chip with Proton the accuracy of floating point should be considered <=7 digits.

Proton 24 single precision maths uses true IEEE754 and a full 24 bit significand (mantissa) is available so will be slightly more accurate than the implementation in Proton. I am unsure if the implicit bit is implemented.

Better still, use the double precision 64 bit maths in Proton24. The more efficient device and 64 bit floating point makes for a formidable solution when extra precision is required.

When using floating point, remember that many numbers do not have an exact binary equivalent. Well known examples include 1/3 or 0.1 or pi. What looks simple for humans may not be so for the machine, for example the square of 0.1 should calculate as 0.01 but instead results in 0.009999999776 in a 4 byte system. If you test for equality to 0.01 the test would fail.

In a 4 byte system, under most circumstances the result of a computation is rounded to 7 digits.

With 4 byte floats, only 7 digit precision can be expected, so consider the following example:

123456.7 (This number is stored as accurately as possible in 4 bytes)

+ 101.7654 (so is this one)

123558.4654 (which is rounded off to 123558.5 and so loses the last 3 digits)

Another problem is the difficulty in expressing some rational numbers as a float. Consider 1/3 which you would expect as 0.333333333 (recurring). In real life, 32 bit floats will tend to store a number closer to 0.333333310. This is because after the 7th decimal place, we run out of bits to put the threes in, so the result is abruptly cut off, leading to an incorrect answer.

Subtraction of close numbers can generate significant errors, similarly multiplication and division calculations that lead to very large or very small results MAY be unusable. Trigonometry functions of numbers that cannot be represented exactly such as pi will not be accurate eg sine (pi) which should equal 0 will compute as a small negative number. Tan (pi/2) will compute in single precision C language as -22877332.0 instead of infinity.

Even rules that we take to be a fundamental truth such as (a + b)+c =a+(b+c) may not be true in floating point practice because of the rounding that occurs and the need to represent numbers in a practical number of bytes.

**So what can we do about these inaccuracies?**- Keep operations one to a line so that Z=sin A * sin B appears as

Code:X = Sin A Y = Cos A Z = X * Y

- Test for shortcuts. For example the sine of an angle approximates the angle (in radians) for small values.

Code:If A <= 0.1 Then X = A Else X = Sin A EndIf

- Float tests may be inaccurate so use X <= Y or X >= Y. Rather than test for X = Y, test for a small gap. For example If Y - X <= 0.000001 then do something is a reasonable test for equality.
- Understand that if you use floats the accuracy may be poor. You should especially check for values at the limits where one number is large and the other small. Make use of ISIS. Even the demo version is very useful.
- Never use floating point maths for financial calculations. Always use integer arithmetic.
- Significant figures should normally be taken into account when doing calculations. If data1 is accurate to 5 significant figures and you multiply it by data2 that is accurate to just 2 significant figures, then the result is only valid to 2 significant figures. For example 2.3 * 18.234 shows as 41.9382 on a calculator but should only be printed to the maximum of a rounded 2 significant figures, that is, 42 (rounded).
- If a possibility of divide by zero, check using something like this If X <= 0.000001 then Result = 999999.9 or whatever is acceptable to your program.
- Don’t assume rounding works in a particular way. There are two common schemes of rounding, the major difference being for negative numbers. In Proton a float is rounded using fRound to the nearest integer. Eg 144.3>144, 0.6>1, 1.1>1, -0.6>-1, -0.3>0, -2.3>-2. On the other hand if you assign a float value to an integer it is truncated e.g. 3.9 becomes 3
- If you need the fractional part of a floating point number in Proton turn off rounding, assign the value of the float to an integer large enough to accommodate the likely range, and then subtract the integer from the float.

Code:_FP_FLAGS = 0 ' Disable Rounding WordVar = FloatVar _FP_FLAGS = 64 ' Enable Rounding Float_FractionalPart = FloatVar – WordVar

**Whenever possible use integer arithmetic**

This is 100% accurate providing you work within the limits of the type. Choose an appropriate integer variable type that will accommodate the range you want to cover. Eg a Byte for values from 0 to 255, Word 0 to 65535, signed Dword from -2147483648 to +2147483647 or unsigned Dword to 4294967296.

For example if you are sending data (with two numerals after the decimal point) over a serial link you might choose to first multiply each number by 100, convert it to an integer and send it. At the receiving end you may choose to manipulate the data as a Word type in the PIC® and then convert it to a float for display by dividing by 100 to print the result in its original 2 decimal places form.

Print At 1, 1, DEC2 Result

There are examples at the end of this document.

Alternatively, imagine the number to be sent was 28.32, firstly multiply it by 100. It becomes 2832 when assigned to a Word variable. It is sent over the serial link as an integer and then may be modified in the PIC®, perhaps it is averaged with a group of readings. If the result after integer arithmetic was 3851 you could send this to a display like this without ever converting it to a float:

Code:

Dim PrintVar As Word Dim PrintVar2 As Word PrintVar = 3851 / 100 ' PrintVar now has a value of 38 PrintVar1 = 3851 // 100 ' PrintVar1 contains the modulus value of 51 Print At 1, 1, DEC PrintVar,”.”, DEC PrintVar1 ' The display reads 38.51

**Things to remember when using integer math:**- Remember where your decimal place is!
- This is primary school mathematics – do the sum on paper the way you know, and then try to convert that to BASIC.
- In this example 13.3 / 8 = 1.6625, but the implied precision is useless as the input numbers are only to 3 and 1 significant figures.
- When adding and subtracting in integer math, multiply both values by the same amount so there is no truncation or rounding being performed, then add or subtract as normal.
- When multiplying in integer math, the output precision is equal to the sum of the two inputs' precision. Multiply both numbers by 10^
^{Precision}before executing the multiplication. - When dividing in integer math, the output precision is equal to the difference between the two inputs' precision.
- Make sure you keep track of what is positive and what is negative. By default, DWords are signed, but this can be disabled using the code: Declare UNSIGNED_DWORDS = On
- You may be reading a temperature sensor. Do all your arithmetic in integer types. Leave the conversion to a float until the last moment or never convert it, just use the strategy above to print it to the display.
- Remember that a Byte rolls over to 0 when you exceed 255, Words rollover to 0 when you exceed 65535, and Dwords rollover to zero when you exceed 2147483647. With integer subtraction the results are accurate unless the number you are subtracting is larger than the one you started with. For example a byte of value 2 which has 3 subtracted from it will give 255 not -1. And so on.

**Examples using integer maths:***Contributed by Wastrix*

__SUBTRACTION (SIMILAR FOR ADDITION): Take 87.9482135 from 112.1987345__

*With integer math:*

Code:

Dim DWord1 As DWord Dim DWord2 As DWord Dim Result As DWord Dim Before As Byte Dim After As DWord DWord1 = 1121987345 ' Multiply both numbers by 10^7 DWord2 = 879482135 Result = DWord1 - DWord2 Before = Result / 10000000 ' Divide by 10^7 again After = Result // 10000000 Print Dec Before, ".", DEC7 After ' Result is 24.250521 (correct)

*With floating point:*

Code:

Dim Float1 As Float Dim Float2 As Float Dim ResultF As Float Float1 = 112.1987345 Float2 = 87.9482135 ResultF = Float1 - Float2 Print $FE, $C0, DEC7 ResultF ' Result is 24.250564 (incorrect)

__DIVISION: Divide 1 by 3__

*With integer math:*

Code:

Dim DWord1 As DWord Dim DWord2 As DWord Dim Result As DWord Dim Before As DWord Dim After As DWord DWord1 = 1000000000 ' Set variables to correct initial values DWord2 = 3 Result = DWord1 / DWord2 ' Do first operation (1/3) Before = Result / 1000000000 ' Get numbers before decimal After = Result // 1000000000 ' Get numbers after decimal place Print Dec Before, ".", DEC9 After ' Result is 0.333333333

*With floating point:*

Code:

Dim Float1 As Float Dim Float2 As Float Dim ResultF As Float Float1 = 1 Float2 = 3 ResultF = Float1 / Float2 Print $FE, $C0, DEC8 ResultF End ' Result is 0.333333310

__MULTIPLICATION: Multiply $89.45 by 12.4__

*With integer math:*

Code:

Dim WordOne As Word ' We only need word size as we Dim WordTwo As Word ' are working with small numbers Dim Result As DWord Dim Before As Word ' Likewise here... Dim After As Byte WordOne = 8945 ' Set the values WordTwo = 124 Result = WordOne * WordTwo Before = Result / 1000 ' 10^(2+1), because we multiplied After = Result // 1000 ' the inputs by 10^2 and 10^1 Print Dec Before, ".", Dec After ' Result is 1109.18 (correct)

*With floating point:*

Code:

Dim Float1 As Float Dim Float2 As Float Dim ResultF As Float Float1 = 89.45 Float2 = 12.4 ResultF = Float1 * Float2 Print $FE, $C0, DEC2 ResultF ' Result is 1109.17 (incorrect)

__ALL OF THE ABOVE: Convert 34.5189 degrees Celsius to Fahrenheit__

*With integer math:*

Code:

Dim Celsius As DWord Dim Fahrenheit As DWord Dim Before As DWord Dim After As DWord Celsius = 34518900 ' Multiplied by 10^6 Fahrenheit = Celsius * 9 Fahrenheit = Fahrenheit / 5 Fahrenheit = Fahrenheit + 32000000 ' Multiplied by 10^6 Before = Fahrenheit / 1000000 ' Divide by 10^6 After = Fahrenheit // 1000000 Print Dec Before, ".", DEC6 After ' Display to 6dp. Notice the 6? ' Result is 94.134020 (correct)

*With floating point:*

Code:

Dim Celsius As Float Dim Fahrenheit As Float Celsius = 34.5189 Fahrenheit = Celsius * 9 / 5 Fahrenheit = Fahrenheit + 32 Print $FE, $C0, DEC6 Fahrenheit ' Result is 94.134017 (incorrect)