Why Are Floating Point Numbers Inaccurate?

Why do some numbers lose accuracy when stored as floating point numbers?

For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers ( 92/10 ), both of which can be expressed exactly in binary ( 0b1011100/0b1010 ). However, the same ratio stored as a floating point number is never exactly equal to 9.2 :

32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875

How can such an apparently simple number be "too big" to express in 64 bits of memory?


In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2 , is actually this fraction:

5179139571476070 * 2 -49

Where the exponent is -49 and the mantissa is 5179139571476070 . The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.

9.2 may be simply 92/10 , but 10 cannot be expressed as 2n if n is limited to integer values.


Seeing the Data

First, a few functions to see the components that make a 32- and 64-bit float . Gloss over these if you only care about the output (example in Python):

def float_to_bin_parts(number, bits=64):
    if bits == 32:          # single precision
        int_pack      = 'I'
        float_pack    = 'f'
        exponent_bits = 8
        mantissa_bits = 23
        exponent_bias = 127
    elif bits == 64:        # double precision. all python floats are this
        int_pack      = 'Q'
        float_pack    = 'd'
        exponent_bits = 11
        mantissa_bits = 52
        exponent_bias = 1023
    else:
        raise ValueError, 'bits argument must be 32 or 64'
    bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
    return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]

There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.

Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double , which is often implemented as 64 bits.

When we call that function with our example, 9.2 , here's what we get:

>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']

Interpreting the Data

You'll see I've split the return value into three components. These components are:

  • Sign
  • Exponent
  • Mantissa (also called Significand, or Fraction)
  • Sign

    The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0 .

    Exponent

    The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010 . In decimal, that represents the value 1026 . A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023 ) to get the true exponent, 0b00000000011 (decimal number 3).

    Mantissa

    The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:

    6.0221413x1023

    The mantissa would be the 6.0221413 . Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1 . So the binary mantissa always starts with 1 ! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:

    1.0010011001100110011001100110011001100110011001100110

    This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.

    When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:

    0.0010011001100110011001100110011001100110011001100110

    In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999 . (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)

    Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.

    Recapping the Components

  • Sign (first component): 0 for positive, 1 for negative
  • Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
  • Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa

  • Calculating the Number

    Putting all three parts together, we're given this binary number:

    1.0010011001100110011001100110011001100110011001100110 x 1011

    Which we can then convert from binary to decimal:

    1.1499999999999999 x 23 (inexact!)

    And multiply to reveal the final representation of the number we started with ( 9.2 ) after being stored as a floating point value:

    9.1999999999999993


    Representing as a Fraction

    9.2

    Now that we've built the number, it's possible to reconstruct it into a simple fraction:

    1.0010011001100110011001100110011001100110011001100110 x 1011

    Shift mantissa to a whole number:

    10010011001100110011001100110011001100110011001100110 x 1011-110100

    Convert to decimal:

    5179139571476070 x 23-52

    Subtract the exponent:

    5179139571476070 x 2-49

    Turn negative exponent into division:

    5179139571476070 / 249

    Multiply exponent:

    5179139571476070 / 562949953421312

    Which equals:

    9.1999999999999993

    9.5

    >>> float_to_bin_parts(9.5)
    ['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
    

    Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.

    Assemble the binary scientific notation:

    1.0011 x 1011

    Shift the decimal point:

    10011 x 1011-100

    Subtract the exponent:

    10011 x 10-1

    Binary to decimal:

    19 x 2-1

    Negative exponent to division:

    19 / 21

    Multiply exponent:

    19 / 2

    Equals:

    9.5



    Further reading

  • The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don't my numbers add up? (floating-point-gui.de)
  • What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
  • IEEE Double-precision floating-point format (Wikipedia)
  • Floating Point Arithmetic: Issues and Limitations (docs.python.org)
  • Floating Point Binary

  • This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.

    Consider the fraction 2/3

    In good-ol' base 10, we typically write it out as something like

  • 0.666...
  • 0.666
  • 0.667
  • When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs . (Yes, in some programs it matters.)

    Number bases

    So here's where number bases are crutial. If we were trying to represent 2/3 in base 3, then

    (2/3)10 = 0.23

    In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.

    To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.

    (1/2)10 = 0.510 = 0.12 = 0.1111...3

    Why are floating point numbers inaccurate?

    Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.


    While all of the other answers are good there is still one thing missing:

    It is impossible to represent irrational numbers (eg π, sqrt(2) , log(3) , etc.) precisely!

    And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.

    Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b . All your arithmetic would have to be done on fractions just like in highschool math (eg a/b * c/d = ac/bd ).

    But of course you would still run into the same kind of trouble when pi , sqrt , log , sin , etc. are involved.

    TL;DR

    For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (ie irrational) can never be represented no matter the system.

    链接地址: http://www.djcxy.com/p/85634.html

    上一篇: 哪些LLVM通行证负责浮动

    下一篇: 为什么浮点数不准确?