Is floating point precision mutable or invariant?

2018-06-28 03:22:55

I keep getting mixed answers of whether floating point numbers (ie float , double , or long double ) have one and only one value of precision, or have a precision value which can vary.

One topic called float vs. double precision seems to imply that floating point precision is an absolute.

However, another topic called Difference between float and double says,

In general a double has 15 to 16 decimal digits of precision

Another source says,

Variables of type float typically have a precision of about 7 significant digits

Variables of type double typically have a precision of about 16 significant digits

I don't like to refer to approximations like the above if I'm working with sensitive code that can break easily when my values are not exact. So let's set the record straight. Is floating point precision mutable or invariant, and why?

The precision is fixed, which is exactly 53 binary digits for double-precision (or 52 if we exclude the implicit leading 1). This comes out to about 15 decimal digits .

The OP asked me to elaborate on why having exactly 53 binary digits means "about" 15 decimal digits.

To understand this intuitively, let's consider a less-precise floating-point format: instead of a 52-bit mantissa like double-precision numbers have, we're just going to use a 4-bit mantissa.

So, each number will look like: (-1)s × 2yyy × 1.xxxx (where s is the sign bit, yyy is the exponent, and 1.xxxx is the normalised mantissa). For the immediate discussion, we'll focus only on the mantissa and not the sign or exponent.

Here's a table of what 1.xxxx looks like for all xxxx values (all rounding is half-to-even, just like how the default floating-point rounding mode works):

  xxxx  |  1.xxxx  |  value   |  2dd  |  3dd  
--------+----------+----------+-------+--------
  0000  |  1.0000  |  1.0     |  1.0  |  1.00
  0001  |  1.0001  |  1.0625  |  1.1  |  1.06
  0010  |  1.0010  |  1.125   |  1.1  |  1.12
  0011  |  1.0011  |  1.1875  |  1.2  |  1.19
  0100  |  1.0100  |  1.25    |  1.2  |  1.25
  0101  |  1.0101  |  1.3125  |  1.3  |  1.31
  0110  |  1.0110  |  1.375   |  1.4  |  1.38
  0111  |  1.0111  |  1.4375  |  1.4  |  1.44
  1000  |  1.1000  |  1.5     |  1.5  |  1.50
  1001  |  1.1001  |  1.5625  |  1.6  |  1.56
  1010  |  1.1010  |  1.625   |  1.6  |  1.62
  1011  |  1.1011  |  1.6875  |  1.7  |  1.69
  1100  |  1.1100  |  1.75    |  1.8  |  1.75
  1101  |  1.1101  |  1.8125  |  1.8  |  1.81
  1110  |  1.1110  |  1.875   |  1.9  |  1.88
  1111  |  1.1111  |  1.9375  |  1.9  |  1.94

How many decimal digits do you say that provides? You could say 2, in that each value in the two-decimal-digit range is covered, albeit not uniquely; or you could say 3, which covers all unique values, but do not provide coverage for all values in the three-decimal-digit range.

For the sake of argument, we'll say it has 2 decimal digits: the decimal precision will be the number of digits where all values of those decimal digits could be represented.

Okay, then, so what happens if we halve all the numbers (so we're using yyy = -1)?

  xxxx  |  1.xxxx  |  value    |  1dd  |  2dd  
--------+----------+-----------+-------+--------
  0000  |  1.0000  |  0.5      |  0.5  |  0.50
  0001  |  1.0001  |  0.53125  |  0.5  |  0.53
  0010  |  1.0010  |  0.5625   |  0.6  |  0.56
  0011  |  1.0011  |  0.59375  |  0.6  |  0.59
  0100  |  1.0100  |  0.625    |  0.6  |  0.62
  0101  |  1.0101  |  0.65625  |  0.7  |  0.66
  0110  |  1.0110  |  0.6875   |  0.7  |  0.69
  0111  |  1.0111  |  0.71875  |  0.7  |  0.72
  1000  |  1.1000  |  0.75     |  0.8  |  0.75
  1001  |  1.1001  |  0.78125  |  0.8  |  0.78
  1010  |  1.1010  |  0.8125   |  0.8  |  0.81
  1011  |  1.1011  |  0.84375  |  0.8  |  0.84
  1100  |  1.1100  |  0.875    |  0.9  |  0.88
  1101  |  1.1101  |  0.90625  |  0.9  |  0.91
  1110  |  1.1110  |  0.9375   |  0.9  |  0.94
  1111  |  1.1111  |  0.96875  |  1.   |  0.97

By the same criteria as before, we're now dealing with 1 decimal digit. So you can see how, depending on the exponent, you can have more or less decimal digits, because binary and decimal floating-point numbers do not map cleanly to each other .

The same argument applies to double-precision floating point numbers (with the 52-bit mantissa), only in that case you're getting either 15 or 16 decimal digits depending on the exponent.

All modern computers use binary floating-point arithmetic. That means we have a binary mantissa, which has typically 24 bits for single precision, 53 bits for double precision and 64 bits for extended precision. (Extended precision is available on x86 processors, but not on ARM or possibly other types of processors.)

24, 53, and 64 bit mantissas mean that for a floating-point number between 2k and 2k+1 the next larger number is 2k-23, 2k-52 and 2k-63 respectively. That's the resolution. The rounding error of each floating-point operation is at most half of that.

So how does that translate into decimal numbers? It depends.

Take k = 0 and 1 ≤ x < 2. The resolution is 2-23, 2-52, and 2-63 which is about 1.19×10-7, 2.2×10-16, and 1.08×10-19 respectively. That's a bit less than 7, 16, and 19 decimals. Then take k = 3 and
8 ≤ x < 16. The difference between two floating-point numbers is now 8 times larger. For 8 ≤ x < 10 you get just over 6, less than 15, and just over 18 decimals respectively. But for 10 ≤ x < 16 you get one decimal more!

You get the highest number of decimal digits if x is only a bit less than 2k+1 and only a bit more than 10n, for example 1000 ≤ x < 1024. You get the lowest number of decimal digits if x is just a bit higher than 2k and a bit less than 10n, for example 1⁄1024 ≤ x < 1⁄1000 . The same binary precision can produce decimal precision that varies by up to 1.3 digits or log10 (2×10).

Of course, you could just read the article "What every computer scientist should know about floating-point arithmetic."

80x86 code using its hardware coprocessor (originally the 8087) provide three levels of precision: 32-bit, 64-bit, and 80-bit. Those very closely follow the IEEE-754 standard of 1985. The recent standard specifies a 128-bit format. The floating point formats have 24, 53, 65, and 113 mantissa bits which correspond to 7.22, 15.95, 19.57, and 34.02 decimal digits of precision.

The formula is mantissa_bits / log_2 10 where the log base two of ten is 3.321928095.

While the precision of any particular implementation does not vary, it may appear to when a floating point value is converted to decimal. Note that the value 0.1 does not have an exact binary representation. It is a repeating bit pattern (0.0001100110011001100110011001100...) like we are used to in decimal for 0.3333333333333 to approximate 1/3.

Many languages often don't support the 80-bit format. Some C compilers may offer long double which uses either 80-bit floats or 128-bit floats. Alas, it might also use a 64-bit float, depending on the implementation.

The NPU has 80 bit registers and performs all operations using the full 80 bit result. Code which calculates within the NPU stack benefit from this extra precision. Unfortunately, poor code generation—or poorly written code— might truncate or round intermediate calculations by storing them in a 32-bit or 64-bit variable.

链接地址: http://www.djcxy.com/p/78620.html

上一篇: 由浮点组成的联合：完全疯狂的输出

下一篇: 浮点精度是可变的还是不变的？