Floating point arithmetic

2018-06-30 16:15:53

I reading about floating point and rounding off errors that occur during the floating point arithmetic.

I read lot of articles on IEEE 754-Single precision / Double precision format. I understand that there is sign bit, 8(or) 11 bits of exponent and 23 (or) 52 bits of significand along with implicit leading bit.

I also know that the real numbers whose denominator is not a prime factor of 2 cannot be exactly representable For Eg 0.1 in binary is 0.0001100110011.....

i understood that 0.1+0.1+0.1 is not equal to 0.3 because the accumulation of rounding error.

Also 0.5 is exactly representable in binary format because it is 1/2. But i don't understand given the above accumulation of rounding error , why 0.1+0.1+0.1+0.1+0.1 = 0.5 ?

In IEEE754 round to nearest even modes you have some nice properties.
First, for any finite float x and n<54, (2^n-1)x+x == 2^nx See Is 3*x+x always exact?

Then you also have (2^n+1)x == 2^nx + x
(well as long as 2^n+1 is exactly representable, n<53).

With these properties, you have

0.1+0.1==2*0.1

0.1+0.1+0.1 == 3*0.1

0.1+0.1+0.1+0.1 == 4*0.1

0.1+0.1+0.1+0.1+0.1 == 5*0.1

This is not enough, because at this stage, 0.1 is not exactly 1/10, so nothing proves that 5*0.1 == 0.5.
For example 3*0.1 != 0.3, and 5*0.3 != 0.15.

So here, it's just luck, the round off error did annihilate instead of cumulate.
(n*0.1 == n/10.0) is true for 65 out of 100 for the integers n from 1 to 100 (allways true for the 7 powers of two in this interval).

0.1 in double precision is 0.0001100110011001100110011001100110011001100110011001101 in binary. Let's step through the binary additions to see what's happening:

  0.0001100110011001100110011001100110011001100110011001101
+
  0.0001100110011001100110011001100110011001100110011001101
-----------------------------------------------------------
  0.001100110011001100110011001100110011001100110011001101   (52 sig bits -- OK)
+
  0.0001100110011001100110011001100110011001100110011001101
-----------------------------------------------------------
  0.0100110011001100110011001100110011001100110011001100111  (54 sig bits -- must round to 53)
  0.0100110011001100110011001100110011001100110011001101     (rounded up)
+
  0.0001100110011001100110011001100110011001100110011001101
-----------------------------------------------------------
  0.0110011001100110011001100110011001100110011001100110101  (54 sig bits -- must round to 53)
  0.01100110011001100110011001100110011001100110011001101    (rounded down)
+
  0.0001100110011001100110011001100110011001100110011001101
-----------------------------------------------------------
  0.1000000000000000000000000000000000000000000000000000001 (55 sig bits -- must round to 53)
  0.1                                                       (rounded down)

So just due to how the roundings accumulated, 0.1 added five times became 0.5.

(I got these values from my binary converter, binary calculator, and floating-point converter.)

链接地址: http://www.djcxy.com/p/85580.html

上一篇: 将float转换为bigint（便携式获取二进制指数和尾数）

下一篇: 浮点运算