Floating point comparison revisited

2018-06-04 15:06:43

This topic has come up many times on StackOverflow, but I believe this is a new take. Yes, I have read Bruce Dawson's articles and What Every Computer Scientist Should Know About Floating-Point Arithmetic and this nice answer.

As I understand it, on a typical system there are four basic problems when comparing floating-point numbers for equality:

Floating point calculations are not exact

Whether ab is "small" depends on the scale of a and b

Whether ab is "small" depends on the type of a and b (eg float, double, long double)

Floating point typically has +-infinity, NaN, and denormalized representations, any of which can interfere with a naïve formulation

This answer -- aka. "the Google approach" -- seems to be popular. It does handle all of the tricky cases. And it does scale the comparison very precisely, checking whether two values are within a fixed number of ULPs of each other. Thus, for example, a very large number compares "almost equal" to infinity.

However:

It is very messy, in my opinion.

It is not particularly portable, relying heavily on internal representations, using a union to read the bits from a float, etc.

It only handles single-precision and double-precision IEEE 754 (in particular, no x86 long double)

I want something similar, but using standard C++ and handling long doubles. By "standard", I mean C++03 if possible and C++11 if necessary.

Here is my attempt.

#include <cmath>
#include <limits>
#include <algorithm>

namespace {
// Local version of frexp() that handles infinities specially.
template<typename T>
T my_frexp(const T num, int *exp)
{
    typedef std::numeric_limits<T> limits;

    // Treat +-infinity as +-(2^max_exponent).
    if (std::abs(num) > limits::max())
    {
        *exp = limits::max_exponent + 1;
        return std::copysign(0.5, num);
    }
    else return std::frexp(num, exp);
}
}

template<typename T>
bool almostEqual(const T a, const T b, const unsigned ulps=4)
{
    // Handle NaN.
    if (std::isnan(a) || std::isnan(b))
        return false;

    typedef std::numeric_limits<T> limits;

    // Handle very small and exactly equal values.
    if (std::abs(a-b) <= ulps * limits::denorm_min())
        return true;

    // frexp() does the wrong thing for zero.  But if we get this far
    // and either number is zero, then the other is too big, so just
    // handle that now.
    if (a == 0 || b == 0)
        return false;

    // Break the numbers into significand and exponent, sorting them by
    // exponent.
    int min_exp, max_exp;
    T min_frac = my_frexp(a, &min_exp);
    T max_frac = my_frexp(b, &max_exp);
    if (min_exp > max_exp)
    {
        std::swap(min_frac, max_frac);
        std::swap(min_exp, max_exp);
    }

    // Convert the smaller to the scale of the larger by adjusting its
    // significand.
    const T scaled_min_frac = std::ldexp(min_frac, min_exp-max_exp);

    // Since the significands are now in the same scale, and the larger
    // is in the range [0.5, 1), 1 ulp is just epsilon/2.
    return std::abs(max_frac-scaled_min_frac) <= ulps * limits::epsilon() / 2;
}

I claim that this code (a) handles all of the relevant cases, (b) does the same thing as the Google implementation for IEEE-754 single- and double-precision, and (c) is perfectly standard C++.

One or more of these claims is almost certainly wrong. I will accept any answer that demonstrates such, preferably with a fix. A good answer should include one or more of:

Specific inputs differing by more than ulps Units in Last Place, but for which this function returns true (the bigger the difference, the better)

Specific inputs differing by up to ulps Units in Last Place, but for which this function returns false (the smaller the difference, the better)

Any case(s) I have missed

Any way in which this code relies on undefined behavior or breaks depending on implementation-defined behavior. (If possible, please cite a relevant spec.)

Fixes for whatever problem(s) you identify

Any way to simplify the code without breaking it

I intend to place a non-trivial bounty on this question.

“Almost Equals” Is Not a Good Function

4 is not an appropriate value: The answer you point to states “Therefore, 4 should be enough for ordinary use” but contains no basis for that claim. In fact, there are ordinary situations in which numbers calculated in floating-point by different means may differ by many ULP even though they would be equal if calculated by exact mathematics. Therefore, there should be no default value for the tolerance; each user should be required to supply their own, hopefully based on thorough analysis of their code.

As an example of why a default of 4 ULP is bad, consider 1./49*49-1 . The mathematically exact result is 0, but the computed result (64-bit IEEE 754 binary) is -0x1p-53, an error exceeding 1e307 ULP of the exact result and almost 1e16 ULP of the computed result.

Sometimes, no value is appropriate: In some cases, the tolerance cannot be relative to the values being compared, neither a mathematically exact relative tolerance nor a quantized ULP tolerance. For example, nearly every output value in an FFT is affected by nearly every input value, and the error in any one element is related to the magnitude of other elements. An “almost equals” routine must be supplied additional context with information about the potential error.

“Almost Equals” has poor mathematical properties: This shows one of the shortcomings of “almost equals”: Scaling changes the results. The code below prints 1 and 0.

double x0 = 1.1;
double x1 = 1.1 + 3*0x1p-52;
std::cout << almostEqual(x0, x1) << "n";
x0 *= .8;
x1 *= .8;
std::cout << almostEqual(x0, x1) << "n";

Another failing is that it is not transitive; almostEqual(a, b) and almostEqual(b, c) does not imply almostEqual(a, c) .

A Bug in Extreme Cases

almostEqual(1.f, 1.f/11, 0x745d17) incorrectly returns 1.

1.f/11 is 0x1.745d18p-4. Subtracting this from 1 (which is 0x10p-4) yields 0xe.8ba2e8p-4. Since an ULP of 1 is 0x1p-23, that is 0xe.8ba2e8p19 ULP = 0xe8ba2e.8/2 ULP (shifted 20 bits and divided by 2, netting 19 bits) = 0x745d17.4 ULP. That exceeds the specified tolerance of 0x745d17, so the correct answer would be 0.

This error is caused by rounding in max_frac-scaled_min_frac .

An easy escape from this problem is to specify that ulps must be less than .5/limits::epsilon . Then rounding occurs in max_frac-scaled_min_frac only if the difference (even when rounded) exceeds ulps ; if the difference is less than that, the subtraction is exact, by Sterbenz' Lemma.

There was a suggestion about using long double to correct this. However, long double would not correct this. Consider comparing 1 and -0x1p-149f with ulps set to 1/limits::epsilon. Unless your significand has 149 bits, the subtraction result rounds to 1, which is less than or equal to 1/limits::epsilon ULP. Yet the mathematical difference clearly exceeds 1.

Minor Note

The expression factor * limits::epsilon / 2 converts factor to the floating-point type, which causes rounding errors for large values of factor that are not exactly representable. Likely, the routine is not intended to be used with such large values (millions of ULPs in float), so this ought to be specified as a limit on the routine rather than a bug.

Simplification: You could avoid my_frexp by discarding the non finite cases first all-together:

if( ! std::isfinite(a) || ! std::isfinite(b) )
    return a == b;

It seems that isfinite is in C++11 at least

EDIT However, if intention is to have limits::infinity() within 1 ulp of limits::max()
then above simplification does not hold, but shouldn't my_frexp() return limits::max_exponent+1 in *exp , rather than max_exponent+2 ?

FUTURE PROOFING : If you ever want to extend such comparison to decimal float http://en.wikipedia.org/wiki/Decimal64_floating-point_format in the future, and assuming that ldexp() and frexp() will handle such type with correct radix, then striclty speaking, 0.5 in return std::copysign(0.5, num); should be replaced by T(1)/limits::radix() - or std::ldexp(T(1),-1) or something... (I could not find a convenient constant in std::numeric_limits)

EDIT As Nemo commented, the assumptions that ldexp and frexp would use the correct FLOAT_RADIX are false, they stick with 2...

So a Future Proof portable version should also use:

std::scalbn(x,n) instead of std::ldexp(x,n)

exp=std::ilogb(std::abs(x)),y=std::scalbn(x,-exp) instead of y=frexp(x,&exp)

now that above y in is [1,FLOAT_RADIX) instead of [T(1)/Float_Radix,1), return copysign(T(1),num) instead of 0.5 for infinite case of my_frexp, and test for ulps*limits::epsilon() instead of ulps*epsilon()/2

That also require a standard >= C++11

链接地址: http://www.djcxy.com/p/15000.html

上一篇: 包装IEEE754单

下一篇: 浮点比较重新审视