Most accurate way to compute asinhf() from log1pf()?
The inverse hyperbolic function asinh()
is closely related to the natural logarithm. I am trying to determine the most accurate way to compute asinh()
from the C99 standard math function log1p()
. For ease of experimentation, I am limiting myself to IEEE-754 single-precision computation right now, that is I am looking at asinhf()
and log1pf()
. I intend to re-use the exact same algorithm for double precision computation, ie asinh()
and log1p()
, later.
My primary goal is to minimize ulp error, the secondary goal is to minimize the number of incorrectly rounded results, under the constraint that the improved code would at most be minimally slower than the versions posted below. Any incremental improvement to accuracy, say 0.2 ulp, would be welcome. Adding a couple of FMAs (fused multiply-adds) would be fine, on the other hand I am hoping someone could identify a solution which employs a fast rsqrtf()
(reciprocal square root).
The resulting C99 code should lend itself to vectorization, possibly by some minor straightforward transformations. All intermediate computation must occur at the precision of the function argument and result, as any switch to higher precision may have a severe negative performance impact. The code must work correctly both with IEEE-754 denormal support and in FTZ (flush to zero) mode.
So far, I have identified the following two candidate implementations. Note that the code may be easily transformed into a branchless vectorizable version with a single call to log1pf()
, but I have not done so at this stage to avoid unnecessary obfuscation.
/* for a >= 0, asinh(a) = log (a + sqrt (a*a+1))
= log1p (a + (sqrt (a*a+1) - 1))
= log1p (a + sqrt1pm1 (a*a))
= log1p (a + (a*a / (1 + sqrt(a*a + 1))))
= log1p (a + a * (a / (1 + sqrt(a*a + 1))))
= log1p (fma (a / (1 + sqrt(a*a + 1)), a, a)
= log1p (fma (1 / (1/a + sqrt(1/a*a + 1)), a, a)
*/
float my_asinhf (float a)
{
float fa, t;
fa = fabsf (a);
#if !USE_RECIPROCAL
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = log1pf (t);
}
#else // USE_RECIPROCAL
if (fa > 0x1.0p126f) { // prevent underflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = 1.0f / fa;
t = fmaf (1.0f / (t + sqrtf (fmaf (t, t, 1.0f))), fa, fa);
t = log1pf (t);
}
#endif // USE_RECIPROCAL
return copysignf (t, a); // restore sign
}
With a particular log1pf()
implementation that is accurate to < 0.6 ulps, I am observing the following error statistics when testing exhaustively across all 232 possible IEEE-754 single-precision inputs. When USE_RECIPROCAL = 0
, the maximum error is 1.49486 ulp, and there are 353,587,822 incorrectly rounded results. With USE_RECIPROCAL = 1
, the maximum error is 1.50805 ulp, and there are only 77,569,390 incorrectly rounded results.
In terms of performance, the variant USE_RECIPROCAL = 0
will be faster if reciprocals and full divisions take roughly the same amount of time, but the variant USE_RECIPROCAL = 1
could be faster if very fast reciprocal support is available.
Answers can assume that all basic arithmetic, including FMA (fused multiply-add) is correctly rounded according to IEEE-754 round-to-nearest-or-even mode. In addition, faster, nearly correctly rounded, versions of reciprocal and rsqrtf()
may be available, where "nearly correctly rounded" means the maximum ulp error will be limited to something like 0.53 ulps and the overwhelming majority of results, say > 95%, are correctly rounded. Basic arithmetic with directed roundings may be available at no additional cost to performance.
Firstly, you may want to look into the accuracy and speed of your log1pf
function: these can vary a bit between libms (I've found the OS X math functions to be fast, the glibc ones to be slower but typically correctly rounded).
Openlibm, based on the BSD libm, which in turn is based on Sun's fdlibm, use multiple approaches by range, but the main bit is the relation:
t = x*x;
w = log1pf(fabsf(x)+t/(one+sqrtf(one+t)));
You may also want to try compiling with the -fno-math-errno
option, which disables the old System V error codes for sqrt
(IEEE-754 exceptions will still work).
After various additional experiments, I have convinced myself that a simple argument transformation that does not use higher precision than the argument and result cannot achieve a tighter error bound than the one achieved by the first variant in the code I posted.
Since my question is about minimizing the error for the argument transformation which is incurred in addition to the error in log1pf()
itself, the most straightforward approach to use for experimentation is to utilize a correctly rounded implementation of that logarithm function. Note that a correctly-rounded implementation is highly unlikely to exist in the context of a high-performance environment. According to the works of J.-M. Muller et. al., to produce accurate single-precision results, x86 extended precision computation should be sufficient, for example
float accurate_log1pf (float a)
{
float res;
__asm fldln2;
__asm fld dword ptr [a];
__asm fyl2xp1;
__asm fst dword ptr [res];
__asm fcompp;
return res;
}
An implementation of asinhf()
using the first variant from my question then looks as follows:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
Testing with all 232 IEEE-754 single-precision operands shows that the maximum error of 1.49486070 ulp occurs at ± 0x1.ff5022p-9
and there are 353,521,140 incorrectly rounded results. What happens if the entire argument transformation uses double-precision arithmetic? The code changes to
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt;
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
However, the error bound does not improve with this change! The maximum error of 1.49486070 ulp still occurs at ± 0x1.ff5022p-9
and there are now 350,971,046 incorrectly rounded results, slightly fewer than before. The issue seems to be that a float
operand cannot convey enough information to log1pf()
to produce more accurate results. A similar problem occurs when computing sinf()
and cosf()
. If the reduced argument, represented as a correctly rounded float
operand, is passed to the core polynomials, the resulting error in sinf()
and cosf()
is just a tad under 1.5 ulp, just as we are observing here with my_asinhf()
.
One solution is to compute the transformed argument to higher than single precision, for example as a double-float operand pair (a useful brief overwiew of double-float techniques can be found in this paper by Andrew Thall). In this case, we can use the additional information to perform linear interpolation on the result, based on the knowledge that the derivative of the logarithm is the reciprocal. This gives us:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt; // "head" of double-float
s = (float)(tt - (double)t); // "tail" of double-float
t = fmaf (s, 1.0f / (1.0f + t), accurate_log1pf (t)); // interpolate
}
return copysignf (t, a); // restore sign
}
Exhaustive test of this version indicates that the maximum error has been reduced to 0.99999948 ulp, it occurs at ± 0x1.deeea0p-22
. There are 349,653,534 incorrectly rounded results. A faithfully-rounded implementation of asinhf()
has been achieved.
Unfortunately, the practical utility of this result is limited. Depending on HW platform, the throughput of arithmetic operations on double
may only be 1/2 to 1/32 of the throughput of float
operations. The double-precision computation can be replaced with double-float computation, but this would incur very significant cost as well. Lastly, my approach here was to use the single-precision implementation as a proving ground for subsequent double-precision work, and many hardware platforms (certainly all the ones I am interested in) do not offer hardware support for a numeric format with higher precision than IEEE-754 binary64
(double precision). Therefore any solution should not require higher-precision arithmetic in intermediate computation.
Since all the troublesome arguments in the case of asinhf()
are small in magnitude, one could [partially?] address the accuracy issue by using a polynomial minimax approximation for the region around the origin. As this would create another code branch, it would likely make vectorization more difficult.
上一篇: 正弦多项式逼近中的这些系数如何确定?