Python: sort function breaks in the presence of nan
sorted([2, float('nan'), 1])
returns [2, nan, 1]
(At least on Activestate Python 3.1 implementation.)
I understand nan
is a weird object, so I wouldn't be surprised if it shows up in random places in the sort result. But it also messes up the sort for the non-nan numbers in the container, which is really unexpected.
I asked a related question about max
, and based on that I understand why sort
works like this. But should this be considered a bug?
Documentation just says "Return a new sorted list [...]" without specifying any details.
EDIT: I now agree that this isn't in violation of the IEEE standard. However, it's a bug from any common sense viewpoint, I think. Even Microsoft, which isn't known to admit their mistakes often, has recognized this one as a bug, and fixed it in the latest version: http://connect.microsoft.com/VisualStudio/feedback/details/363379/bug-in-list-double-sort-in-list-which-contains-double-nan.
Anyway, I ended up following @khachik's answer:
sorted(list_, key = lambda x : float('-inf') if math.isnan(x) else x)
I suspect it results in a performance hit compared to the language doing that by default, but at least it works (barring any bugs that I introduced).
The previous answers are useful, but perhaps not clear regarding the root of the problem.
In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, aka operator <,
could be used throughout if and only if less than defines a suitable ordering over the input values.
But this is specifically NOT true for floating point values and less-than: "NaN is unordered: it is not equal to, greater than, or less than anything, including itself." ( Clear prose from GNU C manual, but applies to all modern IEEE754
based floating point )
So the possible solutions are:
Either approach can be used, in any language.
Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.
Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key()
. The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.
The problem is that there's no correct order if the list contains a NAN, since a sequence a1, a2, a3, ..., an is sorted if a1 <= a2 <= a3 <= ... <= an. If any of these a values is a NAN then the sorted property breaks, since for all a, a <= NAN and NAN <= a are both false.
I'm not sure about the bug, but the workaround may be the following:
sorted(
(2, 1, float('nan')),
lambda x,y: x is float('nan') and -1
or (y is float('nan') and 1
or cmp(x,y)))
which results in:
('nan', 1, 2)
Or remove nan
s before sorting or anything else.
上一篇: 比较包含NaN的列表