How do I do a case
How can I do case insensitive string comparison in Python?
I would like to encapsulate comparison of a regular strings to a repository string using in a very simple and Pythonic way. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.
假设ASCII字符串:
string1 = 'Hello'
string2 = 'hello'
if string1.lower() == string2.lower():
print "The strings are the same (case insensitive)"
else:
print "The strings are not the same (case insensitive)"
Comparing string in a case insensitive way seems like something that's trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.
The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower()
, such as "ß"
:
"ß".lower()
#>>> 'ß'
"ß".upper().lower()
#>>> 'ss'
But let's say you wanted to caselessly compare "BUSSE"
and "Buße"
. Heck, you probably also want to compare "BUSSE"
and "BUẞE"
equal - that's the newer capital form. The recommended way is to use casefold
:
help(str.casefold)
#>>> Help on method_descriptor:
#>>>
#>>> casefold(...)
#>>> S.casefold() -> str
#>>>
#>>> Return a version of S suitable for caseless comparisons.
#>>>
Do not just use lower
. If casefold
is not available, doing .upper().lower()
helps (but only somewhat).
Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê"
- but it doesn't:
"ê" == "ê"
#>>> False
This is because they are actually
import unicodedata
[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX']
[unicodedata.name(char) for char in "ê"]
#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
The simplest way to deal with this is unicodedata.normalize
. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")
#>>> True
To finish up, here this is expressed in functions:
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
def caseless_equal(left, right):
return normalize_caseless(left) == normalize_caseless(right)
Using Python 2, calling .lower()
on each string or Unicode object...
string1.lower() == string2.lower()
...will work most of the time, but indeed doesn't work in the situations @tchrist has described.
Assume we have a file called unicode.txt
containing the two strings Σίσυφος
and ΣΊΣΥΦΟΣ
. With Python 2:
>>> utf8_bytes = open("unicode.txt", 'r').read()
>>> print repr(utf8_bytes)
'xcexa3xcexafxcfx83xcfx85xcfx86xcexbfxcfx82nxcexa3xcex8axcexa3xcexa5xcexa6xcex9fxcexa3n'
>>> u = utf8_bytes.decode('utf8')
>>> print u
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = u.splitlines()
>>> print first.lower()
σίσυφος
>>> print second.lower()
σίσυφοσ
>>> first.lower() == second.lower()
False
>>> first.upper() == second.upper()
True
The Σ character has two lowercase forms, ς and σ, and .lower()
won't help compare them case-insensitively.
However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:
>>> s = open('unicode.txt', encoding='utf8').read()
>>> print(s)
Σίσυφος
ΣΊΣΥΦΟΣ
>>> first, second = s.splitlines()
>>> print(first.lower())
σίσυφος
>>> print(second.lower())
σίσυφος
>>> first.lower() == second.lower()
True
>>> first.upper() == second.upper()
True
So if you care about edge-cases like the three sigmas in Greek, use Python 3.
(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)
链接地址: http://www.djcxy.com/p/58528.html上一篇: 快速算法
下一篇: 我该如何做案件