Python unicode string with UTF
I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopetxc3xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (ie, both are valid characters in their own right, u'xc3xb3'
== ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopetxf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopetxc3xb3n'.encode('latin-1').decode('utf-8')
u'Sopetxf3n'
You should use:
title.encode('raw_unicode_escape')
Python2:
print(u'xd0xbfxd1x80xd0xb8'.encode('raw_unicode_escape'))
Python3:
print(u'xd0xbfxd1x80xd0xb8'.encode('raw_unicode_escape').decode('utf8'))
链接地址: http://www.djcxy.com/p/87844.html
上一篇: IDEA调试器到运行的java进程