How to iterate over Unicode characters in Python 3?

2018-06-19 04:32:08

I need to step through a Python string one character at a time, but a simple "for" loop gives me UTF-16 code units instead:

str = "abcu20acU00010302U0010fffd"
for ch in str:
    code = ord(ch)
    print("U+{:04X}".format(code))

That prints:

U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD

when what I wanted was:

U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Is there any way to get Python to give me the sequence of Unicode code points, regardless of how the string is actually encoded under the hood? I'm testing on Windows here, but I need code that will work anywhere. It only needs to work on Python 3, I don't care about Python 2.x.

The best I've been able to come up with so far is this:

import codecs
str = "abcu20acU00010302U0010fffd"
bytestr, _ = codecs.getencoder("utf_32_be")(str)
for i in range(0, len(bytestr), 4):
    code = 0
    for b in bytestr[i:i + 4]:
        code = (code << 8) + b
    print("U+{:04X}".format(code))

But I'm hoping there's a simpler way.

(Pedantic nitpicking over precise Unicode terminology will be ruthlessly beaten over the head with a clue-by-four. I think I've made it clear what I'm after here, please don't waste space with "but UTF-16 is technically Unicode too" kind of arguments.)

On Python 3.2.1 with narrow Unicode build:

PythonWin 3.2.1 (default, Jul 10 2011, 21:51:15) [MSC v.1500 32 bit (Intel)] on win32.
Portions Copyright 1994-2008 Mark Hammond - see 'Help/About PythonWin' for further copyright information.
>>> import sys
>>> sys.maxunicode
65535

What you've discovered (UTF-16 encoding):

>>> s = "abcu20acU00010302U0010fffd"
>>> len(s)
8
>>> for c in s:
...     print('U+{:04X}'.format(ord(c)))
...     
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD

A way around it:

>>> import struct
>>> s=s.encode('utf-32-be')
>>> struct.unpack('>{}L'.format(len(s)//4),s)
(97, 98, 99, 8364, 66306, 1114109)
>>> for i in struct.unpack('>{}L'.format(len(s)//4),s):
...     print('U+{:04X}'.format(i))
...     
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Update for Python 3.3:

Now it works the way the OP expects:

>>> s = "abcu20acU00010302U0010fffd"
>>> len(s)
6
>>> for c in s:
...     print('U+{:04X}'.format(ord(c)))
...     
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD

Python normally stores the unicode values internally as UCS2. The UTF-16 representation of the UTF-32 U00010302 character is UD800UDF02, that's why you got that result.

That said, there are some python builds that use UCS4, but these builds are not compatible with each other.

Take a look here.

Py_UNICODE This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python's default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4. On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4).

If you create the string as a unicode object, it should be able to break off a character at a time automatically. Eg:

Python 2.6:

s = u"abcu20acU00010302U0010fffd"   # note u in front!
for c in s:
    print "U+%04x" % ord(c)

I received:

U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd

Python 3.2:

s = "abcu20acU00010302U0010fffd"
for c in s:
    print ("U+%04x" % ord(c))

It worked for me:

U+0061
U+0062
U+0063
U+20ac
U+10302
U+10fffd

Additionally, I found this link which explains that the behavior as working correctly. If the string came from a file, etc, it will likely need to be decoded first.

Update :

I've found an insightful explanation here. The internal Unicode representation size is a compile-time option, and if working with "wide" chars outside of the 16 bit plane you'll need to build python yourself to remove the limitation, or use one of the workarounds on this page. Apparently many Linux distros do this for you already as I encountered above.

链接地址: http://www.djcxy.com/p/54046.html

上一篇: C＃：什么是虚拟事件，它们如何被使用？

下一篇: 如何迭代Python 3中的Unicode字符？