What does sys.maxunicode mean?
 CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options.  In utf-16 builds of Python string slicing, iteration, and len seem to work on code units, not code points, so that multibyte characters behave strangely.  
 Eg, on CPython 2.6 with sys.maxunicode = 65535:  
>>> char = u'U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'uu835'
>>> char[1:2]
u'udc9e'
 According to the Python documentation, sys.maxunicode is "An integer giving the largest supported code point for a Unicode character."  
 Does this mean that unicode operations aren't guranteed to work on code points beyond sys.maxunicode ?  If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode operations?  
I came across this problem in How to iterate over Unicode characters in Python 3?
 Characters beyond sys.maxunicode=65535 are stored internally using UTF-16 surrogates.  Yes you have to deal with this yourself or use a wide build.  Even with a wide build you also may have to deal with single characters represented by a combination of code points.  For example:  
>>> print('au0301')
á
>>> print('xe1')
á
 The first uses a combining accent character and the second doesn't.  Both print the same.  You can use unicodedata.normalize to convert the forms.  
下一篇: sys.maxunicode是什么意思?
