[Python-talk] unicode handling in older Python versions
Arc Riley
arcriley at gmail.com
Sat Oct 3 09:01:25 EDT 2009
Thanks Kent, it looks like OSX has a different kind of broken for page1
unicode.
My guess is it's internally storing it as UCS-2 and can thus only handle
unicode characters up to \uffff where these glyphs require 4 bytes to encode
properly. To complete a test the same should be performed with a page0
glyph such as \uD000 (Hangul) vs \U00010000 (Linear B) to verify that 16-bit
unicode encodes properly while 32-bit unicode does not.
In Python 3.0.1 32-bit unicode is supported until you try to substring or
iterate:
>>> a='\ud000'
>>> a[0]
'퀀'
>>> b='\U00010000'
>>> b[0]
'\ud800'
>>> b[1]
'\udc00'
>>> len(b)
2
Do you want to get your name on the ticket by adding the debug for OSX
yourself?
http://bugs.python.org/issue7045
On Sat, Oct 3, 2009 at 7:31 AM, Kent Johnson <kent37 at tds.net> wrote:
> On Mac OSX:
>
> $ python2.4
> Python 2.4.4 (#1, Oct 18 2006, 10:34:39)
> [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>
> In [1]: line = u'𐑑𐑧𐑕𐑑𐑦𐑙'
>
> In [2]: first = u'𐑑'
>
> In [3]: first
> Out[3]: u'\xf0\x90\x91\x91'
>
> and the same with
> Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39)
> [GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
>
> Kent
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://dlslug.org/pipermail/python-talk/attachments/20091003/8a247867/attachment.html>
More information about the Python-talk
mailing list