8 texts in json.dumps as UTF8, not as \u escape sequence

sample code:

>>> import json
>>> json_string = json.dumps("ברי צקלה")
>>> print json_string
"u05d1u05e8u05d9 u05e6u05e7u05dcu05d4"

The problem: it's not human readable. My (smart) users want to verify or even edit text files with JSON dumps. (and i'd rather not use XML)

Is there a way to serialize objects into utf-8 json string (instead of uXXXX ) ?

this doesn't help:

>>> output = json_string.decode('string-escape')
"u05d1u05e8u05d9 u05e6u05e7u05dcu05d4"

this works, but if any sub-objects is a python-unicode and not utf-8, it'll dump garbage:

>>> #### ok:
>>> s= json.dumps( "ברי צקלה", ensure_ascii=False)    
>>> print json.loads(s)   
ברי צקלה

>>> #### NOT ok:
>>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
>>> print d
{1: 'xd7x91xd7xa8xd7x99 xd7xa6xd7xa7xd7x9cxd7x94', 
 2: u'xd7x91xd7xa8xd7x99 xd7xa6xd7xa7xd7x9cxd7x94'}
>>> s = json.dumps( d, ensure_ascii=False, encoding='utf8')
>>> print json.loads(s)['1']
ברי צקלה
>>> print json.loads(s)['2']
××¨× ×¦×§××

i searched the json.dumps documentation but couldn't find something useful.

Edit - Solution(?):

i'll try to sum up the comments and answers by Martijn Pieters:

(edit: 2nd thought after @Sebastian's comment and about a year later)

  • there might be no is a built-in solution in json.dumps.

  • i'll have to convert all strings to UTF8 Unicode the object before it's being JSON-ed. i'll use Mark's function that converts strings recuresively in a nested object

  • the example I gave depends too much on my computer & IDE environment, and doesn't run the same on all computers.

  • Thank you everybody :)


    Use the ensure_ascii=False switch to json.dumps() , then encode the value to UTF-8 manually:

    >>> json_string = json.dumps(u"ברי צקלה", ensure_ascii=False).encode('utf8')
    >>> json_string
    '"xd7x91xd7xa8xd7x99 xd7xa6xd7xa7xd7x9cxd7x94"'
    >>> print json_string
    "ברי צקלה"
    

    If you are writing this to a file, you can use io.open() instead of open() to produce a file object that encodes Unicode values for you as you write, then use json.dump() instead to write to that file:

    with io.open('filename', 'w', encoding='utf8') as json_file:
        json.dump(u"ברי צקלה", json_file, ensure_ascii=False)
    

    In Python 3, the built-in open() is an alias for io.open() . Do note that there is a bug in the json module where the ensure_ascii=False flag can produce a mix of unicode and str objects. The workaround for Python 2 then is:

    with io.open('filename', 'w', encoding='utf8') as json_file:
        data = json.dumps(u"ברי צקלה", ensure_ascii=False)
        # unicode(data) auto-decodes data to unicode if str
        json_file.write(unicode(data))
    

    If you are passing in byte strings (type str in Python 2, bytes in Python 3) encoded to UTF-8, make sure to also set the encoding keyword:

    >>> d={ 1: "ברי צקלה", 2: u"ברי צקלה" }
    >>> d
    {1: 'xd7x91xd7xa8xd7x99 xd7xa6xd7xa7xd7x9cxd7x94', 2: u'u05d1u05e8u05d9 u05e6u05e7u05dcu05d4'}
    
    >>> s=json.dumps(d, ensure_ascii=False, encoding='utf8')
    >>> s
    u'{"1": "u05d1u05e8u05d9 u05e6u05e7u05dcu05d4", "2": "u05d1u05e8u05d9 u05e6u05e7u05dcu05d4"}'
    >>> json.loads(s)['1']
    u'u05d1u05e8u05d9 u05e6u05e7u05dcu05d4'
    >>> json.loads(s)['2']
    u'u05d1u05e8u05d9 u05e6u05e7u05dcu05d4'
    >>> print json.loads(s)['1']
    ברי צקלה
    >>> print json.loads(s)['2']
    ברי צקלה
    

    Note that your second sample is not valid Unicode; you gave it UTF-8 bytes as a unicode literal, that would never work:

    >>> s = u'xd7x91xd7xa8xd7x99 xd7xa6xd7xa7xd7x9cxd7x94'
    >>> print s
    ××¨× ×¦×§××
    >>> print s.encode('latin1').decode('utf8')
    ברי צקלה
    

    Only when I encoded that string to Latin 1 (whose unicode codepoints map one-to-one to bytes) then decode as UTF-8 do you see the expected output. That has nothing to do with JSON and everything to do with that you use the wrong input. The result is called a Mojibake.

    If you got that Unicode value from a string literal, it was decoded using the wrong codec. It could be your terminal is mis-configured, or that your text editor saved your source code using a different codec than what you told Python to read the file with. Or you sourced it from a library that applied the wrong codec. This all has nothing to do with the JSON library .


    easy like a cake

    To write to a file

    import codecs
    import json
    
    with codecs.open('your_file.txt', 'w', encoding='utf-8') as f:
        json.dump({"message":"xin chào việt nam"}, f, ensure_ascii=False)
    

    To print to stdin

    import codecs
    import json
    print(json.dumps({"message":"xin chào việt nam"}, ensure_ascii=False))
    

    UPDATE: This is wrong answer, but it's still useful to understand why it's wrong. See comments.

    How about unicode-escape ?

    >>> d = {1: "ברי צקלה", 2: u"ברי צקלה"}
    >>> json_str = json.dumps(d).decode('unicode-escape').encode('utf8')
    >>> print json_str
    {"1": "ברי צקלה", "2": "ברי צקלה"}
    
    链接地址: http://www.djcxy.com/p/48692.html

    上一篇: 我如何使用泡菜保存字典?

    下一篇: json.dumps中的8个文本为UTF8,而不是\ u转义序列