如何从JSON获取字符串对象而不是Unicode？

2018-06-30 13:14:03

我使用Python 2从ASCII编码的文本文件中解析JSON。

当用json或simplejson加载这些文件时，我所有的字符串值都被转换为Unicode对象而不是字符串对象。问题是，我必须将数据与一些只接受字符串对象的库一起使用。我无法更改库或更新它们。

是否可以获取字符串对象而不是Unicode字符串？

例

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

更新

这个问题在很久以前就被问到了，当时我被Python 2困住了。今天的一个简单而干净的解决方案是使用Python的最新版本 - 即Python 3和forward。

带有`object_hook`的解决方案

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    # if this is a unicode string, return its string representation
    if isinstance(data, unicode):
        return data.encode('utf-8')
    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.iteritems()
        }
    # if it's anything else, return it in its original form
    return data

用法示例：

>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}

这是如何工作的，为什么我会使用它？

Mark Amery的功能比这些更短，更清晰，所以他们有什么意义？你为什么要使用它们？

纯粹是为了表现。 Mark的答案首先用unicode字符串解码JSON文本，然后遍历整个解码值以将所有字符串转换为字节字符串。这有几个不良影响：

整个解码结构的副本将在内存中创建

如果你的JSON对象真的被深度嵌套（500级或更多），那么你会达到Python的最大递归深度

这个答案通过使用json.load和json.loads的object_hook参数来缓解这两个性能问题。从文档：

object_hook是一个可选函数，它将用任何对象字面解码的结果（ dict ）来调用。 object_hook的返回值将被用来代替dict 。此功能可用于实现自定义解码器

由于字典在其他字典中深层嵌套很多级别，所以在解码时它们会传递给object_hook ，所以我们可以在该点处对其中的任何字符串或列表进行字符化，并避免以后需要进行深度递归。

Mark的答案不适合用作object_hook ，因为它会递归嵌套字典。我们用_byteify的ignore_dicts参数来阻止这个回答的递归，除了当object_hook传递一个新的dict来进行字节化时，它会一直传递给它。 ignore_dicts标志告诉_byteify忽略dict因为它们已经被字节化了。

最后，我们对json_load_byteified和json_loads_byteified实现对从json.load或json.loads返回的结果调用_byteify （with ignore_dicts=True ），以处理解码的JSON文本在顶层没有dict情况。

虽然这里有一些很好的答案，但我最终使用PyYAML来解析我的JSON文件，因为它将键和值作为str类型字符串而不是unicode类型。因为JSON是YAML的一个子集，所以它很好地工作：

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

笔记

有些事情要注意，但：

我得到字符串对象，因为我所有的条目都是ASCII编码的 。如果我使用unicode编码的条目，我会将它们作为unicode对象返回 - 没有转换！

你应该（可能总是）使用PyYAML的safe_load函数; 如果您使用它来加载JSON文件，则不需要load函数的“附加功能”。

如果你想要一个对1.2版规范有更多支持的YAML解析器（并且正确解析非常低的数字），请尝试Ruamel YAML： pip install ruamel.yaml并import ruamel.yaml as yaml是我测试中需要的。

转变

如上所述，没有转换！如果你不能确定只处理ASCII值（大多数情况下你不能确定），最好使用转换函数 ：

我几次使用Mark Amery的那个，它工作的很好，而且非常易于使用。你也可以使用一个类似于object_hook函数，因为它可能让你在大文件上获得性能提升。请参阅Mirec Miskuf为此提供的稍微更复杂的答案。

没有内置的选项可以使json模块函数返回字节字符串而不是unicode字符串。但是，这个简短的递归函数会将任何解码的JSON对象从使用unicode字符串转换为UTF-8编码的字节字符串：

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

只需在从json.load或json.loads调用中获得的输出上调用它即可。

几个注意事项：

为了支持Python 2.6或更早版本，用return {byteify(key): byteify(value) for key, value in input.iteritems()} return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]) ，因为Python 2.7之前不支持字典return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]) 。

由于该答案在整个解码对象中递归，所以它有一些不良的性能特征，可以通过非常小心地使用object_hook或object_pairs_hook参数来避免。 Mirec Miskuf的答案迄今为止是唯一能够正确解决问题的答案，尽管如此，它比我的方法复杂得多。

链接地址: http://www.djcxy.com/p/85237.html

上一篇: How to get string objects instead of Unicode from JSON?

下一篇: Get text of button from IBAction

如何从JSON获取字符串对象而不是Unicode？

例

更新

带有object_hook的解决方案

这是如何工作的，为什么我会使用它？

笔记

转变

带有`object_hook`的解决方案