Python, write json / dictionary objects to a file iteratively (one at a time)

2018-07-02 20:12:17

I have a large for loop in which I create json objects and I would like to be able to stream write the object in each iteration to a file. I would like to be able to use the file later in a similar fashion later (read objects one at a time). My json objects contain newlines and I can't just dump each object as a line in a file. How can I achieve this?

To make it more concrete, consider the following:

for _id in collection:
    dict_obj = build_dict(_id)  # build a dictionary object 
    with open('file.json', 'a') as f:
        stream_dump(dict_obj, f)

stream_dump is the function that I want.

Note that I don't want to create a large list and dump the whole list using something like json.dump(obj, file) . I want to be able to append the object to the file in each iteration.

Thanks.

You need to work with a subclass of JSONEncoder and then proxy the build_dict function

from __future__ import (absolute_import, division, print_function,)
#                        unicode_literals)

import collections
import json


mycollection = [1, 2, 3, 4]


def build_dict(_id):
    d = dict()
    d['my_' + str(_id)] = _id
    return d


class SeqProxy(collections.Sequence):
    def __init__(self, func, coll, *args, **kwargs):
        super(SeqProxy, *args, **kwargs)

        self.func = func
        self.coll = coll

    def __len__(self):
        return len(self.coll)

    def __getitem__(self, key):
        return self.func(self.coll[key])


class JsonEncoderProxy(json.JSONEncoder):
    def default(self, o):
        try:
            iterable = iter(o)
        except TypeError:
            pass
        else:
            return list(iterable)
        # Let the base class default method raise the TypeError
        return json.JSONEncoder.default(self, o)


jsonencoder = JsonEncoderProxy()
collproxy = SeqProxy(build_dict, mycollection)


for chunk in jsonencoder.iterencode(collproxy):
    print(chunk)

Ouput:

[
{
"my_1"
:
1
}
,
{
"my_2"
:
2
}
,
{
"my_3"
:
3
}
,
{
"my_4"
:
4
}
]

To read it back chunk by chunk you need to use JSONDecoder and pass a callable as object_hook . This hook will be called with each new decoded object (each dict in your list) when you call JSONDecoder.decode(json_string)

Since you are generating the files yourself, you can simply write out one JSON object per line:

for _id in collection:
    dict_obj = build_dict(_id)  # build a dictionary object 
    with open('file.json', 'a') as f:
        f.write(json.dumps(dict_obj))
        f.write('n')

And then read them in by iterating over lines:

with open('file.json', 'r') as f:
    for line in f:
        dict_obj = json.loads(line)

This isn't a great general solution, but it's a simple one if you are both the generator and consumer.

Simplest solution:

Remove all whitespace characters from your json document:

import string

def remove_whitespaces(txt):
    """ We shall remove all whitespaces"""
    for chr in string.whitespace:
        txt = txt.replace(chr)

Obviously you could also json.dumps(json.loads(json_txt)) (BTW this also verify that the text is a valid json).

Now you could write you documents to a file one line each.

Second solution:

Create an [AnyStr]Io stream, write in the Io a valid document, (your documents being part of an object or list) and then write the io in a file (or upload it to the cloud).

链接地址: http://www.djcxy.com/p/91532.html

上一篇: 我如何到达AST表达式的底部

下一篇: Python，将json / dictionary对象迭代地写入一个文件（一次一个）