Python: yield

How do I yield an object from a generator and forget it immediately, so that it doesn't take up memory?

For example, in the following function:

def grouper(iterable, chunksize):
    """
    Return elements from the iterable in `chunksize`-ed lists. The last returned
    element may be smaller (if length of collection is not divisible by `chunksize`).

    >>> print list(grouper(xrange(10), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    i = iter(iterable)
    while True:
        chunk = list(itertools.islice(i, int(chunksize)))
        if not chunk:
            break
        yield chunk

I don't want the function to hold on to the reference to chunk after yielding it, as it is not used further and just consumes memory, even if all outside references are gone.


EDIT : using standard Python 2.5/2.6/2.7 from python.org.


Solution (proposed almost simultaneously by @phihag and @Owen): wrap the result in a (small) mutable object and return the chunk anonymously, leaving only the small container behind:

def chunker(iterable, chunksize):
    """
    Return elements from the iterable in `chunksize`-ed lists. The last returned
    chunk may be smaller (if length of collection is not divisible by `chunksize`).

    >>> print list(chunker(xrange(10), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    i = iter(iterable)
    while True:
        wrapped_chunk = [list(itertools.islice(i, int(chunksize)))]
        if not wrapped_chunk[0]:
            break
        yield wrapped_chunk.pop()

With this memory optimization, you can now do something like:

 for big_chunk in chunker(some_generator, chunksize=10000):
     ... process big_chunk
     del big_chunk # big_chunk ready to be garbage-collected :-)
     ... do more stuff

If you really really want to get this functionality I suppose you could use a wrapper:

class Wrap:

    def __init__(self, val):
        self.val = val

    def unlink(self):
        val = self.val
        self.val = None
        return val

And could be used like

def grouper(iterable, chunksize):
    i = iter(iterable)
    while True:
        chunk = Wrap(list(itertools.islice(i, int(chunksize))))
        if not chunk.val:
            break
        yield chunk.unlink()

Which is essentially the same as what phihag does with pop() ;)


After yield chunk , the variable value is never used again in the function, so a good interpreter/garbage collector will already free chunk for garbage collection (note: cpython 2.7 seems not do this, pypy 1.6 with default gc does). Therefore, you don't have to change anything but your code example, which is missing the second argument to grouper .

Note that garbage collection is non-deterministic in Python. The null garbage collector, which doesn't collect free objects at all, is a perfectly valid garbage collector. From the Python manual:

Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.

Therefore, it can not be decided whether a Python program does or "doesn't take up memory" without specifying Python implementation and garbage collector. Given a specific Python implementation and garbage collector, you can use the gc module to test whether the object is freed.

That being said, if you really want no reference from the function (not necessarily meaning the object will be garbage-collected), here's how to do it:

def grouper(iterable, chunksize):
    i = iter(iterable)
    while True:
        tmpr = [list(itertools.islice(i, int(chunksize)))]
        if not tmpr[0]:
            break
        yield tmpr.pop()

Instead of a list, you can also use any other data structure that with a function which removes and returns an object, like Owen's wrapper.


@ Radim,

Several points were perplexing me in this thread. I realize that I was missing to understand the base: what was your problem.

Now I think that I've understood and I whish you to confirm.

I'll represent your code like that

import itertools

def grouper(iterable, chunksize):
    i = iter(iterable)
    while True:
        chunk = list(itertools.islice(i, int(chunksize)))
        if not chunk:
            break
        yield chunk

............
............
gigi = grouper(an_iterable,4)
# before A
# A = grouper(an_iterable,4)
# corrected:
A = gigi.next()
# after A
................
...........
# deducing an object x from A ; x doesn't consumes a lot of memory
............
# deleting A because it consumes a lot of memory:
del A
# code carries on, taking time to executes
................
................
......
..........
# before B
# B = grouper(an_iterable,4)
# corrected:
B = gigi.next()
# after B
.....................
........

Your problem is that even during the time elapsing between
# after deletion of A, code carries on, taking time to executes
and
# before B ,
the object of deleted name 'A' still exists and consumes a lot of memory because there is still a binding between this object and the identifier 'chunk' inside the generator function ?

Excuse me to ask you about this now evident point to me.
However, as there was a certain confusion in the thread at a time, I'd like you to confirm I have now correctly understood your problem.

.

@ phihag

You wrote in a comment:

1)
After the yield chunk , there is no way to access the value stored in chunk from this function. Therefore, this function does not hold any references to the object in question

(By the way, I wouldn't have written therefore , but 'because')

I think that this affirmation #1 is debatable.
In fact , I'm convinced it is false. But there is a subtlety in what you pretend, not in this quotation alone, but globally, if we take account of what you say in the beginning of your answer too.

Let us take things in order.

The following code seems to prove the contrary of your affirmation "After the yield chunk, there is no way to access the value stored in chunk from this function."

import itertools

def grouper(iterable, chunksize):
    i = iter(iterable)
    chunk = ''
    last = ''
    while True:
        print 'new turn   ',id(chunk)
        if chunk:
            last = chunk[-1]
        chunk = list(itertools.islice(i, int(chunksize)))
        print 'new chunk  ',id(chunk),'  len of chunk :',len(chunk)
        if not chunk:
            break
        yield '%s  -  %s' % (last,' , '.join(chunk))
        print 'end of turn',id(chunk),'n'


for x in grouper(['1','2','3','4','5','6','7','8','9','10','11'],'4'):
    print repr(x)

result

new turn    10699768
new chunk   18747064   len of chunk : 4
'  -  1 , 2 , 3 , 4'
end of turn 18747064 

new turn    18747064
new chunk   18777312   len of chunk : 4
'4  -  5 , 6 , 7 , 8'
end of turn 18777312 

new turn    18777312
new chunk   18776952   len of chunk : 3
'8  -  9 , 10 , 11'
end of turn 18776952 

new turn    18776952
new chunk   18777512   len of chunk : 0

.

However, you also wrote (it's the beginning of your answer):

2)
After yield chunk , the variable value is never used again in the function, so a good interpreter/garbage collector will already free chunk for garbage collection (note: cpython 2.7 seems not do this, pypy 1.6 with default gc does).

This time you don't say that the function hold no more reference of chunk after yield chunk , you say that its value is not used again before the renewal of chunk in the next turn of the while loop. That's right, in the Radim's code, the object chunk isn't used again before the identifier 'chunk' is re-assigned in the instruction chunk = list(itertools.islice(i, int(chunksize))) in the next turn of the loop.

.

This affirmation #2 in this quotation, different from the preceding #1 one, has two logical consequences :

FIRST , my above code can't pretend to prove strictly to someone thinking like you that there is indeed a way to access the value of chunk after the yield chunk instruction.
Because the conditions in my above code are not the same under which you affirm the contrary, that is to say: in Radim's code about which you speak, the object chunk is really not used again before the next turn.
And then , one can pretend that it's because of the use of chunk in my above code ( the instructions print 'end of turn',id(chunk),'n' , print 'new turn ',id(chunk) and last = chunk[-1] do use it ) that it happens that a reference to the object chunk is still hold after the yield chunk .

SECONDLY , going further in the reasoning, gathering your two quotations leads to conclude that you think it's because chunk is no more used after the yield chunk instruction in the Radim's code that no reference is maintained on it.
It's a matter of logic, IMO: the absence of reference to an object is the condition of its freeing, hence if you pretend that the memory is freed from the object because it is no more used, it's equivalent to pretend that the memory is freed from the object because its unemployment makes the intepreter to delete the reference to it in the function.

I sum up:
you pretend that in Radim's code, chunk is no more used after yield chunk then no more reference to it is hold, then..... cpython 2.7 won't do it... but pypy 1.6 with default gc frees the memory from the object chunk .

At this point , I'm very surprised by the reasoning at the source of this consequence: it would be because of the fact that chunk is no more used that pypy 1.6 would free it. This reasoning isn't clearly expressed like that by you, but without it I would find what you claim in the two quotations being illogical and incomprehensible.

What perplexes me in this conclusion, and the reason I don't agree with all that, is that it implies that pypy 1.6 would be able to analyze the code and detect that chunk won't be used again after yield chunk . I find this idea completely unbelievable and I would like you :

  • to explain what you exactly think about all that. Where am I wrong in the comprehension of your ideas ?

  • to say if you have a proof of the fact that , at least pypy 1.6, doesn't hold reference to chunk when it is no more used.
    The problem of Radim's initial code is that the memory was too much consumed by the persistance of the object chunk because of its reference still hold inside the generator function: that was an indirect symptom of the existence of such a persistent reference inside.
    Have you observed a similar behavior with pypy 1.6 ? I don't see another way to put in evidence the remaining reference inside the generator, since , according to your quotation #2, any use of chunk after yield chunk is enough to trigger the upholding of a reference to it. It's a problem similar to one in quantic mechanics: the fact to measure the speed of a particle modifies its speed.....

  • 链接地址: http://www.djcxy.com/p/24026.html

    上一篇: 解压缩和*运算符

    下一篇: Python:产量