What problems will one see in using Python multiprocessing naively?
We're considering re-factoring a large application with a complex GUI which is isolated in a decoupled fashion from the back-end, to use the new (Python 2.6) multiprocessing module. The GUI/backend interface uses Queues with Message objects exchanged in both directions.
One thing I've just concluded (tentatively, but feel free to confirm it) is that "object identity" would not be preserved across the multiprocessing interface. Currently when our GUI publishes a Message to the back-end, it expects to get the same Message back with a result attached as an attribute. It uses object identity ( if received_msg is message_i_sent:
) to identify returning messages in some cases... and that seems likely not to work with multiprocessing.
This question is to ask what "gotchas" like this you have seen in actual use or can imagine one would encounter in naively using the multiprocessing module, especially in refactoring an existing single-process application. Please specify whether your answer is based on actual experience. Bonus points for providing a usable workaround for the problem.
Edit: Although my intent with this question was to gather descriptions of problems in general, I think I made two mistakes: I made it community wiki from the start (which probably makes many people ignore it, as they won't get reputation points), and I included a too-specific example which -- while I appreciate the answers -- probably made many people miss the request for general responses. I'll probably re-word and re-ask this in a new question. For now I'm accepting one answer as best merely to close the question as far as it pertains to the specific example I included. Thanks to those who did answer!
I have not used multiprocessing itself, but the problems presented are similar to experience I've had in two other domains: distributed systems, and object databases. Python object identity can be a blessing and a curse!
As for general gotchas, it helps if the application you are refactoring can acknowledge that tasks are being handled asynchronously. If not, you will generally end up managing locks, and much of the performance you could have gained by using separate processes will be lost to waiting on those locks. I will also suggest that you spend the time to build some scaffolding for debugging across processes. Truly asynchronous processes tend to be doing much more than the mind can hold and verify -- or at least my mind!
For the specific case outlined, I would manage object identity at the process border when items queued and returned. When sending a task to be processed, annotate the task with an id(), and stash the task instance in a dictionary using the id() as the key. When the task is updated/completed, retrieve the exact task back by id() from the dictionary, and apply the newly updated state to it. Now the exact task, and therefore its identity, will be maintained.
Well, of course testing for identity on non-singleton object (es. "a is None" or "a is False") isn't usually a good practice - it might be quick, but a really-quick workaround would be to exchange the "is" for the "==" test and use an incremental counter to define identity:
# this is not threadsafe.
class Message(object):
def _next_id():
i = 0
while True:
i += 1
yield i
_idgen = _next_id()
del _next_id
def __init__(self):
self.id = self._idgen.next()
def __eq__(self, other):
return (self.__class__ == other.__class__) and (self.id == other.id)
This might be an idea.
Also, be aware that if you have tons of "worker processes", memory consumption might be far greater than with a thread-based approach.
You can try the persistent
package from my project GarlicSim. It's LGPL'ed.
http://github.com/cool-RR/GarlicSim/tree/development/garlicsim/garlicsim/misc/persistent/
(The main module in it is persistent.py
)
I often use it like this:
# ...
self.identity = Persistent()
Then I have an identity that is preserved across processes.
链接地址: http://www.djcxy.com/p/43746.html