Download files from url parallely in python
I have some links in a database which I want to download parallely. I tried doing it serially but it took too much time. I have around 1877 links.
I tried this code for running the downloads parallely but it throws an error: failed: 'tuple' object has no attribute 'read'
#!/usr/bin/env python
import urllib
from stream import ThreadPool
URLs = [
'http://www.cnn.com/',
'http://www.bbc.co.uk/',
'http://www.economist.com/',
'http://nonexistant.website.at.baddomain/',
'http://slashdot.org/',
'http://reddit.com/',
'http://news.ycombinator.com/'
]
def retrieve(urls):
for url in urls:
print url,' '
res = urllib.urlretrieve(url).read()
yield url, res
if __name__ == '__main__':
retrieved = URLs >> ThreadPool(retrieve, poolsize=7)
for url, content in retrieved:
print '%r is %d bytes' % (url, len(content))
for url, exception in retrieved.failure:
print '%r failed: %s' % (url, exception)
I tried this as well:
import urllib
import tldextract
from multiprocessing.pool import ThreadPool
URLs = [
'http://www.cnn.com/',
'http://www.bbc.co.uk/',
'http://www.economist.com/',
'http://nonexistant.website.at.baddomain/',
'http://slashdot.org/',
'http://reddit.com/',
'http://news.ycombinator.com/'
]
def dwld(url):
print url
res = urllib.urlopen(url).read()
filename = tldextract.extract(url)
with open(filename.domain, 'wb') as fh:
fh.write(res)
return url
pool = ThreadPool(processes = 4)
pool.map(dwld, URLs)
Gives me Traceback (most recent call last): File "dwld_thread.py", line 26, in pool.map(dwld, URLs) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 148, in map return self.map_async(func, iterable, chunksize).get() File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/multiprocessing/pool.py", line 422, in get raise self._value IOError: [Errno socket error] [Errno 8] nodename nor servname provided, or not known
I have no idea what that stream.ThreadPool
is that you're using, or what its API is… but the problem is obvious:
res = urllib.urlretrieve(url).read()
If you look at the doc for urlretrieve
:
Return a tuple (filename, headers) where filename is the local file name under which the object can be found…
You obviously can't call read
on that. If you want to download to a local file, using this legacy API, and then read that file, you can:
filename, headers = urllib.urlretrieve(url)
with open(filename) as f:
res = f.read()
But why? Just use urllib2.urlopen
, which "returns a file-like object with two additional methods", so you can just call read
on it, and you won't be creating a temporary file, and you're not using an old function that wasn't quite designed right that nobody has maintained in years.
But Python has a nice ThreadPoolExecutor
built into the standard library. And if you look at the very first example they show you, it's exactly what you're trying to do.
Unfortunately, you're using Python 2.x, which doesn't have the concurrent.futures
module. Fortunately, there is a backport on PyPI that works with 2.5+.
Python also has multiprocessing.dummy.Pool
(also available under the undocumented, but probably more readable, name multiprocessing.ThreadPool
). But if you're willing to go outside the stdlib for some module that you apparently aren't sure how to use and that I've never heard of, I'm guessing you won't have any problem using futures
. So:
import futures
import urllib2
URLs = [
'http://www.cnn.com/',
'http://www.bbc.co.uk/',
'http://www.economist.com/',
'http://nonexistant.website.at.baddomain/',
'http://slashdot.org/',
'http://reddit.com/',
'http://news.ycombinator.com/'
]
def load_url(url):
return urllib2.urlopen(url).read()
if __name__ == '__main__':
with futures.ThreadPoolExecutor(max_workers=7) as executor:
fmap = dict((executor.submit(load_url, url), url) for url in URLs)
for f in futures.as_completed(fmap):
url = fmap[f]
try:
content = f.result()
except Exception as exception:
print '%r failed: %s' % (url, exception)
else:
print '%r is %d bytes' % (url, len(content))
urllib.urlretrieve(url).read()
应该是urllib.urlopen(url).read()
from threading import *
from time import sleep
# if Python2:
import urllib
# if Python3:
# import urllib.request
URLs = [
'http://www.cnn.com/',
'http://www.bbc.co.uk/',
'http://www.economist.com/',
'http://nonexistant.website.at.baddomain/',
'http://slashdot.org/',
'http://reddit.com/',
'http://news.ycombinator.com/'
]
class worker(Thread):
def __init__(self, link):
Thread.__init__(self)
self.link = link
self.start()
def run(self):
# if Python2:
res = urllib.urlopen(url).read() # as mentioned by @DhruvPathak
# if Python3:
# res = urllib.request.urlopen(url).read()
with open(url, 'rb') as fh:
fh.write(res) # store fetched data in a file called <link>
for url in urls:
while len(enumerate()) > 500:
sleep(0.25)
worker(url)
while len(enumerate()) > 1:
sleep(0.25) # wait for all threads to finish
链接地址: http://www.djcxy.com/p/9418.html
上一篇: Python:在熊猫数据框上使用多处理
下一篇: 在python中并行地从url下载文件