在python中并行地从url下载文件

2018-06-02 14:22:57

我在一个数据库中有一些我想要并行下载的链接。我试着连续做，但花了太多时间。我有大约1877个链接。

我试着用这个代码来并行运行下载，但是它会抛出一个错误：failed：'tuple'object has no attribute'read'

#!/usr/bin/env python

import urllib
from stream import ThreadPool

URLs = [
  'http://www.cnn.com/',
  'http://www.bbc.co.uk/',
  'http://www.economist.com/',
  'http://nonexistant.website.at.baddomain/',
  'http://slashdot.org/',
  'http://reddit.com/',
  'http://news.ycombinator.com/'
 ]

def retrieve(urls):
    for url in urls:
    print url,' '
    res = urllib.urlretrieve(url).read()
    yield url, res

if __name__ == '__main__':
    retrieved = URLs >> ThreadPool(retrieve, poolsize=7)
    for url, content in retrieved:
        print '%r is %d bytes' % (url, len(content))
    for url, exception in retrieved.failure:
        print '%r failed: %s' % (url, exception)

我也试过这个：

import urllib
import tldextract
from multiprocessing.pool import ThreadPool

URLs = [
  'http://www.cnn.com/',
  'http://www.bbc.co.uk/',
  'http://www.economist.com/',
  'http://nonexistant.website.at.baddomain/',
   'http://slashdot.org/',
  'http://reddit.com/',
  'http://news.ycombinator.com/'
 ]


def dwld(url):
  print url
  res = urllib.urlopen(url).read() 
  filename = tldextract.extract(url)
  with open(filename.domain, 'wb') as fh:
     fh.write(res)
  return url 

pool = ThreadPool(processes = 4)
pool.map(dwld, URLs)

为我提供Traceback（最近调用最后一个）：在pool.map（dwld，URLs）文件“/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2”中的文件“dwld_thread.py”，第26行。 6 / multiprocessing / pool.py“，第148行，在map中返回self.map_async（func，iterable，chunksize）.get（）文件”/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2。 6 / multiprocessing / pool.py“，第422行，在get raise self._value IOError：[Errno套接字错误] [Errno 8]节点名或服务名提供，或未知

我不知道那是什么stream.ThreadPool是你正在使用，或者它的API是什么......但问题很明显：

res = urllib.urlretrieve(url).read()

如果您查看urlretrieve的文档：

返回一个元组（文件名，头文件），其中filename是可以找到该对象的本地文件名...

你显然无法致电read 。如果你想下载到本地文件，使用这个旧版API，然后阅读该文件，你可以：

filename, headers = urllib.urlretrieve(url)
with open(filename) as f:
    res = f.read()

但为什么？只需使用urllib2.urlopen ，它“用另外两个方法返回一个类文件对象”，这样你就可以调用read ，并且不会创建一个临时文件，也不会使用旧函数设计得不是很好，几年来没有人维护过。

但Python有一个很好的ThreadPoolExecutor内置于标准库中。如果你看他们向你展示的第一个例子，那正是你想要做的。

不幸的是，你正在使用Python 2.x，它没有concurrent.futures模块。幸运的是，PyPI上有一个支持2.5+的backport。

Python也有multiprocessing.dummy.Pool （也可在未公开的，但可能更易读的名称multiprocessing.ThreadPool ）。但是如果你愿意走出stdlib之外的某个模块，你显然不知道如何使用，而且我从来没有听说过，但我猜测你使用futures不会有任何问题。所以：

import futures
import urllib2

URLs = [
  'http://www.cnn.com/',
  'http://www.bbc.co.uk/',
  'http://www.economist.com/',
  'http://nonexistant.website.at.baddomain/',
  'http://slashdot.org/',
  'http://reddit.com/',
  'http://news.ycombinator.com/'
 ]

def load_url(url):
    return urllib2.urlopen(url).read()

if __name__ == '__main__':
    with futures.ThreadPoolExecutor(max_workers=7) as executor:
        fmap = dict((executor.submit(load_url, url), url) for url in URLs)
        for f in futures.as_completed(fmap):
            url = fmap[f]
            try:
                content = f.result()
            except Exception as exception:
                print '%r failed: %s' % (url, exception)
            else:
                print '%r is %d bytes' % (url, len(content))

urllib.urlretrieve(url).read()应该是urllib.urlopen(url).read()

from threading import *
from time import sleep
# if Python2:
import urllib
# if Python3:
# import urllib.request

URLs = [
  'http://www.cnn.com/',
  'http://www.bbc.co.uk/',
  'http://www.economist.com/',
  'http://nonexistant.website.at.baddomain/',
  'http://slashdot.org/',
  'http://reddit.com/',
  'http://news.ycombinator.com/'
 ]

class worker(Thread):
    def __init__(self, link):
        Thread.__init__(self)
        self.link = link
        self.start()
    def run(self):
        # if Python2:
        res = urllib.urlopen(url).read() # as mentioned by @DhruvPathak
        # if Python3:
        # res = urllib.request.urlopen(url).read()
        with open(url, 'rb') as fh:
            fh.write(res) # store fetched data in a file called <link>

for url in urls:
    while len(enumerate()) > 500:
        sleep(0.25)
    worker(url)

while len(enumerate()) > 1:
    sleep(0.25) # wait for all threads to finish

链接地址: http://www.djcxy.com/p/9417.html

上一篇: Download files from url parallely in python

下一篇: What sets up sys.path with Python, and when?