扫描网站的内容（快速）

2018-06-19 14:42:32

我在数据库中有数千个网站，我想搜索特定字符串的所有网站。什么是最快的方法呢？我想我应该首先获得每个网站的内容 - 这将是我这样做的方式：

import urllib2, re
string = "search string"
source = urllib2.urlopen("http://website1.com").read()
if re.search(word,source):
    print "My search string: "+string

并搜索字符串。但这非常缓慢。我如何在python中加速它？

我不认为你的问题是该程序 - 这是事实，你正在执行一个数千个网站的HTTP请求。您可以调查涉及某种并行处理的不同解决方案，但不管您在解析代码时的效率如何，都会在当前实现中遇到瓶颈。

这是一个使用Queue和threading模块的基本示例。我建议阅读多处理与多线程的好处（比如@JonathanV提到的帖子），但是这将有助于理解正在发生的事情：

import Queue
import threading
import time
import urllib2

my_sites = [
    'http://news.ycombinator.com',
    'http://news.google.com',
    'http://news.yahoo.com',
    'http://www.cnn.com'
    ]

# Create a queue for our processing
queue = Queue.Queue()


class MyThread(threading.Thread):
  """Create a thread to make the url call."""

  def __init__(self, queue):
    super(MyThread, self).__init__()
    self.queue = queue

  def run(self):
    while True:
      # Grab a url from our queue and make the call.
      my_site = self.queue.get()
      url = urllib2.urlopen(my_site)

      # Grab a little data to make sure it is working
      print url.read(1024)

      # Send the signal to indicate the task has completed
      self.queue.task_done()


def main():

  # This will create a 'pool' of threads to use in our calls
  for _ in range(4):
    t = MyThread(queue)

    # A daemon thread runs but does not block our main function from exiting
    t.setDaemon(True)

    # Start the thread
    t.start()

  # Now go through our site list and add each url to the queue
  for site in my_sites:
    queue.put(site)

  # join() ensures that we wait until our queue is empty before exiting
  queue.join()

if __name__ == '__main__':
  start = time.time()
  main()
  print 'Total Time: {0}'.format(time.time() - start)

特别是关于threading好资源，请参阅Doug Hellmann的帖子，这里有一篇IBM文章（这已成为我上面证明的通用线程设置）和实际文档。

尝试考虑使用多处理来同时运行多个搜索。多线程也可以工作，但如果管理不当，共享内存可能变成诅咒。看看这个讨论，以帮助你看到哪种选择适合你。

链接地址: http://www.djcxy.com/p/55213.html

上一篇: scan websites for content (fast)

下一篇: continuosly running function in background in python