scan websites for content (fast)

I have thousands of websites in a database and I want to search all of the websites for a specific string. What is the fastest way to do it? I think I should get the content of each website first - this would be the way I do it:

import urllib2, re
string = "search string"
source = urllib2.urlopen("http://website1.com").read()
if re.search(word,source):
    print "My search string: "+string

and search for the string. But this is very slow. How can I accelerate it in python?


I don't think your issue is the program - it is the fact that you are executing an HTTP request for thousands of sites. You could investigate different solutions involving some sort of parallel processing, but regardless of how efficient you make the parsing code you are going to hit a bottleneck with the requests in your current implementation.

Here is a basic example that uses the Queue and threading modules. I would suggest reading up on the benefits of multiprocessing vs. multiple threads (such as the post mentioned by @JonathanV), but this will hopefully be somewhat helpful in understanding what is happening:

import Queue
import threading
import time
import urllib2

my_sites = [
    'http://news.ycombinator.com',
    'http://news.google.com',
    'http://news.yahoo.com',
    'http://www.cnn.com'
    ]

# Create a queue for our processing
queue = Queue.Queue()


class MyThread(threading.Thread):
  """Create a thread to make the url call."""

  def __init__(self, queue):
    super(MyThread, self).__init__()
    self.queue = queue

  def run(self):
    while True:
      # Grab a url from our queue and make the call.
      my_site = self.queue.get()
      url = urllib2.urlopen(my_site)

      # Grab a little data to make sure it is working
      print url.read(1024)

      # Send the signal to indicate the task has completed
      self.queue.task_done()


def main():

  # This will create a 'pool' of threads to use in our calls
  for _ in range(4):
    t = MyThread(queue)

    # A daemon thread runs but does not block our main function from exiting
    t.setDaemon(True)

    # Start the thread
    t.start()

  # Now go through our site list and add each url to the queue
  for site in my_sites:
    queue.put(site)

  # join() ensures that we wait until our queue is empty before exiting
  queue.join()

if __name__ == '__main__':
  start = time.time()
  main()
  print 'Total Time: {0}'.format(time.time() - start)

For good resources on threading in particular, see Doug Hellmann's post here, an IBM article here (this has become my general threading setup as evidenced by the above) and the actual docs here.


Try looking into using multiprocessing to run multiple searches at the same time. Mutlithreading works too but the shared memory can turn into a curse if not managed properly. Take a look at this discussion to help you see which choice would work for you.

链接地址: http://www.djcxy.com/p/55214.html

上一篇: Python线程不会同时运行

下一篇: 扫描网站的内容(快速)