使用Python下载网页上的所有链接(相关文档)

我必须从网页下载大量文件。 它们是wmv文件,PDF,BMP等。当然,它们都有链接。 所以每一次,我必须RMC一个文件,选择'保存链接为',然后保存,然后键入所有文件。 是否有可能在Python中做到这一点? 我搜索SO DB,人们回答了如何从网页获取链接的问题。 我想下载实际的文件。 提前致谢。 (这不是一个硬件问题:))。


下面是一个如何从http://pypi.python.org/pypi/xlwt下载一些选定文件的例子

您需要首先安装机械化:http://wwwsearch.sourceforge.net/mechanize/download.html

import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()

#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.

# Open your site
br.open('http://pypi.python.org/pypi/xlwt')

f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe

filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    for t in filetypes:
        if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
            myfiles.append(l)


def downloadlink(l):
    f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.
    br.click_link(l)
    f.write(br.response().read())
    print l.text," has been downloaded"
    #br.back()

for l in myfiles:
    sleep(1) #throttle so you dont hammer the site
    downloadlink(l)

注意:在某些情况下,您可能希望用br.click_link(l)替换br.follow_link(l) 。 区别在于click_link返回一个Request对象,而follow_link将直接打开链接。 请参阅br.click_link()和br.follow_link()之间的机械化差异


  • 请遵循以下链接中的Python代码:wget-vs-urlretrieve-of-python。
  • 您也可以使用Wget轻松完成此操作。 在Wget尝试--limit ,-- --recursive--accept命令行。 例如: wget --accept wmv,doc --limit 2 --recursive http://www.example.com/files/
  • 链接地址: http://www.djcxy.com/p/54713.html

    上一篇: Download all the links(related documents) on a webpage using Python

    下一篇: How to load data from a file, for a unit test, in python?