recursive wget with hotlinked requisites

2018-06-20 07:44:35

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.

For example, let's look at this page https://dl.dropbox.com/u/11471672/wget-all-the-things.html

Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.

wget -e robots=off -r -l inf -pk

^^ gets everything but the hotlinked image

wget -e robots=off -r -l inf -pk -H

^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web

wget -e robots=off -r -l inf -pk -H --ignore-tags=a

^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.

I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.

You can't specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you'll want to split the crawls that use them. To grab hotlinked page-reqs, you'll have to run wget twice: once to recurse through the site's structure, and once to grab hotlinked reqs. I've had luck with this method:

1) wget -r -l inf [other non-H non-p switches] http://www.example.com

2) build a list of all HTML files in the site structure ( find . | grep html ) and pipe to file

3) wget -pH [other non-r switches] -i [infile]

Step 1 builds the site's structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.

I've managed to do this by using regular expressions. Something like this to mirror http://www.example.com/docs

wget --mirror --convert-links --adjust-extension 
--page-requisites --span-hosts 
--accept-regex '^http://www.example.com/docs|.(js|css|png|jpeg|jpg|svg)$' 
http://www.example.com/docs

You'll probably have to tune the regexs for each specific site. For example some sites like to use parameters on css files (eg style.css?key=value ), which this example will exclude.

The files you want to include from other hosts will probably include at least

Images: png jpg jpeg gif

Fonts: ttf otf woff woff2 eot

Others: js css svg

Anybody know any others?

So the actual regex you want will probably look more like this (as one string with no linebreaks):

^http://www.example.org/docs|.([Jj][Ss]|[Cc][Ss][Ss]|[Pp][Nn][Gg]|[Jj]
[Pp][Ee]?[Gg]|[Ss][Vv][Gg]|[Gg][Ii][Ff]|[Tt][Tt][Ff]|[Oo][Tt][Ff]|[Ww]
[Oo][Ff][Ff]2?|[Ee][Oo][Tt])(?.*)?$

链接地址: http://www.djcxy.com/p/57158.html

上一篇: wget命令下载文件并保存为不同的文件名