Using wget or curl to download web sites for archival

wget is useful for downloading entire web sites recursively. For archival purposes, what you want is usually something like this:

wget -rkp -l3 -np -nH --cut-dirs=1 http://web.psung.name/emacstips/

This will start at the specified URL and recursively download pages up to 3 links away from the original page, but only pages which are in the directory of the URL you specified (emacstips/) or one of its subdirectories.

wget will also rewrite the links in the pages it downloaded to make your downloaded copy a useful local copy, and it will download all page prerequisites (e.g. images, stylesheets, and the like).

The last two options -nH --cut-dirs=1 control where wget places the output. If you omitted those two options, wget would, for example, download http://web.psung.name/emacstips/index.html and place it under a subdirectory web.psung.name/emacstips of the current directory. With only -nH ("no host directory") wget would write that same file to a subdirectory emacstips. And with both options wget would write that same file to the current directory. In general, if you want to reduce the number of extraneous directories created, change cut-dirs to be the number of leading directories in your URL.

Bonus: downloading files with curl

Another tool, curl, provides some of the same features as wget but also some complementary features. One thing that curl can do is to download sequentially numbered files, specified using brackets [..]. For example, the following string:

http://www.cl.cam.ac.uk/~rja14/Papers/SE-[01-24].pdf

refers to the 24 chapters of Ross Anderson's Security Engineering: http://www.cl.cam.ac.uk/~rja14/Papers/SE-01.pdf, http://www.cl.cam.ac.uk/~rja14/Papers/SE-02.pdf, etc., http://www.cl.cam.ac.uk/~rja14/Papers/SE-24.pdf.

You can give curl a pattern for naming the output files. For example if I wanted the files to be named SE-chapter-01.pdf, etc, then the appropriate curl incantation would be:

curl http://www.cl.cam.ac.uk/~rja14/Papers/SE-[01-24].pdf -o "SE-chapter-#1.pdf"

In addition to specifying consecutively numbered files, you can also use braces {..} to specify alternatives, as you would in a shell, e.g. http://web.psung.name/page/{one,two,three}.html. Specifying output patterns with "#1" works with braces too.

2 comments:

  1. Very useful tips. Thanks.

    ReplyDelete
  2. no no no.

    you make it sound like curl is superior to wget. curl is a transfer tool. it may be better at that single function.

    wget is a transfer tool plus an interpreter such that it scans (or reads) the input and can transfer data based on data.

    you get less features.

    its uncommon to see such structured filenames. well, other than in cambridge where structure is their highest offering.

    ReplyDelete