Description | Usage | Configuration files
WebCrawl is a program designed to download an entire website without user interaction (although an interactive mode is available).
It works simply by starting with a single web page, and following all links from that page to attempt to recreate the directory structure on the remote server.
As well as downloading the pages, it also rewrites them to use a local URL where URLs that would otherwise not work on the local system are used in the page (eg URLs that begin with http:// or the begin with a /).
It stores the downloaded files in a directory structure that mirrors the original site's, under a directory called server.domain.com:port. This way, multiple sites can all be loaded into the same directory structure, and if they link to each other, they can be rewritten to link to the local, rather than remote, versions.
Comprehensive URL selection facilities allow you to describe what documents you want to download, so that you don't end up downloading much more than you need.
WebCrawl is written in ANSI C, and should work on any POSIX system. With minor modifications, it should be possible to make it work on any operating system that supports TCP/IP sockets. It has been tested only on Linux.
webcrawl [options] web-address destination-dir
WebCrawl will download the page web-address into a directory
called destination-dir under the compiled in server root
directory (which can be changed with the -o
option, see
below). web-address should not contain a leading
http://
.
URL selection | |
-a |
This causes the program to ask the user whether to download a page that it hasn't been otherwise instructed to (by default, this means off-site pages) |
-f string |
This causes the program to always follow links to URLs that
contain the string. You can use this, for example, to prevent a
crawl from going up beyond a single directory on a site (in
conjunction with the -x option below); say you
wanted to get http://www.web-sites.co.uk/jules but not any
other site located on the same server. You could use the command
line:
Another use would be if a site contained links to (eg) pictures, videos or sound clips on a remote server, you could use the following command line to get them:
Note that webcrawl always downloads inline images. |
-d string |
The opposite of -f , this option tells webcrawl never to
get a URL containing the string. -d takes priority over
all other URL selection options (except that it won't stop it from
downloading inline images, which are always downloaded).
|
-u filename |
Causes webcrawl to log unfollowed links to the file filename. |
-x |
Causes webcrawl not to automatically follow links to pages on the
same server. This is useful in conjuction with the -f
option to specify a subsection of an entire site to download.
|
-X |
Causes webcrawl not to automatically download inline images (which it would otherwise do even when other options did not indicate that the image should be loaded). This is useful in conjunction with the -f option to specify a subsection of an entire site to download, when even the images concerned need careful selection. |
Page re-writing: | |
-n |
Turns off page rewriting completely. |
-rx |
Select which URLs to rewrite. Only URLs that begin with / or http:
are considered for rewriting, all others are always left unchanged.
This options selects which of these URLs are rewritten to point to
local files, depending on the value of x.
|
-k |
Keep original filenames - disables changing of filenames to remove metacharacters that may confuse a web server, and to ensure that the extension on the end of the filename is a correct .html or .htm whenever the page has a text/html content type. (See Configuration Files below for a discussion of how to achieve this with other file types). |
-q |
Disable process ID insertion into query filenames. Without this flag, and whenever -k is not in use, webcrawl rewrites the filenames of queries (defined as any fetch from a web server that includes a '?' character in the filename) to include the process ID of the webcrawl process fetching the query in hexadecimal after the (escaped) '?' in the filename; this may be desirable if performing the same query multiple times to get different results. This flag disables this behaviour. |
Recursion limiting: | |
-l[x] number |
This option is used to limit the depth to which webcrawl will search the tree (forest) of interlinked pages. There are two parameters that may be set; with x as l, the initial limit is set, with x as r, the limit used after jumping to a remote site is set. If x is missed out, both limits are set. |
-v |
Increases the program's verbosity. Without this option, no reports
on status are made unless errors occur, etc. Used once, webcrawl
will report which URLs it is trying to download, and also which
links it has decided not to follow. -v may be used more
than once, but this is probably only useful for debugging purposes.
|
-o dir |
Change the server root directory. This is the directory that the path specified at the end of the command line is relative to. |
-p dir |
Change the URL rewriting prefix. This is prepended to rewritten
URLs, and should for a (relative) URL that points to the current
server root directory. An example of the use of the -o
and -p options is given below:
|
HTTP-related options | |
-A string |
Causes webcrawl to send the specified string as the HTTP 'User-Agent' value, rather than the compiled in default (normally 'Mozilla/4.05 [en] (X11; I; Linux 2.0.27 i586; Nav)', although this can be changed in the file web.h at compile time). |
-t n |
Specifies a timeout, in seconds. Default behaviour is to give up after this length of time from the initial connection attempt. |
-T |
Changes the timeout behavior. With this flag, the timeout occurs only if no data is received from the server for the specified length of time. |
webcrawl uses configuration files at present to specify rules for the rewriting of filenames. It searches for files in /etc/webcrawl.conf, /usr/local/etc/webcrawl.conf, and $HOME/.webcrawl and processes all files it finds in that order. Parameters set in one file may be overriden by subsequent files. Note that it is perfectly possible to use webcrawl without a configuration file - it is only for advanced features that are too complex to configure on the command line that it is required.
The overall syntax of the webcrawl file is a set of sections, each headed by a line of the form [section-name].
At present, only the [rename] section is defined. This may contain the following commands:
meta string
Sets metacharacter list. Any character in the list specified will be quoted in filenames produced (unless filename rewriting is disabled with the -k option). Quoting is performed by prepending the quoting character (default @) to the hexadecimal ASCII value of the character being quoted. The default metacharacter list is: ?&*%=#
quote char
Sets the quoting character (default @)
type content/type preferred [extra extra ...]
Sets the list of acceptable extensions for the specifed MIME content type. The first item in the list is the preferred extension; if renaming is not disabled (with the -k option) and the extension of a file of this type is not on the list, then the first extension on the list will be appended to its name.An implicit line is defined internally, which reads:
type text/html html htm
This could be overriden; if say you preferred the 'htm' extension over 'html', you could use:
type text/html htm html
in a configuration file to cause .htm extensions to be used whenever a new extension was added.
WebCrawl was written by Julian R. Hall jules@acris.co.uk, with suggestions and prompting by Andy Smith.
Bugs should be submitted to Julian Hall at the address above. Please include information about what architecture, version, etc, you are using.