LinkChecker

LinkChecker

Documentation

Basic usage

To check a URL like http://www.example.org/ it is enough to type linkchecker www.example.org/ on the command line or type www.example.org in the GUI application. This will check the complete domain of http://www.example.org recursively. All links pointing outside of the domain are also checked for validity.

Performed checks

All URLs have to pass a preliminary syntax test. Minor quoting After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.

Recursion

Before descending recursively into a URL, it has to fulfill several conditions. The conditions are checked in this order:

  1. The URL must be valid.

  2. The URL must be parseable. This currently includes HTML files, Opera bookmarks files, and directories. If a file type cannot be determined (for example it does not have a common HTML file extension, and the content does not look like HTML), it is assumed to be non-parseable.

  3. The URL content must be retrievable. This is usually the case except for example mailto: or unknown URL types.

  4. The maximum recursion level must not be exceeded. It is configured with the --recursion-level command line option, the recursion level GUI option, or through the configuration file. The recursion level is unlimited by default.

  5. It must not match the ignored URL list. This is controlled with the --ignore-url command line option or through the configuration file.

  6. The Robots Exclusion Protocol must allow links in the URL to be followed recursively. This is checked by searching for a "nofollow" directive in the HTML header data.

Note that the directory recursion reads all files in that directory, not just a subset like index.htm*.