Chapter 6. Programming interface

Table of Contents
6.1. Writing a document filter
6.2. Field data processing configuration
6.3. API

Recoll has an Application programming Interface, usable both for indexing and searching, currently accessible from the Python language.

Another less radical way to extend the application is to write filters for new types of documents.

The processing of metadata attributes for documents (fields) is highly configurable.

6.1. Writing a document filter

Recoll filters are executable programs which translate from a specific format (ie: openoffice, acrobat, etc.) to the Recoll indexing input format, which may be text/plain or text/html.

As of Recoll 1.13, there are two kinds of filters:

The following will just describe the simple filters, if you are programmer enough to write one of the other kind, it shouldn't be too difficult to make sense of one of the existing modules (ie: rclzip).

Recoll simple filters are usually shell-scripts, but this is in no way necessary. These programs are extremely simple and most of the difficulty lies in extracting the text from the native format, not outputting what is expected by Recoll. Happily enough, most document formats already have translators or text extractors which handle the difficult part and can be called from the filter. In some case the output of the translating program is appropriate, and no intermediate shell-script is needed.

Filters are called with a single argument which is the source file name. They should output the result to stdout.

The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells the filter if the operation is for indexing or previewing. Some filters use this to output a slightly different format. This is not essential.

The association of file types to filters is performed in the mimeconf file. A sample:


[index]
application/msword = exec antiword -t -i 1 -m UTF-8;\
     mimetype = text/plain ; charset=utf-8

application/ogg = exec rclogg

text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html

application/x-chm = execm rclchm

The fragment specifies that:

The easiest way to write a new filter is probably to start from an existing one.

Filters which output text/plain text are generally simpler, but they cannot specify the character set and other metadata, so they are limited to cases where these elements are not needed.

6.1.1. Filter HTML output

The output HTML could be very minimal like the following example:

<html><head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
</head>
<body>some text content</body></html>
         

You should take care to escape some characters inside the text by transforming them into appropriate entities. "&" should be transformed into "&amp;", "<" should be transformed into "&lt;". This is not always properly done by translating programs which output HTML, and of course nerver by those which output plain text.

The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.

Recoll will also make use of other header fields if they are present: title, description, keywords.

Filters also have the possibility to "invent" field names. This should be output as meta tags:

<meta name="somefield" content="Some textual data" />

See the following section for details about configuring how field data is processed by the indexer.