Recoll has an Application programming Interface, usable both for indexing and searching, currently accessible from the Python language.
Another less radical way to extend the application is to write filters for new types of documents.
The processing of metadata attributes for documents (fields) is highly configurable.
Recoll filters are executable programs which translate from a specific format (ie: openoffice, acrobat, etc.) to the Recoll indexing input format, which may be text/plain or text/html.
Recoll filters are usually shell-scripts, but this is in no way necessary. These programs are extremely simple and most of the difficulty lies in extracting the text from the native format, not outputting what is expected by Recoll. Happily enough, most document formats already have translators or text extractors which handle the difficult part and can be called from the filter. In some case the output of the translating program is appropriate, and no intermediate shell-script is needed.
Filters are called with a single argument which is the source file name. They should output the result to stdout.
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells the filter if the operation is for indexing or previewing. Some filters use this to output a slightly different format. This is not essential.
The association of file types to filters is performed in the mimeconf file. A sample:
[index] application/msword = exec antiword -t -i 1 -m UTF-8;\ mimetype=text/plain;charset=utf-8 application/ogg = exec rclogg text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
The fragment specifies that:
application/msword files are processed by executing the antiword program, which outputs text/plain encoded in iso-8859-1.
application/ogg files are processed by the rclogg script, with default output type (text/html, with encoding specified in the header, or utf-8 by default).
text/rtf is processed by unrtf, which outputs text/html. The iso-8859-1 encoding is specified because it is not the utf-8 default, and not output by unrtf in the HTML header section.
The easiest way to write a new filter is probably to start from an existing one.
Filters which output text/plain text are generally simpler, but they cannot specify the character set and other metadata, so they are limited to cases where these elements are not needed.
The output HTML could be very minimal like the following example:
<html><head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> </head> <body>some text content</body></html>
You should take care to escape some characters inside the text by transforming them into appropriate entities. "&" should be transformed into "&", "<" should be transformed into "<". This is not always properly done by translating programs which output HTML, and of course nerver by those which output plain text.
The character set needs to be specified in the header. It does not need to be UTF-8 (Recoll will take care of translating it), but it must be accurate for good results.
Recoll will also make use of other header fields if they are present: title, description, keywords.
Filters also have the possibility to "invent" field names. This should be output as meta tags:
<meta name="somefield" content="Some textual data" />
See the following section for details about configuring how field data is processed by the indexer.