groonga - An open-source fulltext search engine and column store.

4.1. Basic operations

The groonga package provides a C library (libgroonga) and a command line tool (groonga). This tutorial explains how to use the groonga command, with which you can create/operate databases, start a server, establish a connection with a server, etc.

4.1.1. Create a database

You can create a new database with the following command.

Form:

groonga -n DB_PATH_NAME

The '-n' option specifies to create a new database. DB_PATH_NAME specifies the path of the new database. Note that this command fails if the specified path already exists.

This command creates a database and then enters into interactive mode in which groonga prompts you to enter commands for operating that database. You can terminate this mode with Ctrl-d.

Execution example:

% groonga -n /tmp/tutorial.db
> Ctrl-d
%

4.1.2. Operate a database

Form:

groonga DB_PATH_NAME [COMMAND]

DB_PATH_NAME specifies the path of a target database.

If COMMAND is specified, groonga executes COMMAND and returns the result. Otherwise, groonga starts in interactive mode that reads commands from the standard input and execute them one by one. This tutorial focuses on the interactive mode.

Let's try to see the status of a groonga process by using a status command.

Execution example:

% groonga -n /tmp/groonga-databases/introduction.db
> status
[[0,1322616280.40348,0.000158121],{"alloc_count":127,"starttime":1322616279,"uptime":1,"version":"1.2.8-9-gbf05b82","n_queries":0,"cache_hit_rate":0.0,"command_version":1,"default_command_version":1,"max_command_version":2}]

As shown in the above example, a command basically returns a JSON array. The first element contains an error code, execution time, etc. The second element is the result of an operation.

4.1.3. Command format

Commands for operating a database accept arguments as follows:

Form_1: COMMAND VALUE_1 VALUE_2 ..

Form_2: COMMAND --NAME_1 VALUE_1 --NAME_2 VALUE_2 ..

In the first form, arguments must be passed in order. This kind of arguments are called positional arguments because the position of each argument determines its meaning.

In the second form, you can specify a parameter name with its value. So, the order of arguments is not defined. This kind of arguments are known as named parameters or keyword arguments.

If you want to specify a value which contains white-spaces or special characters, such as quotes and parentheses, please enclose the value with single-quotes or double-quotes.

For details, see also the paragraph of "command" in groonga実行ファイル.

4.1.4. Basic commands

status
shows status of a groonga process.
table_list
shows a list of tables in a database.
column_list
shows a list of columns in a table.
table_create
adds a table to a database.
column_create
adds a column to a table.
select
searches records from a table and shows the result.
load
inserts records to a table.

4.1.5. Create a table

A table_create command creates a table.

In most cases, a table of groonga has a primary key which must be specified with its data type and index type.

There are various data types such as integers, floating-point numbers, etc. The index type determines the search performance and the availability of prefix searches. We will explain the details later.

Let's create a 'Site' table which has a primary key of ShortText. In this example, the index type is HASH.

Execution example:

> table_create --name Site --flags TABLE_HASH_KEY --key_type ShortText
[[0,1322616280.60791,0.01234375],true]

4.1.6. View a table

A select command shows contents of table.

Execution example:

> select --table Site
[[0,1322616280.82196,0.000451873],[[[0],[["_id","UInt32"],["_key","ShortText"]]]]]

When only a table is specified, the 'select' command returns the first (at most) 10 records of that table. "[0]" in the result shows the number of records in the 'Site' table. The next array is a list of columns. ["_id","Uint32"] is a column of UInt32, named "_id". ["_key","ShortText"] is a column of ShortText, named "_key".

The above two columns, '_id' and '_key', are the necessary columns. The '_id' column stores IDs those are automatically allocated by groonga. The '_key' column is associated with the primary key. You are not allowed to rename these columns.

4.1.7. Create a column

A column_create command adds a column to a table.

Let's add a column of ShortText to store titles. You may give a descriptive name 'title' to the column.

Execution example:

> column_create --table Site --name title --flags COLUMN_SCALAR --type ShortText
[[0,1317212712.91734,0.077833747],true]
> select --table Site
[[0,1317212713.19572,0.000121119],[[[0],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]]]]]

The COLUMN_SCALAR flag specifies to add a regular column.

4.1.8. Create a lexicon table for full text searches

Let's go on to how to make a full text search.

Groonga uses an inverted index to provide fast full text search. So, the first step is to create a lexicon table which stores an inverted index, also known as postings lists. The primary key of this table is associated with a vocabulary made up of index terms and each record stores postings lists for one index term.

The following shows a command which creates a lexicon table named 'Terms'. The data type of its primary key is ShortText.

Execution example:

> table_create --name Terms --flags TABLE_PAT_KEY|KEY_NORMALIZE --key_type ShortText --default_tokenizer TokenBigram
[[0,1317212713.39679,0.092312046],true]

The table_create command takes many parameters but you don't need to understand all of them. Please skip the next paragraph if you are not interested in how it works.

The 'TABLE_PAT_KEY' flag specifies to store index terms in a patricia trie. The 'KEY_NORMALIZE' flag specifies to normalize index terms. In this example, both flags are validated by using a '|'. The 'default_tokenizer' parameter specifies a method for tokenizing text. This example specifies 'TokenBigram' that is generally called 'N-gram'.

4.1.10. Load data

A load command loads JSON-formatted records into a table.

The following adds nine records to the 'Site' table.

Execution example:

> load --table Site
> [
> {"_key":"http://example.org/","title":"This is test record 1!"},
> {"_key":"http://example.net/","title":"test record 2."},
> {"_key":"http://example.com/","title":"test test record three."},
> {"_key":"http://example.net/afr","title":"test record four."},
> {"_key":"http://example.org/aba","title":"test test test record five."},
> {"_key":"http://example.com/rab","title":"test test test test record six."},
> {"_key":"http://example.net/atv","title":"test test test record seven."},
> {"_key":"http://example.org/gat","title":"test test record eight."},
> {"_key":"http://example.com/vdw","title":"test test record nine."},
> ]
[[0,1317212714.08816,2.203527402],9]

Let's make sure that these records are correctly stored.

Execution example:

> select --table Site
[[0,1317212716.49285,0.000270908],[[[9],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]],[1,"http://example.org/","This is test record 1!"],[2,"http://example.net/","test record 2."],[3,"http://example.com/","test test record three."],[4,"http://example.net/afr","test record four."],[5,"http://example.org/aba","test test test record five."],[6,"http://example.com/rab","test test test test record six."],[7,"http://example.net/atv","test test test record seven."],[8,"http://example.org/gat","test test record eight."],[9,"http://example.com/vdw","test test record nine."]]]]

4.1.11. Search data

Before a full text search, let's try to search data by '_id' and '_key'. These columns work as unique keys.

You can search records by using a 'select' command with a 'query' parameter.

Execution example:

> select --table Site --query _id:1
[[0,1317212716.69871,0.000308514],[[[1],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]],[1,"http://example.org/","This is test record 1!"]]]]

'_id:1' specifies to search a record whose ID is 1.

Next, let's search a record by a primary key.

Execution example:

> select --table Site --query "_key:\"http://example.org/\""
[[0,1317212716.9005,0.000478343],[[[1],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]],[1,"http://example.org/","This is test record 1!"]]]]

'_key:"http://example.org/"' specifies to search a record whose primary key is "http://example.org/".

4.1.13. Specify output columns

An 'output_columns' parameter in a 'select' command specifies columns to be shown in the search result. If you want to specify more than one columns, please separate column names by commas (,).

Execution example:

> select --table Site --output_columns _key,title,_score --query title:@test
[[0,1317212717.50916,0.00060758],[[[9],[["_key","ShortText"],["title","ShortText"],["_score","Int32"]],["http://example.org/","This is test record 1!",1],["http://example.net/","test record 2.",1],["http://example.com/","test test record three.",2],["http://example.net/afr","test record four.",1],["http://example.org/aba","test test test record five.",3],["http://example.com/rab","test test test test record six.",4],["http://example.net/atv","test test test record seven.",3],["http://example.org/gat","test test record eight.",2],["http://example.com/vdw","test test record nine.",2]]]]

This command specifies three output columns including the '_score' column, which stores the relevance score of each record.

4.1.14. Specify output ranges

A 'select' command returns a part of its search result if 'offset' and/or 'limit' parameters are specified. These parameters are useful to paginate a search result, a widely-used interface which shows a search result on a page by page basis.

An 'offset' parameter specifies the starting point and a 'limit' parameter specifies the maximum number of records to be returned. If you need the first record in a search result, the offset parameter must be 0 or omitted.

Execution example:

> select --table Site --offset 0 --limit 3
[[0,1317212717.71574,0.000238544],[[[9],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]],[1,"http://example.org/","This is test record 1!"],[2,"http://example.net/","test record 2."],[3,"http://example.com/","test test record three."]]]]
> select --table Site --offset 3 --limit 3
[[0,1317212717.91925,0.00023617],[[[9],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]],[4,"http://example.net/afr","test record four."],[5,"http://example.org/aba","test test test record five."],[6,"http://example.com/rab","test test test test record six."]]]]
> select --table Site --offset 7 --limit 3
[[0,1317212718.12219,0.00019999],[[[9],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]],[8,"http://example.org/gat","test test record eight."],[9,"http://example.com/vdw","test test record nine."]]]]

4.1.15. Sort

A 'select' command sorts its result when used with a 'sortby' parameter.

A 'sortby' parameter specifies a column as a sorting creteria. A search result is arranged in ascending order of the column values. If you want to sort a search result in reverse order, please add a leading hyphen (-) to the column name of a parameter.

Execution example:

> select --table Site --sortby -_id
[[0,1317212718.32565,0.000385755],[[[9],[["_id","UInt32"],["_key","ShortText"],["title","ShortText"]],[9,"http://example.com/vdw","test test record nine."],[8,"http://example.org/gat","test test record eight."],[7,"http://example.net/atv","test test test record seven."],[6,"http://example.com/rab","test test test test record six."],[5,"http://example.org/aba","test test test record five."],[4,"http://example.net/afr","test record four."],[3,"http://example.com/","test test record three."],[2,"http://example.net/","test record 2."],[1,"http://example.org/","This is test record 1!"]]]]

You can use the '_score' column as a sorting criteria for ranking a search result.

Execution example:

> select --table Site --query title:@test --output_columns _id,_score,title --sortby _score
[[0,1317212718.5331,0.000667311],[[[9],[["_id","UInt32"],["_score","Int32"],["title","ShortText"]],[1,1,"This is test record 1!"],[2,1,"test record 2."],[4,1,"test record four."],[3,2,"test test record three."],[9,2,"test test record nine."],[8,2,"test test record eight."],[7,3,"test test test record seven."],[5,3,"test test test record five."],[6,4,"test test test test record six."]]]]

If you want to specify more than one columns, please separate column names by commas. In such a case, a search result is sorted in order of the column values in the first column, and then records having the same values in the first column are sorted in order of the second column values.

Execution example:

> select --table Site --query title:@test --output_columns _id,_score,title --sortby _score,_id
[[0,1317212718.73819,0.00069225],[[[9],[["_id","UInt32"],["_score","Int32"],["title","ShortText"]],[1,1,"This is test record 1!"],[2,1,"test record 2."],[4,1,"test record four."],[3,2,"test test record three."],[8,2,"test test record eight."],[9,2,"test test record nine."],[5,3,"test test test record five."],[7,3,"test test test record seven."],[6,4,"test test test test record six."]]]]

footnote

[1]Currently, a 'match_columns' parameter is available iff there exists an inverted index for full text search. A 'match_columns' parameter for a regular column is not supported.