In Files

Methods

Class/Module Index [+]

Quicksearch

Ferret::Analysis::RegExpTokenizer

Summary

A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.

Example

Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.

# of course you would add more than just é
RegExpTokenizer.new(input, /[[:alpha:]é]+/)

"Dave's résumé, at http://www.davebalmain.com/ 1234"
  => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]

Constants

REGEXP

Public Class Methods

new(input, /[[:alpha:]]+/) click to toggle source

Create a new tokenizer based on a regular expression

input

text to tokenizer

regexp

regular expression used to recognize tokens in the input

static VALUE
frb_rets_init(int argc, VALUE *argv, VALUE self) 
{
    VALUE rtext, regex, proc;
    TokenStream *ts;

    rb_scan_args(argc, argv, "11&", &rtext, &regex, &proc);

    ts = rets_new(rtext, regex, proc);

    Frt_Wrap_Struct(self, &frb_rets_mark, &frb_rets_free, ts);
    object_add(ts, self);
    return self;
}

Public Instance Methods

text = text → text click to toggle source

Get the text being tokenized by the tokenizer.

static VALUE
frb_rets_get_text(VALUE self)
{
    TokenStream *ts;
    GET_TS(ts, self);
    return RETS(ts)->rtext;
}
text = text → text click to toggle source

Set the text to be tokenized by the tokenizer. The tokenizer gets reset to tokenize the text from the beginning.

static VALUE
frb_rets_set_text(VALUE self, VALUE rtext)
{
    TokenStream *ts;
    GET_TS(ts, self);

    rb_hash_aset(object_space, ((VALUE)ts)|1, rtext);
    StringValue(rtext);
    RETS(ts)->rtext = rtext;
    RETS(ts)->curr_ind = 0;

    return rtext;
}

[Validate]

Generated with the Darkfish Rdoc Generator 2.