org.apache.solr.analysis
Class CommonGramsFilter

java.lang.Object
  extended by org.apache.lucene.util.AttributeSource
      extended by org.apache.lucene.analysis.TokenStream
          extended by org.apache.lucene.analysis.TokenFilter
              extended by org.apache.solr.analysis.CommonGramsFilter
All Implemented Interfaces:
Closeable

public final class CommonGramsFilter
extends org.apache.lucene.analysis.TokenFilter

Construct bigrams for frequently occurring terms while indexing. Single terms are still indexed too, with bigrams overlaid. This is achieved through the use of PositionIncrementAttribute.setPositionIncrement(int). Bigrams have a type of GRAM_TYPE Example:


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
 
Field Summary
 
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
 
Constructor Summary
CommonGramsFilter(org.apache.lucene.analysis.TokenStream input, Set<?> commonWords)
          Deprecated. Use CommonGramsFilter(Version, TokenStream, Set) instead
CommonGramsFilter(org.apache.lucene.analysis.TokenStream input, Set<?> commonWords, boolean ignoreCase)
          Deprecated. Use CommonGramsFilter(Version, TokenStream, Set, boolean) instead
CommonGramsFilter(org.apache.lucene.analysis.TokenStream input, String[] commonWords)
          Deprecated. Use CommonGramsFilter(Version, TokenStream, Set) instead.
CommonGramsFilter(org.apache.lucene.analysis.TokenStream input, String[] commonWords, boolean ignoreCase)
          Deprecated. Use CommonGramsFilter(Version, TokenStream, Set, boolean) instead.
CommonGramsFilter(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.TokenStream input, Set<?> commonWords)
          Construct a token stream filtering the given input using a Set of common words to create bigrams.
CommonGramsFilter(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.TokenStream input, Set<?> commonWords, boolean ignoreCase)
          Construct a token stream filtering the given input using a Set of common words to create bigrams, case-sensitive if ignoreCase is false (unless Set is CharArraySet).
 
Method Summary
 boolean incrementToken()
          Inserts bigrams for common words into a token stream.
static org.apache.lucene.analysis.CharArraySet makeCommonSet(String[] commonWords)
          Deprecated. create a CharArraySet with CharArraySet instead
static org.apache.lucene.analysis.CharArraySet makeCommonSet(String[] commonWords, boolean ignoreCase)
          Deprecated. create a CharArraySet with CharArraySet instead
 void reset()
          
 
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close, end
 
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

CommonGramsFilter

@Deprecated
public CommonGramsFilter(org.apache.lucene.analysis.TokenStream input,
                                    Set<?> commonWords)
Deprecated. Use CommonGramsFilter(Version, TokenStream, Set) instead


CommonGramsFilter

@Deprecated
public CommonGramsFilter(org.apache.lucene.analysis.TokenStream input,
                                    Set<?> commonWords,
                                    boolean ignoreCase)
Deprecated. Use CommonGramsFilter(Version, TokenStream, Set, boolean) instead


CommonGramsFilter

public CommonGramsFilter(org.apache.lucene.util.Version matchVersion,
                         org.apache.lucene.analysis.TokenStream input,
                         Set<?> commonWords)
Construct a token stream filtering the given input using a Set of common words to create bigrams. Outputs both unigrams with position increment and bigrams with position increment 0 type=gram where one or both of the words in a potential bigram are in the set of common words .

Parameters:
input - TokenStream input in filter chain
commonWords - The set of common words.

CommonGramsFilter

public CommonGramsFilter(org.apache.lucene.util.Version matchVersion,
                         org.apache.lucene.analysis.TokenStream input,
                         Set<?> commonWords,
                         boolean ignoreCase)
Construct a token stream filtering the given input using a Set of common words to create bigrams, case-sensitive if ignoreCase is false (unless Set is CharArraySet). If commonWords is an instance of CharArraySet (true if makeCommonSet() was used to construct the set) it will be directly used and ignoreCase will be ignored since CharArraySet directly controls case sensitivity.

If commonWords is not an instance of CharArraySet, a new CharArraySet will be constructed and ignoreCase will be used to specify the case sensitivity of that set.

Parameters:
input - TokenStream input in filter chain.
commonWords - The set of common words.
ignoreCase - -Ignore case when constructing bigrams for common words.

CommonGramsFilter

@Deprecated
public CommonGramsFilter(org.apache.lucene.analysis.TokenStream input,
                                    String[] commonWords)
Deprecated. Use CommonGramsFilter(Version, TokenStream, Set) instead.

Construct a token stream filtering the given input using an Array of common words to create bigrams.

Parameters:
input - Tokenstream in filter chain
commonWords - words to be used in constructing bigrams

CommonGramsFilter

@Deprecated
public CommonGramsFilter(org.apache.lucene.analysis.TokenStream input,
                                    String[] commonWords,
                                    boolean ignoreCase)
Deprecated. Use CommonGramsFilter(Version, TokenStream, Set, boolean) instead.

Construct a token stream filtering the given input using an Array of common words to create bigrams and is case-sensitive if ignoreCase is false.

Parameters:
input - Tokenstream in filter chain
commonWords - words to be used in constructing bigrams
ignoreCase - -Ignore case when constructing bigrams for common words.
Method Detail

makeCommonSet

@Deprecated
public static org.apache.lucene.analysis.CharArraySet makeCommonSet(String[] commonWords)
Deprecated. create a CharArraySet with CharArraySet instead

Build a CharArraySet from an array of common words, appropriate for passing into the CommonGramsFilter constructor. This permits this commonWords construction to be cached once when an Analyzer is constructed.

Parameters:
commonWords - Array of common words which will be converted into the CharArraySet
Returns:
CharArraySet of the given words, appropriate for passing into the CommonGramFilter constructor
See Also:
passing false to ignoreCase

makeCommonSet

@Deprecated
public static org.apache.lucene.analysis.CharArraySet makeCommonSet(String[] commonWords,
                                                                               boolean ignoreCase)
Deprecated. create a CharArraySet with CharArraySet instead

Build a CharArraySet from an array of common words, appropriate for passing into the CommonGramsFilter constructor,case-sensitive if ignoreCase is false.

Parameters:
commonWords - Array of common words which will be converted into the CharArraySet
ignoreCase - If true, all words are lower cased first.
Returns:
a Set containing the words

incrementToken

public boolean incrementToken()
                       throws IOException
Inserts bigrams for common words into a token stream. For each input token, output the token. If the token and/or the following token are in the list of common words also output a bigram with position increment 0 and type="gram" TODO:Consider adding an option to not emit unigram stopwords as in CDL XTF BigramStopFilter, CommonGramsQueryFilter would need to be changed to work with this. TODO: Consider optimizing for the case of three commongrams i.e "man of the year" normally produces 3 bigrams: "man-of", "of-the", "the-year" but with proper management of positions we could eliminate the middle bigram "of-the"and save a disk seek and a whole set of position lookups.

Specified by:
incrementToken in class org.apache.lucene.analysis.TokenStream
Throws:
IOException

reset

public void reset()
           throws IOException

Overrides:
reset in class org.apache.lucene.analysis.TokenFilter
Throws:
IOException


Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.