org.apache.solr.analysis
Class HyphenatedWordsFilter
java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.TokenFilter
org.apache.solr.analysis.HyphenatedWordsFilter
- All Implemented Interfaces:
- Closeable
public final class HyphenatedWordsFilter
- extends org.apache.lucene.analysis.TokenFilter
When the plain text is extracted from documents, we will often have many words hyphenated and broken into
two lines. This is often the case with documents where narrow text columns are used, such as newsletters.
In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together.
This filter should be used on indexing time only.
Example field definition in schema.xml:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource |
org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State |
Fields inherited from class org.apache.lucene.analysis.TokenFilter |
input |
Constructor Summary |
HyphenatedWordsFilter(org.apache.lucene.analysis.TokenStream in)
Creates a new HyphenatedWordsFilter |
Methods inherited from class org.apache.lucene.analysis.TokenFilter |
close, end |
Methods inherited from class org.apache.lucene.util.AttributeSource |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString |
HyphenatedWordsFilter
public HyphenatedWordsFilter(org.apache.lucene.analysis.TokenStream in)
- Creates a new HyphenatedWordsFilter
- Parameters:
in
- TokenStream that will be filtered
incrementToken
public boolean incrementToken()
throws IOException
-
- Specified by:
incrementToken
in class org.apache.lucene.analysis.TokenStream
- Throws:
IOException
reset
public void reset()
throws IOException
-
- Overrides:
reset
in class org.apache.lucene.analysis.TokenFilter
- Throws:
IOException
Copyright © 2000-2011 Apache Software Foundation. All Rights Reserved.