public final class TibWordTokenizer
extends org.apache.lucene.analysis.Tokenizer
Trie.
Takes a syllable at a time and returns the longest sequence of syllable that form a word within the Trie.
isTibLetter(int) is used to distinguish clusters of letters forming syllables and u\0F0B(tsheg) to distinguish syllables within a word.
- Unknown syllables are tokenized as separate words.
- All the punctuation is discarded from the produced tokens, including the tsheg that usually follows "ང".
Due to its design, this tokenizer doesn't deal with contextual ambiguities.
For example, if both དོན and དོན་གྲུབ exist in the Trie, དོན་གྲུབ will be returned every time the sequence དོན + གྲུབ is found.
The sentence སེམས་ཅན་གྱི་དོན་གྲུབ་པར་ཤོག will be tokenized into "སེམས་ཅན + གྱི + དོན་གྲུབ + པར + ཤོག" (སེམས་ཅན + གྱི + དོན + གྲུབ་པར + ཤོག expected).
Derived from Lucene 6.4.1 analysis.
| Constructor and Description |
|---|
TibWordTokenizer()
Constructs a TibWordTokenizer using a default lexicon file (here "resource/output/total_lexicon.txt")
|
TibWordTokenizer(String filename)
Constructs a TibWordTokenizer using the file designed by filename
|
| Modifier and Type | Method and Description |
|---|---|
void |
end() |
boolean |
incrementToken() |
boolean |
isTibLetter(int c)
Finds whether the given character is a Tibetan letter or not.
|
protected int |
normalize(int c)
Called on each token character to normalize it before it is added to the
token.
|
void |
reset() |
void |
setDebug(boolean debug) |
void |
setLemmatize(boolean lemmatize) |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toStringpublic TibWordTokenizer(String filename) throws FileNotFoundException, IOException
filename - the path to the lexicon fileFileNotFoundException - the file containing the lexicon cannot be foundIOException - the file containing the lexicon cannot be readpublic TibWordTokenizer()
throws FileNotFoundException,
IOException
FileNotFoundException - the file containing the lexicon cannot be foundIOException - the file containing the lexicon cannot be readprotected int normalize(int c)
c - the character to normalizepublic final boolean incrementToken()
throws IOException
incrementToken in class org.apache.lucene.analysis.TokenStreamIOExceptionpublic final boolean isTibLetter(int c)
c - a unicode code-pointc in the specified range; false otherwisepublic final void end()
throws IOException
end in class org.apache.lucene.analysis.TokenStreamIOExceptionpublic void reset()
throws IOException
reset in class org.apache.lucene.analysis.TokenizerIOExceptionpublic final void setLemmatize(boolean lemmatize)
public final void setDebug(boolean debug)
Copyright © 2018. All rights reserved.