public final class TibWordTokenizer
extends org.apache.lucene.analysis.Tokenizer
Trie.
Takes a syllable at a time and returns the longest sequence of syllable that
form a word within the Trie.
isTibLetter(int) is used to distinguish clusters of letters forming
syllables and u\0F0B(tsheg) to distinguish syllables within a word.
- Unknown syllables are tokenized as separate words.
- All the punctuation is discarded from the produced tokens, including the
tsheg that usually follows "ང".
Due to its design, this tokenizer doesn't deal with contextual
ambiguities.
For example, if both དོན and དོན་གྲུབ exist in the Trie, དོན་གྲུབ will be
returned every time the sequence དོན + གྲུབ is found.
The sentence སེམས་ཅན་གྱི་དོན་གྲུབ་པར་ཤོག will be tokenized into "སེམས་ཅན +
གྱི + དོན་གྲུབ + པར + ཤོག" (སེམས་ཅན + གྱི + དོན + གྲུབ་པར + ཤོག expected).
Derived from Lucene 6.4.1 analysis.util.CharTokenizer
| Constructor and Description |
|---|
TibWordTokenizer()
Constructs a TibWordTokenizer using a default lexicon file (here
"resource/output/total_lexicon.txt")
|
TibWordTokenizer(String trieFile) |
TibWordTokenizer(io.bdrc.lucene.stemmer.Trie trie)
Constructs a TibWordTokenizer using a given trie
|
| Modifier and Type | Method and Description |
|---|---|
void |
end() |
boolean |
incrementToken() |
boolean |
isTibLetter(int c)
Finds whether the given character is a Tibetan letter or not.
|
protected int |
normalize(int c)
Called on each token character to normalize it before it is added to the
token.
|
void |
reset() |
void |
setDebug(boolean debug) |
void |
setLemmatize(boolean lemmatize) |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toStringpublic TibWordTokenizer()
throws IOException
IOException - the file containing the lexicon cannot be readpublic TibWordTokenizer(io.bdrc.lucene.stemmer.Trie trie)
trie - built with BuildCompiledTrie.javapublic TibWordTokenizer(String trieFile) throws FileNotFoundException, IOException
FileNotFoundExceptionIOExceptionprotected int normalize(int c)
c - the character to normalizepublic final boolean incrementToken()
throws IOException
incrementToken in class org.apache.lucene.analysis.TokenStreamIOExceptionpublic final boolean isTibLetter(int c)
c - a unicode code-pointc in the specified range; false otherwisepublic final void end()
throws IOException
end in class org.apache.lucene.analysis.TokenStreamIOExceptionpublic void reset()
throws IOException
reset in class org.apache.lucene.analysis.TokenizerIOExceptionpublic final void setLemmatize(boolean lemmatize)
public final void setDebug(boolean debug)
Copyright © 2019. All rights reserved.