Turkish analysis components for Lucene/Solr
Open Source Software usage gaining momentum in Turkey. Turkish users on lucene/solr mailing lists are increasing. This project makes use of publicly available Turkish nlp tools to create lucene/solr plugins from them. I created this project in order to promote and support open source. Stock lucene/solr has SnowballPorterFilter(Factory) for Turkish language. However, this stemmer performs poorly and has funny collisions. For example; altın, alim, alın, altan, and alıntı are all reduced to a same stem. In other words, they are treated as if they were the same word even though they have different meanings. I will post some other harmful collisions here.
Currently we have five TokenFilters. Detailed documentation is on the way.
TRMorphStemFilter(Factory)
Turkish Stemmer based on TRmorph
This one is not production ready yet. It requires Operating System specific foma executable.
I couldn't find an elegant way to convert foma
to java. I am using "executing shell commands in Java to call flookup
" workaround advised in [FAQ] (http://code.google.com/p/foma/wiki/FAQ). If you know something better please let me know.
<fieldType name="text_tr_morph" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.TRMorphStemFilterFactory" lookup="/Applications/foma/flookup" fst="/Volumes/datadisk/Desktop/TRmorph-master/stem.fst" />
</analyzer>
</fieldType>
Zemberek2StemFilter(Factory) Turkish Stemmer based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-4.8.0.jar inside solr/collection1/lib directory.
<fieldType name="text_tr_zemberek2" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.Zemberek2StemFilterFactory" strategy="minMorpheme"/>
</analyzer>
</fieldType>
Zemberek2DeasciifyFilter(Factory) Turkish Deasciifier based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-4.8.0.jar inside solr/collection1/lib directory.
Zemberek3StemFilter(Factory) Turkish Stemmer based on Zemberek3 Download tr folder which contains dictionary files, and put it under solr/collection1/conf. You need three jars : zemberek-morphology-0.9.1.jar zemberek-core-0.9.1.jar TurkishAnalysis-4.8.0.jar inside solr/collection1/lib directory. Please note that zemberek-* jars need to generated from my fork. Here is the difference over original repository.
<fieldType name="text_tr_zemberek3" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" cache="tr/top-20K-words.txt" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
</analyzer>
</fieldType>
TurkishDeasciifyFilter(Factory) Translation of Emacs Turkish mode from Lisp to Java. This filter intended to be used at query time to allow diacritics-insensitive search for Turkish.
<fieldType name="text_tr_deascii" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" cache="tr/top-20K-words.txt" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ApostropheFilterFactory"/>
<filter class="solr.TurkishLowerCaseFilterFactory"/>
<filter class="org.apache.lucene.analysis.tr.TurkishDeasciifyFilterFactory" preserveOriginal="true"/>
<filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" cache="tr/top-20K-words.txt" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
</analyzer>
</fieldType>
I will post benchmark results of different field types (different stemmers) designed for different use-cases.
##Dependencies
- JRE 1.7 or above
- Apache Maven 3.0.3 or above
- Apache Lucene (Solr) 4.8.0
##Author
Please feel free to contact Ahmet Arslan at iorixxx at yahoo dot com
if you have any questions, comments or contributions.