GithubHelp home page GithubHelp logo

mikailceren / lucene-solr-analysis-turkish Goto Github PK

View Code? Open in Web Editor NEW

This project forked from iorixxx/lucene-solr-analysis-turkish

1.0 0.0 0.0 1.53 MB

Turkish analysis components for Lucene/Solr

License: Apache License 2.0

Shell 0.72% Java 99.28%

lucene-solr-analysis-turkish's Introduction

lucene-solr-analysis-turkish

Turkish analysis components for Lucene/Solr

Open Source Software usage gaining momentum in Turkey. Turkish users on lucene/solr mailing lists are increasing. This project makes use of publicly available Turkish nlp tools to create lucene/solr plugins from them. I created this project in order to promote and support open source. Stock lucene/solr has SnowballPorterFilter(Factory) for Turkish language. However, this stemmer performs poorly and has funny collisions. For example; altın, alim, alın, altan, and alıntı are all reduced to a same stem. In other words, they are treated as if they were the same word even though they have different meanings. I will post some other harmful collisions here.

Currently we have five TokenFilters. Detailed documentation is on the way.

TRMorphStemFilter(Factory) Turkish Stemmer based on TRmorph This one is not production ready yet. It requires Operating System specific foma executable. I couldn't find an elegant way to convert foma to java. I am using "executing shell commands in Java to call flookup" workaround advised in [FAQ] (http://code.google.com/p/foma/wiki/FAQ). If you know something better please let me know.

<fieldType name="text_tr_morph" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.tr.TRMorphStemFilterFactory" lookup="/Applications/foma/flookup" fst="/Volumes/datadisk/Desktop/TRmorph-master/stem.fst" />
  </analyzer>
</fieldType>

Zemberek2StemFilter(Factory) Turkish Stemmer based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-4.8.0.jar inside solr/collection1/lib directory.

<fieldType name="text_tr_zemberek2" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.tr.Zemberek2StemFilterFactory" strategy="minMorpheme"/>
  </analyzer>
</fieldType>

Zemberek2DeasciifyFilter(Factory) Turkish Deasciifier based on Zemberek2 You need two jars : zemberek-cekirdek-2.1.3.jar zemberek-tr-2.1.3.jar TurkishAnalysis-4.8.0.jar inside solr/collection1/lib directory.

Zemberek3StemFilter(Factory) Turkish Stemmer based on Zemberek3 Download tr folder which contains dictionary files, and put it under solr/collection1/conf. You need three jars : zemberek-morphology-0.9.1.jar zemberek-core-0.9.1.jar TurkishAnalysis-4.8.0.jar inside solr/collection1/lib directory. Please note that zemberek-* jars need to generated from my fork. Here is the difference over original repository.

<fieldType name="text_tr_zemberek3" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ApostropheFilterFactory"/>
    <filter class="solr.TurkishLowerCaseFilterFactory"/>
    <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" cache="tr/top-20K-words.txt" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
  </analyzer>
</fieldType>

TurkishDeasciifyFilter(Factory) Translation of Emacs Turkish mode from Lisp to Java. This filter intended to be used at query time to allow diacritics-insensitive search for Turkish.

<fieldType name="text_tr_deascii" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" cache="tr/top-20K-words.txt" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.ApostropheFilterFactory"/>
     <filter class="solr.TurkishLowerCaseFilterFactory"/>
     <filter class="org.apache.lucene.analysis.tr.TurkishDeasciifyFilterFactory" preserveOriginal="true"/>
     <filter class="org.apache.lucene.analysis.tr.Zemberek3StemFilterFactory" strategy="maxLength" cache="tr/top-20K-words.txt" dictionary="tr/master-dictionary.dict,tr/secondary-dictionary.dict,tr/non-tdk.dict,tr/proper.dict"/>
   </analyzer>
 </fieldType>

I will post benchmark results of different field types (different stemmers) designed for different use-cases.

##Dependencies

  • JRE 1.7 or above
  • Apache Maven 3.0.3 or above
  • Apache Lucene (Solr) 4.8.0

##Author Please feel free to contact Ahmet Arslan at iorixxx at yahoo dot com if you have any questions, comments or contributions.

lucene-solr-analysis-turkish's People

Contributors

iorixxx avatar

Stargazers

MİKAİL avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.