GithubHelp home page GithubHelp logo

defuse1999 / elasticsearch-thulac-plugin Goto Github PK

View Code? Open in Web Editor NEW

This project forked from microbun/elasticsearch-thulac-plugin

0.0 0.0 0.0 189 KB

thulac analysis plugin for elasticsearch

Java 100.00%

elasticsearch-thulac-plugin's Introduction

THULAC Analysis for Elasticsearch

采用THULAC实现的Elasticsearch中文分词插件。

版本

Plugin 版本 ES 版本 THULAC 版本 Link
master 6.x -> master lite
6.4.1-181027 6.4.1 lite 下载
6.4.0-181027 6.4.0 lite 下载
6.3.0-181027 6.3.0 lite 下载
6.2.0-181027 6.2.0 lite 下载
6.1.0-181027 6.1.0 lite 下载

下载安装

直接下载已经打包好的插件,解压到elasticsearch的plugins目录下即可。

编译安装

1.编译打包

git clone [email protected]:microbun/elasticsearch-thulac-plugin.git
cd elasticsearch-thulac-plugin
./gradlew release

2.安装到elasticsearch

cp build/distributions/elasticsearch-thulac-plugin-6.1.0.zip ${ES_HOME}/plugins
cd ${ES_HOME}/plugins
unzip elasticsearch-thulac-plugin-6.1.0.zip
rm elasticsearch-thulac-plugin-6.1.0.zip

解压后在plugins目录下会有一个thulac文件夹。

thulac
 |-elasticsearch-thulac-plugin-6.1.0.jar
 |-models #算法模型�目录
 |-plugin-descriptor.properties
 |-plugin.xml

3.由于THULAC的模型太大,插件中没有包含模型数据,可以在THULAC 下载模型(lite),将模型拷贝到models中。

示例

1.创建索引

1.1 使用默认分词方式

curl -H "Content-Type:application/json" -XPUT http://localhost:9200/index -d'
{
  "settings": {
  },
  "mapping": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "thulac"
      }
    }
  }
}
'

1.2 自定义分词器

curl -H "Content-Type:application/json" -XPUT http://localhost:9200/index -d'
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "custom_thulac_tokenizer": {
          "type": "thulac",
          "user_dict": "userdict.txt",
          "t2s": true,
          "filter": false
        }
      },
      "analyzer": {
        "custom_thulac_analyzer": {
          "tokenizer": "custom_thulac_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mapping": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "custom_thulac_tokenizer"
      }
    }
  }
}
参数名称 含义
t2s 将句子从繁体转化为简体。默认:true false/true
filter 使用过滤器去除一些没有意义的词语,例如“可以”。默认:false false/true
user_dict 自定义词典路径,每一个词一行,UTF8编码,相对路径和绝对路径.
相对路径:userdict.txt 会加载 ${ES_HOME}/plugins/module/userdict.txt文件
绝对路径:/home/elasticsearch/userdict.txt
默认:userdict.txt

2.查看索引

curl http://localhost:9200/index

3.测试分词效果

curl -H "Content-Type:application/json"  -XPOST http://localhost:9200/index/_analyze -d'
{
 "analyzer":"thulac", 
 "text":"我是**人"
}
'

4.删除索引

curl -XDELETE http://localhost:9200/index

elasticsearch-thulac-plugin's People

Contributors

microbun avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.