GithubHelp home page GithubHelp logo

taptap / pinyin-plus Goto Github PK

View Code? Open in Web Editor NEW
112.0 7.0 11.0 3.6 MB

简繁体汉字转拼音的项目,解决多音字的问题。ElasticSearch、solr 的拼音分词工具

Home Page: https://gitee.com/kailing/pinyin-plus

License: Apache License 2.0

Java 100.00%
pinyin pin-yin pinyin-analysis pinyin4j elasticsearch elasticsearch-pinyin pinyin-data

pinyin-plus's Introduction

pinyin-plus

汉字转拼音的库,有如下特点

  • 拼音数据基于 cc-cedictkaifangcidian 开源词库
  • 基于拼音词库的数据初始化分词引擎进行分词,准确度高,解决多音字的问题
  • 支持繁体字
  • 支持自定义词库,词库格式同 cc-cedict 字典格式
  • api 简单,分为普通模式、索引模式

使用场景

汉字转拼音,常用于索引引擎场景创建拼音的索引,这个场景的问题一般由两种实现路径,一种是直接使用带拼音的的分词 插件,会自动帮你创建出拼音的索引,还有一种就是自己将汉字转换为拼音字符串,采用空格分隔分词来达到定制化索引的目的。 不论哪种实现路径,都离不开分词和拼音转换。pinyin-plus 的特点是,索引分词的词库和拼音的词库是基于同一套词库, 所以多音词的准确度特别高,而且词库的格式保留了开源词典的格式,词库可以轻松的定时更新。同时也预留了自定义词库的扩展 接口,保留定制化需求的高优先级

性能

#pinyin-plus 的压测数据,测试词语:率土之滨
kl@kldeMacBook-Pro-6 arthas % wrk -t16 -c100 -d15s --latency http://localhost:8080/%E7%8E%87%E5%9C%9F%E4%B9%8B%E6%BB%A8
Running 15s test @ http://localhost:8080/%E7%8E%87%E5%9C%9F%E4%B9%8B%E6%BB%A8
  16 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   733.97us  138.45us  16.40ms   96.12%
    Req/Sec     8.19k   293.50     8.90k    87.83%
  Latency Distribution
     50%  718.00us
     75%  739.00us
     90%  785.00us
     99%    1.02ms
  1970023 requests in 15.10s, 266.78MB read
Requests/sec: 130469.56
Transfer/sec:     17.67MB

添加依赖

gradle

compile "com.github.taptap:pinyin-plus:1.0"

maven

        <dependency>
            <groupId>com.github.taptap</groupId>
            <artifactId>pinyin-plus</artifactId>
            <version>1.0</version>
        </dependency>

使用

    //普通模式示例,汉字转换拼音后,单子采用空格隔开输出
    @Test
    void testToPinYin() {
        String pinyin = PinyinPlus.to("率土之滨");
        System.err.println(pinyin);
        Assertions.assertEquals("shuai tu zhi bin", pinyin);
    }
    //索引模式示例,汉字转换拼音后,词组采用空格隔开输出
    @Test
    void testToPinYin2() {
            String pinyin = PinyinPlus.toIndex("写的射雕英雄传");
            System.err.println(pinyin);
            Assertions.assertEquals("xie de shediaoyingxiongzhuan", pinyin);
    }
    

自定义词库

在项目 resources 目录下,新增 custom_cedict_ts.u8 文本文件,输入如下格式数据,# 开头的为注释,如:

#自定义词库
血花 血花 [xue4 hua1] //

格式保留和开源词库 cc-cedict 一样的风格,遇到相同的词组,自定义的优先级最高,会覆盖系统默认的词组

鸣谢

pinyin-plus's People

Contributors

klboke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.