GithubHelp home page GithubHelp logo

kuromoji.js's Introduction

kuromoji.js

Build Status Coverage Status npm version dependencies Code Climate Downloads

JavaScript implementation of Japanese morphological analyzer. This is a pure JavaScript porting of Kuromoji.

You can see how kuromoji.js works in demo site.

Directory

Directory tree is as follows:

build/
  kuromoji.js -- JavaScript file for browser (Browserified)
demo/         -- Demo
dict/         -- Dictionaries for tokenizer (gzipped)
example/      -- Examples to use in Node.js
src/          -- JavaScript source
test/         -- Unit test

Usage

You can tokenize sentences with only 5 lines of code. If you need working examples, you can see the files under the demo or example directory.

Node.js

Install with npm package manager:

npm install kuromoji

Load this library as follows:

var kuromoji = require("kuromoji");

You can prepare tokenizer like this:

kuromoji.builder({ dicPath: "path/to/dictionary/dir/" }).build(function (err, tokenizer) {
    // tokenizer is ready
    var path = tokenizer.tokenize("すもももももももものうち");
    console.log(path);
});

Browser

You only need the build/kuromoji.js and dict/*.dat.gz files

Install with Bower package manager:

bower install kuromoji

Or you can use the kuromoji.js file and dictionary files from the GitHub repository.

In your HTML:

<script src="url/to/kuromoji.js"></script>

In your JavaScript:

kuromoji.builder({ dicPath: "/url/to/dictionary/dir/" }).build(function (err, tokenizer) {
    // tokenizer is ready
    var path = tokenizer.tokenize("すもももももももものうち");
    console.log(path);
});

API

The function tokenize() returns an JSON array like this:

[ {
    word_id: 509800,          // 辞書内での単語ID
    word_type: 'KNOWN',       // 単語タイプ(辞書に登録されている単語ならKNOWN, 未知語ならUNKNOWN)
    word_position: 1,         // 単語の開始位置
    surface_form: '黒文字',    // 表層形
    pos: '名詞',               // 品詞
    pos_detail_1: '一般',      // 品詞細分類1
    pos_detail_2: '*',        // 品詞細分類2
    pos_detail_3: '*',        // 品詞細分類3
    conjugated_type: '*',     // 活用型
    conjugated_form: '*',     // 活用形
    basic_form: '黒文字',      // 基本形
    reading: 'クロモジ',       // 読み
    pronunciation: 'クロモジ'  // 発音
  } ]

(This is defined in src/util/IpadicFormatter.js)

See also JSDoc page in details.

kuromoji.js's People

Contributors

takuyaa avatar azu avatar xl1 avatar fvntr avatar ikawaha avatar iwehrman avatar arichardsmith avatar

Stargazers

 avatar Leon avatar  avatar kazu_iroiro avatar John Hewitt avatar  avatar Sato Daiki avatar Tater Boom avatar  avatar Jon Lee avatar  avatar bem avatar Kouki Itou avatar Mochamad Aulia Akbar Praditomo avatar Albertus Angga Raharja avatar Kees van Beilen avatar unshou-nei avatar hitumabushi avatar JinMenDOG avatar tomo_x avatar Yamada Hayao avatar shoji avatar Jack Arrington avatar  avatar Leonard avatar Masaya Kawai avatar Eduardo Vera avatar 助手 avatar  avatar Michele Riva avatar Tseleung avatar Jeff Chiang avatar Claudio Paladini avatar キンジョウ avatar clockcrockwork avatar Minoru Akagi avatar guusy! avatar omitanc avatar ym0425 avatar Mikuto Matsuo avatar  avatar Yosuke avatar Pedro Ribeiro avatar Nakamura Kazutaka avatar Ryusei avatar Vincent avatar ruichao.ma avatar Akiomi Kamakura avatar 未月 avatar xains avatar Alex avatar Megan Sharp avatar r-uchino avatar PeterCao avatar Chronocide avatar flumpus avatar Sebastián Petrík avatar  avatar  avatar jo avatar hato avatar Umut Karakulak avatar みじんこ avatar Muaz Rahman avatar Raku Zeta avatar Chris Pavlopoulos avatar Matheus M.  avatar Shotaro Nakamura avatar James avatar Minseo Lee avatar Lihang Xu avatar Jay Kominek avatar ubugeeei avatar tk avatar  avatar SamsonMXVI avatar john avatar Huifusu avatar Eg avatar りすりす/TwoSquirrels avatar Waleed Al-Balooshi avatar Kevin Schiffer avatar Doyle avatar aiktb avatar  avatar Julien Jung avatar Majimay avatar karyo avatar kang avatar  avatar Rebuild avatar Matthieu Locussol avatar neethan avatar Gurvan Jousset avatar Seaony avatar Tomofumi Chiba avatar ibone avatar Ivan Kozik avatar laiso avatar  avatar

Watchers

MISUMI Masaru avatar canotun avatar  avatar Kenichiro Murata avatar shilik avatar Larvata avatar Kent Kawashima avatar kitsuyui avatar  avatar sota avatar ka2 avatar Yuri Hater avatar Tomoyuki Hata avatar 上辻智也 avatar  avatar  avatar  avatar Nikita Chasovnikov avatar  avatar chikuwa-daifuku avatar Mistyさん avatar

kuromoji.js's Issues

gulpタスクが実行できない

gulp build-dict
を実行すると
/ほげぱす/kuromoji.js/node_modules/globby/index.js:28 } catch { ^'
'SyntaxError: Unexpected token {'

が発生して実行できないです

辞書の更新ができないです。

自レス:
(確認してないが)node_modulesで、勝手に私が実行していた、「npm install」で各モジュールのバージョン等がズレてしまってたことが理由かも。

以下のコメントの手順を実施したら、とりあえずうまく行った感じでした。

Wrong pos?

I'm searching "名詞" in the sentence.

but "、" is classfied "名詞" like below.

{ word_id: 51340, word_type: 'KNOWN', word_position: 16, surface_form: '、', pos: '名詞', pos_detail_1: '数', pos_detail_2: '*', pos_detail_3: '*', conjugated_type: '*', conjugated_form: '*', basic_form: '、', reading: '、', pronunciation: '、' }

Not getting the same results as Kuromoji java

Hi,

I was trying to tokenize the following sentence :

第1条 この法人は、一般社団法人国際銀行協会(以下「本協会」という。)と称し、英文では、 International Bankers Association of Japanと記載する。

and the results are different when using the java version of kuromojin (with Ipadic dictionary) and the tokenizer provided by kuromoji.js. In particular, the following sequence 協会 is splitted in kuromoji.js.

I saw a closed issue (#16) stating this could due to the Viterbi version of the tokenizer. Is there a way to disable it ?

Many thanks in advance,

Best

微笑み is broken down to 微 and 笑み

I am trying to use kuroshiro with kuromoji to annotate Japanese lyrics with romaji. In that context, "微笑み" should be kept together to be convert to "hohoemi", but because it is broken down to "微" and "笑み", the romaji conversion proceeds to output "bi emi".

ローカルでは動くのに、Webサーバー上ではkuromoji.jsが動作しません

ローカルでは以下のコードの状態でkuromoji.jsを使えるのですが、Webサーバー(Lolipop)にアップした途端に以下のようなエラーが出て使えません。
このエラーは一体どのようにしたら解決できるのでしょうか?サーバーの問題なんでしょうか?
お聞きしたいです。よろしくお願いいたします。

<script src="assets/kuromoji.js"></script>
<script>
kuromoji.builder({ dicPath: "assets/dict" }).build(function (err, tokenizer) {
    // tokenizer is ready
    var path = tokenizer.tokenize("すもももももももものうち");
    console.log(path);
});
</script>

image

研究 is broken down into 研 and 究

I'm using kuromoji.js for a web app: https://kuromoji.fluentcards.com/

If you enter 研究, it gets broken down to 研 and 究. The same on the demo site.

However, when I try the same word on the original Java version's website, 研究 gets parsed as a single token.

Is this behavior configurable?

Edit: I've looked through the source and have realized it's using the Viterbi algorithm which is not the default on the original Kuromoji demo site. Hence the difference in the output. Closing the issue.

Using only kanji->kana data

Thanks for the great project! I'm only interested in breaking down kanji into it's kana form. I don't need data about the parts of speech, pronunciation etc. There are currently 12 dictionary files (~17.8MB gzipped) and I want to bring the number down for my simple purposes.

I'm having trouble grasping if it's possible to uninclude the extra info and reduce the amount of dictionary data I need.

Are all the dictionaries critical for getting the kana, or will I be able to modify the code and still get just kana with less dictionary data?

Builder wont accept url to data folder in chrome extension

Im trying to use the library inside a chrome extension and when I set up the builder, I pass in a path to data made by chrome like this:

let builder = kuromoji
  .builder({ dicPath: chrome.extension.getURL("data/") })
  .build(function (err, tokenizer) {
    var path = tokenizer.tokenize("すもももももももものうち");
  });

However I get an error:

kuromoji.js:7724 Uncaught TypeError: Cannot read property 'lookup' of null
    at UnknownDictionary.lookup (kuromoji.js:7724)
    at ViterbiBuilder.build (kuromoji.js:8806)
    at Tokenizer.getLattice (kuromoji.js:6961)
    at Tokenizer.tokenizeForSentence (kuromoji.js:6916)
    at Tokenizer.tokenize (kuromoji.js:6907)
    at furigana.js:103
    at kuromoji.js:7010
    at kuromoji.js:8272
    at kuromoji.js:3876
    at kuromoji.js:475

Phraze tokenized as single token

I was trying to follow instruction on the official website:

For example, we want a search for 空港 (airport) to match 関西国際空港 (Kansai International Airport), but most analyzers don’t allow this since 関西国際空港 tends to become one token.

For me it gets tokenized as single word:

kuromoji.builder({ dicPath: "dict/" }).build((err, tokenizer) => {
  console.log(tokenizer.tokenize("関西国際空港"));
});

// => [ { word_id: 1271160,
//        word_type: 'KNOWN',
//        word_position: 1,
//        surface_form: '関西国際空港',
//        pos: '名詞',
//        pos_detail_1: '固有名詞',
//        pos_detail_2: '組織',
//        pos_detail_3: '*',
//        conjugated_type: '*',
//        conjugated_form: '*',
//        basic_form: '関西国際空港',
//        reading: 'カンサイコクサイクウコウ',
//        pronunciation: 'カンサイコクサイクーコー' } ]

@takuyaa , is it an issue with dictionary? Is there a way to convert Ipadic dictionary to .dat format?

Can not load dict from external URL

We are trying to use kuromoji.js with kuromojin on browser.
When we use external url as dicPath option, double slash is normalized by path.join.

> require("path").join("http://external-url.com/dict", "base.dat.gz");
'http:/external-url.com/dict/base.dat.gz'

So XMLHttpRequest will go to https://mydomain.com/external-url.com/dict/base.dat.gz
And fail to load dict.

Screen Shot 2020-01-29 at 4 51 25 AM
Screen Shot 2020-01-29 at 4 52 27 AM

「見れる」の解析結果がおかしい

「居れる」はこうなる。

表層形 品詞 品詞細分類1 品詞細分類2 品詞細分類3 活用型 活用形 基本形 読み 発音
動詞 自立 * * 一段 未然形 居る
れる 動詞 接尾 * * 一段 基本形 れる レル レル

「着れる」はこうなる。

表層形 品詞 品詞細分類1 品詞細分類2 品詞細分類3 活用型 活用形 基本形 読み 発音
動詞 自立 * * 一段 未然形 着る
れる 動詞 接尾 * * 一段 基本形 れる レル レル

ということは、「見れる」はこうなることが期待される。

表層形 品詞 品詞細分類1 品詞細分類2 品詞細分類3 活用型 活用形 基本形 読み 発音
動詞 自立 * * 一段 未然形 見る
れる 動詞 接尾 * * 一段 基本形 れる レル レル

しかし、実際はこうなる。

表層形 品詞 品詞細分類1 品詞細分類2 品詞細分類3 活用型 活用形 基本形 読み 発音
見れる 動詞 自立 * * 一段 基本形 見れる ミレル ミレル

npmjs.orgのドキュメントが古い (Outdated and misleading description in npmjs.org)

すばらしいライブラリありがとうございます。おかげでhubotに一瞬で形態素解析機能を追加できました。

一点だけ、README.mdは問題ないのですが、
https://www.npmjs.com/package/kuromoji#node-js
の、build関数のコールバックがfunction(err, tokenizer)ではなくfunction(tokenizer)になっています。

更新しておいたほうが良いと思います

Compatibility with React Native

I would love to help get this working in React Native. It seems that the biggest obstacle would the Dictionary Loader. Any help in understanding how the app decides which of the two dictionary loaders to use would be great.

Any plans for UniDic support?

UniDic is, like IPADIC, a morpheme lexicon which Java Kuromoji has excellent support for. (Well, Atilika says Kuromoji UniDic support is "experimental" but in our experiments it works really well.) Any plans for UniDic support in kuromoji.js? Thanks 🍻!

Stop offering bower package

Problem

Bower have no own repository, so we must put compiled builds in this git repo.

Cons:

  • Pull request will be often a large number of diffs
  • Git repository is bigger because of compiled js, docs, and binary dictionaries

To do

ref. How to drop Bower support? · Bower blog

To browser users

Migrate to npm ecosystem. Use npm, Yarn, or webpack.

Browser users could install kuromoji.js, if you use npm:

  • $ npm install kuromoji --save
  • Or, add kuromoji to the dependency field in your package.json, then $ npm install

Published bower packages until v0.0.5 remain for now. But I would like to stop publishing to bower repo.

、 as 名詞 数

Try はじめまして。どうぞ、よろしく。
And the will be:

{

    "type":"WordNode",
    "children":[
        {
            "type":"TextNode",
            "value":"",
            "position":{
                "start":{
                    "line":1,
                    "column":11,
                    "offset":10
                },
                "end":{
                    "line":1,
                    "column":12,
                    "offset":11
                }
            },
            "data":{
                "word_id":51340,
                "word_type":"KNOWN",
                "surface_form":"",
                "pos":"名詞",
                "pos_detail_1":"",
                "pos_detail_2":"*",
                "pos_detail_3":"*",
                "conjugated_type":"*",
                "conjugated_form":"*",
                "basic_form":"",
                "reading":"",
                "pronunciation":""
            }
        }
    ],
    "position":{
        "start":{
            "line":1,
            "column":11,
            "offset":10
        },
        "end":{
            "line":1,
            "column":12,
            "offset":11
        }
    },
    "data":{
        "word_id":51340,
        "word_type":"KNOWN",
        "surface_form":"",
        "pos":"名詞",
        "pos_detail_1":"",
        "pos_detail_2":"*",
        "pos_detail_3":"*",
        "conjugated_type":"*",
        "conjugated_form":"*",
        "basic_form":"",
        "reading":"",
        "pronunciation":""
    }

}

Is this wrong?
How to fix it?

User dictionary support

I found the TODO for user dictionaries:
https://github.com/takuyaa/kuromoji.js/blob/master/src/Tokenizer.js#L110

But, comparing to the atilika version of kuromoji, I think the user dictionary code actually needs to go here:
https://github.com/takuyaa/kuromoji.js/blob/master/src/viterbi/ViterbiBuilder.js#L96

Ref: https://github.com/atilika/kuromoji/blob/master/kuromoji-core/src/main/java/com/atilika/kuromoji/viterbi/ViterbiBuilder.java#L101
https://github.com/atilika/kuromoji/blob/master/kuromoji-core/src/main/java/com/atilika/kuromoji/viterbi/ViterbiBuilder.java#L176
https://github.com/atilika/kuromoji/blob/d0700ab6dd489aaf0fcb1e4e78ce2f682be9f255/kuromoji-core/src/main/java/com/atilika/kuromoji/dict/UserDictionary.java#L72

That seems like a fair bit of code - no wonder you skipped over it before! Based on your knowledge of the code, is it all fairly mechanical porting work, or is it going to be hard work?

(Do you have any other ideas for dynamically adding new entries, that don't require recompiling, then reloading, all the dictionary files?)

can't resolve path .

Module not found: Error: Can't resolve 'path' in \node_modules\kuromoji\src\loader. I'm using react webpack and nodeJs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.