takuyaa / kuromoji.js Goto Github PK

View Code? Open in Web Editor NEW

821.0 21.0 114.0 44.88 MB

JavaScript implementation of Japanese morphological analyzer

JavaScript 100.00%

kuromoji.js's Introduction

kuromoji.js

JavaScript implementation of Japanese morphological analyzer. This is a pure JavaScript porting of Kuromoji.

You can see how kuromoji.js works in demo site.

Usage

You can tokenize sentences with only 5 lines of code. If you need working examples, you can see the files under the demo or example directory.

Node.js

Install with npm package manager:

npm install kuromoji

Load this library as follows:

var kuromoji = require("kuromoji");

You can prepare tokenizer like this:

kuromoji.builder({ dicPath: "path/to/dictionary/dir/" }).build(function (err, tokenizer) {
    // tokenizer is ready
    var path = tokenizer.tokenize("すもももももももものうち");
    console.log(path);
});

Browser

You only need the build/kuromoji.js and dict/*.dat.gz files

Install with Bower package manager:

bower install kuromoji

Or you can use the kuromoji.js file and dictionary files from the GitHub repository.

In your HTML:

<script src="url/to/kuromoji.js"></script>

In your JavaScript:

kuromoji.builder({ dicPath: "/url/to/dictionary/dir/" }).build(function (err, tokenizer) {
    // tokenizer is ready
    var path = tokenizer.tokenize("すもももももももものうち");
    console.log(path);
});

API

The function tokenize() returns an JSON array like this:

[ {
    word_id: 509800,          // 辞書内での単語ID
    word_type: 'KNOWN',       // 単語タイプ(辞書に登録されている単語ならKNOWN, 未知語ならUNKNOWN)
    word_position: 1,         // 単語の開始位置
    surface_form: '黒文字',    // 表層形
    pos: '名詞',               // 品詞
    pos_detail_1: '一般',      // 品詞細分類1
    pos_detail_2: '*',        // 品詞細分類2
    pos_detail_3: '*',        // 品詞細分類3
    conjugated_type: '*',     // 活用型
    conjugated_form: '*',     // 活用形
    basic_form: '黒文字',      // 基本形
    reading: 'クロモジ',       // 読み
    pronunciation: 'クロモジ'  // 発音
  } ]

(This is defined in src/util/IpadicFormatter.js)

kuromoji.js's People

Contributors

Stargazers

Watchers

Forkers

ikawaha review--- intermezzo-fr tak4hir0 lacolaco yonghoonwon fvntr xl1 azu hoshiumiarata tkubotake sebastianszturo jmk2142 iwehrman tvdstaaij hitsujiwool exabugs withnic jitsmaster 9renpoto leekangsan kadeercanada yoshinori-koide ks3dev you06 neet saiselfbotline andrewkfiedler iliakan loopsketch mikob mercuito eyhn takeshikondo seangenabe jffifa fjwuyongzhi jasonqsong codeyu r-bies mirigana gyugyu vuduc4793 inabe49 hata6502 zsh2960203 larsdood aratomo kuwahalab taisukef knowclip sora513 akanosenritu denki-hitsuji rikai-tiep ryooo ampcpmgp endotl cotodama qqpann kitsunelemon sato-tomohiro stephen mamoru0217 cinemally hi-noguchi siegrainwong molesquirrel zelricx koharuyuzuki kimumu-asia mmgyce moregon44 penicillin0 code4fukui rlho axiomecg kitiho k-yomo rosekelly6400 ttrace jgaynor17 ankiapp hokhyk steve-xmh gurvaann techtech-github kschiffer sglkc eyesofkids ubugeeei katekate0919 ojakomaru getchipbot mijinkosd ishiidaichi arahori macha795 kaisiemek rajun1971

kuromoji.js's Issues

gulpタスクが実行できない

gulp build-dict
を実行すると
/ほげぱす/kuromoji.js/node_modules/globby/index.js:28 } catch { ^'
'SyntaxError: Unexpected token {'

が発生して実行できないです

辞書の更新ができないです。

自レス：
（確認してないが）node_modulesで、勝手に私が実行していた、「npm install」で各モジュールのバージョン等がズレてしまってたことが理由かも。

以下のコメントの手順を実施したら、とりあえずうまく行った感じでした。

Wrong pos?

I'm searching "名詞" in the sentence.

but "、" is classfied "名詞" like below.

{ word_id: 51340, word_type: 'KNOWN', word_position: 16, surface_form: '、', pos: '名詞', pos_detail_1: '数', pos_detail_2: '*', pos_detail_3: '*', conjugated_type: '*', conjugated_form: '*', basic_form: '、', reading: '、', pronunciation: '、' }

Not getting the same results as Kuromoji java

Hi,

I was trying to tokenize the following sentence :

第1条この法人は、一般社団法人国際銀行協会（以下「本協会」という。）と称し、英文では、 International Bankers Association of Japanと記載する。

and the results are different when using the java version of kuromojin (with Ipadic dictionary) and the tokenizer provided by kuromoji.js. In particular, the following sequence 協会 is splitted in kuromoji.js.

I saw a closed issue (#16) stating this could due to the Viterbi version of the tokenizer. Is there a way to disable it ?

Many thanks in advance,

Best

How do you import a dictionary in React Native?

This repository is not maintained now?

This repository is not maintained now?
I forked kuromoji.js, and I'm going to publish it to NPM.

https://github.com/hata6502/kuromoji.js

微笑み is broken down to 微 and 笑み

I am trying to use kuroshiro with kuromoji to annotate Japanese lyrics with romaji. In that context, "微笑み" should be kept together to be convert to "hohoemi", but because it is broken down to "微" and "笑み", the romaji conversion proceeds to output "bi emi".

ローカルでは動くのに、Webサーバー上ではkuromoji.jsが動作しません

ローカルでは以下のコードの状態でkuromoji.jsを使えるのですが、Webサーバー（Lolipop）にアップした途端に以下のようなエラーが出て使えません。
このエラーは一体どのようにしたら解決できるのでしょうか？サーバーの問題なんでしょうか？
お聞きしたいです。よろしくお願いいたします。

<script src="assets/kuromoji.js"></script>
<script>
kuromoji.builder({ dicPath: "assets/dict" }).build(function (err, tokenizer) {
    // tokenizer is ready
    var path = tokenizer.tokenize("すもももももももものうち");
    console.log(path);
});
</script>

研究 is broken down into 研 and 究

I'm using kuromoji.js for a web app: https://kuromoji.fluentcards.com/

If you enter 研究, it gets broken down to 研 and 究. The same on the demo site.

However, when I try the same word on the original Java version's website, 研究 gets parsed as a single token.

Is this behavior configurable?

Edit: I've looked through the source and have realized it's using the Viterbi algorithm which is not the default on the original Kuromoji demo site. Hence the difference in the output. Closing the issue.

Using only kanji->kana data

Thanks for the great project! I'm only interested in breaking down kanji into it's kana form. I don't need data about the parts of speech, pronunciation etc. There are currently 12 dictionary files (~17.8MB gzipped) and I want to bring the number down for my simple purposes.

I'm having trouble grasping if it's possible to uninclude the extra info and reduce the amount of dictionary data I need.

Are all the dictionaries critical for getting the kana, or will I be able to modify the code and still get just kana with less dictionary data?

Builder wont accept url to data folder in chrome extension

Im trying to use the library inside a chrome extension and when I set up the builder, I pass in a path to data made by chrome like this:

let builder = kuromoji
  .builder({ dicPath: chrome.extension.getURL("data/") })
  .build(function (err, tokenizer) {
    var path = tokenizer.tokenize("すもももももももものうち");
  });

However I get an error:

kuromoji.js:7724 Uncaught TypeError: Cannot read property 'lookup' of null
    at UnknownDictionary.lookup (kuromoji.js:7724)
    at ViterbiBuilder.build (kuromoji.js:8806)
    at Tokenizer.getLattice (kuromoji.js:6961)
    at Tokenizer.tokenizeForSentence (kuromoji.js:6916)
    at Tokenizer.tokenize (kuromoji.js:6907)
    at furigana.js:103
    at kuromoji.js:7010
    at kuromoji.js:8272
    at kuromoji.js:3876
    at kuromoji.js:475

byte length of Int16Array should be a multiple of 2

I'm getting this error when running on production.
please help me

Phraze tokenized as single token

I was trying to follow instruction on the official website:

For example, we want a search for 空港 (airport) to match 関西国際空港 (Kansai International Airport), but most analyzers don’t allow this since 関西国際空港 tends to become one token.

For me it gets tokenized as single word:

kuromoji.builder({ dicPath: "dict/" }).build((err, tokenizer) => {
  console.log(tokenizer.tokenize("関西国際空港"));
});

// => [ { word_id: 1271160,
//        word_type: 'KNOWN',
//        word_position: 1,
//        surface_form: '関西国際空港',
//        pos: '名詞',
//        pos_detail_1: '固有名詞',
//        pos_detail_2: '組織',
//        pos_detail_3: '*',
//        conjugated_type: '*',
//        conjugated_form: '*',
//        basic_form: '関西国際空港',
//        reading: 'カンサイコクサイクウコウ',
//        pronunciation: 'カンサイコクサイクーコー' } ]

@takuyaa , is it an issue with dictionary? Is there a way to convert Ipadic dictionary to .dat format?

Can not load dict from external URL

We are trying to use kuromoji.js with kuromojin on browser.
When we use external url as dicPath option, double slash is normalized by path.join.

> require("path").join("http://external-url.com/dict", "base.dat.gz");
'http:/external-url.com/dict/base.dat.gz'

So XMLHttpRequest will go to https://mydomain.com/external-url.com/dict/base.dat.gz
And fail to load dict.

「見れる」の解析結果がおかしい

「居れる」はこうなる。

表層形	品詞	品詞細分類1	品詞細分類2	品詞細分類3	活用型	活用形	基本形	読み	発音
居	動詞	自立	*	*	一段	未然形	居る	イ	イ
れる	動詞	接尾	*	*	一段	基本形	れる	レル	レル

「着れる」はこうなる。

表層形	品詞	品詞細分類1	品詞細分類2	品詞細分類3	活用型	活用形	基本形	読み	発音
着	動詞	自立	*	*	一段	未然形	着る	キ	キ
れる	動詞	接尾	*	*	一段	基本形	れる	レル	レル

ということは、「見れる」はこうなることが期待される。

表層形	品詞	品詞細分類1	品詞細分類2	品詞細分類3	活用型	活用形	基本形	読み	発音
見	動詞	自立	*	*	一段	未然形	見る	ミ	ミ
れる	動詞	接尾	*	*	一段	基本形	れる	レル	レル

しかし、実際はこうなる。

表層形	品詞	品詞細分類1	品詞細分類2	品詞細分類3	活用型	活用形	基本形	読み	発音
見れる	動詞	自立	*	*	一段	基本形	見れる	ミレル	ミレル

npmjs.orgのドキュメントが古い (Outdated and misleading description in npmjs.org)

すばらしいライブラリありがとうございます。おかげでhubotに一瞬で形態素解析機能を追加できました。

一点だけ、README.mdは問題ないのですが、
https://www.npmjs.com/package/kuromoji#node-js
の、build関数のコールバックがfunction(err, tokenizer)ではなくfunction(tokenizer)になっています。

更新しておいたほうが良いと思います

Compatibility with React Native

I would love to help get this working in React Native. It seems that the biggest obstacle would the Dictionary Loader. Any help in understanding how the app decides which of the two dictionary loaders to use would be great.

Any plans for UniDic support?

UniDic is, like IPADIC, a morpheme lexicon which Java Kuromoji has excellent support for. (Well, Atilika says Kuromoji UniDic support is "experimental" but in our experiments it works really well.) Any plans for UniDic support in kuromoji.js? Thanks 🍻!

Infection blocked ( at avast )

Help me.
Install blocked by Avast, because Infection blocked .

https://raw.githubusercontent.com/takuyaa/kuromoji.js/master/README.md

Stop offering bower package

Problem

Bower have no own repository, so we must put compiled builds in this git repo.

Cons:

Pull request will be often a large number of diffs
Git repository is bigger because of compiled js, docs, and binary dictionaries

To do

update README.md
- ref. How to migrate away from Bower? · Bower blog
remove build/ from git repository
remove bower.json
maintain gulpfile.js and package.json

ref. How to drop Bower support? · Bower blog

To browser users

Migrate to npm ecosystem. Use npm, Yarn, or webpack.

Browser users could install kuromoji.js, if you use npm:

$ npm install kuromoji --save
Or, add kuromoji to the dependency field in your package.json, then $ npm install

Published bower packages until v0.0.5 remain for now. But I would like to stop publishing to bower repo.

gzip library not needed in the browser version

https://github.com/takuyaa/kuromoji.js/blob/master/src/loader/BrowserDictionaryLoader.js#L50

Gzip library not needed for the BrowserDictionaryLoader. As long as the server responds withContent-Encoding: gzip then it will be decompressed automatically.

Doesn't work in Firefox because of error in loading array buffer

Only in Firefox browsers, loading the dictionary fails in array buffer.

、 as 名詞数

Try はじめまして。どうぞ、よろしく。
And the 、 will be:

{

    "type":"WordNode",
    "children":[
        {
            "type":"TextNode",
            "value":"、",
            "position":{
                "start":{
                    "line":1,
                    "column":11,
                    "offset":10
                },
                "end":{
                    "line":1,
                    "column":12,
                    "offset":11
                }
            },
            "data":{
                "word_id":51340,
                "word_type":"KNOWN",
                "surface_form":"、",
                "pos":"名詞",
                "pos_detail_1":"数",
                "pos_detail_2":"*",
                "pos_detail_3":"*",
                "conjugated_type":"*",
                "conjugated_form":"*",
                "basic_form":"、",
                "reading":"、",
                "pronunciation":"、"
            }
        }
    ],
    "position":{
        "start":{
            "line":1,
            "column":11,
            "offset":10
        },
        "end":{
            "line":1,
            "column":12,
            "offset":11
        }
    },
    "data":{
        "word_id":51340,
        "word_type":"KNOWN",
        "surface_form":"、",
        "pos":"名詞",
        "pos_detail_1":"数",
        "pos_detail_2":"*",
        "pos_detail_3":"*",
        "conjugated_type":"*",
        "conjugated_form":"*",
        "basic_form":"、",
        "reading":"、",
        "pronunciation":"、"
    }

}

Is this wrong?
How to fix it?

User dictionary support

I found the TODO for user dictionaries:
https://github.com/takuyaa/kuromoji.js/blob/master/src/Tokenizer.js#L110

But, comparing to the atilika version of kuromoji, I think the user dictionary code actually needs to go here:
https://github.com/takuyaa/kuromoji.js/blob/master/src/viterbi/ViterbiBuilder.js#L96

Ref: https://github.com/atilika/kuromoji/blob/master/kuromoji-core/src/main/java/com/atilika/kuromoji/viterbi/ViterbiBuilder.java#L101
https://github.com/atilika/kuromoji/blob/master/kuromoji-core/src/main/java/com/atilika/kuromoji/viterbi/ViterbiBuilder.java#L176
https://github.com/atilika/kuromoji/blob/d0700ab6dd489aaf0fcb1e4e78ce2f682be9f255/kuromoji-core/src/main/java/com/atilika/kuromoji/dict/UserDictionary.java#L72

That seems like a fair bit of code - no wonder you skipped over it before! Based on your knowledge of the code, is it all fairly mechanical porting work, or is it going to be hard work?

(Do you have any other ideas for dynamically adding new entries, that don't require recompiling, then reloading, all the dictionary files?)

kuromoji-vercel

👋 I just updated https://github.com/martinheidegger/kuromoji-vercel and wanted to say HI! and thanks for the cool project.

can't resolve path .

Module not found: Error: Can't resolve 'path' in \node_modules\kuromoji\src\loader. I'm using react webpack and nodeJs