GithubHelp home page GithubHelp logo

neologd / mecab-ipadic-neologd Goto Github PK

View Code? Open in Web Editor NEW
2.7K 123.0 289.0 452.82 MB

Neologism dictionary based on the language resources on the Web for mecab-ipadic

License: Other

Shell 92.13% Perl 7.87%
mecab-ipadic named-entities dictionary furigana neologism-dictionary mecab language-resources japanese-language

mecab-ipadic-neologd's Introduction

NEologd : Neologism dictionary generator

NEologd generates neologism dictionary using various language resources.

An entry of the neologism dictionary has following 4 columns for each neologism.

  • Surface
  • Phonetic signs
    • IPA (International Phonetic Alphabet)
    • kana indicating the pronunciation (In Japanese)
  • Base form of Surface
  • Part-Of-Speech (POS) tags

NEologd will cope with an occurrence of neologism of the world instead of you.

For Japanese

README.ja.md is written in Japanese.

Application example

Copyrights

Copyright (c) 2015 Toshinori Sato (@overlast) All rights reserved.

mecab-ipadic-neologd's People

Contributors

felixonmars avatar neologd avatar overlast avatar pecorarista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mecab-ipadic-neologd's Issues

Download common-nouns.csv of specific date

Motivation

  • Extract newly added nouns to the dictionary using the current common-nouns.csv and the last year's common-nouns.csv

Goal

  • Download latest common-nouns.csv and last year'S common-nouns.csv.
  • Is there any way we could download common-nouns.csv of specific date?
    I have looked into the /seeds/ directory but it seems that there is only 2017/02's common-nouns.csv.

With best regards

Issueではありたせんが。。。

倧倉申し蚳ないですが、本蟞曞ずMECABの既存蟞曞を䞀緒に䜿うのがおすすめず蚀うこずなんですが、䞡方を䜿うにはどうすればいいか教えおいただけたすか。

すもももももももものうち

蟞曞を自分で鍛えるのが面倒なので、新し目の蟞曞を探しおいおmecab-ipadic-neologdに行き圓たりたした。なるほど今たで现切れになっおいたものが䞀語ずしお認識され調子良さそうです。しかしながら、ひず぀こたったこずが。「すもももももももものうち」を解析するず、䞀般名詞「すもももももももものうち」ず解析されおしたいたす。

これは蟞曞をmakeする過皋でなにか足りなかったからなのでしょうかそれずも、こういう仕様なのでしょうか

同じようにmecab-unidic-neologdの方も䞀般名詞ずなっおしたうこずを確認しおおりたす。

出力゚ンコヌディングの指定

Windows環境(C#, NMeCaB)で䜿甚しおいるのですが、出力゚ンコヌディングがUTF8なので少し手を加えないず䜿甚できたせん。

コンパむル環境はUnixで圓面良いので、出力゚ンコヌディングをむンストヌラのオプションで指定できるようにしおもらえるず助かりたす。

参考(自著ブログ): mecab-ipadic-neologdをNMeCab甚にshift-jisでコンパむルした - 雲行きそらゆきココロむキ

build時に「line 525: 6288 Killed ${MECAB_LIBEXEC_DIR}/mecab-dict-index -f UTF8 -t UTF8」の゚ラヌが出る

゚ラヌの内容

最近のレポゞトリからgit clone埌、゚ラヌが衚瀺されおむンストヌルに倱敗したす。
参照しようずしおいるディレクトリが違うように芋えたすが、ご助蚀いただけたしたら幞いです。

状況

・DockerFileを利甚しおいたす。
・DockerFile内でgit clone 埌にbuildしおいたす。

コヌド

# Dockerfile

FROM python:3.6
WORKDIR /code
ENV PYTHONUNBUFFERED 1
COPY requirements.txt /code/
RUN apt-get update -y&&\
    apt-get upgrade -y&&\
    apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8 sudo -y&&\
    apt-get install git make curl xz-utils file&&\
    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git&&\
    /code/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y &&\
    mkdir /code/media && \
    mkdir /code/static &&\
    python -m pip install --upgrade pip &&\
    pip install -r requirements.txt
COPY . /code/

゚ラヌの党文

↑は関係のない項目なので省きたす。
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
Setting up mecab-ipadic-utf8 (2.7.0-20070801+main-2.1) ...
Compiling IPA dictionary for Mecab.  This takes long time...
reading /usr/share/mecab/dic/ipadic/unk.def ... 40
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
update-alternatives: using /var/lib/mecab/dic/ipadic-utf8 to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode
Processing triggers for libc-bin (2.28-10) ...
Reading package lists...
Building dependency tree...
Reading state information...
curl is already the newest version (7.64.0-4+deb10u1).
file is already the newest version (1:5.35-4+deb10u1).
git is already the newest version (1:2.20.1-2+deb10u3).
make is already the newest version (4.2.1-1.2).
xz-utils is already the newest version (5.2.4-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Cloning into 'mecab-ipadic-neologd'...
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
[install-mecab-ipadic-NEologd] :     patch => ok
[install-mecab-ipadic-NEologd] :     which => ok
[install-mecab-ipadic-NEologd] :     file => ok
[install-mecab-ipadic-NEologd] :     openssl => ok
[install-mecab-ipadic-NEologd] :     awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : create /code/mecab-ipadic-neologd/libexec/../build
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
[make-mecab-ipadic-NEologd] : Try to access to https://ja.osdn.net
[make-mecab-ipadic-NEologd] : Try to download from https://ja.osdn.net/frs/g_redir.php?m=kent&f=mecab%2Fmecab-ipadic%2F2.7.0-20070801%2Fmecab-ipadic-2.7.0-20070801.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 11.6M  100 11.6M    0     0  7350k      0  0:00:01  0:00:01 --:--:-- 7731k
Hash value of /code/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz matched
[make-mecab-ipadic-NEologd] : Decompress original mecab-ipadic file
mecab-ipadic-2.7.0-20070801/
mecab-ipadic-2.7.0-20070801/README
mecab-ipadic-2.7.0-20070801/AUTHORS
mecab-ipadic-2.7.0-20070801/COPYING
mecab-ipadic-2.7.0-20070801/ChangeLog
mecab-ipadic-2.7.0-20070801/INSTALL
mecab-ipadic-2.7.0-20070801/Makefile.am
mecab-ipadic-2.7.0-20070801/Makefile.in
mecab-ipadic-2.7.0-20070801/NEWS
mecab-ipadic-2.7.0-20070801/aclocal.m4
mecab-ipadic-2.7.0-20070801/config.guess
mecab-ipadic-2.7.0-20070801/config.sub
mecab-ipadic-2.7.0-20070801/configure
mecab-ipadic-2.7.0-20070801/configure.in
mecab-ipadic-2.7.0-20070801/install-sh
mecab-ipadic-2.7.0-20070801/missing
mecab-ipadic-2.7.0-20070801/mkinstalldirs
mecab-ipadic-2.7.0-20070801/Adj.csv
mecab-ipadic-2.7.0-20070801/Adnominal.csv
mecab-ipadic-2.7.0-20070801/Adverb.csv
mecab-ipadic-2.7.0-20070801/Auxil.csv
mecab-ipadic-2.7.0-20070801/Conjunction.csv
mecab-ipadic-2.7.0-20070801/Filler.csv
mecab-ipadic-2.7.0-20070801/Interjection.csv
mecab-ipadic-2.7.0-20070801/Noun.adjv.csv
mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv
mecab-ipadic-2.7.0-20070801/Noun.csv
mecab-ipadic-2.7.0-20070801/Noun.demonst.csv
mecab-ipadic-2.7.0-20070801/Noun.nai.csv
mecab-ipadic-2.7.0-20070801/Noun.name.csv
mecab-ipadic-2.7.0-20070801/Noun.number.csv
mecab-ipadic-2.7.0-20070801/Noun.org.csv
mecab-ipadic-2.7.0-20070801/Noun.others.csv
mecab-ipadic-2.7.0-20070801/Noun.place.csv
mecab-ipadic-2.7.0-20070801/Noun.proper.csv
mecab-ipadic-2.7.0-20070801/Noun.verbal.csv
mecab-ipadic-2.7.0-20070801/Others.csv
mecab-ipadic-2.7.0-20070801/Postp-col.csv
mecab-ipadic-2.7.0-20070801/Postp.csv
mecab-ipadic-2.7.0-20070801/Prefix.csv
mecab-ipadic-2.7.0-20070801/Suffix.csv
mecab-ipadic-2.7.0-20070801/Symbol.csv
mecab-ipadic-2.7.0-20070801/Verb.csv
mecab-ipadic-2.7.0-20070801/char.def
mecab-ipadic-2.7.0-20070801/feature.def
mecab-ipadic-2.7.0-20070801/left-id.def
mecab-ipadic-2.7.0-20070801/matrix.def
mecab-ipadic-2.7.0-20070801/pos-id.def
mecab-ipadic-2.7.0-20070801/rewrite.def
mecab-ipadic-2.7.0-20070801/right-id.def
mecab-ipadic-2.7.0-20070801/unk.def
mecab-ipadic-2.7.0-20070801/dicrc
mecab-ipadic-2.7.0-20070801/RESULT
[make-mecab-ipadic-NEologd] : Configure custom system dictionary on /code/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801-neologd-20200813
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking whether make sets $(MAKE)... yes
checking for working aclocal-1.4... missing
checking for working autoconf... found
checking for working automake-1.4... missing
checking for working autoheader... found
checking for working makeinfo... missing
checking for a BSD-compatible install... /usr/bin/install -c
checking for mecab-config... /usr/bin/mecab-config
configure: creating ./config.status
config.status: creating Makefile
[make-mecab-ipadic-NEologd] : Encode the character encoding of system dictionary resources from EUC_JP 
to UTF-8
./../../libexec/iconv_euc_to_utf8.sh ./Adnominal.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp-col.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Filler.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.nai.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.verbal.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.proper.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Conjunction.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adj.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.number.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.name.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.place.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Interjection.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Auxil.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.demonst.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adverb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adverbal.csv
./../../libexec/iconv_euc_to_utf8.sh ./Verb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Prefix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Suffix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Symbol.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adjv.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.org.csv 
rm ./Adnominal.csv 
rm ./Postp-col.csv 
rm ./Filler.csv 
rm ./Others.csv
rm ./Noun.nai.csv 
rm ./Noun.others.csv
rm ./Noun.verbal.csv
rm ./Noun.proper.csv
rm ./Conjunction.csv
rm ./Adj.csv
rm ./Postp.csv
rm ./Noun.number.csv
rm ./Noun.name.csv 
rm ./Noun.place.csv
rm ./Noun.csv
rm ./Interjection.csv
rm ./Auxil.csv
rm ./Noun.demonst.csv
rm ./Adverb.csv
rm ./Noun.adverbal.csv
rm ./Verb.csv 
rm ./Prefix.csv
rm ./Suffix.csv 
rm ./Symbol.csv
rm ./Noun.adjv.csv
rm ./Noun.org.csv
./../../libexec/iconv_euc_to_utf8.sh ./right-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./left-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./feature.def 
./../../libexec/iconv_euc_to_utf8.sh ./unk.def 
./../../libexec/iconv_euc_to_utf8.sh ./rewrite.def 
./../../libexec/iconv_euc_to_utf8.sh ./pos-id.def
./../../libexec/iconv_euc_to_utf8.sh ./matrix.def 
./../../libexec/iconv_euc_to_utf8.sh ./char.def 
rm ./right-id.def 
rm ./left-id.def 
rm ./feature.def
rm ./unk.def
rm ./rewrite.def
rm ./pos-id.def 
rm ./matrix.def
rm ./char.def
mv ./Postp.csv.utf8 ./Postp.csv 
mv ./Noun.org.csv.utf8 ./Noun.org.csv 
mv ./Prefix.csv.utf8 ./Prefix.csv
mv ./Noun.demonst.csv.utf8 ./Noun.demonst.csv
mv ./rewrite.def.utf8 ./rewrite.def
mv ./Others.csv.utf8 ./Others.csv 
mv ./matrix.def.utf8 ./matrix.def
mv ./pos-id.def.utf8 ./pos-id.def
mv ./Noun.others.csv.utf8 ./Noun.others.csv
mv ./Noun.adjv.csv.utf8 ./Noun.adjv.csv 
mv ./Interjection.csv.utf8 ./Interjection.csv
mv ./Adj.csv.utf8 ./Adj.csv
mv ./unk.def.utf8 ./unk.def
mv ./Auxil.csv.utf8 ./Auxil.csv
mv ./Noun.number.csv.utf8 ./Noun.number.csv 
mv ./char.def.utf8 ./char.def
mv ./Conjunction.csv.utf8 ./Conjunction.csv
mv ./feature.def.utf8 ./feature.def
mv ./Filler.csv.utf8 ./Filler.csv
mv ./Symbol.csv.utf8 ./Symbol.csv 
mv ./Postp-col.csv.utf8 ./Postp-col.csv
mv ./Noun.csv.utf8 ./Noun.csv
mv ./Adnominal.csv.utf8 ./Adnominal.csv 
mv ./Adverb.csv.utf8 ./Adverb.csv
mv ./Noun.nai.csv.utf8 ./Noun.nai.csv
mv ./Noun.name.csv.utf8 ./Noun.name.csv
mv ./Noun.adverbal.csv.utf8 ./Noun.adverbal.csv
mv ./Noun.proper.csv.utf8 ./Noun.proper.csv 
mv ./Noun.place.csv.utf8 ./Noun.place.csv
mv ./Suffix.csv.utf8 ./Suffix.csv
mv ./left-id.def.utf8 ./left-id.def
mv ./right-id.def.utf8 ./right-id.def
mv ./Noun.verbal.csv.utf8 ./Noun.verbal.csv
mv ./Verb.csv.utf8 ./Verb.csv 
[make-mecab-ipadic-NEologd] : Fix yomigana field of IPA dictionary
patching file Noun.csv
patching file Noun.place.csv
patching file Verb.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.adverbal.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.others.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Prefix.csv
patching file Suffix.csv
patching file Noun.proper.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Suffix.csv
patching file Noun.demonst.csv
patching file Noun.csv
patching file Noun.name.csv
[make-mecab-ipadic-NEologd] : Copy user dictionary resource
[make-mecab-ipadic-NEologd] : Install adverb entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adverb-dict-seed.20150623.csv.xz
[make-mecab-ipadic-NEologd] : Install interjection entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-interjection-dict-seed.20170216.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-common-noun-ortho-variant-dict-seed.20170228.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-proper-noun-ortho-variant-dict-seed.20161110.csv.xz
[make-mecab-ipadic-NEologd] : Install entries of orthographic variant of a noun used as verb form using /code/mecab-ipadic-neologd/libexec/../seed/neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv.xz
[make-mecab-ipadic-NEologd] : Install frequent adjective orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-std-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-exp-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-adjective-exp-dict-seed.20151126.csv.xz, please set --install_adjective_exp option

[make-mecab-ipadic-NEologd] : Install adjective verb orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-verb-dict-seed.20160324.csv.xz
[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-date-time-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-date-time-infreq-dict-seed.20190415.csv.xz, 
please set --install_infreq_datetime option

[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-quantity-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-quantity-infreq-dict-seed.20190415.csv.xz, please set --install_infreq_quantity option

[make-mecab-ipadic-NEologd] : Install entries of ill formed words using /code/mecab-ipadic-neologd/libexec/../seed/neologd-ill-formed-words-dict-seed.20170127.csv.xz
[make-mecab-ipadic-NEologd] : Re-Index system dictionary
reading ./unk.def ... 40
emitting double-array: 100% |###########################################|
./model.def is not found. skipped.
reading ./Adnominal.csv ... 135
reading ./Postp-col.csv ... 91
reading ./Filler.csv ... 19
reading ./Others.csv ... 2
reading ./Noun.nai.csv ... 42
reading ./neologd-ill-formed-words-dict-seed.20170127.csv ... 60616
reading ./neologd-proper-noun-ortho-variant-dict-seed.20161110.csv ... 138379
reading ./Noun.others.csv ... 153
reading ./Noun.verbal.csv ... 12150
reading ./Noun.proper.csv ... 27493
reading ./Conjunction.csv ... 171
reading ./Adj.csv ... 27210
reading ./neologd-common-noun-ortho-variant-dict-seed.20170228.csv ... 152869
reading ./Postp.csv ... 146
reading ./Noun.number.csv ... 42
reading ./Noun.name.csv ... 34215
reading ./Noun.place.csv ... 73194
reading ./neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv ... 26058
reading ./Symbol.csv ... 208
reading ./neologd-adjective-verb-dict-seed.20160324.csv ... 20268
reading ./Noun.adjv.csv ... 3328                                             058
reading ./Noun.org.csv ... 17149
/code/mecab-ipadic-neologd/bin/../libexec/make-mecab-ipadic-neologd.sh: line 525:  6288 Killed
         ${MECAB_LIBEXEC_DIR}/mecab-dict-index -f UTF8 -t UTF8
ERROR: Service 'python-django' failed to build: The command '/bin/sh -c apt-get update -y&&    apt-get 
upgrade -y&&    apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8 sudo -y&&    apt-get 
install git make curl xz-utils file&&    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git&&    /code/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y &&    mkdir /code/media &&     mkdir /code/static &&    python -m pip install --upgrade pip &&    pip install -r requirements.txt' returned a non-zero code: 137

Improper proper nouns

I found some clauses suffixed with "。" are registered as 固有名詞 (proper noun) incorrectly.

$ echo '奜きだ。' | mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd
奜きだ。	名詞,固有名詞,䞀般,*,*,*,奜きだ。,スキダ,スキダ

The examples are the below:

  • 奜きだ。
  • 元気です。
  • おはよう。
  • あなた。
  • たたね。
  • 嚘。

How to use on Windows10 and Python?

I am a non-japanese speaker. Firstly I installed mecab from that website:
https://pypi.org/project/mecab-python3/

even it didn't create a mecab folder on my pc.

in the python file, I wrote wakati = MeCab.Tagger("-Owakati") and it worked well! but they say mecab-ipadic-neologd is better and I need to use it. But all guides are based on Linux and MacOS. Please help

Make capable to install mecab-ipadic-NEologd to an user directory without sudo privileges

Currently, I should set "--asuser option" to install the mecab-ipadic-NEologd to an user directory without sudo privileges.
But I would like mecab-ipadic-NEologd to detect whether sudo privileges are required.

So I will implement following features

  • A process to compare an uid of a current user and an uid of target directory
  • assudo option
    • It's required when I want to install using sudoer privileges

Nothing happened after "Download original mecab-ipadic file"

Try to install ipadic-neologd on Mac and followed steps, but after the step of Download original mecab-ipadic file, nothing happened and the program seems break. Can you help? Thanks

./bin/install-mecab-ipadic-neologd -n
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] : find => ok
[install-mecab-ipadic-NEologd] : sort => ok
[install-mecab-ipadic-NEologd] : head => ok
[install-mecab-ipadic-NEologd] : cut => ok
[install-mecab-ipadic-NEologd] : egrep => ok
[install-mecab-ipadic-NEologd] : mecab => ok
[install-mecab-ipadic-NEologd] : mecab-config => ok
[install-mecab-ipadic-NEologd] : make => ok
[install-mecab-ipadic-NEologd] : curl => ok
[install-mecab-ipadic-NEologd] : sed => ok
[install-mecab-ipadic-NEologd] : cat => ok
[install-mecab-ipadic-NEologd] : diff => ok
[install-mecab-ipadic-NEologd] : tar => ok
[install-mecab-ipadic-NEologd] : unxz => ok
[install-mecab-ipadic-NEologd] : xargs => ok
[install-mecab-ipadic-NEologd] : grep => ok
[install-mecab-ipadic-NEologd] : iconv => ok
[install-mecab-ipadic-NEologd] : patch => ok
[install-mecab-ipadic-NEologd] : which => ok
[install-mecab-ipadic-NEologd] : file => ok
[install-mecab-ipadic-NEologd] : openssl => ok
[install-mecab-ipadic-NEologd] : awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/local/lib/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
MacBook-Pro:mecab-ipadic-neologd User$

「䞖界の秘密」

Hello,
This dictionary is very good!
I use it every day.
Thank you so much.

By the way, a phrase "䞖界の秘密" is analyzed a token of this dictionary.
The phrase is a quiz TV program name.
But, that TV program ended only five months.
I think that the phrase should be analyzed "侖界" + "の" + "秘密".

A small script to find wrong yomigana entries

Hello,

First of all, your mecab-ipadic-neologd is amazing.
Thank you so much!

I wrote a small script and found some wrong yomigana entries.
find-neologd-error-entries.rb.txt
neologd-error-entries.txt

$ ruby find-neologd-error-entries.rb mecab-user-dict-seed.20160111.csv
It generates "neologd-error-entries.txt".

e.g.

  • 京郜垂䞊京区西町,1293,1293,-1319,名詞,固有名詞,地域,䞀般,,,京郜垂䞊京区西町,キョりトシカミギョりクニシマチニシマチニシマチニシマチニシマチニシマチニシマチ,キョヌトシカミギョヌクニシマチニシマチニシマチニシマチニシマチニシマチニシマチ
  • 神接小孊校,1288,1288,352,名詞,固有名詞,䞀般,,,*,神接小孊校,カミツショりガッコりコりヅショりガッコり,カミツショヌガッコヌコヌズショりガッコヌ

I know we can't get yomigana perfectly, but neologd may have some errors in zip code data splitting.

What is the correct way to customize the pos-id.def file in mecab-ipadic-neologd?

Hi,
I'm trying to modify the pos-id.def coming with the neologd dictionary. But after changing that file, whether I execute
sudo ./mecab-dict-index -f UTF8 -t UTF8 -d /usr/lib/mecab/dic/mecab-ipadic-neologd
or execute
sudo ./mecab-dict-index -f UTF8 -t UTF8 -d .../build/mecab-ipadic-2.7.0-20070801-neologd-20170710>,
I would get the error "

dictionary_compiler.cpp(133) [dic.size()] no dictionaries are specified
or
char_property.cpp(236) [unk.find(it->first) != unk.end()] category [ALPHA] is undefined in ...../mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20170710/unk.def

respectively.

So could anyone tell me the correct way to compile the new pos-id.def for the neglogd dictionary? Any hint is appreciated. Thanks.

How to produce mecab-user-dict-seed.YYYYMMDD.csv.xz?

Hi, I love and appreciate this helpful dictionary!

A quick question: how do you produce seed file mecab-user-dict-seed.YYYYMMDD.csv.xz?
I suppose you use some scripts to it, but if so, the scripts are also uploaded to the git repo?

I'm looking for the way to build a bit customized-version of the dic.

Thanks in advance!

Pronunciations for 1日間 ~ 10日間 are wrong

「1日間 ~ 10日間」の読み方が間違っおる。
The 'カン' from '間' are missing.
And the furigana of "1日間" should be "むチニチカン" not "ツむタチカン"。

1日間   名詞,固有名詞,䞀般,*,*,*,1日間,ツむタチ,ツむタチ
2日間   名詞,固有名詞,䞀般,*,*,*,2日間,フツカ,フツカ
3日間
4日間
...
10日間  名詞,固有名詞,䞀般,*,*,*,10日間,トオカ,トオカ

11日間 is correct.

11日間  名詞,固有名詞,䞀般,*,*,*,11日間,ゞュりむチニチカン,ゞュりむチニチカン

Wide "" is included in 原圢

$ ag アヌスりィンドアンドファむアヌ mecab-user-dict-seed.20200123.csv 
151101:Earth Wind & Fire,1288,1288,4131,名詞,固有名詞,䞀般,*,*,*,EarthWind&Fire,アヌスりィンドアンドファむアヌ,アヌスりィンドアンドファむアヌ

I think this is better.

- EarthWind&Fire
+ Earth, Wind&Fire

mecab-ipadic-NEologd won't be updated when running the installer with the full path

I ran a cron job with the full path installer and -n option, like,

00 03 * * 2 /opt/mecab/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y

Then the following errors occurred.

fatal: Not a git repository (or any of the parent directories): .git
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

This occurred in the following code because the current directory was not a git repository.

if [ `git log refs/heads/master --pretty=%H | head -1` = `git ls-remote origin -h refs/heads/master |cut -f1` ]; then
    echo "$ECHO_PREFIX mecab-ipadic-NEologd is already up-to-date"

In this case, the condition is always true because both of the results are empty.
Therefore, the message "mecab-ipadic-NEologd is already up-to-date" is always displayed.

needs ` yum install patch` with CentOS7

It must be needed patch command before install with CentOS7 as Minimal

./bin/install-mecab-ipadic-neologd -n
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
which: no patch in ($HOME/perl5/perlbrew/bin:/usr/local/bin:/usr/bin:$HOME/bin:/usr/local/sbin:/usr/sbin)
[install-mecab-ipadic-NEologd] :     patch is not found.

so, we have to do rewrite the description like below:

$ sudo yum install mecab mecab-devel mecab-ipadic git make curl xz patch

Failed to build lucene-kuromoji because mecab-user-dict-seed.20180920.csv contain invalid format.

mecab-user-dict-seed.20180920.csv contains invalid CSV format as follows.

line 2236780:

攟虫,1283,1283,6095,名詞,サ倉接続,**,*,*,攟虫,ホりチュり,ホヌチュヌ,[unknown:_:17793 17615 6095]

line 2473240:

死着,1283,1283,6095,名詞,サ倉接続,**,*,*, 死着,シチャク,シチャク,[unknown:_:13429 13295 6095]

`Android暙準ブラりザ` related entries

Motivation

Fix incorrect entries

Goal

  • write the goal

write the description

$ grep 'Android.*ブラりザ' mecab-user-dict-seed.20200709.csv
Android暙準ブラりザ,1288,1288,4545,名詞,固有名詞,䞀般,*,*,*,Android暙準ブラりザ,ブラりザ,ブラりザ
Android暙準ブラりザヌ,1288,1288,5229,名詞,固有名詞,䞀般,*,*,*,Android暙準ブラりザ,ブラりザヌ,ブラりザヌ
android暙準ブラりザ,1288,1288,4545,名詞,固有名詞,䞀般,*,*,*,Android暙準ブラりザ,ブラりザ,ブラりザ
ブラりザ,1288,1288,6395,名詞,固有名詞,䞀般,*,*,*,Android暙準ブラりザ,ブラりザ,ブラりザ

$ mecab -d /usr/lib/mecab/dic/mecab-ipadic-neologd
ブラりザ
ブラりザ        名詞,固有名詞,䞀般,*,*,*,Android暙準ブラりザ,ブラりザ,ブラりザ

ブラりザ is a generic word but neologd seems it as a Android暙準ブラりザ :(

hatena keyword doesn't have 16 or higher yomigana characters.

hatena keyword doesn't have 16 or higher yomigana characters.

e.g.

  • うごめもしゅうぞんのはおなでのも うごメモ呚蟺のはおなでの問題
  • おしえおはおなだいありヌでんごん 教えおはおなダむアリヌ䌝蚀板
  • しんはおなだいあらヌえいがひゃく 真・はおなダむアラヌ映画癟遞

hatena keyword has proper yomigana when the yomigana has 15 or lower characters.

Wrong entry for ササキ

䜐々朚貞枅,1289,1289,2337,名詞,固有名詞,人名,䞀般,*,*,䜐々朚貞枅,ササキ,ササキ

Morphological analysis result of "倫婊" is wrong

Motivation

I think the morphological analysis result of "倫婊" is wrong.
(build version: mecab-ipadic-2.7.0-20070801-neologd-20190919)

echo "倫婊" | mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
倫婊    名詞,固有名詞,䞀般,*,*,*,倫婊。,フりフ,フヌフ
  1. Original (原圢) of "倫婊" is 倫婊 instead of 倫婊。
  2. "倫婊" (Type of noun/品詞现分類1) is 䞀般 instead of 固有名詞

Goal

  1. Fix original to 倫婊 from 倫婊。
  2. Fix type of noun to 䞀般 from 固有名詞

Could you deal with this issue for me?

normalize_neologd.pyの間違い

WikiのRegexp.jaのペヌゞに蚘茉されおいるnormalize_neologd.pyですが

s = unicode_normalize('−---', s)

の郚分のずの間がHYPHEN-MINUSではなくMINUS SIGNになっおいたす

「日」を正芏化するず「10日」のようになるず思うのですが珟圚の゜ヌスコヌドでは「0日」のようになっおしたいたす (Python 2.7.9で確認)

以䞋の倉曎をマヌゞしおいただけたせんか

arosh/mecab-ipadic-neologd-wiki@0e5534d

(Wikiに察するPull Requestの方法が分からなかったのでIssueで質問させおいただきたした

Cannot Install mecab-python3 (unable to execute swig: no such file in directory)

Motivation

Hello,
I've successfully installed MeCab, mecab-ipadic, and the neological dictionary. However, I cannot install mecab-python3 to get MeCab talking with Python. Each time I've tried, I receive the following error:

unable to execute 'swig': No such file or directory
error: command 'swig' failed with exit status 1

From what I've gathered looking into the issue on Google, it seems to be an issue that resulted from the most recent update. Was wondering if there was a temporary fix until the issue gets resolved?

I ran across this online:

https://qiita.com/siraasagi/items/e07e0b271cb7cd679a70

but as I'm using a Mac, I cannot run apt in the command line. Brew also does not recognize the formulae when I substitute it with apt-get. Any help would be much appreciated!

Thanks for your time!

Goal

Goal is to use MeCab with Python to tokenize some Japanese text for NLP purposes.

"株匏䌚瀟" should be splitted.

I think these characters should be splitted:
株, (æ ª), 株匏䌚瀟

neologd has 5 "あおい電子工業" variants.

  • あおい電子工業 株匏䌚瀟,1292,1292,-14635,名詞,固有名詞,組織,,,*,あおい電子工業株匏䌚瀟,アオむデンシコりギョりカブシキガむシャ,アオむデンシコヌギョヌカブシキガむシャ
  • あおい電子工業 株,1292,1292,-10826,名詞,固有名詞,組織,,,*,あおい電子工業株匏䌚瀟,アオむデンシコりギョりカブシキガむシャ,アオむデンシコヌギョヌカブシキガむシャ
  • あおい電子工業(æ ª),1292,1292,6301,名詞,固有名詞,組織,,,*,あおい電子工業株匏䌚瀟,アオむデンシコりギョりカブシキガむシャ,アオむデンシコヌギョヌカブシキガむシャ
  • あおい電子工業株匏䌚瀟,1292,1292,-9787,名詞,固有名詞,組織,,,*,あおい電子工業株匏䌚瀟,アオむデンシコりギョりカブシキガむシャ,アオむデンシコヌギョヌカブシキガむシャ
  • あおい電子工業株,1292,1292,-6382,名詞,固有名詞,組織,,,*,あおい電子工業株匏䌚瀟,アオむデンシコりギョりカブシキガむシャ,アオむデンシコヌギョヌカブシキガむシャ

I don't think we need 5 variants for "あおい電子工業",
and, more importantly, neologd doesn't have basic "あおい電子工業 アオむデンシコりギョり".

I think these entries are enough and we can reduce the dictionary size.
あおい電子工業 アオむデンシコりギョり
株匏䌚瀟 カブシキガむシャ
(æ ª) カブシキガむシャ
株 カブシキガむシャ

Regards.

Failed to build lucene-kuromoji because mecab-user-dict-seed.20190930.csv contain invalid format.

How have you been in a year?
mecab-user-dict-seed.20190930.csv contains invalid CSV format as follows.

line 1378761:
マスストランディング,1288,1288,-141,名詞,固有名詞,䞀般,*,**,,マス・ストランディング,マスストランディング,マスストランディング
マスストランディング,1288,1288,-141,名詞,固有名詞,䞀般,*,*,*,マス・ストランディング,マスストランディング,マスストランディング

Negative cost

Thanks first for the great database.

Motivation

I find some words in the data are assigned negative costs.

$ cat mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20191111/mecab-user-dict-seed.20191111.csv | grep "ファニチャヌロり"
ファニチャヌロりレヌシング,1288,1288,-5111,名詞,固有名詞,䞀般,*,*,*,ファニチャヌ・ロり・レヌシング,ファニチャヌロりレヌシング,ファニチャヌロりレヌシング
ファニチャヌ・ロり・レヌシング,1288,1288,-9029,名詞,固有名詞,䞀般,*,*,*,ファニチャヌ・ロり・レヌシング,ファニチャヌロりレヌシング,ファニチャヌロりレヌシング

Costs are lower for more frequent words. But the examples above do not seem to be so frequent as assigned a very low cost. I suspect this could possibly be a result of integer overflow or sort.

Goal

I would like to know:
(1) if this is a correct/intended result or a bug
(2) if correct/intended, how negative costs should be interpreted.

Can someone help me with this?

Unnecessary variants for single address

grep -a "愛知県名叀屋垂南区豊田町" mecab-user-dict-seed.20160225.csv

名叀屋垂豊田町,1293,1293,-5820,名詞,固有名詞,地域,䞀般,,,愛知県名叀屋垂南区豊田町,ナゎダシトペダチョり,ナゎダシトペダチョヌ
愛知県南区豊田町,1293,1293,-1981,名詞,固有名詞,地域,䞀般,,,愛知県名叀屋垂南区豊田町,アむチケンミナミクトペダチョり,アむチケンミナミクトペダチョヌ
愛知県名叀屋垂南区豊田町,1293,1293,-19354,名詞,固有名詞,地域,䞀般,,,愛知県名叀屋垂南区豊田町,アむチケンナゎダシミナミクトペダチョり,アむチケンナゎダシミナミクトペダチョヌ
愛知県名叀屋垂豊田町,1293,1293,-18608,名詞,固有名詞,地域,䞀般,,,愛知県名叀屋垂南区豊田町,アむチケンナゎダシトペダチョり,アむチケンナゎダシトペダチョヌ

I think we don't need "名叀屋垂豊田町" "愛知県南区豊田町" "愛知県名叀屋垂豊田町".
https://www.google.co.jp/search?q="名叀屋垂豊田町"
4 results
https://www.google.co.jp/search?q="愛知県南区豊田町"
0 results
https://www.google.co.jp/search?q="愛知県名叀屋垂豊田町"
0 results

Some entries have wrong yomi and pronunciation

Some entries have wrong yomi and pronunciation.
For example, after building dictionary,

$ cd mecab_ipadic_neologd
$ grep '高橋みなみ,' ./**/*.csv
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:5代目高橋みなみ,1289,1289,4078,名詞,固有名詞,人名,䞀般,*,*,5代目高橋みなみ,ゎダむメタカハシミナミ,ゎダむメタカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:ゎダむメタカハシミナミ,1289,1289,-951,名詞,固有名詞,人名,䞀般,*,*,5代目高橋みなみ,ゎダむメタカハシミナミ,ゎダむメタカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:高橋みなみ,1289,1289,273,名詞,固有名詞,人名,䞀般,*,*,高橋みなみ,タカハシミナミ,タカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:高橋みなみ,1289,1289,273,名詞,固有名詞,人名,䞀般,*,*,高橋みなみ,タカハシミナミ゚ヌケヌビヌフォヌティ゚むト,タカハシミナミ゚ヌケヌビヌフォヌティ゚むト
$ grep '日本料理,' ./**/*.csv
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:日本料理,1288,1288,3024,名詞,固有名詞,䞀般,*,*,*,日本料理,ニホンリョりリ,ニホンリョヌリ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:日本料理,1288,1288,3024,名詞,固有名詞,䞀般,*,*,*,日本料理,ニホンリョりリニッポンリョりリ,ニホンリョヌリニッポンリョりリ

Why?
It looks to me that 日本料理 has concatenated yomi and pronunciation.
Why does 高橋みなみ have ゚ヌケヌビヌフォヌティ゚むト?

My version is 20170228-01, but more old version have same issues.

Thanks.

数倀系が固有名詞になっおいる

$100,1288,1288,7806,名詞,固有名詞,䞀般,,,,$100,ヒャクドル,ヒャクドル
昭和10幎,1288,1288,6518,名詞,固有名詞,䞀般,
,,,昭和10幎,ショりワゞュりネン,ショヌワゞュりネン
10 years,1288,1288,4569,名詞,固有名詞,䞀般,,,*,10 years,テンむダヌズ,テンむダヌズ

などの数倀系の蟞曞の品詞が、固有名詞になっおいるが、固有名詞ではないのではないでしょうか
䞀般などの品詞に倉えられないでしょうか

It cannot parse である correctly

When I use mecab with default dictionary , it can correctly parse this sentence.

察象者はれロであるが、実斜する。
察象    名詞,䞀般,*,*,*,*,察象,タむショり,タむショヌ
者      名詞,接尟,䞀般,*,*,*,者,シャ,シャ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
れロ    名詞,数,*,*,*,*,れロ,れロ,れロ
で      助動詞,*,*,*,特殊・ダ,連甚圢,だ,デ,デ
ある    助動詞,*,*,*,五段・ラ行アル,基本圢,ある,アル,アル
が      助詞,接続助詞,*,*,*,*,が,ガ,ガ
、      蚘号,読点,*,*,*,*,、,、,、
実斜    名詞,サ倉接続,*,*,*,*,実斜,ゞッシ,ゞッシ
する    動詞,自立,*,*,サ倉・スル,基本圢,する,スル,スル
。      蚘号,句点,*,*,*,*,。,。,。

but, When I use mecab with neologd dictionary (commit 0700f47) , 「である」 is treated as 固有名詞.

察象者はれロであるが、実斜する。
察象者  名詞,固有名詞,䞀般,*,*,*,察象者,タむショりシャ,タむショヌシャ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
れロ    名詞,数,*,*,*,*,れロ,れロ,れロ
である  名詞,固有名詞,䞀般,*,*,*,である,デアル,デアル
が      助詞,栌助詞,䞀般,*,*,*,が,ガ,ガ
、      蚘号,読点,*,*,*,*,、,、,、
実斜    名詞,サ倉接続,*,*,*,*,実斜,ゞッシ,ゞッシ
する    動詞,自立,*,*,サ倉・スル,基本圢,する,スル,スル
。      蚘号,句点,*,*,*,*,。,。,。

Is this a bug, or the sentence is grammatically wrong ?
Thanks.

組織名「日生協」/「日本生掻協同組合連合䌚」

単語の远加に関する芁望がありたす。
いわゆる「生協」(COOP)の略称「日生協」ず正匏名称の「日本生掻協同組合連合䌚」を远加しおほしいです。

珟状、人名の姓ずしお「日生」のみが蟞曞に存圚するため、「日生協」を凊理するず
「日生」ず「協」に分割されおしたいたす。

パッチの圢にする良い方法が浮かばなかったので、Issueずしお報告したす。

Most "}" entries are unnecessary

I think most "}" entries are unnecessary.

ag } mecab-user-dict-seed.20200130.csv 
47452:2015幎{{lang|zh|深圳}}土砂厩事故,1288,1288,3753,名詞,固有名詞,䞀般,*,*,*,2015幎{{lang|zh|深圳}}土砂厩事故,ニセンゞュりゎネンシンセンドシャクズレゞコ,ニセンゞュりゎネンシンセンドシャクズレゞコ
388354:}★,1288,1288,8142,名詞,固有名詞,䞀般,*,*,*,}★,ワルグチトワルコメボクメツダンタ,ワルグチトワルコメボクメツダンタ
655423:カゞタタカアキ,1289,1289,4374,名詞,固有名詞,人名,䞀般,*,*,梶田隆章{{R|nichigai}},カゞタタカアキ,カゞタタカヌキ
842080:ザりィドり{{}}真実を求めお{{}},1288,1288,4068,名詞,固有名詞,䞀般,*,*,*,ザ・りィドり{{}}真実を求めお{{}},ザりィドりシンゞツヲモトメテ,ザりィドりシンゞツオモトメテ

Thank you for providing and keeping a good dictionary.

むンストヌルに倱敗

リポゞトリをクロヌン埌に以䞋のコマンドでむンストヌルを実行するず mecab-ipadic-2.7.0-20070801.tar.gz のハッシュ倀が違うずいう原因で゚ラヌが発生したす

$ ./bin/install-mecab-ipadic-neologd -n

・゚ラヌ発生時のログ

[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
[install-mecab-ipadic-NEologd] :     patch => ok
[install-mecab-ipadic-NEologd] :     which => ok
[install-mecab-ipadic-NEologd] :     file => ok
[install-mecab-ipadic-NEologd] :     openssl => ok
[install-mecab-ipadic-NEologd] :     awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : create /mecab-ipadic-neologd/libexec/../build
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3435    0  3435    0     0   7797      0 --:--:-- --:--:-- --:--:--  7896
[make-mecab-ipadic-NEologd] : Fail to download /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz
[make-mecab-ipadic-NEologd] : You should remove /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz before retrying to install mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] :        rm -rf /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801
[make-mecab-ipadic-NEologd] :        rm /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz

該圓のtarファむルが眮かれおいる以䞋の URL ぞアクセスするず Google ドラむブの゚ラヌが衚瀺されおおりこれが圱響しおいるかもしれたせん

https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM

google_drive_

Add Left double quotation mark to Regexp.ja

Motivation

This issue is about Regexp.ja in a wiki.

以䞋の党角蚘号は半角蚘号に眮換
/”’−¥

It recommends replacing Right double quotation mark(U+201D) to Quotation mark(U+0022) and not replacing Left double quotation mark(U+201C) to Quotation mark. I prefer both Right and Left double quotation mark to be replaced to Quotation mark in sentences like below.

ダブルクォテヌションは日本語では“匷調”のために䜿われる。
→ ダブルクォテヌションは日本語では"匷調"のために䜿われる。

Sorry if there is a specific reason why Left double quotation mark is not included in the rule.

Goal

My suggestion might look like this.

以䞋の党角蚘号は半角蚘号に眮換
“”’−¥

In addition to adding Left double quotation mark(U+201C), I omitted Slash(U+002F), which is a half-width character, at the head of the line. I guess this is a mistake.

README.md のBibtexに぀いお

2017幎床の蚀語凊理孊䌚ず2016幎床の情報凊理孊䌚の論文の author に぀いおですが
橋本泰䞀さんの名前が Taiichi Hashimoro ずタむポしおいるかず思いたす
Taiichi Hashimoto が正しいかず

installer reports error

installer reports fatal: ambiguous argument '...refs/heads/master^': unknown revision or path not in the working tree.

system env:

~% $SHELL '--version'
zsh 5.0.2 (x86_64-pc-linux-gnu)

~% git --version
git version 2.5.0

detail logs:

....

fatal: ambiguous argument '...refs/heads/master^': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
[install-mecab-ipadic-neologd] : Get the newest updated information using git
./bin/install-mecab-ipadic-neologd: line 199: [: =: unary operator expected
HEAD is now at f987dff Fix typo

[install-mecab-ipadic-neologd] : mecab-ipadic-neologd will be install to /usr/lib64/mecab/dic/mecab-ipadic-neologd

....

Release new version

Motivation

I hope people can install latest updated package with fresh data.

I has packaged version 0.0.5 of mecab-ipadic-neologd which released on 2016-05-02 for Debian and derived distribution like Ubuntu, also release the packaging file on both Launchpad PPA and Bintray, So people can easily install by command apt-get install mecab-ipadic-neologd.

Goal

  • Release new version for updated data

'侉重県' and '矀銬県' are parsed as name of person

Both 侉重県 and 矀銬県 are name of prefecture. Other prefectures are analyzed as 名詞-固有名詞-地域-䞀般 correctly.

But these prefectures are analyzed as 名詞-固有名詞-人名-䞀般 and I find these are in seed file. There are no famous persons named 侉重県 nor 矀銬県 as I searched.

I think both of words should be analyzed as 名詞-固有名詞-地域-䞀般.

Result of analysis

茚城県  名詞,固有名詞,地域,䞀般,*,*,茚城県,むバラキケン,むバラキケン
栃朚県  名詞,固有名詞,地域,䞀般,*,*,栃朚県,トチギケン,トチギケン
矀銬県  名詞,固有名詞,人名,䞀般,*,*,矀銬県,グンマケン,グンマケン
愛知県  名詞,固有名詞,地域,䞀般,*,*,愛知県,アむチケン,アむチケン
岐阜県  名詞,固有名詞,地域,䞀般,*,*,岐阜県,ギフケン,ギフケン
侉重県  名詞,固有名詞,人名,䞀般,*,*,侉重県,ミ゚ケン,ミ゚ケン

Seed file

./build/mecab-ipadic-2.7.0-20070801-neologd-20190812/mecab-user-dict-seed.20190812.csv:侉重県,1289,1289,-2894,名詞,固有名詞,人名,䞀般,*,*,侉重県,ミ゚ケン,ミ゚ケン
./build/mecab-ipadic-2.7.0-20070801-neologd-20190812/mecab-user-dict-seed.20190812.csv:矀銬県,1289,1289,1138,名詞,固有名詞,人名,䞀般,*,*,矀銬県,グンマケン,グンマケン

e-mail and URL tokenization

Motivation and Goal

Instead of breaking down an email address and/or an URL, it could be a desirable option to be able to identify email addresses and URLs as a single token. See example below to compare current behavior to the suggested one.

Sample code

import MeCab
mecab = MeCab.Tagger("-Ochasen -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")

text = "䞭川さんのメヌルは[email protected]です"
print(mecab.parse(text))

Output

䞭川    ナカガワ䞭川    名詞-固有名詞-人名-姓
さん    サン    さん    名詞-接尟-人名
の      ノ      の      助詞-連䜓化
メヌル   メヌル   メヌル   名詞-サ倉接続
は      ハ      は      助詞-係助詞
nakagawa        nakagawa        nakagawa        名詞-固有名詞-組織
@       @       @       蚘号-䞀般
xxxx    む゚ナむXXXX    名詞-固有名詞-䞀般
.       .       .       蚘号-䞀般
co.jp   シヌオヌゞェむピヌ co.jp   名詞-固有名詞-䞀般
です    デス    です    助動詞  特殊・デ 基本圢
EOS

Desirable output

䞭川    ナカガワ䞭川    名詞-固有名詞-人名-姓
さん    サン    さん    名詞-接尟-人名
の      ノ      の      助詞-連䜓化
メヌル   メヌル   メヌル   名詞-サ倉接続
は      ハ      は      助詞-係助詞
nakagawa@xxxx.co.jp        [...]
です    デス    です    助動詞  特殊・デ 基本圢
EOS

neologdを䜿っおみお思ったのですが

こんにちわ。
䜿わせお頂いおありがずうございたす。
さお、数字やロヌマ字、蚘号の混じったものは名詞・固有名詞ずなっおいたす。
ipadicでは数芁玠であるこずがfeatureで分かりたす。
同じように数字などの混じった名詞・固有名詞に、䟋えば床量衡などの芁玠を加えお頂けたせんか
良い案があれば、その他の方法でも良いです。
単䜓で¥は蚘号、その他の%、kg、cm、Ⅼリットルず読めないは名詞です。
䟋えば、4カ月  名詞、固有名詞、䞀般、床量衡
    5぀   名詞、䞀般、    、床量衡
    A型   名詞、固有名詞、䞀般、床量衡
    35℃  名詞、固有名詞、䞀般、床量衡
    70  名詞、固有名詞、䞀般、床量衡
    65kg  名詞、固有名詞、䞀般、床量衡
    180cm  名詞、固有名詞、䞀般、床量衡
    500m3  ※これは分割されおしたいたす。
    8個ケ ※これは分割されおしたいたす。
5l※リットルず読めるが、分割されおしたいたす。

Download failed in China

Hi. Thank you for sharing a great dictionary!
Currently, we are using your dict for Japanese text-to-speech system in our project.

The users from China reported the failure of downloading due to the block of google drive service.
espnet/espnet#606
Is there any plan to provide another download source for the installation?

Missing Japanese names

These names are missing in mecab-user-dict-seed.20181112.csv and mecab-ipadic-2.7.0-20070801.
I think they are famous/common names.

サンペむ 䞉瓶
゜りシゲル 宗茂
タケナタカ 歊豊
ナりト 勇人
リンカ 梚花

Some wrong yomigana/hyouki entries

// hyouki
mecab-user-dict-seed.20160222.csv:387971: りグむスタケ,1288,1288,-1686,名詞,固有名詞,䞀般,,,,鶯〓,りグむスタケ,りグむスタケ
mecab-user-dict-seed.20160222.csv:388991: りチダヒャッケン,1288,1288,-5999,名詞,固有名詞,䞀般,
,,,内田癟〓@6BE1@,りチダヒャッケン,りチダヒャッケン
(+87 "〓" entries)

// yomigana
mecab-user-dict-seed.20160222.csv:268129: けけ,1289,1289,7587,名詞,固有名詞,人名,䞀般,,,けけ,ケケヶ,ケケヶ
mecab-user-dict-seed.20160222.csv:274205: ずミや株匏䌚瀟,1288,1288,4587,名詞,固有名詞,䞀般,,,*,ずミや株匏䌚瀟,ズミダカブシキガむシャ,ズミダカブシキガむシャ

"ヶ" and "ミ" are not good for Japanese yomigana.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.