kensuke-mitsuzawa / japanesetokenizers Goto Github PK
View Code? Open in Web Editor NEWaim to use JapaneseTokenizer as easy as possible
License: MIT License
aim to use JapaneseTokenizer as easy as possible
License: MIT License
It's better to put command to process dummy text just after package initializes a jumanpp process.
It's better to put an automated-restart procedure in the jumanpp process handler.
Seems like this is lacking macOS support?
I installed with
pip install JapaneseTokenizer
make install
make install_neologd
During make install
I received the following error:
install_tokenizers.sh: line 89: ldconfig: command not found
And during make install_neologd
i got:
[install-mecab-ipadic-NEologd] : unxz is not found.
make: *** [install_neologd] Error 1
And while trying to run the example starter code, I got
[Y/12/06 15:03:43]ERROR - mecab_wrapper.py#__CallMecab:137: ('',)
[Y/12/06 15:03:43]ERROR - mecab_wrapper.py#__CallMecab:138: Possibly Path to userdict is invalid. Check the path
Traceback (most recent call last):
File "test.py", line 7, in <module>
mecab_wrapper = JapaneseTokenizer.MecabWrapper(dictType='ipadic')
File "/Users/rpryzant/kana/venv/lib/python2.7/site-packages/JapaneseTokenizer/mecab_wrapper/mecab_wrapper.py", line 45, in __init__
self.mecabObj = self.__CallMecab()
File "/Users/rpryzant/kana/venv/lib/python2.7/site-packages/JapaneseTokenizer/mecab_wrapper/mecab_wrapper.py", line 139, in __CallMecab
raise subprocess.CalledProcessError(returncode=-1, cmd="Failed to initialize Mecab object")
subprocess.CalledProcessError: Command 'Failed to initialize Mecab object' returned non-zero exit status -1
I am running macOS 10.13.1
https://stackoverflow.com/questions/53718267/module-import-issue-with-a-japanese-tokenizer
This issue is coming because an author of pyknp removed juman++ module from pyknp package.
However, it's existing in pyknp=0.3.
So, it should run install pyknp=0.3 in setup.py script.
Hi, how do we segment sentence from a paragraph in japanese text ?
This message is too ambiguous.
Exception: You could not call neologd dictionary bacause you do NOT install the package neologdn.
should be
Exception: You could not call neologd dictionary bacause you do NOT install the package neologdn. run pip install neologdn
Thank you for including Mykytea! I supported Mykytea-python for Python3. Could you support kytea with Python 3?
I tried to implement python3 version, but I can't install jctconv ikegami-yukino/jctconv#3 , so I haven't create patch yet.
MeCab_wrap.cxx:8434:80: error: ‘MECAB_ONE_BEST’ was not declared in this scope
MeCab_wrap.cxx:8435:77: error: ‘MECAB_NBEST’ was not declared in this scope
MeCab_wrap.cxx:8436:79: error: ‘MECAB_PARTIAL’ was not declared in this scope
MeCab_wrap.cxx:8437:85: error: ‘MECAB_MARGINAL_PROB’ was not declared in this scope
MeCab_wrap.cxx:8438:83: error: ‘MECAB_ALTERNATIVE’ was not declared in this scope
MeCab_wrap.cxx:8439:82: error: ‘MECAB_ALL_MORPHS’ was not declared in this scope
MeCab_wrap.cxx:8440:89: error: ‘MECAB_ALLOCATE_SENTENCE’ was not declared in this scope
MeCab_wrap.cxx:8441:84: error: ‘MECAB_ANY_BOUNDARY’ was not declared in this scope
MeCab_wrap.cxx:8442:86: error: ‘MECAB_TOKEN_BOUNDARY’ was not declared in this scope
MeCab_wrap.cxx:8443:84: error: ‘MECAB_INSIDE_TOKEN’ was not declared in this scope
error: Setup script exited with error: command 'gcc' failed with exit status 1
Unknown. This is mainly because jumanpp server script is not stable
Put jumanpp server in this package.
if timeout;
then; try-start-jumanpp-server
else; exception
result = self.juman.analysis(input_str)
File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 128, in analysis
return self.juman(input_str)
File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 121, in juman
result = MList(self.juman_lines(input_str))
File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 116, in juman_lines
return self.socket.query(input_str, pattern=self.pattern)
File "/usr/local/lib/python2.7/dist-packages/pyknp/juman/juman.py", line 41, in query
return recv.strip().decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte
Users expect stopword is a list of root form
, however, an argument is surface form.
Traceback (most recent call last):
File "/Users/kensuke-mi/Desktop/analysis_work/fuman-ds-py-fuman2vector/job_scripts/train_word2vec_jumanpp.py", line 55, in <module>
port=config_obj.get('Tokenizer', 'jumanpp_port'))
File "/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper_python3.py", line 73, in __init__
self.jumanpp_obj = JumanppClient(hostname=server, port=port, timeout=timeout)
File "/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/lib/python3.5/site-packages/JapaneseTokenizer/jumanpp_wrapper/jumanpp_wrapper_python3.py", line 30, in __init__
self.sock.connect((hostname, port))
TypeError: an integer is required (got type str)
I encounter the error when I try to install the package on MacOS Mojave.
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include -arch x86_64 -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include -arch x86_64 -I/usr/local/Cellar/mecab/0.996/include -I/Users/kensuke-mi/.pyenv/versions/anaconda3-5.3.1/include/python3.7m -c MeCab_wrap.cpp -o build/temp.macosx-10.7-x86_64-3.7/MeCab_wrap.o
warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
MeCab_wrap.cpp:3051:10: fatal error: 'stdexcept' file not found
#include <stdexcept>
^~~~~~~~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1
The main reason for the error is that C/C++ compiler is the old version.
So, it's totally OK if you update C/C++ compiler.
brew install gcc
ln -s /usr/local/bin/gcc-8 /usr/local/bin/gcc
and ln -s /usr/local/bin/g++-8 /usr/local/bin/g++
~/.bash_profile
: export PATH=$PATH:/usr/local/bin
source ~/.bash_profile
The setup is failed because of compiling error of neologdn.
Complete output from command /Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/sp/z0_0lktj7nn2s31db2dt5md40000gq/T/pip-build-_kqnspts/neologdn/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /var/folders/sp/z0_0lktj7nn2s31db2dt5md40000gq/T/pip-hylasjm2-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_ext
building 'neologdn' extension
creating build
creating build/temp.macosx-10.6-x86_64-3.5
/usr/bin/clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/include -arch x86_64 -I/Users/kensuke-mi/.pyenv/versions/anaconda3-4.0.0/include/python3.5m -c neologdn.cpp -o build/temp.macosx-10.6-x86_64-3.5/neologdn.o -std=c++11
neologdn.cpp:255:10: fatal error: 'unordered_map' file not found
#include <unordered_map>
^
1 error generated.
error: command '/usr/bin/clang' failed with exit status 1
try to avoid installing neologdn when it happens compiling error.
for convenience, it would be great to convert POS tags into universal tagset.
tagset table.
https://universaldependencies.org/tagset-conversion/ja-ipadic-uposf.html
In [4]: input_sentence = '10日放送の「中居正広のミになる図書館」(テレビ朝日系)で、SMAPの中居正広が、篠原信一の過去の勘違いを明かす一幕があった。'
In [17]: mecab_wrapper.tokenize(input_sentence).filter(stopwords=['SMAP']).convert_list_object()
Out[17]:
['1',
'0',
'日',
'放送',
'の',
'「',
'中居',
'正広',
'の',
'ミ',
'に',
'なる',
'図書館',
'」',
'(',
'テレビ朝日',
'系',
')',
'で',
'、',
'SMAP',
'の',
'中居',
'正広',
'が',
'、',
'篠原',
'信一',
'の',
'過去',
'の',
'勘違い',
'を',
'明かす',
'一幕',
'が',
'ある',
'た',
'。']
SMAP
still exists in the input string.
it's high-cost to maintain both of python2/python3 files.
Mecab -> use difference python package depending on python version
juman & jumanpp & kytea -> put both python into same file
The word filtering by P.O.S does NOT work under specific p.o.s condition.
The case is between pos_condition = ('名詞', '一般', )
and the p.o.s with word is ('名詞', '非自立', '一般')
Works Application team released their own morphology analyzer called "Sudachi".
Sudachi has quite useful feature for business users.
It's convenient if we are able to call it from this package.
They released python implementation of sudachi.
It's easy if we call this package. The main drawback is that sudachi-py does not work in python2.x
Sudachi-py needs to deploy dictionary file by manual.
We would like to make it automatic somehow.
this is needed by Mecab3
Traceback (most recent call last):
File "generate_theme1.py", line 3, in <module>
from JapaneseTokenizer import MecabWrapper
File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/__init__.py", line 2, in <module>
from JapaneseTokenizer.juman_wrapper import JumanWrapper
File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/juman_wrapper/__init__.py", line 2, in <module>
from .juman_wrapper import JumanWrapper
File "/share/data/home/kensuke_mitsuzawa/outsource-ds-py-company-review/conda-env/lib/python3.5/site-packages/JapaneseTokenizer/juman_wrapper/juman_wrapper.py", line 11, in <module>
from pyknp import MList
ImportError: No module named 'pyknp'
Best match: neologdn 0.2.1
Processing neologdn-0.2.1.tar.gz
Writing /tmp/easy_install-mdYouk/neologdn-0.2.1/setup.cfg
Running neologdn-0.2.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-mdYouk/neologdn-0.2.1/egg-dist-tmp-9BvPzi
cc1plus: 警告: コマンドラインオプション ‘-Wstrict-prototypes’ は Ada/C/ObjC 用としては有効ですが、C++ 用としては有効ではありません [デフォルトで有効]
cc1plus: エラー: 認識できないコマンドラインオプション ‘-std=c++11’ です
error: Setup script exited with error: command 'gcc' failed with exit status 1
Mecab is standard a pos tagger for a long time, but it requires much work to install.
So, instead of mecab, janome tagger is good to use as a standard.
Mecab tagger will be 'plugin' style tagger.
Travis environment could not install boost library correctly. That causes install failure of Jumanpp and fails of test cases.
The error log is,
checking for boostlib >= 1.57... configure: We could not detect the boost libraries (version 1.57 or higher). If you have a staged boost library (still not installed) please specify $BOOST_ROOT in your environment and do not give a PATH to --with-boost option. If you are sure you have boost installed, then check your version number looking in <boost/version.hpp>. See http://randspringer.de/boost for more documentation.
This warning message in 2 times
[Y/09/29 16:54:36]WARNING - jumanpp_wrapper.py#call_juman_interface:197: Re-starting unix process because it tak
es longer time than 30 seconds...
[Y/09/29 16:55:06]WARNING - jumanpp_wrapper.py#call_juman_interface:197: Re-starting unix process because it tak
es longer time than 30 seconds...
It seems that final exception is here.
Traceback (most recent call last):
File "/share/data/home/kensuke_mitsuzawa/fuman-ds-py-academic-service/conda-env/lib/python3.5/site-packages/p$
xpect-4.2.1-py3.5.egg/pexpect/spawnbase.py", line 150, in read_nonblocking
s = os.read(self.child_fd, size)
OSError: [Errno 5] Input/output error
MacBook-Pro% pip install JapaneseTokenizer
Requirement already satisfied (use --upgrade to upgrade): JapaneseTokenizer in /Users/kensuke-mi/Desktop/analysis_work/python_morphology_splitters
Requirement already satisfied (use --upgrade to upgrade): future in /Users/kensuke-mi/.pyenv/versions/3.5.1/lib/python3.5/site-packages/future-0.15.2-py3.5.egg (from JapaneseTokenizer)
Requirement already satisfied (use --upgrade to upgrade): six in /Users/kensuke-mi/.pyenv/versions/3.5.1/lib/python3.5/site-packages (from JapaneseTokenizer)
Collecting mecab-python (from JapaneseTokenizer)
Downloading mecab-python-0.996.tar.gz (40kB)
100% |████████████████████████████████| 40kB 6.4MB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python/setup.py", line 18, in <module>
include_dirs=cmd2("mecab-config --inc-dir"),
File "/private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python/setup.py", line 10, in cmd2
return string.split (cmd1(str))
AttributeError: module 'string' has no attribute 'split'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/nq/13lcpk354h51bgkmx4q4ttr00000gp/T/pip-build-kd5ixe8l/mecab-python
You are using pip version 7.1.2, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
some methods are described with type hint, however, they are wrong hint description.
Hi,
Thank you for making this package available, when i try to install i get the following error. Please let me know how i can resolve it
Collecting mecab-python3
Using cached https://files.pythonhosted.org/packages/ac/48/295efe525df40cbc2173748eb869290e81a57e835bc41f6d3834fc5dad5f/mecab-python3-0.996.1.tar.gz
Complete output from command python setup.py egg_info:
/bin/sh: mecab-config: command not found
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/rd/qqy1bpm93qj1qcmrj8624qz91pyzr5/T/pip-build-pqo_16ve/mecab-python3/setup.py", line 29, in
inc_dir = mecab_config("--inc-dir")
File "/private/var/folders/rd/qqy1bpm93qj1qcmrj8624qz91pyzr5/T/pip-build-pqo_16ve/mecab-python3/setup.py", line 27, in mecab_config
return os.popen("mecab-config " + arg).readlines()[0].split()
IndexError: list index out of range
Thank You
It runs string normalization for juman & jumanpp.
All カタカナ are into 全角カタカナ, all numeric expression are into 全角数字
However, 全角カタカナ & 全角数字 is not normal way to use Japanese text.
全角カタカナ -> 半角カタカナ
全角数字 -> 半角数字
after tokenization
Installed /opt/conda/lib/python3.5/site-packages/kytea-0.1.3-py3.5-linux-x86_64.egg
Searching for pyknp
Reading https://pypi.python.org/simple/pyknp/
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
Couldn't find index page for 'pyknp' (maybe misspelled?)
No local packages or working download links found for pyknp
error: Could not find suitable distribution for Requirement.parse('pyknp')
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.