GithubHelp home page GithubHelp logo

lancopku / pkuseg-python Goto Github PK

View Code? Open in Web Editor NEW
6.4K 209.0 974.0 4.28 MB

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

License: MIT License

Python 66.57% Cython 33.43%
chinese-word-segmentation

pkuseg-python's People

Contributors

honnibal avatar jingjingxupku avatar jklj077 avatar luoruixuan avatar xusun26 avatar zhangyics avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pkuseg-python's Issues

model/bigram_word.txt 不存在

试图运行 Python 版

python3 main.py test a.txt out.txt

报错:

Traceback (most recent call last):
  File "main.py", line 316, in <module>
    run()
  File "main.py", line 17, in run
    testFeature = Feature(config.readFile, 'test')
  File "/Users/xilins/Personal/PKUSeg-python/feature.py", line 11, in __init__
    self.test_init()
  File "/Users/xilins/Personal/PKUSeg-python/feature.py", line 26, in test_init
    self.readBigramFeature(config.modelDir+'/bigram_word.txt')
  File "/Users/xilins/Personal/PKUSeg-python/feature.py", line 54, in readBigramFeature
    with open(file, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'model/bigram_word.txt'

建议

首先代码跟我写的纯功能脚本一样..哈哈,虽然我在学校也随便写的,不过2千多星了还是得重构下的,提几个想法,1.可不可以在上面做成分类器,分对应类别后自动加载改类别模型去切词,2.多模型融合出结果

关于 fine-tuning 的问题

请问这个可以支持 fine-tuning 吗?我想在 pre-trained 的 general 的 model 上用其他语料进行 fine-tune。

有關模型訓練的問題

如我要在我的領域裹,透過上述指令 pkuseg.train('train', 'test', 'model', nthread=20)來訓練自己的領域模型,有以下4個問題:
1, train 及test data 是否全部要用.utf8的格式,或是採用.txt格式也可?
2, train及test data兩者的內容數量,是否也是 8/2分?
3. train/test data的內容應否要多少字或詞,才算足夠??
4. train/test data文件內,所有詞,必須要以空格分開,才可放入訓練?

这个是什么问题导致的?

length = 1 : 0
length = 2 : 2496
length = 3 : 2642
length = 4 : 2568
length = 5 : 1313
length = 6 : 633
length = 7 : 249
length = 8 : 133
length = 9 : 66
length = 10 : 16
length = 11 : 6
length = 12 : 1
length = 13 : 1

start training...

reading training & test data...
done! train/test data sizes: 1/1

r: 1
iter0 diff=1.00e+100 train-time(sec)=5.64 f-score=0.06%
iter1 diff=1.00e+100 train-time(sec)=5.63 f-score=0.00%
Traceback (most recent call last):
File "test.py", line 8, in
pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20)
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/init.py", line 324, in train
trainer.train(config)
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 103, in train
score_list = trainer.test(testset, i)
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 169, in test
testset, self.model, writer
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 357, in _decode_fscore
gold_tags, pred_tags, self.idx_to_chunk_tag
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/scorer.py", line 37, in getFscore
pre = correct_chunk / res_chunk * 100
ZeroDivisionError: division by zero

能不能和LTP、Hanlp在全新黑盒数据集上做对比呢?

1、拿测试语料去训练,再拿测试语料做测试,这完全就是不专业的做法
2、即便是几个分词工具都按照测试语料做训练,但是都是你们来做,你们更熟悉pkuseg的调试,所以调试效果肯定好,最终效果有偏

综上:如果不能提供全新黑盒数据的对比测试结果,那么就别吹的那么厉害

支持哪个版本的Python?

作为一个Python编写的开源工具,安装说明里竟然没有说明它支持的Python版本,尤其这个工具还是用来处理中文分词的,Python2.7与Python3.X在中文处理机制上的差异那么大.....!

分词结果并不理想怎么办呢?

jieba&pkuseg分词效果对比
jieba分词:这个 不是 我刷 的 哦
今天 可能 不行 啊 今天 工资 没到 账 今天 是 礼拜五
你好 不是 对不起

pkuseg分词:这 个 不 是 我 刷 的 哦
今天 可能 不 行 啊 今天 工资 没 到 账 今天 是 礼拜五
你 好 不 是 对 不 起

我们可以看到在这个case下,pkuseg分词效果并不如jieba分词,我想问的是,如果想让pkuseg分词case下的分词效果比jieba分词效果好,应该如何去做呢?是测试所有的模型分词结果吗?还是说需要重新训练模型呢?

实测网络恶俗用语分词,与jieba效果对比

实测ctb8模型,与jieba默认模型对比稍差一些,建议出一些最佳实践的文档

print(seg.cut('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作'))
['工信', '处女', '干事', '每', '月', '经过', '下属', '科室', '都', '要', '亲口', '交代', '24', '口', '交换机', '等', '技术性', '器件', '的', '安装', '工作']
print(", ".join(jieba.cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")))
工信处, 女干事, 每月, 经过, 下属, 科室, 都, 要, 亲口, 交代, 24, 口, 交换机, 等, 技术性, 器件, 的, 安装, 工作
print(seg.cut('贱狗奴,鸡巴套子把爸爸加上'))
['贱', '狗', '奴', ',', '鸡', '巴', '套子', '把', '爸爸', '加上']
print(", ".join(jieba.cut("贱狗奴,鸡巴套子把爸爸加上",)))
贱狗奴, ,, 鸡巴, 套子, 把, 爸爸, 加上

就比较了一句话的结果就能和jieba一决胜负了

pkuseg:
seg = pkuseg.pkuseg()
print(seg.cut('结婚的和尚未结婚的确实在干扰分词啊'))
['结婚', '的', '和尚', '未', '结婚', '的确', '实在', '干扰', '分词', '啊']

jieba:
print([i[0] for i in jieba.tokenize('结婚的和尚未结婚的确实在干扰分词啊')])
['结婚', '的', '和', '尚未', '结婚', '的', '确实', '在', '干扰', '分词', '啊']

一句话分错三个词,不知道如此高调的宣布远超jieba的勇气在哪儿 ......

ValueError: 'pkuseg/inference.pyx' doesn't match any files 这个是撒子鬼哦

Lestin-MacBook-Pro:~ lestin.yin$ pip3 install -U pkuseg
Collecting pkuseg
Using cached https://files.pythonhosted.org/packages/a5/83/5c6379ff4737bcd26ee1b5c83f2ae78b76651aa8ab1cd3ae1225329371fe/pkuseg-0.0.14.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/y5/vr8wrq014hqclky244604c_r0000gn/T/pip-install-gekq4gpk/pkuseg/setup.py", line 48, in
setup_package()
File "/private/var/folders/y5/vr8wrq014hqclky244604c_r0000gn/T/pip-install-gekq4gpk/pkuseg/setup.py", line 42, in setup_package
ext_modules=cythonize(extensions, annotate=True),
File "/anaconda3/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 920, in cythonize
aliases=aliases)
File "/anaconda3/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 800, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "/anaconda3/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 125, in nonempty
raise ValueError(error_msg)
ValueError: 'pkuseg/inference.pyx' doesn't match any files

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/y5/vr8wrq014hqclky244604c_r0000gn/T/pip-install-gekq4gpk/pkuseg/

新的模式结果不合理

ub16hp@UB16HP:/media/ub16hp/WINDOWS/ub16_prj/bumblebee$ python3
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

import PKUSeg
seg = PKUSeg.PKUSeg()
loading model
finish
text = seg.cut('我爱北京***')
text
['我', '爱', '北', '京', '天', '安', '门']

与其余分词工具包的性能对比并不公平吧?

请问一下对比的jieba 和 THULAC 模型有用对应的训练语料(MSRA,CTB8)训练么? 如果有训练语料的话,这两个模型的结果应该不会那么差。80%左右的F值都快和unsupervised segmentation 差不多了。

如果用in domain 训练语料训练的pkuseg 和 没有使用对应domain训练语料的jieba THULAC 对比,这样是显然不公平的啊。大幅提高了分词的准确率的结论不能通过这种对比实验得出。

事实上MSRA 分词效果在论文里基本上都超过97.5了。

常见人名分词有误

>>> seg.cut("***总书记先后六次就“秦岭违建”作出批示指示")
['习近', '平', '总书记', '先后', '六次', '就', '“', '秦岭', '违建', '”', '作', '出', '批示', '指示']

AttributeError: type object 'pkuseg.inference.array' has no attribute '__reduce_cython__'

python3.7 导入的时候报错
AttributeError Traceback (most recent call last)
in
5 from xgboost import XGBRegressor
6 from sklearn import preprocessing
----> 7 import pkuseg
8 import re
9 from sklearn.feature_selection import SelectFromModel

/usr/local/anaconda3/lib/python3.7/site-packages/pkuseg/init.py in
12 from multiprocessing import Process, Queue
13
---> 14 import pkuseg.trainer
15 import pkuseg.inference as _inf
16

/usr/local/anaconda3/lib/python3.7/site-packages/pkuseg/trainer.py in
17 # from .feature_generator import *
18 from pkuseg.model import Model
---> 19 import pkuseg.inference as _inf
20
21 # from .inference import *

/usr/local/anaconda3/lib/python3.7/site-packages/pkuseg/inference.cpython-37m-x86_64-linux-gnu.so in init pkuseg.inference()

AttributeError: type object 'pkuseg.inference.array' has no attribute 'reduce_cython'

对标点符号的支持不大好

对标点符号的支持不大好,比如全角空格、半角前括号,都合并到词语中去了。如下:

>>> seg.cut("诗人  贾岛")
['诗人', '\u3000贾岛']
>>> seg.cut("诗人(贾岛)")
['诗人', '(贾岛', ')']

python3.6 import 失败

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuse
g
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pkuseg in d:\dev_tools\python3.6\lib\site-package
s (0.0.14)
Requirement already satisfied: numpy in d:\dev_tools\python3.6\lib\site-packages
(from pkuseg) (1.13.3+mkl)

python
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import pkuseg
Traceback (most recent call last):
File "", line 1, in
File "D:\dev_tools\python3.6\lib\site-packages\pkuseg_init_.py", line 14, i
n
import pkuseg.trainer as trainer
File "D:\dev_tools\python3.6\lib\site-packages\pkuseg\trainer.py", line 19, in

import pkuseg.inference as _inf File "__init__.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expec ted 216 from C header, got 192 from PyObject

几条建议

如果目标是要打造工业级强度的分词工具,那么:

1、按照 PEP8 规范把代码整理一下;
2、不建议支持 Python 2,Python 2 都要淘汰了,这个精力花得不值;
3、模型文件在 GitHub (可以参考 distributing-large-binaries)或者 s3 上放一份;
4、加载模型的时候给出提示(输出相应的日志,而非直接打印到控制台),这样用户可以知道模型什么时候加载完毕,否则会误以为分词本身用了很长时间;
5、和其他分词工具(如 hanlp, LTP 等)进行更全面的对比,以及需要增加关于性能的基准测试(比如每秒能处理多少词);
6、对标点符号、数字等的特殊处理;
7、增加 C++/Java 接口(假如仅仅是做推断的话,其实更建议 CRF 的部分用 C++ 重写)

AttributeError: module 'tokenize' has no attribute 'generate_tokens'

-- coding: utf-8 --

import pkuseg

seg = pkuseg.pkuseg() # 以默认配置加载模型
text = seg.cut('我爱北京***') # 进行分词
print(text)

Traceback (most recent call last):
File "C:/Users/sinnus/NLP/Gideon/tokenize/model/tt.py", line 5, in
seg = pkuseg.pkuseg() # 以默认配置加载模型
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\pkuseg_init_.py", line 188, in init
self.model = Model.load()
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\pkuseg\model.py", line 30, in load
sizes = npz["sizes"]
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\npyio.py", line 258, in getitem
pickle_kwargs=self.pickle_kwargs)
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\format.py", line 678, in read_array
shape, fortran_order, dtype = _read_array_header(fp, version)
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\format.py", line 540, in _read_array_header
header = _filter_header(header)
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\format.py", line 502, in _filter_header
for token in tokenize.generate_tokens(StringIO(string).readline):
AttributeError: module 'tokenize' has no attribute 'generate_tokens'

ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

利用pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg安装

import pkuseg

seg = pkuseg.pkuseg()

text = "我爱北京***"

cut = seg.cut(text)
print(cut)

Traceback (most recent call last):
File "E:/python/work/spider/bx/piggy.py", line 1, in
import pkuseg
File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg_init_.py", line 14, in
import pkuseg.trainer
File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg\trainer.py", line 19, in
import pkuseg.inference as _inf
File "init.pxd", line 918, in init pkuseg.inference
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

linux conda pyhon3.5 import error

from pkuseg.feature_extractor import FeatureExtractor
ImportError: /home/anaconda3/envs/CAD/lib/python3.5/site-packages/pkuseg/feature_extractor.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

训练时间过长

你好,我尝试使用 msr_training.utf8 (8.6w句)来训练,msr_test.utf8(3000多句)来验证,每个迭代花了近半个钟,训练一共花了近10个半钟,我需要怎么做来提高训练速度?
image
image

what 's feature of 'richEdge'

ub16hp@UB16HP:~/ub16_prj/PKUSeg-python$ grep -inr 'richedge'
main.py:46: richEdge.train()
main.py:52: richEdge.test()
main.py:193: score = richEdge.train()

python3.7不支持

您好,我昨天使用了pkuseg,发现python3.7不支持。
python 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import pkuseg
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'pkuseg'

请问什么时候会有更新版本支持3.7? 谢谢!

细领域如何复现模型效果?

  • 细领域如何复现模型效果?我使用默认参数配置,按照说明进行的训练。结果只能达到60%多的F值。
  • 输入的训练数据,词语之间使用空格分割。
  • 参数采用默认值,训练时间大概为6个多小时。
  • 语料采用的是icwb2中的训练数据和测试数据。

AttributeError: 'list' object has no attribute 'copy'

hi , when I import the pkuseg, there seems something is wrong, please help check the issue.

import pkuseg
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pkuseg/init.py", line 2, in
from .config import Config
File "/usr/local/lib/python2.7/dist-packages/pkuseg/config.py", line 174, in
config = Config()
File "/usr/local/lib/python2.7/dist-packages/pkuseg/config.py", line 39, in init
self.regList = self.regs.copy()
AttributeError: 'list' object has no attribute 'copy'

代码示例4错误

RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.