lancopku / pkuseg-python Goto Github PK
View Code? Open in Web Editor NEWpkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
License: MIT License
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
License: MIT License
python3.6
pip3 install pkuseg successful
import pkuseg successful
seg = pkuseg.pkuseg()
TypeError: 'module' object is not callable
如题
config.py里的'几二三四五六七八九十千万亿兆零'有什么特殊含义吗?
几?
您好,我昨天使用了pkuseg,发现python3.7不支持。
python 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
import pkuseg
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'pkuseg'
请问什么时候会有更新版本支持3.7? 谢谢!
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuse
g
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pkuseg in d:\dev_tools\python3.6\lib\site-package
s (0.0.14)
Requirement already satisfied: numpy in d:\dev_tools\python3.6\lib\site-packages
(from pkuseg) (1.13.3+mkl)
import pkuseg.inference as _inf File "__init__.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expec ted 216 from C header, got 192 from PyObjectpython
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AM
D64)] on win32
Type "help", "copyright", "credits" or "license" for more information.import pkuseg
Traceback (most recent call last):
File "", line 1, in
File "D:\dev_tools\python3.6\lib\site-packages\pkuseg_init_.py", line 14, i
n
import pkuseg.trainer as trainer
File "D:\dev_tools\python3.6\lib\site-packages\pkuseg\trainer.py", line 19, in
什么编码,需要用空格还什么其它标识符?
Lestin-MacBook-Pro:~ lestin.yin$ pip3 install -U pkuseg
Collecting pkuseg
Using cached https://files.pythonhosted.org/packages/a5/83/5c6379ff4737bcd26ee1b5c83f2ae78b76651aa8ab1cd3ae1225329371fe/pkuseg-0.0.14.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/y5/vr8wrq014hqclky244604c_r0000gn/T/pip-install-gekq4gpk/pkuseg/setup.py", line 48, in
setup_package()
File "/private/var/folders/y5/vr8wrq014hqclky244604c_r0000gn/T/pip-install-gekq4gpk/pkuseg/setup.py", line 42, in setup_package
ext_modules=cythonize(extensions, annotate=True),
File "/anaconda3/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 920, in cythonize
aliases=aliases)
File "/anaconda3/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 800, in create_extension_list
for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern):
File "/anaconda3/lib/python3.6/site-packages/Cython/Build/Dependencies.py", line 125, in nonempty
raise ValueError(error_msg)
ValueError: 'pkuseg/inference.pyx' doesn't match any files
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/y5/vr8wrq014hqclky244604c_r0000gn/T/pip-install-gekq4gpk/pkuseg/
length = 1 : 0
length = 2 : 2496
length = 3 : 2642
length = 4 : 2568
length = 5 : 1313
length = 6 : 633
length = 7 : 249
length = 8 : 133
length = 9 : 66
length = 10 : 16
length = 11 : 6
length = 12 : 1
length = 13 : 1
start training...
reading training & test data...
done! train/test data sizes: 1/1
r: 1
iter0 diff=1.00e+100 train-time(sec)=5.64 f-score=0.06%
iter1 diff=1.00e+100 train-time(sec)=5.63 f-score=0.00%
Traceback (most recent call last):
File "test.py", line 8, in
pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20)
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/init.py", line 324, in train
trainer.train(config)
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 103, in train
score_list = trainer.test(testset, i)
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 169, in test
testset, self.model, writer
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 357, in _decode_fscore
gold_tags, pred_tags, self.idx_to_chunk_tag
File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/scorer.py", line 37, in getFscore
pre = correct_chunk / res_chunk * 100
ZeroDivisionError: division by zero
例如:jieba.add_word('人工智能')
首先代码跟我写的纯功能脚本一样..哈哈,虽然我在学校也随便写的,不过2千多星了还是得重构下的,提几个想法,1.可不可以在上面做成分类器,分对应类别后自动加载改类别模型去切词,2.多模型融合出结果
比如,的,吗,哪个? 这样的词
看起来吹的很厉害
import pkuseg
seg = pkuseg.pkuseg() # 以默认配置加载模型
text = seg.cut('我爱北京***') # 进行分词
print(text)
Traceback (most recent call last):
File "C:/Users/sinnus/NLP/Gideon/tokenize/model/tt.py", line 5, in
seg = pkuseg.pkuseg() # 以默认配置加载模型
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\pkuseg_init_.py", line 188, in init
self.model = Model.load()
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\pkuseg\model.py", line 30, in load
sizes = npz["sizes"]
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\npyio.py", line 258, in getitem
pickle_kwargs=self.pickle_kwargs)
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\format.py", line 678, in read_array
shape, fortran_order, dtype = _read_array_header(fp, version)
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\format.py", line 540, in _read_array_header
header = _filter_header(header)
File "C:\Users\sinnus\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\format.py", line 502, in _filter_header
for token in tokenize.generate_tokens(StringIO(string).readline):
AttributeError: module 'tokenize' has no attribute 'generate_tokens'
请问怎么获取分词词性
pkuseg:
seg = pkuseg.pkuseg()
print(seg.cut('结婚的和尚未结婚的确实在干扰分词啊'))
['结婚', '的', '和尚', '未', '结婚', '的确', '实在', '干扰', '分词', '啊']
jieba:
print([i[0] for i in jieba.tokenize('结婚的和尚未结婚的确实在干扰分词啊')])
['结婚', '的', '和', '尚未', '结婚', '的', '确实', '在', '干扰', '分词', '啊']
一句话分错三个词,不知道如此高调的宣布远超jieba的勇气在哪儿 ......
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
预训练模型能否存储到其他网盘里,百度云的下载限速太厉害,很难用。谢谢!
ub16hp@UB16HP:~/ub16_prj/PKUSeg-python$ grep -inr 'richedge'
main.py:46: richEdge.train()
main.py:52: richEdge.test()
main.py:193: score = richEdge.train()
jieba有ElasticSearch的plugin。
python3.7 导入的时候报错
AttributeError Traceback (most recent call last)
in
5 from xgboost import XGBRegressor
6 from sklearn import preprocessing
----> 7 import pkuseg
8 import re
9 from sklearn.feature_selection import SelectFromModel
/usr/local/anaconda3/lib/python3.7/site-packages/pkuseg/init.py in
12 from multiprocessing import Process, Queue
13
---> 14 import pkuseg.trainer
15 import pkuseg.inference as _inf
16
/usr/local/anaconda3/lib/python3.7/site-packages/pkuseg/trainer.py in
17 # from .feature_generator import *
18 from pkuseg.model import Model
---> 19 import pkuseg.inference as _inf
20
21 # from .inference import *
/usr/local/anaconda3/lib/python3.7/site-packages/pkuseg/inference.cpython-37m-x86_64-linux-gnu.so in init pkuseg.inference()
AttributeError: type object 'pkuseg.inference.array' has no attribute 'reduce_cython'
>>> seg.cut("***总书记先后六次就“秦岭违建”作出批示指示")
['习近', '平', '总书记', '先后', '六次', '就', '“', '秦岭', '违建', '”', '作', '出', '批示', '指示']
作为一个Python编写的开源工具,安装说明里竟然没有说明它支持的Python版本,尤其这个工具还是用来处理中文分词的,Python2.7与Python3.X在中文处理机制上的差异那么大.....!
如题,谢谢!
文中提到:“jieba的默认模型为统计模型,主要基于训练数据上的词频信息,我们在不同训练集上重新统计了词频信息”,我想了解下,你们是通过什么去统计词频的信息?
when will pkuseg support python2.7?
from pkuseg.feature_extractor import FeatureExtractor
ImportError: /home/anaconda3/envs/CAD/lib/python3.5/site-packages/pkuseg/feature_extractor.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf
如我要在我的領域裹,透過上述指令 pkuseg.train('train', 'test', 'model', nthread=20)來訓練自己的領域模型,有以下4個問題:
1, train 及test data 是否全部要用.utf8的格式,或是採用.txt格式也可?
2, train及test data兩者的內容數量,是否也是 8/2分?
3. train/test data的內容應否要多少字或詞,才算足夠??
4. train/test data文件內,所有詞,必須要以空格分開,才可放入訓練?
对标点符号的支持不大好,比如全角空格、半角前括号,都合并到词语中去了。如下:
>>> seg.cut("诗人 贾岛")
['诗人', '\u3000贾岛']
>>> seg.cut("诗人(贾岛)")
['诗人', '(贾岛', ')']
请问这个可以支持 fine-tuning 吗?我想在 pre-trained 的 general 的 model 上用其他语料进行 fine-tune。
百度网盘这个下载没有会员的话下载太慢了,希望能提供一些其他途径的下载渠道,比如在release里
开源项目对代码质量还是应该有要求的
请问一下对比的jieba 和 THULAC 模型有用对应的训练语料(MSRA,CTB8)训练么? 如果有训练语料的话,这两个模型的结果应该不会那么差。80%左右的F值都快和unsupervised segmentation 差不多了。
如果用in domain 训练语料训练的pkuseg 和 没有使用对应domain训练语料的jieba THULAC 对比,这样是显然不公平的啊。大幅提高了分词的准确率
的结论不能通过这种对比实验得出。
事实上MSRA 分词效果在论文里基本上都超过97.5了。
实测ctb8模型,与jieba默认模型对比稍差一些,建议出一些最佳实践的文档
print(seg.cut('工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作'))
['工信', '处女', '干事', '每', '月', '经过', '下属', '科室', '都', '要', '亲口', '交代', '24', '口', '交换机', '等', '技术性', '器件', '的', '安装', '工作']
print(", ".join(jieba.cut("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作")))
工信处, 女干事, 每月, 经过, 下属, 科室, 都, 要, 亲口, 交代, 24, 口, 交换机, 等, 技术性, 器件, 的, 安装, 工作
print(seg.cut('贱狗奴,鸡巴套子把爸爸加上'))
['贱', '狗', '奴', ',', '鸡', '巴', '套子', '把', '爸爸', '加上']
print(", ".join(jieba.cut("贱狗奴,鸡巴套子把爸爸加上",)))
贱狗奴, ,, 鸡巴, 套子, 把, 爸爸, 加上
好像python2.7.10跑不了啊
ub16hp@UB16HP:/media/ub16hp/WINDOWS/ub16_prj/bumblebee$ python3
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
import PKUSeg
seg = PKUSeg.PKUSeg()
loading model
finish
text = seg.cut('我爱北京***')
text
['我', '爱', '北', '京', '天', '安', '门']
hi , when I import the pkuseg, there seems something is wrong, please help check the issue.
import pkuseg
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pkuseg/init.py", line 2, in
from .config import Config
File "/usr/local/lib/python2.7/dist-packages/pkuseg/config.py", line 174, in
config = Config()
File "/usr/local/lib/python2.7/dist-packages/pkuseg/config.py", line 39, in init
self.regList = self.regs.copy()
AttributeError: 'list' object has no attribute 'copy'
利用pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg安装
import pkuseg
seg = pkuseg.pkuseg()
text = "我爱北京***"
cut = seg.cut(text)
print(cut)
Traceback (most recent call last):
File "E:/python/work/spider/bx/piggy.py", line 1, in
import pkuseg
File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg_init_.py", line 14, in
import pkuseg.trainer
File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg\trainer.py", line 19, in
import pkuseg.inference as _inf
File "init.pxd", line 918, in init pkuseg.inference
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject
如果目标是要打造工业级强度的分词工具,那么:
1、按照 PEP8 规范把代码整理一下;
2、不建议支持 Python 2,Python 2 都要淘汰了,这个精力花得不值;
3、模型文件在 GitHub (可以参考 distributing-large-binaries)或者 s3 上放一份;
4、加载模型的时候给出提示(输出相应的日志,而非直接打印到控制台),这样用户可以知道模型什么时候加载完毕,否则会误以为分词本身用了很长时间;
5、和其他分词工具(如 hanlp, LTP 等)进行更全面的对比,以及需要增加关于性能的基准测试(比如每秒能处理多少词);
6、对标点符号、数字等的特殊处理;
7、增加 C++/Java 接口(假如仅仅是做推断的话,其实更建议 CRF 的部分用 C++ 重写)
jieba&pkuseg分词效果对比
jieba分词:这个 不是 我刷 的 哦
今天 可能 不行 啊 今天 工资 没到 账 今天 是 礼拜五
你好 不是 对不起
pkuseg分词:这 个 不 是 我 刷 的 哦
今天 可能 不 行 啊 今天 工资 没 到 账 今天 是 礼拜五
你 好 不 是 对 不 起
我们可以看到在这个case下,pkuseg分词效果并不如jieba分词,我想问的是,如果想让pkuseg分词case下的分词效果比jieba分词效果好,应该如何去做呢?是测试所有的模型分词结果吗?还是说需要重新训练模型呢?
试图运行 Python 版
python3 main.py test a.txt out.txt
报错:
Traceback (most recent call last):
File "main.py", line 316, in <module>
run()
File "main.py", line 17, in run
testFeature = Feature(config.readFile, 'test')
File "/Users/xilins/Personal/PKUSeg-python/feature.py", line 11, in __init__
self.test_init()
File "/Users/xilins/Personal/PKUSeg-python/feature.py", line 26, in test_init
self.readBigramFeature(config.modelDir+'/bigram_word.txt')
File "/Users/xilins/Personal/PKUSeg-python/feature.py", line 54, in readBigramFeature
with open(file, 'r', encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'model/bigram_word.txt'
1、拿测试语料去训练,再拿测试语料做测试,这完全就是不专业的做法
2、即便是几个分词工具都按照测试语料做训练,但是都是你们来做,你们更熟悉pkuseg的调试,所以调试效果肯定好,最终效果有偏
综上:如果不能提供全新黑盒数据的对比测试结果,那么就别吹的那么厉害
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.