rockyzhengwu / foolnltk Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 385.0 117.86 MB

A Chinese Nature Language Toolkit

License: Apache License 2.0

Python 97.05% Shell 2.95%

foolnltk's People

Contributors

Stargazers

Watchers

Forkers

sdsfhtw rustrw i-spark linusp airob xingxingxudong wandec wj1031924 jdc08161063 yangyaoyunshu bigrlab zlg358 taozhang8 niucheney nieshaoshuai xbmu ericustc likeucode ginking huxiaoqian hyqdido robingong allensmile erin-h mazilong wanyongtao1988 lightsilver zhongxingpeng fage2016 jyjatbupt wangtianyiftd from1900 jackysnake antoinelee lsjiguang yangvict pliuecnu zhaoicat gpsbird wanghll joey10huawei gaosh0405 orgatai gdh756462786 sdd031215 xlturing wangjun87 lishiji1992 yufengzhixing xibinyue henfee schuckbeta jianbotang jn7163 youknownothingall chenghuige yan96in cwlseu bigmaye ghosthamlet yutaoxxx tutty427 limin2021 hydercps slidersun apprisi mengyou658 kalengo pengyulong colinsongf facingwaller derekkk xiaozan-pku jankim xwzpp arvinc threefoldo shooter2062424 kevinlee9 zclfly robot-e liu-nlper jincurry 1987618girl hongvvu doudoukingj 5up3rc fromfield qitong citysir onyxin82 reason519 vincentcaow zuidai baifengbai huhuigou stevenlee-belief ye-lun fevathtw caibing1872

foolnltk's Issues

怎么训练实体识别的模型（how to train ner model）

怎么自己训练实体识别模型？根据文档成功了训练了自己的分词模型，实体识别的训练过程有介绍吗

TypeError: the JSON object must be str, not 'bytes'

I ran the example in the README to segment a Chinese sentense , but a TypeError occured as follow

$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fool
>>> text="一个傻子在北京"
>>> print(fool.cut(text))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/fool/__init__.py", line 65, in cut
    all_words = LEXICAL_ANALYSER.cut(text)
  File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 98, in cut
    self._load_seg_model()
  File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 39, in _load_seg_model
    self.seg_model = self._load_model("seg.pb", "char_map", "seg_map")
  File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 35, in _load_model
    char_to_id, id_to_seg = _load_map_file(self.map_file_path, word_map_name, tag_name)
  File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 18, in _load_map_file
    data = json.load(myfile)
  File "/usr/lib/python3.5/json/__init__.py", line 268, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'

How to handle this error? Many thanks !

FYI：

foolnltk version: 0.1.1 ( installed by sudo pip3 install foolnltk --no-cache-dir )
python35
ubuntu16.4(x64)

我仔细看过源码发现训练的任务是一个标注label的任务，那么在fool模块中提供的词性标注和主题识别是怎么做到的呢？

这个label是B-词首，M-词中，E-词尾和S-单独成词，所以最后的模型也只能做分词的任务，那么模块中的其他功能是如何实现的？

NER种类有哪些，分别是什么含义

试了下感觉准确率蛮高的，点个赞！
但是有个疑问，就是不知道此项目识别出的NER都有哪些，分别有什么具体的含义，比如product这个实体，是指一个真正的产品如一款汽车，还是说包含了所有人类智慧的产物，例如一本书，希望能有明确的定义，谢谢！

关于demo中dev.txt、test.txt、train.txt生成

    您好！用word2vec能变异将词料分好维度，但关于源码中dev.txt、test.txt、train.txt这几个文件是通过什么工具生成的？4词标注法（这就是词料么？）？这个是如何扩展加字？

安装不了foolnltk

出现ERROR: Cannot uninstall 'wrapt'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
安装不了，求救

TypeError: 'NoneType' object is not callable

你好，我在测试的时候有时会报：
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method BaseSession.del of <tensorflow.python.client.session.Session object at 0x0000000019C8A208>>
再次运行的时候又正常，我使用的平台为tensorflow 1.4.0 win7。这是我的版本问题吗？

自定义词典是如何起作用的？

您好，请问如何确保加入自定义词典中的词必然会被划分成词的？也就是说具体是如何对自定义词典处理的？

请教个问题

@rockyzhengwu
请教个问题

        with tf.variable_scope("crf_loss" if not name else name):
            small = -1000.0
            start_logits = tf.concat(
                [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])],
                axis=-1)

            pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
            logits = tf.concat([project_logits, pad_logits], axis=-1)
            logits = tf.concat([start_logits, logits], axis=1)
            targets = tf.concat(
                [tf.cast(self.num_tags * tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)

            self.trans = tf.get_variable(
                "transitions",
                shape=[self.num_tags + 1, self.num_tags + 1],
                initializer=self.initializer)

            log_likelihood, self.trans = crf_log_likelihood(
                inputs=logits,
                tag_indices=targets,
                transition_params=self.trans,
                sequence_lengths=lengths + 1)

在做CRF层的时候，在预测值和实际值周围加上一个-1000的维度有什么作用吗？

请问ner的标注是怎么确定的呢

目前看见用了crf模式的分词，词也是由单个字分别为B、M、E方式来的

但是词与实体的关系是通过什么建立的呢？

载入自己训练的模型时发生错误

我的环境：
ubuntu 18.04

错误代码如下
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1: invalid start byte

./mian export

I've got this error when run './main export', how can I fixed it?

请问下词性标注集是什么？

看起来比较像ICTPOS3.0词性标记集

Please upload map.zip

工程执行中有爆出如下错误：
FileNotFoundError: [Errno 2] No such file or directory: '/usr/fool/map.zip'`
查看代码，另外还有可能需要load这些文件：
def _load_seg_model(self):
self.seg_model = self._load_model("seg.pb", "char_map", "seg_map")

def _load_pos_model(self):
    self.pos_model = self._load_model("pos.pb", "word_map", "pos_map")

def _load_ner_model(self):
    self.ner_model = self._load_model("ner.pb", "char_map", "ner_map")

劳烦作者提供相关下载包，不胜感激。

training src?

COOL！没有训练的代码吗？

Run error:text = ""

import fool

text = ""
print(fool.cut(text))

/Users/howie/anaconda3/envs/python36/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
  return f(*args, **kwds)
starting load model 
2017-12-28 21:16:02.363773: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
loaded model cost : 1.258011s
Traceback (most recent call last):
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnimplementedError: TensorArray has size zero, but element shape [?,100] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
	 [[Node: prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray"], dtype=DT_FLOAT, element_shape=[?,100], _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/range, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/while/Exit_1)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/howie/Documents/programming/python36/hi/01.py", line 10, in <module>
    print(fool.cut(text))
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/__init__.py", line 37, in cut
    words, _, _ = LEXICAL_ANALYSER.cut(text)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/lexical.py", line 121, in cut
    seg_path = self.seg_model.predict(input_chars)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/predictor.py", line 59, in predict
    logits, trans = self.sess.run([self.logits, self.trans], feed_dict=feed_dict)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnimplementedError: TensorArray has size zero, but element shape [?,100] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
	 [[Node: prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray"], dtype=DT_FLOAT, element_shape=[?,100], _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/range, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/while/Exit_1)]]

Caused by op 'prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3', defined at:
  File "/Users/howie/Documents/programming/python36/hi/01.py", line 10, in <module>
    print(fool.cut(text))
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/__init__.py", line 36, in cut
    _check_model()
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/__init__.py", line 26, in _check_model
    LEXICAL_ANALYSER.load_model()
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/lexical.py", line 78, in load_model
    self.seg_model = Predictor(os.path.join(data_path, "seg.pb"), self.map.num_seg)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/predictor.py", line 42, in __init__
    self.graph = load_graph(model_path)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/predictor.py", line 36, in load_graph
    tf.import_graph_def(graph_def, name="prefix")
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 313, in import_graph_def
    op_def=op_def)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

UnimplementedError (see above for traceback): TensorArray has size zero, but element shape [?,100] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
	 [[Node: prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray"], dtype=DT_FLOAT, element_shape=[?,100], _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/range, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/while/Exit_1)]]

Error when trying to load exported model

I trained the model and exported it by instructions.

I wrote a script to load the model like "load model section "
An error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf7 in position 1: invalid start byte"
occurred when executing to this line
fool.load_model(map_file=map_file, model_file=checkpoint_ifle)

I fixed it by change load_graph of fool/model.py
with tf.gfile.GFile(path) as f:
to
with tf.gfile.GFile(path,'rb') as f

Please fix it, thank you!

./main.sh map error

I got this error when run ./main.sh data. How can I solve this

词性标注的标准是？

想问一下您的代码基于的词性标注的标准是什么？

请问你用的哪个版本的TensorFlow。我运行ner模块时出现“No module named 'tensorflow.contrib.crf'”。

请问你用的哪个版本的TensorFlow。我运行ner模块时出现“No module named 'tensorflow.contrib.crf'”。或者您知道这是什么原因吗？谢谢

CRF

做词性标注任务时CRF用到的特征有哪些？
如果是我自己训练模型默认的CRF又用了哪些特征？

支持python2.7

仅需将fool目录下model.py中的
def load_map(path):
with open(path, 'rb') as f:
char_to_id, tag_to_id, id_to_tag = pickle.load(f)
return char_to_id, id_to_tag
改为：
def load_map(path):
with open(path) as f:
char_to_id, tag_to_id, id_to_tag = pickle.load(f)
return char_to_id, id_to_tag
即可

UnicodeDecodeError in Windows

I'm trying it in Windows and get UnicodeDecodeError, wondering if we can replace "open(path)" with "codecs.open(path, encoding='utf-8')", e.g. line 21 in dictionary.py, to avoid this. Thanks

能够提供一下模型训练的数据集吗

非常感谢你的分享，我想使用你的代码浮现训练过程作为学习。能够提供一下训练集吗？

请问ner方法的标注有哪些，看到了location，org，company还有其他的吗

user dict has no effect in some cases

user dict in most cases has obvious effect, but not always, e.g.:
contents of user_dict.txt:
二十 10000
四百 10000
一千 10000
When run the command:
echo "一一千四百二十九" | python3 -m fool -d -u user_dict.txt
the result is "一一千四百二十九", not be cut apart.
But if the sentence to be cut is "一千四百二十九", it will be cut to "一千四百二十九". That's what I want.
How to explain the results and what can be done if I want the result be: "一一千四百二十九"?

非常handy的工具~几点建议

未来的改动大么？希望保持现在的简单实用
还有
能否实现自定义词性字典？
能否实现自定义实体字典？
谢谢~

大佬您好。请问下，关于词性标注和实体识别，foolnltk所采用的语料情况是咋样的？能够复用你的模型训练过程，进行实体识别模型的训练吗？

模型文件

您好，我想问一下，如果我暂时不训练模型，您的文件里有训练好的模型么，我想直接利用模型进行分词测试。

能解释下crf那段代码。为什么要pad_logits，并且num_tags + 1吗？

在训练代码中。bi_lstm文件的 loss_layer中，在计算crf之前。为什么要对。logits和target做一个填充。增加纬度

FileNotFoundError: [Errno 2] No such file or directory: '/usr/fool/map.zip'

pip3 install foolnltk后
执行示例代码
出现filenotfound

What is the license of FoolNLTK source code , model and corpus ?

hi all ,

I want to integrate FoolNLTK into my project for personal use or may be in business product .

So , what is the license of FoolNLTK source code , model and corpus ?

Can we separate the source code , model and corpus with different license ?

Thanks in advance .

请问FOOL识别的命名实体类型有哪些？

屏蔽warning

怎样屏蔽这些碍眼的warning，另外fool也警告了，需要注意哦。

np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING: Logging before flag parsing goes to stderr.
W0804 17:59:34.962490 140736280798080 deprecation_wrapper.py:119] From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fool/predictor.py:32: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

W0804 17:59:34.962846 140736280798080 deprecation_wrapper.py:119] From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fool/predictor.py:33: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.

W0804 17:59:35.032593 140736280798080 deprecation_wrapper.py:119] From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fool/predictor.py:53: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-08-04 17:59:35.033006: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

词性对照表

你好，请问词性标注时你用的是哪个标准的词性对照表？

./main train

When I train the model myself, how to assign one GPU to use.

加载用户字典不起作用以及实体未识别出来的情况

博主好，foolnltk使用时发现加载用户字典不起作用，不知道是什么原因导致的，具体如下：
环境：win10+python3.6

fool.analysis('阿里收购饿了么')
返回：([[('阿里', 'nz'), ('收购', 'v'), ('饿', 'v'), ('了', 'y'), ('么', 'y')]], [[(0, 3, 'company', '阿里')]])

用户字典格式：
饿了么 10

fool.load_userdict(path)
fool.analysis('阿里收购饿了么')
返回：([[('阿里', 'nz'), ('收购', 'v'), ('饿', 'v'), ('了', 'y'), ('么', 'y')]], [[(0, 3, 'company', '阿里')]])

加载用户字典似乎不起作用？分词时“饿了么”还是被拆开了，实体识别中也没识别出来

关于分词和NER结果的结构问题

fool.analysis 和fool.cut 的结果之前版本都是返回一个list结构，现在（0.1.3）为什么是一个嵌套的list呢？有什么特殊作用吗？

char_embeding 是否有错位

我看到你在char_embeding中增加了一行0向量，但我发现这会导致look_up的时候错位。

        # zero_pad = tf.constant(0.0, dtype=tf.float32, shape=[1, config["char_dim"]])
        # self.char_embeding = tf.concat(axis=0, values=[zero_pad, tf.get_variable(name="char_embeding", initializer=embeddings)])
        self.char_embeding = tf.get_variable(name="char_embeding", initializer=embeddings)

@rockyzhengwu

布置在服务器上，处理数据时，时快时慢是怎么回事？

布置在服务器上，处理数据时，时快时慢是怎么回事？有时1秒，有时一二十秒

訓練失敗

您好
在我使用自己的语料训练模型时出现以下错误
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [6299,100] rhs shape= [1104,100]
(在步骤 train时，前面tfrecord都通过了)

请问该如何解决..

run error in mac ox

/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/altman/IdeaProjects/gopath/src/labs/fool_test/test.py
/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
starting load model
Traceback (most recent call last):
File "/Users/altman/IdeaProjects/gopath/src/labs/fool_test/test.py", line 4, in
print(fool.cut(text))
File "/usr/local/lib/python3.6/site-packages/fool/init.py", line 36, in cut
_check_model()
File "/usr/local/lib/python3.6/site-packages/fool/init.py", line 26, in _check_model
LEXICAL_ANALYSER.load_model()
File "/usr/local/lib/python3.6/site-packages/fool/lexical.py", line 75, in load_model
self.map = DataMap(os.path.join(data_path, "maps.pkl"))
File "/usr/local/lib/python3.6/site-packages/fool/lexical.py", line 20, in init
self._load(path)
File "/usr/local/lib/python3.6/site-packages/fool/lexical.py", line 26, in _load
with open(path, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/fool/maps.pkl'

Process finished with exit code 1

rockyzhengwu / foolnltk Goto Github PK

foolnltk's People

Contributors

Stargazers

Watchers

Forkers

foolnltk's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs