rockyzhengwu / foolnltk Goto Github PK
View Code? Open in Web Editor NEWA Chinese Nature Language Toolkit
License: Apache License 2.0
A Chinese Nature Language Toolkit
License: Apache License 2.0
如题
怎么自己训练实体识别模型?根据文档成功了训练了自己的分词模型,实体识别的训练过程有介绍吗
请问通过训练,可以提升实体识别的准确率吗
I ran the example in the README to segment a Chinese sentense , but a TypeError occured as follow
$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fool
>>> text="一个傻子在北京"
>>> print(fool.cut(text))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/fool/__init__.py", line 65, in cut
all_words = LEXICAL_ANALYSER.cut(text)
File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 98, in cut
self._load_seg_model()
File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 39, in _load_seg_model
self.seg_model = self._load_model("seg.pb", "char_map", "seg_map")
File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 35, in _load_model
char_to_id, id_to_seg = _load_map_file(self.map_file_path, word_map_name, tag_name)
File "/usr/local/lib/python3.5/dist-packages/fool/lexical.py", line 18, in _load_map_file
data = json.load(myfile)
File "/usr/lib/python3.5/json/__init__.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
How to handle this error? Many thanks !
FYI:
这个label是B-词首,M-词中,E-词尾和S-单独成词,所以最后的模型也只能做分词的任务,那么模块中的其他功能是如何实现的?
试了下感觉准确率蛮高的,点个赞!
但是有个疑问,就是不知道此项目识别出的NER都有哪些,分别有什么具体的含义,比如product这个实体,是指一个真正的产品如一款汽车,还是说包含了所有人类智慧的产物,例如一本书,希望能有明确的定义,谢谢!
您好!用word2vec能变异将词料分好维度,但关于源码中dev.txt、test.txt、train.txt这几个文件是通过什么工具生成的?4词标注法(这就是词料么?)?这个是如何扩展加字?
出现ERROR: Cannot uninstall 'wrapt'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
安装不了,求救
How to train this model?
你好,我在测试的时候有时会报:
TypeError: 'NoneType' object is not callable
Exception ignored in: <bound method BaseSession.del of <tensorflow.python.client.session.Session object at 0x0000000019C8A208>>
再次运行的时候又正常,我使用的平台为tensorflow 1.4.0 win7。这是我的版本问题吗?
您好,请问如何确保加入自定义词典中的词必然会被划分成词的?也就是说具体是如何对自定义词典处理的?
@rockyzhengwu
请教个问题
with tf.variable_scope("crf_loss" if not name else name):
small = -1000.0
start_logits = tf.concat(
[small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])],
axis=-1)
pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
logits = tf.concat([project_logits, pad_logits], axis=-1)
logits = tf.concat([start_logits, logits], axis=1)
targets = tf.concat(
[tf.cast(self.num_tags * tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)
self.trans = tf.get_variable(
"transitions",
shape=[self.num_tags + 1, self.num_tags + 1],
initializer=self.initializer)
log_likelihood, self.trans = crf_log_likelihood(
inputs=logits,
tag_indices=targets,
transition_params=self.trans,
sequence_lengths=lengths + 1)
在做CRF层的时候,在预测值和实际值周围加上一个-1000的维度有什么作用吗?
目前看见用了crf模式的分词,词也是由单个字分别为B、M、E方式来的
但是词与实体的关系是通过什么建立的呢?
我的环境:
ubuntu 18.04
错误代码如下
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 1: invalid start byte
看起来比较像ICTPOS3.0词性标记集
工程执行中有爆出如下错误:
FileNotFoundError: [Errno 2] No such file or directory:
'/usr/fool/map.zip'`
查看代码,另外还有可能需要load这些文件:
def _load_seg_model(self):
self.seg_model = self._load_model("seg.pb", "char_map", "seg_map")
def _load_pos_model(self):
self.pos_model = self._load_model("pos.pb", "word_map", "pos_map")
def _load_ner_model(self):
self.ner_model = self._load_model("ner.pb", "char_map", "ner_map")
劳烦作者提供相关下载包,不胜感激。
COOL! 没有训练的代码吗?
import fool
text = ""
print(fool.cut(text))
/Users/howie/anaconda3/envs/python36/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
starting load model
2017-12-28 21:16:02.363773: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
loaded model cost : 1.258011s
Traceback (most recent call last):
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnimplementedError: TensorArray has size zero, but element shape [?,100] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray"], dtype=DT_FLOAT, element_shape=[?,100], _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/range, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/while/Exit_1)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/howie/Documents/programming/python36/hi/01.py", line 10, in <module>
print(fool.cut(text))
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/__init__.py", line 37, in cut
words, _, _ = LEXICAL_ANALYSER.cut(text)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/lexical.py", line 121, in cut
seg_path = self.seg_model.predict(input_chars)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/predictor.py", line 59, in predict
logits, trans = self.sess.run([self.logits, self.trans], feed_dict=feed_dict)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnimplementedError: TensorArray has size zero, but element shape [?,100] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray"], dtype=DT_FLOAT, element_shape=[?,100], _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/range, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/while/Exit_1)]]
Caused by op 'prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3', defined at:
File "/Users/howie/Documents/programming/python36/hi/01.py", line 10, in <module>
print(fool.cut(text))
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/__init__.py", line 36, in cut
_check_model()
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/__init__.py", line 26, in _check_model
LEXICAL_ANALYSER.load_model()
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/lexical.py", line 78, in load_model
self.seg_model = Predictor(os.path.join(data_path, "seg.pb"), self.map.num_seg)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/predictor.py", line 42, in __init__
self.graph = load_graph(model_path)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/fool/predictor.py", line 36, in load_graph
tf.import_graph_def(graph_def, name="prefix")
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 313, in import_graph_def
op_def=op_def)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/Users/howie/anaconda3/envs/python36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
UnimplementedError (see above for traceback): TensorArray has size zero, but element shape [?,100] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray"], dtype=DT_FLOAT, element_shape=[?,100], _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArray, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/TensorArrayStack/range, prefix/char_BiLSTM/bidirectional_rnn/bw/bw/while/Exit_1)]]
I trained the model and exported it by instructions.
I wrote a script to load the model like "load model section "
An error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf7 in position 1: invalid start byte"
occurred when executing to this line
fool.load_model(map_file=map_file, model_file=checkpoint_ifle)
I fixed it by change load_graph of fool/model.py
with tf.gfile.GFile(path) as f:
to
with tf.gfile.GFile(path,'rb') as f
Please fix it, thank you!
想问一下您的代码基于的词性标注的标准是什么?
请问你用的哪个版本的TensorFlow。我运行ner模块时出现“No module named 'tensorflow.contrib.crf'”。或者您知道这是什么原因吗?谢谢
做词性标注任务时CRF用到的特征有哪些?
如果是我自己训练模型默认的CRF又用了哪些特征?
仅需将fool目录下model.py中的
def load_map(path):
with open(path, 'rb') as f:
char_to_id, tag_to_id, id_to_tag = pickle.load(f)
return char_to_id, id_to_tag
改为:
def load_map(path):
with open(path) as f:
char_to_id, tag_to_id, id_to_tag = pickle.load(f)
return char_to_id, id_to_tag
即可
I'm trying it in Windows and get UnicodeDecodeError, wondering if we can replace "open(path)" with "codecs.open(path, encoding='utf-8')", e.g. line 21 in dictionary.py, to avoid this. Thanks
非常感谢你的分享,我想使用你的代码浮现训练过程作为学习。能够提供一下训练集吗?
请问ner方法的标注有哪些,看到了location,org,company还有其他的吗
user dict in most cases has obvious effect, but not always, e.g.:
contents of user_dict.txt:
二十 10000
四百 10000
一千 10000
When run the command:
echo "一一千四百二十九" | python3 -m fool -d -u user_dict.txt
the result is "一一千四百二十九", not be cut apart.
But if the sentence to be cut is "一千四百二十九", it will be cut to "一千 四百 二十 九". That's what I want.
How to explain the results and what can be done if I want the result be: "一 一千 四百 二十 九"?
未来的改动大么?希望保持现在的简单实用
还有
能否实现自定义词性字典?
能否实现自定义实体字典?
谢谢~
大佬您好。请问下,关于词性标注和实体识别,foolnltk所采用的语料情况是咋样的?能够复用你的模型训练过程,进行实体识别模型的训练吗?
您好,我想问一下, 如果我暂时不训练模型,您的文件里有训练好的模型么,我想直接利用模型进行分词测试。
在训练代码中。bi_lstm文件的 loss_layer中,在计算crf之前。为什么要对。logits和target做一个填充。增加纬度
pip3 install foolnltk后
执行示例代码
出现filenotfound
hi all ,
I want to integrate FoolNLTK into my project for personal use or may be in business product .
So , what is the license of FoolNLTK source code , model and corpus ?
Can we separate the source code , model and corpus with different license ?
Thanks in advance .
怎样屏蔽这些碍眼的warning,另外fool也警告了,需要注意哦。
np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING: Logging before flag parsing goes to stderr.
W0804 17:59:34.962490 140736280798080 deprecation_wrapper.py:119] From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fool/predictor.py:32: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.
W0804 17:59:34.962846 140736280798080 deprecation_wrapper.py:119] From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fool/predictor.py:33: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.
W0804 17:59:35.032593 140736280798080 deprecation_wrapper.py:119] From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/fool/predictor.py:53: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2019-08-04 17:59:35.033006: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
你好,请问词性标注时你用的是哪个标准的词性对照表?
When I train the model myself, how to assign one GPU to use.
博主好,foolnltk使用时发现加载用户字典不起作用,不知道是什么原因导致的,具体如下:
环境:win10+python3.6
fool.analysis('阿里收购饿了么')
返回:([[('阿里', 'nz'), ('收购', 'v'), ('饿', 'v'), ('了', 'y'), ('么', 'y')]], [[(0, 3, 'company', '阿里')]])
用户字典格式:
饿了么 10
fool.load_userdict(path)
fool.analysis('阿里收购饿了么')
返回:([[('阿里', 'nz'), ('收购', 'v'), ('饿', 'v'), ('了', 'y'), ('么', 'y')]], [[(0, 3, 'company', '阿里')]])
加载用户字典似乎不起作用?分词时“饿了么”还是被拆开了,实体识别中也没识别出来
我看到你在char_embeding中增加了一行0向量,但我发现这会导致look_up的时候错位。
# zero_pad = tf.constant(0.0, dtype=tf.float32, shape=[1, config["char_dim"]])
# self.char_embeding = tf.concat(axis=0, values=[zero_pad, tf.get_variable(name="char_embeding", initializer=embeddings)])
self.char_embeding = tf.get_variable(name="char_embeding", initializer=embeddings)
布置在服务器上,处理数据时,时快时慢是怎么回事?有时1秒,有时一二十秒
您好
在我使用自己的语料训练模型时出现以下错误
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [6299,100] rhs shape= [1104,100]
(在步骤 train时,前面tfrecord都通过了)
请问该如何解决..
/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/altman/IdeaProjects/gopath/src/labs/fool_test/test.py
/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6
return f(*args, **kwds)
starting load model
Traceback (most recent call last):
File "/Users/altman/IdeaProjects/gopath/src/labs/fool_test/test.py", line 4, in
print(fool.cut(text))
File "/usr/local/lib/python3.6/site-packages/fool/init.py", line 36, in cut
_check_model()
File "/usr/local/lib/python3.6/site-packages/fool/init.py", line 26, in _check_model
LEXICAL_ANALYSER.load_model()
File "/usr/local/lib/python3.6/site-packages/fool/lexical.py", line 75, in load_model
self.map = DataMap(os.path.join(data_path, "maps.pkl"))
File "/usr/local/lib/python3.6/site-packages/fool/lexical.py", line 20, in init
self._load(path)
File "/usr/local/lib/python3.6/site-packages/fool/lexical.py", line 26, in _load
with open(path, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/fool/maps.pkl'
Process finished with exit code 1
目前根据作者讲解的步骤训练得到了maps.pkl和modle.pb,但只能用于构建模型进行predict,请问如何用于进行分词,谢谢。
如题。
跑test中ner方法调用,得到的结果是
ners: [[(2, 8, 'location', '北京***')], [(2, 5, 'location', '北京'), (9, 12, 'location', '非洲')], [], [(2, 8, 'location', '北京***')]]
==>但是实际上'北京***'等实体的在text中的文本索引应该是(2,7)才对。
后面还尚未研究是否有别的用意?
仅以记录单词在句子中的索引值的概念来理解的话,此处似乎是多加了1.
博主,您好。最近在用您的代码进行中文分词时,在加载了自定义词典后,分词部分能被准确分出,但当进行实体识别时发现这个过程并没有加载自定义词典。。
以及想问博主加载的字典之后能否支持自定义的词性。
不然真的就是fool式的分词工具了
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.