brikerman / kashgari Goto Github PK

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Home Page: http://kashgari.readthedocs.io/

License: Apache License 2.0

Python 99.36% Shell 0.64%

nlp sequence-labeling text-classification bert-model ner machine-learning nlp-framework named-entity-recognition gpt-2 transfer-learning

kashgari's Introduction

Kashgari

Overview | Performance | Installation | Documentation | Contributing

🎉🎉🎉 We released the 2.0.0 version with TF2 Support. 🎉🎉🎉

If you use this project for your research, please cite:

@misc{Kashgari
  author = {Eliyar Eziz},
  title = {Kashgari},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/BrikerMan/Kashgari}}
}

Overview

Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.

Human-friendly. Kashgari's code is straightforward, well documented and tested, which makes it very easy to understand and modify.
Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
Production Ready. Kashgari could export model with SavedModel format for tensorflow serving, you could directly deploy it on the cloud.

Our Goal

Academic users Easier experimentation to prove their hypothesis without coding from scratch.
NLP beginners Learn how to build an NLP project with production level code quality.
NLP developers Build a production level classification/labeling model within minutes.

Performance

Welcome to add performance report.

Task	Language	Dataset	Score
Named Entity Recognition	Chinese	People's Daily Ner Corpus	95.57
Text Classification	Chinese	SMP2018ECDTCorpus	94.57

Installation

The project is based on Python 3.6+, because it is 2019 and type hinting is cool.

Backend	kashgari version	desc
TensorFlow 2.2+	`pip install 'kashgari>=2.0.2'`	TF2.10+ with tf.keras
TensorFlow 1.14+	`pip install 'kashgari>=1.0.0,<2.0.0'`	TF1.14+ with tf.keras
Keras	`pip install 'kashgari<1.0.0'`	keras version

You also need to install tensorflow_addons with TensorFlow.

TensorFlow Version	tensorflow_addons version
TensorFlow 2.1	`pip install tensorflow_addons==0.9.1`
TensorFlow 2.2	`pip install tensorflow_addons==0.11.2`
TensorFlow 2.3, 2.4, 2.5	`pip install tensorflow_addons==0.13.0`

Tutorials

Here is a set of quick tutorials to get you started with the library:

There are also articles and posts that illustrate how to use Kashgari:

Examples:

Neural machine translation with Seq2Seq

Contributors ✨

Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.

kashgari's People

Contributors

Stargazers

Watchers

Forkers

yyht zyxpaidaxing amoliu sigmaquan gongqingyi-github beesitech laisun sxhfut ghostintheshellarise carsondahlberg chaoyue729 epirus antonioiba shihuaxing cmcai0104 yimiwawa skywindy binnong zhangxt fossabot wqw123 churximi wangyiyao2016 jiyulongxu semsevens henanhorse cdj0311 xrzlizheng ilineicry haonanli chros425 dst1213 yingyuankai yuanjie-ai andysongsx jiniaoxu alexwwang hellomlwo gaybro8777 shiyangjing happyyolanda ares5221 berryhn js418 newzq microw gbacillus tianyikenan suncostanx colionx hujukee brahmaslee yanyiting jiahenghuang wx352798 geogubd safly kuilef we1l1n yehuangcn mathxin little-girl-1992 melansediao niurouli banifeng chaoongithub shangcaiwangtao praggie ruiwo hanhongchang cooleel frances255 pmp55 qiyyyue haha00gou rodrigosnader adamlouisky zorrock sunny121li qq1074123922 scievan growingluffy boluoyu ukij stormbirds 0xjchen scottishfold007 snailhk lee-shining naturelanguageqing moonlione chasingstar95 gsj1029 ctcome zdqf psyche-ps zhangjiekui cslele neemax zhlj98

kashgari's Issues

[Question] TPU support for BERT

Hi there,
I am running a classification problem using BERT embedding on Colab. How can I make Kashgare support TPU?
Thanks!

how to train by GPU?

使用bert作为嵌入时，input层是两层？和使用w2v嵌入调用方式不一样吧[Question]

A clear and concise description of what you want to know.

[BUG] 多标签验证时报错

加载多标签分类模型验证测试集

训练一个多标签分类项目，训练正常，执行

best_model.evaluate(test_x, test_y)

报错如下：

ValueErrorTraceback (most recent call last)
<ipython-input-2-63b84ac3fbe4> in <module>
----> 1 train(seq_len=300, epochs=100)

<ipython-input-1-7c2988bf06e3> in train(base_dir, seq_len, epochs, batch_size)
     23     model = FastTextModel(embedding, multi_label=True)
     24     data = [train_x, train_y, valid_x, valid_y, valid_x, valid_y]
---> 25     helper.train(model, data, base_dir, epochs, batch_size, 'val_categorical_accuracy')
     26 
     27 

/notebooks/info_extract/helper.py in train(model, data, base_dir, epochs, batch_size, monitor)
    111               fit_kwargs={'callbacks': callback_lists})
    112     best_model = model.load_model(os.path.join(base_dir, 'best_model'))
--> 113     best_model.evaluate(test_x, test_y)
    114     export_saved_model(best_model.model, base_dir)

/usr/local/lib/python3.6/dist-packages/kashgari/tasks/classification/base_model.py in evaluate(self, x_data, y_data, batch_size, digits, debug_info)
    327     def evaluate(self, x_data, y_data, batch_size=None, digits=4, debug_info=False) -> Tuple[float, float, Dict]:
    328         y_pred = self.predict(x_data, batch_size=batch_size)
--> 329         report = metrics.classification_report(y_data, y_pred, output_dict=True, digits=digits)
    330         print(metrics.classification_report(y_data, y_pred, digits=digits))
    331         if debug_info:

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict)
   1522     """
   1523 
-> 1524     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1525 
   1526     labels_given = True

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py in _check_targets(y_true, y_pred)
     70     """
     71     check_consistent_length(y_true, y_pred)
---> 72     type_true = type_of_target(y_true)
     73     type_pred = type_of_target(y_pred)
     74 

/usr/local/lib/python3.6/dist-packages/sklearn/utils/multiclass.py in type_of_target(y)
    260         if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
    261                 and not isinstance(y[0], string_types)):
--> 262             raise ValueError('You appear to be using a legacy multi-label data'
    263                              ' representation. Sequence of sequences are no'
    264                              ' longer supported; use a binary array or sparse'

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

[Question] 如何使用自己的数据集来做NER

我的数据集和ChinaPeoplesDailyNerCorpus中的人民日报数据格式是一样的，但实体类别不同（15种），用get_sequence_tagging_data处理完自己的数据直接用于输入就可以了吗？需要自己添加tag字典嘛？
我用我的数据集在原始的BERT模型上跑的acc只有0.67，昨天用您的BLSTMCRF模型crf_acc能达到0.91，该如何判断是我那儿出错呢，还是BLSTMCRF模型效果确实好呢？

[BUG] Listcomp on Line 133 for multi_label classification in base_model.py may have a bug

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

[ *] I have searched in existing issues but did not find the same one.

Environment

OS [e.g. Mac OS, Linux]:
requirements.txt:

None business of OS.

Issue Description

What

unhashable type: 'list' when dealing with multi label y data.

Reproduce

For multi_label classification tasks, if label data (y) have a data structure like [ list[str] ],
e.g. [ ['a'], ['b', 'c'], ['a', 'c']], this error occur.

Other Comment

I will revision the method related, say ClassificationModel.convert_idx_to_label.

Would you please check if the test part works well under current circumstance? I suspect for that.

[BUG]BLSTMCRFModel模型val_acc结果错误

val_acc计算错误

测试代码用《NLP - 基于 BERT 的中文命名实体识别（NER)》文中例子

from kashgari.corpus import *
train_x, train_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('train')
validate_x, validate_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('validate')
test_x, test_y  = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('test')

print(f"train data count: {len(train_x)}")
print(f"validate data count: {len(validate_x)}")
print(f"test data count: {len(test_x)}")

from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('./bert', 128)
from kashgari.tasks.seq_labeling import BLSTMCRFModel
model = BLSTMCRFModel(embedding)
model.fit(train_x,
          train_y,
          y_validate=validate_y,
          x_validate=validate_x,
          epochs=10,
          batch_size=500)

部分结果如下

Epoch 1/10
41/41 [==============================] - 105s 3s/step - loss: 0.2520 - crf_accuracy: 0.9303 - 
acc: 0.6245 - val_loss: 0.0724 - val_crf_accuracy: 0.9789 - val_acc: 0.9789
Epoch 2/10
41/41 [==============================] - 100s 2s/step - loss: 0.0548 - crf_accuracy: 0.9838 - 
acc: 0.6246 - val_loss: 0.0357 - val_crf_accuracy: 0.9898 - val_acc: 0.9898

val_crf_accuracy与val_acc结果一样

卸载TensorFlow-1.11.0后，只安装tensorflow-gpu (1.11.0)训练报错

Using TensorFlow backend.
Traceback (most recent call last):
File "train.py", line 235, in
from kashgari.embeddings import BERTEmbedding
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/kashgari/init.py", line 13, in
import kashgari.embeddings
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/kashgari/embeddings/init.py", line 13, in
from .embeddings import BERTEmbedding
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/kashgari/embeddings/embeddings.py", line 18, in
import keras_bert
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras_bert/init.py", line 1, in
from .bert import gelu, get_model, get_custom_objects, get_base_dict, gen_batch_inputs
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras_bert/bert.py", line 2, in
import keras
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/init.py", line 3, in
from . import utils
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/utils/init.py", line 6, in
from . import conv_utils
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/utils/conv_utils.py", line 9, in
from .. import backend as K
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/backend/init.py", line 89, in
from .tensorflow_backend import *
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 6, in
from tensorflow.python.framework import ops as tf_ops
ModuleNotFoundError: No module named 'tensorflow.python'

[BUG] Different behavior in 0.1.8 and 0.2.1

Environment

Colab.research.google.com
Kashgari 1.1.8 / 0.2.1

Issue Description

Different behavior in 0.1.8 and 0.2.1
In Kashgari 0.1.8 BLSTModel converge in training process and I see val_acc: 0.98 and train_acc: 0.9594
In Kashgari 0.2.1 BLSTModel is overfitting and I see val_acc ~0.5 and train_acc ~0.96
There is no difference in my code, only different versions of library.

Reproduce

code:

from sklearn.model_selection import train_test_split
import pandas as pd
import nltk
from kashgari.tasks.classification import BLSTMModel

# get and process data
!wget https://www.dropbox.com/s/265kphxkijj1134/fontanka.zip

df1 = pd.read_csv('fontanka.zip')
df1.fillna(' ', inplace = True)
nltk.download('punkt')

# split on train/test
X_train, X_test, y_train, y_test = train_test_split(df1.full_text[:3570].values, df1.textrubric[:3570].values, test_size=0.2, random_state=42)
X_train = [nltk.word_tokenize(sentence) for sentence in X_train]
X_test  = [nltk.word_tokenize(sentence) for sentence in X_test]
y_train = y_train.tolist()
y_test  = y_test.tolist()

# train model
model = BLSTMModel()
model.fit(X_train, y_train, x_validate=X_test, y_validate=y_test, epochs = 10)

code in colab:
https://colab.research.google.com/drive/1yTBMeiBl2y7-Yw0DS_vTn2A4y_Vj3N-8

Result

Last epoch:

Kashgari 0.1.8

Epoch 10/10
55/55 [==============================] - 90s 2s/step - loss: 0.1378 - acc: 0.9615 - val_loss: 0.0921 - val_acc: 0.9769

Kashgari 0.2.1

Epoch 10/10
44/44 [==============================] - 76s 2s/step - loss: 0.0990 - acc: 0.9751 - val_loss: 2.3739 - val_acc: 0.5323

Other Comment

In 0.2.1 all models now in different file and lr hyperparameter is given explicitly (1e-3)
In 0.1.8 lr hyperparameter was omitted, I suppose that it used keras default, which is the same (1e-3)

Also in 0.1.8 you had (dense size = +1 classes on classifier)
#21 and ommited it in 0.2.1. I don't see how this could affect training process.

I couldn't find more differences between versions, could you help with this - why models began to overfit in new version of library?

Please, add a simple example for sequence labeling

Your description in "../docs/Tutorial_3_Sequence_Labeling.md" isn't enough.

[Question] 怎么保存和加载最优模型？

想要加载最优的模型文件而不是最后一个epoch的模型文件，ModelCheckpoint在这里不起作用

我在class ClassificationModel(BaseModel)里面，加了个函数：

def load_weights(self, model_path):
        return self.model.load_weights(model_path)

然后调用

early_stopping = EarlyStopping(monitor='val_loss',min_delta=0.01, patience=5, mode='min', verbose=1)
reduce_lr = ReduceLROnPlateau(
        monitor='val_loss', factor=0.5, patience=5, min_lr=0.0001, verbose=2)
bst_model_path = 'weight_%d.h5' % count
checkpoint = ModelCheckpoint(bst_model_path, monitor='val_loss', mode='min',
                                       save_best_only=True, verbose=1, save_weights_only=True)
callbacks = [checkpoint,reduce_lr,early_stopping]
hist = model.fit(x_train,y_train,
                     validation_data=(x_val, y_val),
                     epochs=4, batch_size=512,
#                      class_weight="auto",
#                      callbacks=callbacks,
                     fit_kwargs={"callbacks":callbacks,"verbose":1}
                     
                     )
model.load_weights(bst_model_path)

但是显示不存在weight_0.h5文件，说明 ModelCheckpoin并未被调用

[Bug]x_data ,y_data =SMP2017ECDTClassificationCorpus.get_classification_data()

RuntimeError: Error while fetching file http://storage.eliyar.biz/corpus/smp2017ecdt-data-task1.tar.gz. Dataset fetching aborted.
Error: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently

看起来好像是无法访问数据存放的网站，怎么修复~

[Question] 文本分类例子导入模块报错

尝试例子的时候，这两句都会报错：
#from kashgari.embeddings import BERTEmbedding

from kashgari.tasks.classification import CNNModel

#embedding = BERTEmbedding('/home/xwjia/bert-base-chinese', sequence_length=400)

Traceback (most recent call last):

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 2, in
from kashgari.tasks.classification import CNNModel

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/kashgari/init.py", line 13, in
import kashgari.embeddings

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/kashgari/embeddings/init.py", line 13, in
from .embeddings import BERTEmbedding

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/kashgari/embeddings/embeddings.py", line 65
self._token2idx: Dict[str, int] = None
^
SyntaxError: invalid syntax
定位到这里，实在不知道怎么解决，求翻牌。谢谢。

非常棒的框架，节省了很多时间，能否把tensorflow的dashboard功能集成进来？

NER例子模型预测出误

刚按照NER例子tutorial文章训练了NER模型，模型预测有误，不太清楚哪里出错...

import jieba
from kashgari.tasks.seq_labeling import BLSTMCRFModel
from kashgari.corpus import ChinaPeoplesDailyNerCorpus
from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('/home/eee/sentence-alignment-classification-model/model/multi_cased_L-12_H-768_A-12', 100)

train_x, train_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('train')
validate_x, validate_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('validate')
test_x, test_y  = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('test')

model = BLSTMCRFModel(embedding)
model.fit(train_x,
          train_y,
          validate_y=validate_y,
          validate_x=validate_x,
          epochs=200,
          batch_size=500)
model.save('./model')

new_model = BLSTMCRFModel.load_model('./model')

# EXAMPLE 1
news = "「DeepMind 击败人类职业玩家的方式与他们声称的 AI 使命，以及所声称的『正确』方式完全相反。」"
x = list(jieba.cut(news))
>>> x
['「', 'DeepMind', ' ', '击败', '人类', '职业', '玩家', '的', '方式', '与', '他们', '声称', '的', ' ', 'AI', ' ', '使命', '，', '以及', '所', '声称', '的', '『', '正确', '』', '方式', '完全', '相反', '。', '」']
>>> new_model.predict(x)                                                                                                     
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

# EXAMPLE 2
news = "陈志衍是有个非常好的男孩子，他住在香港的九龙塘区，他今年二十三号生日。"
x = list(jieba.cut(news))
>>> x
['陈志衍', '是', '有', '个', '非常', '好', '的', '男孩子', '，', '他', '住', '在', '香港', '的', '吉林', '区', '，', '他', '今年', '二十三', '号', '生日', '。']
>>> new_model.predict(x)
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

[Question] 使用keras的callbacks csvlogger没有打印出来结果，是不能使用callbacks吗？

使用了BERTEmbedding+BLSTMCRFModel

csv_logger = CSVLogger('/export/public/gongjuntai/bert_ft/csv/csv_logger.csv', separator=',', append=True)
model.fit(x_train = x_train, y_train = y_train,
x_validate = x_validate,
y_validate = y_validate,
batch_size = 64,
epochs = 2,
labels_weight = None,
default_labels_weight = 50.0,
callbacks = [csv_logger])

[Question] 调用bert之后是不是只能直接用不能进行fine tune呢

如题

Is there any API for early stop?[Question]

A clear and concise description of what you want to know.
As the Title

[Question] 标签一定要str类型吗，可以拿来做情感分析吗

比如AI Challenger2018情感分析的多标签怎么处理喂进去

怎么让模型对一篇文章打上多个分类而不是只能有一种分类？

A clear and concise description of what you want to know.

[Question] How to serving model with the tensorflow serving

问题已经解决，谢谢BrikerMan的热心帮助，希望这个解决方案可以让其他人节约一点探索时间，更好的享受项目带来的便利！

我尝试把训练模型保存成saved_model模型，代码如下：

import tensorflow as tf
from kashgari.tasks.seq_labeling import BLSTMCRFModel
from keras import backend as K

# K.set_learning_phase(1)
# 关键修改
K.set_learning_phase(0)

model = BLSTMCRFModel.load_model('./model')
legacy_init_op = tf.group(tf.tables_initializer())

xmodel = model.model

with K.get_session() as sess:
    export_path = './saved_model/14'
    builder = tf.saved_model.builder.SavedModelBuilder(export_path)

    signature_inputs = {
        'token_input': tf.saved_model.utils.build_tensor_info(xmodel.input[0]),
        'seg_input': tf.saved_model.utils.build_tensor_info(xmodel.input[1]),
    }

    signature_outputs = {
        tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES: tf.saved_model.utils.build_tensor_info(
            xmodel.output)
    }

    classification_signature_def = tf.saved_model.signature_def_utils.build_signature_def(
        inputs=signature_inputs,
        outputs=signature_outputs,
        method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME)

    builder.add_meta_graph_and_variables(
        sess,
        [tf.saved_model.tag_constants.SERVING],
        signature_def_map={
            'predict_webshell_php': classification_signature_def
        },
        legacy_init_op=legacy_init_op
    )

    builder.save()

成功保存后调用saved_model模型预测，结果全是0，请问是什么原因？
调用代码：

import json

import tensorflow as tf
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants

export_dir = './saved_model/14/'

with open('./model/words.json', 'r', encoding='utf-8') as f:
    dict = json.load(f)

s = ['[CLS]', '国', '正', '学', '长', '的', '文', '章', '与', '诗', '词', '，', '早', '就', '读', '过', '一', '些', '，', '很', '是', '喜',
     '欢', '。', '[CLS]']
s1 = [dict[x] for x in s]
if len(s1) < 100:
    s1 += [0] * (100 - len(s1))
print(s1)
s2 = [0] * 100

with tf.Session() as sess:
    meta_graph_def = tf.saved_model.loader.load(sess, [tag_constants.SERVING], export_dir)
    signature = meta_graph_def.signature_def

    x1_tensor_name = signature['predict_webshell_php'].inputs['token_input'].name
    x2_tensor_name = signature['predict_webshell_php'].inputs['seg_input'].name

    y_tensor_name = signature['predict_webshell_php'].outputs[
        signature_constants.CLASSIFY_OUTPUT_CLASSES].name
    x1 = sess.graph.get_tensor_by_name(x1_tensor_name)
    x2 = sess.graph.get_tensor_by_name(x2_tensor_name)
    y = sess.graph.get_tensor_by_name(y_tensor_name)
    result = sess.run(y, feed_dict={x1: [s1], x2: [s2]})  # 预测值
    print(result.argmax(-1))
    print(result.shape)

[BUG] embedding = BERTEmbedding('bert-base-chinese', sequence_length=600)

设置成６００报错，说只能５１２

编辑了

A clear and concise description of what you want to know.

[Question] Predict时不用GPU，怎么设置？

A clear and concise description of what you want to know.
您好
Predict时不用GPU，怎么设置？

Bert如何得到的向量？

您好！
我有两个疑问，还望您解答。

BertEmbedding得到的是句子向量还是词向量？
keras_bert代码在哪里呢？
谢谢您！

[BUG] DataLossError: Checksum does not match: stored 3531060969 vs. calculated on the restored bytes 1701788620

升级到tensorflow1.13.1后再次运行程序报错如下：

DataLossErrorTraceback (most recent call last)
<ipython-input-1-34122276fdd3> in <module>
      9 
     10 # 执行训练任务，指定训练轮次，并保存训练过程中的性能最优的模型
---> 11 train(base_dir='./ner', epochs=2)

/notebooks/ner.py in train(base_dir, seq_len, epochs, batch_size, extend_flag)
     93 """
     94 def train(base_dir='./ner', seq_len=100, epochs=20, batch_size=100, extend_flag=True):
---> 95     embedding = BERTEmbedding('./bert', seq_len)
     96     train_x, train_y = get_sequence_tagging_data(os.path.join(base_dir, 'data/train.txt'))
     97     valid_x, valid_y = get_sequence_tagging_data(os.path.join(base_dir, 'data/valid.txt'))

/usr/local/lib/python3.6/dist-packages/kashgari/embeddings/embeddings.py in __init__(self, name_or_path, sequence_length, embedding_size, **kwargs)
     67         self._model: Model = None
     68         self._kwargs = kwargs
---> 69         self.build(**kwargs)
     70 
     71     def update(self, info: Dict[str, Any]):

/usr/local/lib/python3.6/dist-packages/kashgari/embeddings/embeddings.py in build(self)
    299         model = keras_bert.load_trained_model_from_checkpoint(config_path,
    300                                                               check_point_path,
--> 301                                                               seq_len=self.sequence_length)
    302         output_layer = NonMaskingLayer()(model.output)
    303         self._model = Model(model.inputs, output_layer)

/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py in load_trained_model_from_checkpoint(config_file, checkpoint_file, training, seq_len)
     62             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/key/kernel' % i),
     63             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/key/bias' % i),
---> 64             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/value/kernel' % i),
     65             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/value/bias' % i),
     66             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/output/dense/kernel' % i),

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/checkpoint_utils.py in load_variable(ckpt_dir_or_file, name)
     80     name = name[:-2]
     81   reader = load_checkpoint(ckpt_dir_or_file)
---> 82   return reader.get_tensor(name)
     83 
     84 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py in get_tensor(self, tensor_str)
    368         from tensorflow.python.util import compat
    369         return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str),
--> 370                                           status)
    371 
    372     __swig_destroy__ = _pywrap_tensorflow_internal.delete_CheckpointReader

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    526             None, None,
    527             compat.as_text(c_api.TF_Message(self.status.status)),
--> 528             c_api.TF_GetCode(self.status.status))
    529     # Delete the underlying status object from memory otherwise it stays alive
    530     # as there is a reference to status from this from the traceback due to

DataLossError: Checksum does not match: stored 3531060969 vs. calculated on the restored bytes 1701788620

请问是否有解决方法？
另外，利用pip install kashgari时会安装tensorflow1.13.1的cpu版本，我本机docker容器安装了tensorflow-gpu 1.12.0，能否再安装时检查是否安装了合适的tensorflow版本，而不是非要安装最新版本。
谢谢

[Question]请问，我们平时训练用的样本量非常大没办法一次load进来，我自己都是用iteration的对象喂给tf的，那用你这个框架针对这种情况应该如何处理？

A clear and concise description of what you want to know.

[Question] 离线安装kashgari

因为特殊情况没有网，离线安装kashgari显示没有安装tensorflow，但是我已经安装了tensorflow_gpu=1.12.0，并且tensorflow任务可以正常跑。之后，我尝试安装了cpu版本的tensorflow,
kashgari可以正常使用。请问这应该是什么问题？？？

[Question] alueError: Tensor Tensor("non_masking_layer_1/Identity:0", shape=(?, 50, 768), dtype=float32) is not an element of this graph.

A clear and concise description of what you want to know.
在将bert预训练词向量导入自己搭建的模型之后，参数没问题之后，结果这个报个这个错误，可是我的模型中没有用到non_masking_layer_1啊？

加载BERT模型后，如何得到每个词对应的向量？[Question]

A clear and concise description of what you want to know.

Consider adding fine tuning BERT for sequence tagging?

Currently Kashgari only supports feature-based transfer learning from BERT (Am I right ?). Any plan on adding fine-tuning based module in near future?

wordembedding trainable=True 后报OOM[Question]

把你的Word embedding里面的trainable=False 改成trainable=True，for循环batchsize和epoch，�循环到中间就会报OOM，请问是什么原因呢？

同样情况下的trainable=False以及custom embedding不会报OOM。

[

[Question] BLSTMModel文本分类，做预测很慢

您好
预测的数据量100M，2块1080ti GPU，CPU i7 12核
从加载已训练好的模型到预测结果输出，整个过程26分钟。
而CNNModel 10秒就完成了

ChinaPeoplesDailyNerCorpus预料下载报错

请问数据集是删掉了吗? 数据无法下载

RuntimeError: Error while fetching file http://storage.eliyar.biz/corpus/china-people-daily-ner-corpus.tar.gz. Dataset fetching aborted.
Error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>

ValueError: if reps is tensor.vector, you should specify the ndim

[BUG] 执行model.to_json()报错，无法保存模型结构到json文件中

测试代码如下：

from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('./model', 100)
embedding.model.to_json()

TypeErrorTraceback (most recent call last)
<ipython-input-6-a9e7b24a398e> in <module>
----> 1 embedding.model.to_json()

/usr/local/lib/python3.6/dist-packages/keras/engine/network.py in to_json(self, **kwargs)
   1211 
   1212         model_config = self._updated_config()
-> 1213         return json.dumps(model_config, default=get_json_type, **kwargs)
   1214 
   1215     def to_yaml(self, **kwargs):

/usr/lib/python3.6/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    236         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237         separators=separators, default=default, sort_keys=sort_keys,
--> 238         **kw).encode(obj)
    239 
    240 

/usr/lib/python3.6/json/encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

/usr/lib/python3.6/json/encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/usr/local/lib/python3.6/dist-packages/keras/engine/network.py in get_json_type(obj)
   1208                 return obj.__name__
   1209 
-> 1210             raise TypeError('Not JSON Serializable:', obj)
   1211 
   1212         model_config = self._updated_config()

TypeError: ('Not JSON Serializable:', <function linear at 0x7efc42e98730>)

[BUG] IndexError: list index out of range on line 197 in tasks.classification.base_model.py

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

[ * ] I have searched in existing issues but did not find the same one.

Environment

OS [e.g. Mac OS, Linux]: Ubuntu 16.04
requirements.txt:

absl-py==0.7.0
art==3.0
astor==0.7.1
atomicwrites==1.3.0
attrs==18.2.0
boto==2.49.0
boto3==1.9.91
botocore==1.12.91
bz2file==0.98
certifi==2018.1.18
chardet==3.0.4
codecov==2.0.15
colorlog==4.0.2
coverage==4.5.2
docutils==0.14
download==0.3.3
gast==0.2.2
gensim==3.7.1
grpcio==1.18.0
h5py==2.9.0
idna==2.8
jieba==0.39
jmespath==0.9.3
kashgari==0.1.6
Keras==2.2.4
Keras-Applications==1.0.7
keras-bert==0.29.0
keras-embed-sim==0.2.0
keras-layer-normalization==0.9.0
keras-multi-head==0.15.0
keras-pos-embd==0.7.0
keras-position-wise-feed-forward==0.3.0
Keras-Preprocessing==1.0.9
keras-self-attention==0.33.0
keras-transformer==0.17.0
Markdown==3.0.1
more-itertools==6.0.0
numpy==1.16.1
pandas==0.24.1
pluggy==0.8.1
protobuf==3.6.1
py==1.7.0
pycm==1.8
pytest==4.2.1
pytest-cov==2.6.1
python-dateutil==2.8.0
pytz==2018.9
PyYAML==3.13
regex==2019.2.18
requests==2.21.0
s3transfer==0.2.0
scikit-learn==0.20.2
scipy==1.2.1
seqeval==0.0.5
six==1.12.0
sklearn==0.0
smart-open==1.8.0
tensorboard==1.12.2
tensorflow==1.12.0
termcolor==1.1.0
tqdm==4.31.1
urllib3==1.24.1
Werkzeug==0.14.1
xlrd==1.2.0

Issue Description

What

  File "$InstalledPath/kashgari/tasks/classification/base_model.py", line 244, in predict
    results.append(self._format_output_dict(words_list[index], res[index]))
  File "$InstalledPath/kashgari/tasks/classification/base_model.py", line 197, in _format_output_dict
    'class': candidates[0],
IndexError: list index out of range

After the model is trained, you may want to make some prediction, so in model.predict method, if you set the param output_dict=True you may get this error occur in private method _format_output_dict of class tasks.classification.base_model.

Reproduce

As is described above, if you want to output the prob dict of predicted value, this error occurs.

Other Comment

Haven't figured out why this may happen yet, but seems the var candidates goes wrong.

[Question] 断点续训

感谢作者！请问怎么实现断点续训呢，比如保存最近5个epoch

[BUG] from kashgari.corpus import SMP2017ECDTClassificationCorpus

[Proposal] Migrate keras to tf.keras

I am proposing to change keras to tf.keras for better performance, better serving and add TPU support. Maybe we should re-write the whole project, clean up the code, add missing documents and so on.

Here are the features I am planning to add.

Multi-GPU/TPU support
Export model for Tensorflow Serving
Fine-tune Ability for W2V and BERT

[Question] 使用model.predict预测的label为何会有[BOS]和[EOS]在里面

A clear and concise description of what you want to know.
我用自己的数据训练，结果还可以（0.85），但是为何预测的label里还存在[BOS]和[EOS],在predict里result应该是在convert idx to labels时remove 了bos和eos的，怎么还会出现在最后的prediction label里？

康 B-brand
元 E-brand
的 O
饼 B-category
干 [EOS]

保存的模型可以进行量化吗？或者是生成PB文件，这样可以提升效率

A clear and concise description of what you want to know.

[Question] how to fine tune BERT-base model with multi-GPUs

Currently I try to fine tune models on GPUs with the huge BERT-Large model (batch_size>=128), Does Kashgari support for training on a large number of GPUs?
Thanks for this awesome project! :P

[Question] why is dense size = Classification category +1？（in classification task）

why is dense size = Classification category +1？（in classification task）
These two pictures below are from your blog：
https://eliyar.biz/nlp_chinese_text_classification_in_15mins/?nsukey=O%2BczPW172RmhIMqZT%2BfUlcATr6%2Ba8SX%2BJOwFmeqO7sK%2FDXKSkt%2BembEWvsivT76TbO5z7Dh02n%2BDux3v7PQgBZJl9fdsUl7aFAeUOQis%2BI0D6lpYW%2B9nAVUiSaAvf6mCvwDLx2a6%2Be5p4Up5QuSJ9UiCKek51b5HXyjbOPnZJMM%3D

[Question] 请问是否支持机器翻译（数据预处理和训练）pipline

A clear and concise description of what you want to know.

[Question] 请问kashgari可以完成问答句子对的分类吗，比如句子A 句子B 拼在一起的分类，还是只能针对单个句子或者文本的分类？

A clear and concise description of what you want to know.

[Question] 和flask一起使用的时候报错

报错：ValueError: Fetch argument <tf.Variable 'Embedding-Token/embeddings:0' shape=(21128, 768) dtype=float32_ref> cannot be interpreted as a Tensor. (Tensor Tensor("Embedding-Token/embeddings:0", shape=(21128, 768), dtype=float32_ref) is not an element of this graph.)
报错的地方：
bert_embedding = BERTEmbedding('bert-base-chinese', sequence_length=512)

[Question] bert直接接bi-lstm+crf之后预测很慢，一条预测要耗时260ms,请问能优化速度吗

titan xp的GPU，应该能排除机器问题，，，想问问有没有解决办法

[Question] 如何优化模型效果

小白一枚，训练完以后效果不是很理想，有没有什么可配置化的参数来提高准确性呢

brikerman / kashgari Goto Github PK

kashgari's Introduction

Overview | Performance | Installation | Documentation | Contributing

Overview

Our Goal

Performance

Installation

Tutorials

Contributors ✨

kashgari's People

Contributors

Stargazers

Watchers

Forkers

kashgari's Issues

加载多标签分类模型验证测试集

Check List

Environment

Issue Description

What

Reproduce

Other Comment

val_acc计算错误

测试代码用《NLP - 基于 BERT 的中文命名实体识别（NER)》文中例子

部分结果如下

val_crf_accuracy与val_acc结果一样

Environment

Issue Description

Reproduce

Result

Other Comment

升级到tensorflow1.13.1后再次运行程序报错如下：

Check List

Environment

Issue Description

What

Reproduce

Other Comment

Recommend Projects

Recommend Topics

Recommend Org

Jobs