GithubHelp home page GithubHelp logo

brikerman / kashgari Goto Github PK

View Code? Open in Web Editor NEW
2.4K 65.0 442.0 14.69 MB

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Home Page: http://kashgari.readthedocs.io/

License: Apache License 2.0

Python 99.36% Shell 0.64%
nlp sequence-labeling text-classification bert-model ner machine-learning nlp-framework named-entity-recognition gpt-2 transfer-learning

kashgari's Introduction

GitHub Slack Coverage Status PyPI

🎉🎉🎉 We released the 2.0.0 version with TF2 Support. 🎉🎉🎉

If you use this project for your research, please cite:

@misc{Kashgari
  author = {Eliyar Eziz},
  title = {Kashgari},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/BrikerMan/Kashgari}}
}

Overview

Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.

  • Human-friendly. Kashgari's code is straightforward, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
  • Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
  • Production Ready. Kashgari could export model with SavedModel format for tensorflow serving, you could directly deploy it on the cloud.

Our Goal

  • Academic users Easier experimentation to prove their hypothesis without coding from scratch.
  • NLP beginners Learn how to build an NLP project with production level code quality.
  • NLP developers Build a production level classification/labeling model within minutes.

Performance

Welcome to add performance report.

Task Language Dataset Score
Named Entity Recognition Chinese People's Daily Ner Corpus 95.57
Text Classification Chinese SMP2018ECDTCorpus 94.57

Installation

The project is based on Python 3.6+, because it is 2019 and type hinting is cool.

Backend kashgari version desc
TensorFlow 2.2+ pip install 'kashgari>=2.0.2' TF2.10+ with tf.keras
TensorFlow 1.14+ pip install 'kashgari>=1.0.0,<2.0.0' TF1.14+ with tf.keras
Keras pip install 'kashgari<1.0.0' keras version

You also need to install tensorflow_addons with TensorFlow.

TensorFlow Version tensorflow_addons version
TensorFlow 2.1 pip install tensorflow_addons==0.9.1
TensorFlow 2.2 pip install tensorflow_addons==0.11.2
TensorFlow 2.3, 2.4, 2.5 pip install tensorflow_addons==0.13.0

Tutorials

Here is a set of quick tutorials to get you started with the library:

There are also articles and posts that illustrate how to use Kashgari:

Examples:

Contributors ✨

Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.

kashgari's People

Contributors

adline125 avatar alexwwang avatar allcontributors[bot] avatar bradfora avatar bratao avatar brikerman avatar cyberzhg avatar dependabot[bot] avatar echan00 avatar eryueniaobp avatar fossabot avatar haoyuhu avatar lemoz avatar lsgrep avatar mangopomelo avatar monkeywithacupcake avatar nirantk avatar sharpkoi avatar sunyancn avatar zxy199803 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kashgari's Issues

[Question] TPU support for BERT

Hi there,
I am running a classification problem using BERT embedding on Colab. How can I make Kashgare support TPU?
Thanks!

[BUG] 多标签验证时报错

加载多标签分类模型验证测试集

训练一个多标签分类项目,训练正常,执行

best_model.evaluate(test_x, test_y)

报错如下:

ValueErrorTraceback (most recent call last)
<ipython-input-2-63b84ac3fbe4> in <module>
----> 1 train(seq_len=300, epochs=100)

<ipython-input-1-7c2988bf06e3> in train(base_dir, seq_len, epochs, batch_size)
     23     model = FastTextModel(embedding, multi_label=True)
     24     data = [train_x, train_y, valid_x, valid_y, valid_x, valid_y]
---> 25     helper.train(model, data, base_dir, epochs, batch_size, 'val_categorical_accuracy')
     26 
     27 

/notebooks/info_extract/helper.py in train(model, data, base_dir, epochs, batch_size, monitor)
    111               fit_kwargs={'callbacks': callback_lists})
    112     best_model = model.load_model(os.path.join(base_dir, 'best_model'))
--> 113     best_model.evaluate(test_x, test_y)
    114     export_saved_model(best_model.model, base_dir)

/usr/local/lib/python3.6/dist-packages/kashgari/tasks/classification/base_model.py in evaluate(self, x_data, y_data, batch_size, digits, debug_info)
    327     def evaluate(self, x_data, y_data, batch_size=None, digits=4, debug_info=False) -> Tuple[float, float, Dict]:
    328         y_pred = self.predict(x_data, batch_size=batch_size)
--> 329         report = metrics.classification_report(y_data, y_pred, output_dict=True, digits=digits)
    330         print(metrics.classification_report(y_data, y_pred, digits=digits))
    331         if debug_info:

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict)
   1522     """
   1523 
-> 1524     y_type, y_true, y_pred = _check_targets(y_true, y_pred)
   1525 
   1526     labels_given = True

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py in _check_targets(y_true, y_pred)
     70     """
     71     check_consistent_length(y_true, y_pred)
---> 72     type_true = type_of_target(y_true)
     73     type_pred = type_of_target(y_pred)
     74 

/usr/local/lib/python3.6/dist-packages/sklearn/utils/multiclass.py in type_of_target(y)
    260         if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence)
    261                 and not isinstance(y[0], string_types)):
--> 262             raise ValueError('You appear to be using a legacy multi-label data'
    263                              ' representation. Sequence of sequences are no'
    264                              ' longer supported; use a binary array or sparse'

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

[Question] 如何使用自己的数据集来做NER

我的数据集和ChinaPeoplesDailyNerCorpus中的人民日报数据格式是一样的,但实体类别不同(15种),用get_sequence_tagging_data处理完自己的数据直接用于输入就可以了吗?需要自己添加tag字典嘛?
我用我的数据集在原始的BERT模型上跑的acc只有0.67,昨天用您的BLSTMCRF模型crf_acc能达到0.91,该如何判断是我那儿出错呢,还是BLSTMCRF模型效果确实好呢?

[BUG] Listcomp on Line 133 for multi_label classification in base_model.py may have a bug

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

Environment

  • OS [e.g. Mac OS, Linux]:
  • requirements.txt:
None business of OS.

Issue Description

What

unhashable type: 'list' when dealing with multi label y data.

Reproduce

For multi_label classification tasks, if label data (y) have a data structure like [ list[str] ],
e.g. [ ['a'], ['b', 'c'], ['a', 'c']], this error occur.

Other Comment

I will revision the method related, say ClassificationModel.convert_idx_to_label.

Would you please check if the test part works well under current circumstance? I suspect for that.

[BUG]BLSTMCRFModel模型val_acc结果错误

val_acc计算错误

测试代码用《NLP - 基于 BERT 的中文命名实体识别(NER)》文中例子

from kashgari.corpus import *
train_x, train_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('train')
validate_x, validate_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('validate')
test_x, test_y  = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('test')

print(f"train data count: {len(train_x)}")
print(f"validate data count: {len(validate_x)}")
print(f"test data count: {len(test_x)}")

from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('./bert', 128)
from kashgari.tasks.seq_labeling import BLSTMCRFModel
model = BLSTMCRFModel(embedding)
model.fit(train_x,
          train_y,
          y_validate=validate_y,
          x_validate=validate_x,
          epochs=10,
          batch_size=500)

部分结果如下

Epoch 1/10
41/41 [==============================] - 105s 3s/step - loss: 0.2520 - crf_accuracy: 0.9303 - 
acc: 0.6245 - val_loss: 0.0724 - val_crf_accuracy: 0.9789 - val_acc: 0.9789
Epoch 2/10
41/41 [==============================] - 100s 2s/step - loss: 0.0548 - crf_accuracy: 0.9838 - 
acc: 0.6246 - val_loss: 0.0357 - val_crf_accuracy: 0.9898 - val_acc: 0.9898

val_crf_accuracy与val_acc结果一样

卸载TensorFlow-1.11.0后,只安装tensorflow-gpu (1.11.0)训练报错

Using TensorFlow backend.
Traceback (most recent call last):
File "train.py", line 235, in
from kashgari.embeddings import BERTEmbedding
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/kashgari/init.py", line 13, in
import kashgari.embeddings
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/kashgari/embeddings/init.py", line 13, in
from .embeddings import BERTEmbedding
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/kashgari/embeddings/embeddings.py", line 18, in
import keras_bert
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras_bert/init.py", line 1, in
from .bert import gelu, get_model, get_custom_objects, get_base_dict, gen_batch_inputs
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras_bert/bert.py", line 2, in
import keras
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/init.py", line 3, in
from . import utils
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/utils/init.py", line 6, in
from . import conv_utils
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/utils/conv_utils.py", line 9, in
from .. import backend as K
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/backend/init.py", line 89, in
from .tensorflow_backend import *
File "/data1/chh/software/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 6, in
from tensorflow.python.framework import ops as tf_ops
ModuleNotFoundError: No module named 'tensorflow.python'

[BUG] Different behavior in 0.1.8 and 0.2.1

Environment

  • Colab.research.google.com
  • Kashgari 1.1.8 / 0.2.1

Issue Description

Different behavior in 0.1.8 and 0.2.1
In Kashgari 0.1.8 BLSTModel converge in training process and I see val_acc: 0.98 and train_acc: 0.9594
In Kashgari 0.2.1 BLSTModel is overfitting and I see val_acc ~0.5 and train_acc ~0.96
There is no difference in my code, only different versions of library.

Reproduce

code:

from sklearn.model_selection import train_test_split
import pandas as pd
import nltk
from kashgari.tasks.classification import BLSTMModel

# get and process data
!wget https://www.dropbox.com/s/265kphxkijj1134/fontanka.zip

df1 = pd.read_csv('fontanka.zip')
df1.fillna(' ', inplace = True)
nltk.download('punkt')

# split on train/test
X_train, X_test, y_train, y_test = train_test_split(df1.full_text[:3570].values, df1.textrubric[:3570].values, test_size=0.2, random_state=42)
X_train = [nltk.word_tokenize(sentence) for sentence in X_train]
X_test  = [nltk.word_tokenize(sentence) for sentence in X_test]
y_train = y_train.tolist()
y_test  = y_test.tolist()

# train model
model = BLSTMModel()
model.fit(X_train, y_train, x_validate=X_test, y_validate=y_test, epochs = 10)

code in colab:
https://colab.research.google.com/drive/1yTBMeiBl2y7-Yw0DS_vTn2A4y_Vj3N-8

Result

Last epoch:

Kashgari 0.1.8

Epoch 10/10
55/55 [==============================] - 90s 2s/step - loss: 0.1378 - acc: 0.9615 - val_loss: 0.0921 - val_acc: 0.9769

Kashgari 0.2.1

Epoch 10/10
44/44 [==============================] - 76s 2s/step - loss: 0.0990 - acc: 0.9751 - val_loss: 2.3739 - val_acc: 0.5323

Other Comment

In 0.2.1 all models now in different file and lr hyperparameter is given explicitly (1e-3)
In 0.1.8 lr hyperparameter was omitted, I suppose that it used keras default, which is the same (1e-3)

Also in 0.1.8 you had (dense size = +1 classes on classifier)
#21 and ommited it in 0.2.1. I don't see how this could affect training process.

I couldn't find more differences between versions, could you help with this - why models began to overfit in new version of library?

[Question] 怎么保存和加载最优模型?

想要加载最优的模型文件而不是最后一个epoch的模型文件,ModelCheckpoint在这里不起作用

我在class ClassificationModel(BaseModel)里面,加了个函数:

def load_weights(self, model_path):
        return self.model.load_weights(model_path)

然后调用

early_stopping = EarlyStopping(monitor='val_loss',min_delta=0.01, patience=5, mode='min', verbose=1)
reduce_lr = ReduceLROnPlateau(
        monitor='val_loss', factor=0.5, patience=5, min_lr=0.0001, verbose=2)
bst_model_path = 'weight_%d.h5' % count
checkpoint = ModelCheckpoint(bst_model_path, monitor='val_loss', mode='min',
                                       save_best_only=True, verbose=1, save_weights_only=True)
callbacks = [checkpoint,reduce_lr,early_stopping]
hist = model.fit(x_train,y_train,
                     validation_data=(x_val, y_val),
                     epochs=4, batch_size=512,
#                      class_weight="auto",
#                      callbacks=callbacks,
                     fit_kwargs={"callbacks":callbacks,"verbose":1}
                     
                     )
model.load_weights(bst_model_path)

但是显示不存在weight_0.h5文件,说明 ModelCheckpoin并未被调用

[Question] 文本分类例子导入模块报错

尝试例子的时候,这两句都会报错:
#from kashgari.embeddings import BERTEmbedding

from kashgari.tasks.classification import CNNModel

#embedding = BERTEmbedding('/home/xwjia/bert-base-chinese', sequence_length=400)

Traceback (most recent call last):

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 3296, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 2, in
from kashgari.tasks.classification import CNNModel

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/kashgari/init.py", line 13, in
import kashgari.embeddings

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/kashgari/embeddings/init.py", line 13, in
from .embeddings import BERTEmbedding

File "/home/xwjia/.virtualenvs/bert_venv/lib/python3.5/site-packages/kashgari/embeddings/embeddings.py", line 65
self._token2idx: Dict[str, int] = None
^
SyntaxError: invalid syntax
定位到这里,实在不知道怎么解决,求翻牌。谢谢。

NER例子模型预测出误

刚按照NER例子tutorial文章训练了NER模型,模型预测有误,不太清楚哪里出错...

import jieba
from kashgari.tasks.seq_labeling import BLSTMCRFModel
from kashgari.corpus import ChinaPeoplesDailyNerCorpus
from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('/home/eee/sentence-alignment-classification-model/model/multi_cased_L-12_H-768_A-12', 100)

train_x, train_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('train')
validate_x, validate_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('validate')
test_x, test_y  = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('test')

model = BLSTMCRFModel(embedding)
model.fit(train_x,
          train_y,
          validate_y=validate_y,
          validate_x=validate_x,
          epochs=200,
          batch_size=500)
model.save('./model')

new_model = BLSTMCRFModel.load_model('./model')

# EXAMPLE 1
news = "「DeepMind 击败人类职业玩家的方式与他们声称的 AI 使命,以及所声称的『正确』方式完全相反。」"
x = list(jieba.cut(news))
>>> x
['「', 'DeepMind', ' ', '击败', '人类', '职业', '玩家', '的', '方式', '与', '他们', '声称', '的', ' ', 'AI', ' ', '使命', ',', '以及', '所', '声称', '的', '『', '正确', '』', '方式', '完全', '相反', '。', '」']
>>> new_model.predict(x)                                                                                                     
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

# EXAMPLE 2
news = "陈志衍是有个非常好的男孩子,他住在香港的九龙塘区,他今年二十三号生日。"
x = list(jieba.cut(news))
>>> x
['陈志衍', '是', '有', '个', '非常', '好', '的', '男孩子', ',', '他', '住', '在', '香港', '的', '吉林', '区', ',', '他', '今年', '二十三', '号', '生日', '。']
>>> new_model.predict(x)
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

[Question] How to serving model with the tensorflow serving

问题已经解决,谢谢BrikerMan的热心帮助,希望这个解决方案可以让其他人节约一点探索时间,更好的享受项目带来的便利!

我尝试把训练模型保存成saved_model模型,代码如下:

import tensorflow as tf
from kashgari.tasks.seq_labeling import BLSTMCRFModel
from keras import backend as K

# K.set_learning_phase(1)
# 关键修改
K.set_learning_phase(0)

model = BLSTMCRFModel.load_model('./model')
legacy_init_op = tf.group(tf.tables_initializer())

xmodel = model.model

with K.get_session() as sess:
    export_path = './saved_model/14'
    builder = tf.saved_model.builder.SavedModelBuilder(export_path)

    signature_inputs = {
        'token_input': tf.saved_model.utils.build_tensor_info(xmodel.input[0]),
        'seg_input': tf.saved_model.utils.build_tensor_info(xmodel.input[1]),
    }

    signature_outputs = {
        tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES: tf.saved_model.utils.build_tensor_info(
            xmodel.output)
    }

    classification_signature_def = tf.saved_model.signature_def_utils.build_signature_def(
        inputs=signature_inputs,
        outputs=signature_outputs,
        method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME)

    builder.add_meta_graph_and_variables(
        sess,
        [tf.saved_model.tag_constants.SERVING],
        signature_def_map={
            'predict_webshell_php': classification_signature_def
        },
        legacy_init_op=legacy_init_op
    )

    builder.save()

成功保存后调用saved_model模型预测,结果全是0,请问是什么原因?
调用代码:

import json

import tensorflow as tf
from tensorflow.python.saved_model import signature_constants
from tensorflow.python.saved_model import tag_constants

export_dir = './saved_model/14/'

with open('./model/words.json', 'r', encoding='utf-8') as f:
    dict = json.load(f)

s = ['[CLS]', '国', '正', '学', '长', '的', '文', '章', '与', '诗', '词', ',', '早', '就', '读', '过', '一', '些', ',', '很', '是', '喜',
     '欢', '。', '[CLS]']
s1 = [dict[x] for x in s]
if len(s1) < 100:
    s1 += [0] * (100 - len(s1))
print(s1)
s2 = [0] * 100

with tf.Session() as sess:
    meta_graph_def = tf.saved_model.loader.load(sess, [tag_constants.SERVING], export_dir)
    signature = meta_graph_def.signature_def

    x1_tensor_name = signature['predict_webshell_php'].inputs['token_input'].name
    x2_tensor_name = signature['predict_webshell_php'].inputs['seg_input'].name

    y_tensor_name = signature['predict_webshell_php'].outputs[
        signature_constants.CLASSIFY_OUTPUT_CLASSES].name
    x1 = sess.graph.get_tensor_by_name(x1_tensor_name)
    x2 = sess.graph.get_tensor_by_name(x2_tensor_name)
    y = sess.graph.get_tensor_by_name(y_tensor_name)
    result = sess.run(y, feed_dict={x1: [s1], x2: [s2]})  # 预测值
    print(result.argmax(-1))
    print(result.shape)

编辑了

A clear and concise description of what you want to know.

Bert如何得到的向量?

您好!
我有两个疑问,还望您解答。

  1. BertEmbedding得到的是句子向量还是词向量?
  2. keras_bert代码在哪里呢?
    谢谢您!

[BUG] DataLossError: Checksum does not match: stored 3531060969 vs. calculated on the restored bytes 1701788620

升级到tensorflow1.13.1后再次运行程序报错如下:

DataLossErrorTraceback (most recent call last)
<ipython-input-1-34122276fdd3> in <module>
      9 
     10 # 执行训练任务,指定训练轮次,并保存训练过程中的性能最优的模型
---> 11 train(base_dir='./ner', epochs=2)

/notebooks/ner.py in train(base_dir, seq_len, epochs, batch_size, extend_flag)
     93 """
     94 def train(base_dir='./ner', seq_len=100, epochs=20, batch_size=100, extend_flag=True):
---> 95     embedding = BERTEmbedding('./bert', seq_len)
     96     train_x, train_y = get_sequence_tagging_data(os.path.join(base_dir, 'data/train.txt'))
     97     valid_x, valid_y = get_sequence_tagging_data(os.path.join(base_dir, 'data/valid.txt'))

/usr/local/lib/python3.6/dist-packages/kashgari/embeddings/embeddings.py in __init__(self, name_or_path, sequence_length, embedding_size, **kwargs)
     67         self._model: Model = None
     68         self._kwargs = kwargs
---> 69         self.build(**kwargs)
     70 
     71     def update(self, info: Dict[str, Any]):

/usr/local/lib/python3.6/dist-packages/kashgari/embeddings/embeddings.py in build(self)
    299         model = keras_bert.load_trained_model_from_checkpoint(config_path,
    300                                                               check_point_path,
--> 301                                                               seq_len=self.sequence_length)
    302         output_layer = NonMaskingLayer()(model.output)
    303         self._model = Model(model.inputs, output_layer)

/usr/local/lib/python3.6/dist-packages/keras_bert/loader.py in load_trained_model_from_checkpoint(config_file, checkpoint_file, training, seq_len)
     62             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/key/kernel' % i),
     63             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/key/bias' % i),
---> 64             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/value/kernel' % i),
     65             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/self/value/bias' % i),
     66             tf.train.load_variable(checkpoint_file, 'bert/encoder/layer_%d/attention/output/dense/kernel' % i),

/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/checkpoint_utils.py in load_variable(ckpt_dir_or_file, name)
     80     name = name[:-2]
     81   reader = load_checkpoint(ckpt_dir_or_file)
---> 82   return reader.get_tensor(name)
     83 
     84 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py in get_tensor(self, tensor_str)
    368         from tensorflow.python.util import compat
    369         return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str),
--> 370                                           status)
    371 
    372     __swig_destroy__ = _pywrap_tensorflow_internal.delete_CheckpointReader

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    526             None, None,
    527             compat.as_text(c_api.TF_Message(self.status.status)),
--> 528             c_api.TF_GetCode(self.status.status))
    529     # Delete the underlying status object from memory otherwise it stays alive
    530     # as there is a reference to status from this from the traceback due to

DataLossError: Checksum does not match: stored 3531060969 vs. calculated on the restored bytes 1701788620

请问是否有解决方法?
另外,利用pip install kashgari时会安装tensorflow1.13.1的cpu版本,我本机docker容器安装了tensorflow-gpu 1.12.0,能否再安装时检查是否安装了合适的tensorflow版本,而不是非要安装最新版本。
谢谢

[Question] 离线安装kashgari

因为特殊情况没有网,离线安装kashgari显示没有安装tensorflow,但是我已经安装了tensorflow_gpu=1.12.0,并且tensorflow任务可以正常跑。之后,我尝试安装了cpu版本的tensorflow,
kashgari可以正常使用。请问这应该是什么问题???

wordembedding trainable=True 后报OOM[Question]

把你的Word embedding里面的trainable=False 改成trainable=True,for循环batchsize和epoch,�循环到中间就会报OOM,请问是什么原因呢?

同样情况下的trainable=False以及custom embedding不会报OOM。

ChinaPeoplesDailyNerCorpus预料下载报错

请问数据集是删掉了吗? 数据无法下载

RuntimeError: Error while fetching file http://storage.eliyar.biz/corpus/china-people-daily-ner-corpus.tar.gz. Dataset fetching aborted.
Error: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>

[BUG] 执行model.to_json()报错,无法保存模型结构到json文件中

测试代码如下:

from kashgari.embeddings import BERTEmbedding
embedding = BERTEmbedding('./model', 100)
embedding.model.to_json()

TypeErrorTraceback (most recent call last)
<ipython-input-6-a9e7b24a398e> in <module>
----> 1 embedding.model.to_json()

/usr/local/lib/python3.6/dist-packages/keras/engine/network.py in to_json(self, **kwargs)
   1211 
   1212         model_config = self._updated_config()
-> 1213         return json.dumps(model_config, default=get_json_type, **kwargs)
   1214 
   1215     def to_yaml(self, **kwargs):

/usr/lib/python3.6/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    236         check_circular=check_circular, allow_nan=allow_nan, indent=indent,
    237         separators=separators, default=default, sort_keys=sort_keys,
--> 238         **kw).encode(obj)
    239 
    240 

/usr/lib/python3.6/json/encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

/usr/lib/python3.6/json/encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258 
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/usr/local/lib/python3.6/dist-packages/keras/engine/network.py in get_json_type(obj)
   1208                 return obj.__name__
   1209 
-> 1210             raise TypeError('Not JSON Serializable:', obj)
   1211 
   1212         model_config = self._updated_config()

TypeError: ('Not JSON Serializable:', <function linear at 0x7efc42e98730>)

[BUG] IndexError: list index out of range on line 197 in tasks.classification.base_model.py

Check List

Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

Environment

  • OS [e.g. Mac OS, Linux]: Ubuntu 16.04
  • requirements.txt:
absl-py==0.7.0
art==3.0
astor==0.7.1
atomicwrites==1.3.0
attrs==18.2.0
boto==2.49.0
boto3==1.9.91
botocore==1.12.91
bz2file==0.98
certifi==2018.1.18
chardet==3.0.4
codecov==2.0.15
colorlog==4.0.2
coverage==4.5.2
docutils==0.14
download==0.3.3
gast==0.2.2
gensim==3.7.1
grpcio==1.18.0
h5py==2.9.0
idna==2.8
jieba==0.39
jmespath==0.9.3
kashgari==0.1.6
Keras==2.2.4
Keras-Applications==1.0.7
keras-bert==0.29.0
keras-embed-sim==0.2.0
keras-layer-normalization==0.9.0
keras-multi-head==0.15.0
keras-pos-embd==0.7.0
keras-position-wise-feed-forward==0.3.0
Keras-Preprocessing==1.0.9
keras-self-attention==0.33.0
keras-transformer==0.17.0
Markdown==3.0.1
more-itertools==6.0.0
numpy==1.16.1
pandas==0.24.1
pluggy==0.8.1
protobuf==3.6.1
py==1.7.0
pycm==1.8
pytest==4.2.1
pytest-cov==2.6.1
python-dateutil==2.8.0
pytz==2018.9
PyYAML==3.13
regex==2019.2.18
requests==2.21.0
s3transfer==0.2.0
scikit-learn==0.20.2
scipy==1.2.1
seqeval==0.0.5
six==1.12.0
sklearn==0.0
smart-open==1.8.0
tensorboard==1.12.2
tensorflow==1.12.0
termcolor==1.1.0
tqdm==4.31.1
urllib3==1.24.1
Werkzeug==0.14.1
xlrd==1.2.0

Issue Description

What

  File "$InstalledPath/kashgari/tasks/classification/base_model.py", line 244, in predict
    results.append(self._format_output_dict(words_list[index], res[index]))
  File "$InstalledPath/kashgari/tasks/classification/base_model.py", line 197, in _format_output_dict
    'class': candidates[0],
IndexError: list index out of range

After the model is trained, you may want to make some prediction, so in model.predict method, if you set the param output_dict=True you may get this error occur in private method _format_output_dict of class tasks.classification.base_model.

Reproduce

As is described above, if you want to output the prob dict of predicted value, this error occurs.

Other Comment

Haven't figured out why this may happen yet, but seems the var candidates goes wrong.

[Proposal] Migrate keras to tf.keras

I am proposing to change keras to tf.keras for better performance, better serving and add TPU support. Maybe we should re-write the whole project, clean up the code, add missing documents and so on.

Here are the features I am planning to add.

  1. Multi-GPU/TPU support
  2. Export model for Tensorflow Serving
  3. Fine-tune Ability for W2V and BERT

[Question] 使用model.predict预测的label为何会有[BOS]和[EOS]在里面

A clear and concise description of what you want to know.
我用自己的数据训练,结果还可以(0.85),但是为何预测的label里还存在[BOS]和[EOS],在predict里result应该是在convert idx to labels时remove 了bos和eos的,怎么还会出现在最后的prediction label里?

康 B-brand
元 E-brand
的 O
饼 B-category
干 [EOS]

[Question] 和flask一起使用的时候报错

报错:ValueError: Fetch argument <tf.Variable 'Embedding-Token/embeddings:0' shape=(21128, 768) dtype=float32_ref> cannot be interpreted as a Tensor. (Tensor Tensor("Embedding-Token/embeddings:0", shape=(21128, 768), dtype=float32_ref) is not an element of this graph.)
报错的地方:
bert_embedding = BERTEmbedding('bert-base-chinese', sequence_length=512)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.