brightmart / albert_zh Goto Github PK

View Code? Open in Web Editor NEW

3.9K 103.0 757.0 2.05 MB

A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型

Home Page: https://arxiv.org/pdf/1909.11942.pdf

Python 98.85% Shell 1.15%

albert bert roberta xlnet tensorflow pytorch pre-trained-model chinese-corpus pre-trained

albert_zh's People

Contributors

Stargazers

Watchers

Forkers

jin1258804025 jiayalu123 yyht huguanglong ashbursong strategist922 jingmouren joytianya zgq7799 cedar33 moolighty hitluobin arlancooper nicexw zuiwufenghua laugha crosstuck andyrbm limore0129 xf05888 yjnick qqwant xiaojie2018 dingsiyu insomniacn atakey chapzq77 xiong666 wwnlp731 guhaifudeng gyfzehangfl qianrenjian hadryan fdgithub123 shaunstanislauslau hhy5277 fightingchenx rosssong terrychen17 songym2020 addingding times125 brianlv zxlzr saurabhkulkarni77 90217 limin-feng1993 liaobowen yanghaihuo richiesui superrichiesui peterding seongl solumilken temibabs xingshaozhi cjhsu1991 qsong4 econben namisan daishu7 shenzaimin nonva colinsongf sunchao3555 two222 zmskye auscenery gaoyiyeah liangtianxin dogydev pengshancai joe2hpimn gdh756462786 the-black-knight-01 chinesektry conderls shonenkov dalek-who tri325 alsm168 coola007 spencerhogd xiaodanjiao kentchun33333 duxiaochao hxyshare yangxudong gloktia jack00101 yyaaa1 sirius3013 lijun-tian mingyates fucheng830 brucekyle99 lmxko lcyuanjiang frankchu0229 yangyyyy

albert_zh's Issues

English pre-trained release?

Hello,

Are you planning on releasing English pre-trained versions of Albert in the future?

Thank you,

Are you going to publish pre-trained English Models as well?

Hello,

I have e-mailed you about this issue. I would be happy if you answer my question.
Are you going to publish pre-trained English models as well?

Thank you so much.
D

预训练无法使用gpu

你好，我用自己的数据预训练tiny，看起来只使用cpu在跑，环境设置如下：
1、device_list 可以看到1块cpu和4块gpu
2、tf版本：只有gpu版本和 gpu/cpu版本共存（保证gpu版本>=cpu），都试过
3、CUDA_VISIBLE_DEVICES 设置为已有gpu index

结果如下-cpu大量使用、gpu只使用100多M：
/Users/zhangyang/Documents/Albert cpu使用情况.png

缺少基于用户输入的预测

现有的用例展示的是对于文本的批量运算，缺少基于用户输入的预测。使用他开源bert工具（基于tf）调用tiny_bert模型进行用户输入预测时总是报：部分变量与checkpoint中的shape不同的错误，无法正常运行。

预训练模型无法下载

https://storage.googleapis.com/albert_zh/albert_base_zh_additional_36k_steps.zip

这些链接都无法下载了

使用的是SentencePiece还是WordPiece ?

请问这个项目使用的是SentencePiece还是WordPiece ?

希望发布时带上mlm部分的权重！

希望发布时带上mlm部分的权重！先前roberta没有这部分权重，有些任务就做不了，或者没有一个好的初始化。

Hi,
When generating SOP train instances, if there is only one sentence in the document. Or other extreme case: every sentences in the doc are very long (i.e. more than the target_token_num),
The generated output is
[CLS] tokens_a [SEP] tokens_a[SEP]. This is due to the continue statement without increasing index i in create_instances_from_document_albert().

If there is only one sentence in the current chunk, in order to do SOP, how about we randomly find a split position in the tokens_a? all tokens after that position will go to tokens_b. Tokens before that position will assign to tokens_a.

 if len(tokens_b) == 0 and len(current_chunk) == 1:
                        if len(tokens_a) > 1:
                            #There is only one sentence in the chunk. The sentence could be very long. Or it is the last one in the document.
                            #In order to make SOP, we need to split the sentence into 2 parts.
                            index = rng.randint(1, len(tokens_a) - 1)
                            # index = int(len(tokens_a)/2)
                            tokens_b = tokens_a[index:]
                            tokens_a = tokens_a[0:index]
                        else:
                            print("only 1 token in tokens_a, can't split it into 2 parts. just skip this sentence.")
                            break

请问一下英文albert的github在哪里

没有找到官方的github T_T

什么时间开源模型代码以及发布预训模型参数？

请问下，模型代码以及发布预训模型参数会按照这个时间发布吗？
1、albert_base, 参数量12M, 层数12，10月5号

2、albert_large, 参数量18M, 层数24，10月13号

3、albert_xlarge, 参数量59M, 层数24，10月6号

4、albert_xxlarge, 参数量233M, 层数12，10月7号（效果最佳的模型

虽然参数减少，但是inference的速度并没有减少太多，只是提高的训练速度？

by the way，预训练的代码会release吗？

模型训练的设备要求是什么

模型训练需要多大显存，训练时长大概是多久

albert使用的时候,显存减少了吗?训练速度加快了吗?

我现在最直观的是觉得模型文件大小比bert小了很多,但是好像使用的时候显存和bert差不多.但是我看有些文章里写的是 1 解决了内存限制问题 2 训练速度加快. 我怎么没感觉到啊. 谁能帮忙解释下？

How to measure the embedding/representation difference between layers?

Can anyone explain how to draw such figure since the L2 distance or cosine similarity is often conducted on the vector rather than the embedding matrix. Besides, how to measure it on the whole dataset?
Thanks for any suggestions or references~

请问预计什么时候发布中文版xxlarge模型啊？谢谢！

albert模型预测速度远小于bert

您好，请问下我用Albert模型精调的分类模型，主要参数如下，使用了GitHub上放出来的两个xlarge模型 albert_xlarge_zh_177k和albert_xlarge_zh_183k
--do_train=true
--do_eval=true
--do_predict=false
--do_export=false
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/albert_config_xlarge.json
--init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt
--max_seq_length=32
--train_batch_size=64
--save_checkpoints_steps=500
--max_steps_without_decrease=15000
--learning_rate=1e-6
--num_train_epochs=10 \

python：3.6.4
TensorFlow：1.14
训练：GPU V100
预测：戴尔笔记本
使用 tornado做web框架，都在我自己本地笔记本电脑上测试，测试代码完全一样
bert 预测耗时220ms，
Albert 预测耗时2200ms
这个差距有点大，不是说模型更小，参数更少，预测的速度不会更快是吗，请大佬赐教。

create_pretraining_data.py line 322

Hello,brightmart:
Thank you for Implementation of albert. When I see ine 322 of create_pretraining_data.py
if len(tokens_a)==0 or len(tokens_b)==0: continue
I have a problem:
if list current_chunk only have the last sentence of document, current_chunk will have
two same sentences. Then tokens_a is last sentence and tokens_b is also last sentence.

请问一下SOP模块在哪，怎么还是“run_pretraining.py”里还是NSP呢？

final layer normalization

The Pre-LN Transformer
puts the layer normalization inside the residual
connection and equips with an additional finallayer normalization before prediction

在prelln_transformer_model中我没找到这个final layer normalization
请问是我遗漏了还是您没有实现这层呢

LCQMC.zip + albert_xlarge_zh_183k.zip ValueError: Shape of variable bert/embeddings/LayerNorm/beta:0 ((312,)) doesn't match with shape of tensor bert/embeddings/LayerNorm/beta ([2048]) from checkpoint reader.

完全按照说明来，为什么还会出现shape不一致情况呢

请教一下输入token如果是词内部的字，使用的token是'字'还是'##字'?

我注意到bert官方提供的中文vocab.txt里，每个汉字都有两个token，一个带有'##'前缀，一个不带前缀，我的理解是不带前缀的表示词的首字，带前缀的是非首字。由于两者转换为id后并不相同，我想请教一下对应词内非首字，预训练数据的输入是否使用带前缀的token（给模型输入分词信息）？另外，MLM的label是否使用带前缀的版本？不胜感激！

关于超参数intermediate_size

代码中设置的intermediate_size=4096。按我的理解，论文(第4页第2行)里说intermediate_size是4倍的hidden size，应该是16384。不知道是不是我理解错了。感谢解答！

您好，关于保存的参数问题

您提供的预训练中，只有Bert部分的模型参数，缺少“cls/predictions”部分和“"cls/seq_relationship"部分的参数，方便开源一下完整的模型参数吗？（主要是tiny-albert和albert-base）万分感谢。

Failed to find any matching files for ./albert_large_zh/bert_model.ckpt

在尝试用albert_large_zh模型跑fine-tuning时出错，按照以下命令执行

 export BERT_BASE_DIR=./albert_large_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier.py   --task_name=lcqmc_pair   --do_train=true   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --bert_config_file=./albert_config/albert_config_large.json --max_seq_length=128 --train_batch_size=64   --learning_rate=2e-5  --num_train_epochs=3 \
    --output_dir=albert_large_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt &

得到ERROR信息如下

ERROR:tensorflow:Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt
E1013 17:11:41.890616 140328154277632 error_handling.py:70] Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt
INFO:tensorflow:training_loop marked as finished
I1013 17:11:41.890880 140328154277632 error_handling.py:96] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1013 17:11:41.891006 140328154277632 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
  File "run_classifier.py", line 947, in <module>
    tf.app.run()
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_classifier.py", line 819, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2709, in _call_model_fn
    config)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2967, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1549, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1867, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "run_classifier.py", line 498, in model_fn
    ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
  File "/home/lcy/nlp/SecretProject/ALBERT/albert_zh/modeling.py", line 352, in get_assignment_map_from_checkpoint
    init_vars = tf.train.list_variables(init_checkpoint)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 97, in list_variables
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 636, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 648, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt

请问该如何解决？非常感谢

albert_tiny 是从头开始训练的吗？

你好请问一下tinybert的训练方式是不是直接从最原始的开始训练呢？还是从大模型distill到小模型呢？

预训练无法使用gpu

结果如下-cpu大量使用、gpu只使用100多M：
/Users/zhangyang/Documents/Albert cpu使用情况.png

Could I run pretraining on multi gpu?

Hi,
I've been taking a look at the codes implemented and couldn't find the way to train on multi gpu.
Is there some way to do that? or you have any plans to implement this feature.
Please let me know and thanks in advance :)

Are you going to publish xxlarge english model?

6层albert模型的发布问题

目前发布的albert_tiny模型仅有4层，虽然模型体量小，但模型效果与其他模型还是有差距。albert_base有12层，但模型整体规模比较大，预测效率还是与bert_base、roberta_base预测效率相近。所以希望作者可以发布6层的albert模型，以适应更多的任务需求。谢谢。

cmrc2018任务

可以发下cmrc2018任务的代码吗想自己验证一下

albert预训练问题

请问在训练层数比较深的albert时，需要像论文里说的先训6层的，然后12层的参数从6层的finetune吗？还是可以直接train from scratch？这两种方式预训练结果差异会大吗？

使用albert_large运行报错，感觉是word embedding的dim不对，求帮助

ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((21128, 1024)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([21128, 128]) from checkpoint reader.
报错如上
我是直接跑了bert官方的分类脚本，用bert的时候没有问题，不知道是不是embedding的dim设置不一样，作者你在跑分类的时候有修改其他代码吗？感谢

使用albert_large_zh模型，在一个文本分类任务上fine-tuning，测试集精度只有49.99%, 使用同样的数据、同样的代码和脚本试了albert_tiny_zh、albert_base_zh，都能到95%

如题，想问一下，albert_large_zh在文本分类的任务上有什么特殊的地方吗？

在msra-ner数据集上，用albert-base，使用训练集的75个样本在测试集达到74%准确率，比bert好啊

请问embedding层的因式分解具体实现方式是怎么样的？

从论文中我猜测是embedding到128维后接一个全连接到高维？
还是说有其他的实现方式，论文中感觉没有讲得很清楚

请问使用TPU预训练好的模型怎么把adam相关的variable去掉？

我使用下面的代码加载训练好的模型，这是因为GPU和TPU不一样吗？

# tf.__version__ 1.15.0
sess = tf.Session()
imported_meta = tf.train.import_meta_graph('model.ckpt-250000.meta')
imported_meta.restore(sess,  'model.ckpt-250000.data-00000-of-00001') --->> 这里抛异常

my_vars = []
for var in tf.all_variables():
    if 'adam_v' not in var.name and 'adam_m' not in var.name:
        my_vars.append(var)
saver = tf.train.Saver(my_vars)
saver.save(sess, './model.ckpt')

INFO:tensorflow:Restoring parameters from gs://medical_bert/medical/model.ckpt-250000
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
   1364     try:
-> 1365       return fn(*args)
   1366     except errors.OpError as e:

8 frames
InvalidArgumentError: No OpKernel was registered to support Op 'TPUReplicatedInput' used by {{node input0}}with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
InvalidArgumentError: No OpKernel was registered to support Op 'TPUReplicatedInput' used by node input0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py in restore(self, sess, save_path)
   1324       # We add a more reasonable error message here to help users (b/110263146)
   1325       raise _wrap_restore_error_with_msg(
-> 1326           err, "a mismatch between the current graph and the graph")
   1327 
   1328   @staticmethod

InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'TPUReplicatedInput' used by node input0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

请问如何在训练好的albert checkpoint使用自己的语料继续训练？

我使用run_pretrain.py代码，指定好训练好的model_dir，但是会抛异常Key bert/embedding/Layernorm/beta/lamb_m not found，

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Key bert/embeddings/LayerNorm/beta/lamb_m not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

关于lamb_m的优化参数会在训练完成后去掉，想了解一下如果正确加载预训练checkpoint继续训练？

提供mxnet支持吗？

我想在GluonNLP 中导入ALBERT，请问可否提供对mxnet的支持（例如param文件）？

发现的一个bug 及提出一个问题：按字符级和按词语级分割那种更好？

修复的一个bug:
在create_pretraining_data.py中，masked_lm_labels 中的label是含有前缀##的而tokens是不含这个在做convert_tokens_to_ids转换时会报错。解决方法在convert_by_vocab加入删除##的逻辑就可以了。

问题:
看到大神的代码在create_pretraining_data.py中是将所有的prefix '##' 去除干净在convert_tokens_to_ids。但是vocab.txt中其实是含有很多带有##为前缀的token的，比如 '##好', '##在' 。以此引出一个问题是按字符级和按词语级分割那种更好？

能否开放训练使用的高质量语料？

目前是否能开放训练使用的经过清洗、分句后的高质量语料？

use Multilingual pretrain model Bert

please tell me, can i use Multilingual pretrain model from Bert to train custom data with albert code ???

How to create vocab for other language?

Thanks for the implementation!
I would like to know how you have created vocab for training ALBERT.
I am using Sentencepiece for this but which model_type to choose from bpe,word or char

similarity脚本报错

以albert_tiny_zh预训练模型作为输出文件夹的模型报错（即不经过fine tune这一步直接以albert_tiny_zh模型运行similarity脚本）

错误信息：tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_resto re_v2_ops.cc:184 : Not found: Key global_step not found in checkpoint

备注：经过fine tune lcmqc数据集这步后运行没出现报错

为什么不对中文直接使用wordpiece，而要把##去掉？

train albert on my own data, convergence speed is slow

hi, i tried to train albert on my own data using 32-gtx1080Ti, i found that after some steps, loss could not decrease and it seems stuck on local minimal.

请问如何加载已经训练好的模型，在自己的数据上继续训练

如题

有没有类似翻译这样的task？

很少看见翻译MT这样的task，是因为啥呢？

中文预训练模型效果评审

READ.ME中引入的是双审阶段的测试效果，是否可以提供对应预训练的测试效果

关于xlarge模型的batch_size和学习率

您好，我最近在使用xlarge-albert在自己任务上微调，起初我设置的batch_size是16，学习率是2e-5，然后训练过程中发现loss震荡的厉害，验证集效果极差。
然后，我把学习率调低到2e-6，发现效果好一些，但是验证集精度仍然和原始bert有差距。
最后，我又继续把学习率调低到2e-7，发现效果又会好一些，但是和原始bert还是有差距。另外和使用albert-base相比也有差距，所以我觉得是训练出了问题。
所有我想请教下您，使用xlarge-albert微调时，学习率和batch_size需要设置成多少合适呢？我看到您说batch_size不能太小，否则可能影响精度，我16的batch_size是否过小了？

基于albert_xlarge_zh fine_tune跑分类任务，收敛速度比bert_base慢？

设置同样的batch_size, max_seq_len, learning_rate，epoch_num; 分类任务跑同样的40w数据量bert_base 12小时可以跑完，而albert_xlarge_zh目前跑了24小时好像1个epoch都没跑完，请问这个正常么？还是执行哪些地方配置修改有问题？

brightmart / albert_zh Goto Github PK

albert_zh's People

Contributors

Stargazers

Watchers

Forkers

albert_zh's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs