GithubHelp home page GithubHelp logo

brightmart / albert_zh Goto Github PK

View Code? Open in Web Editor NEW
3.9K 103.0 757.0 2.05 MB

A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型

Home Page: https://arxiv.org/pdf/1909.11942.pdf

Python 98.85% Shell 1.15%
albert bert roberta xlnet tensorflow pytorch pre-trained-model chinese-corpus pre-trained

albert_zh's People

Contributors

brightmart avatar bringtree avatar multiverse-tf avatar solumilken avatar stopit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

albert_zh's Issues

预训练无法使用gpu

你好,我用自己的数据预训练tiny,看起来只使用cpu在跑,环境设置如下:
1、device_list 可以看到1块cpu和4块gpu
2、tf版本:只有gpu版本 和 gpu/cpu版本共存(保证gpu版本>=cpu),都试过
3、CUDA_VISIBLE_DEVICES 设置为已有gpu index

结果如下-cpu大量使用、gpu只使用100多M:
/Users/zhangyang/Documents/Albert cpu使用情况.png

缺少基于用户输入的预测

现有的用例展示的是对于文本的批量运算,缺少基于用户输入的预测。使用他开源bert工具(基于tf)调用tiny_bert模型进行用户输入预测时总是报:部分变量与checkpoint中的shape不同的错误,无法正常运行。

SOP data preparation

Hi,
When generating SOP train instances, if there is only one sentence in the document. Or other extreme case: every sentences in the doc are very long (i.e. more than the target_token_num),
The generated output is
[CLS] tokens_a [SEP] tokens_a[SEP]. This is due to the continue statement without increasing index i in create_instances_from_document_albert().

If there is only one sentence in the current chunk, in order to do SOP, how about we randomly find a split position in the tokens_a? all tokens after that position will go to tokens_b. Tokens before that position will assign to tokens_a.

 if len(tokens_b) == 0 and len(current_chunk) == 1:
                        if len(tokens_a) > 1:
                            #There is only one sentence in the chunk. The sentence could be very long. Or it is the last one in the document.
                            #In order to make SOP, we need to split the sentence into 2 parts.
                            index = rng.randint(1, len(tokens_a) - 1)
                            # index = int(len(tokens_a)/2)
                            tokens_b = tokens_a[index:]
                            tokens_a = tokens_a[0:index]
                        else:
                            print("only 1 token in tokens_a, can't split it into 2 parts. just skip this sentence.")
                            break

什么时间开源模型代码以及发布预训模型参数?

请问下,模型代码以及发布预训模型参数会按照这个时间发布吗?
1、albert_base, 参数量12M, 层数12,10月5号

2、albert_large, 参数量18M, 层数24,10月13号

3、albert_xlarge, 参数量59M, 层数24,10月6号

4、albert_xxlarge, 参数量233M, 层数12,10月7号(效果最佳的模型

albert使用的时候,显存减少了吗?训练速度加快了吗?

albert使用的时候,显存减少了吗?训练速度加快了吗?

我现在最直观的是觉得模型文件大小比bert小了很多,但是好像使用的时候显存和bert差不多.但是我看有些文章里写的是 1 解决了内存限制问题 2 训练速度加快. 我怎么没感觉到啊. 谁能帮忙解释下?

albert模型预测速度远小于bert

您好,请问下我用Albert模型精调的分类模型,主要参数如下,使用了GitHub上放出来的两个xlarge模型 albert_xlarge_zh_177k和albert_xlarge_zh_183k
--do_train=true
--do_eval=true
--do_predict=false
--do_export=false
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/albert_config_xlarge.json
--init_checkpoint=$BERT_BASE_DIR/albert_model.ckpt
--max_seq_length=32
--train_batch_size=64
--save_checkpoints_steps=500
--max_steps_without_decrease=15000
--learning_rate=1e-6
--num_train_epochs=10 \

python:3.6.4
TensorFlow:1.14
训练:GPU V100
预测:戴尔笔记本
使用 tornado做web框架,都在我自己本地笔记本电脑上测试,测试代码完全一样
bert 预测耗时220ms,
Albert 预测耗时2200ms
这个差距有点大,不是说模型更小,参数更少,预测的速度不会更快是吗,请大佬赐教。

create_pretraining_data.py line 322

Hello,brightmart:
Thank you for Implementation of albert. When I see ine 322 of create_pretraining_data.py
if len(tokens_a)==0 or len(tokens_b)==0: continue
I have a problem:
if list current_chunk only have the last sentence of document, current_chunk will have
two same sentences. Then tokens_a is last sentence and tokens_b is also last sentence.

final layer normalization

The Pre-LN Transformer
puts the layer normalization inside the residual
connection and equips with an additional finallayer normalization before prediction

在prelln_transformer_model中我没找到这个final layer normalization
请问是我遗漏了还是您没有实现这层呢

请教一下输入token如果是词内部的字,使用的token是'字'还是'##字'?

我注意到bert官方提供的中文vocab.txt里,每个汉字都有两个token,一个带有'##'前缀,一个不带前缀,我的理解是不带前缀的表示词的首字,带前缀的是非首字。由于两者转换为id后并不相同,我想请教一下对应词内非首字,预训练数据的输入是否使用带前缀的token(给模型输入分词信息)?另外,MLM的label是否使用带前缀的版本?不胜感激!

关于超参数intermediate_size

代码中设置的intermediate_size=4096。按我的理解,论文(第4页第2行)里说intermediate_size是4倍的hidden size,应该是16384。不知道是不是我理解错了。感谢解答!

您好,关于保存的参数问题

您提供的预训练中,只有Bert部分的模型参数,缺少“cls/predictions”部分和“"cls/seq_relationship"部分的参数,方便开源一下完整的模型参数吗?(主要是tiny-albert和albert-base)万分感谢。

Failed to find any matching files for ./albert_large_zh/bert_model.ckpt

在尝试用albert_large_zh模型跑fine-tuning时出错,按照以下命令执行

 export BERT_BASE_DIR=./albert_large_zh
    export TEXT_DIR=./lcqmc
    nohup python3 run_classifier.py   --task_name=lcqmc_pair   --do_train=true   --do_eval=true   --data_dir=$TEXT_DIR   --vocab_file=./albert_config/vocab.txt  \
    --bert_config_file=./albert_config/albert_config_large.json --max_seq_length=128 --train_batch_size=64   --learning_rate=2e-5  --num_train_epochs=3 \
    --output_dir=albert_large_lcqmc_checkpoints --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt &

得到ERROR信息如下

ERROR:tensorflow:Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt
E1013 17:11:41.890616 140328154277632 error_handling.py:70] Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt
INFO:tensorflow:training_loop marked as finished
I1013 17:11:41.890880 140328154277632 error_handling.py:96] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1013 17:11:41.891006 140328154277632 error_handling.py:130] Reraising captured error
Traceback (most recent call last):
  File "run_classifier.py", line 947, in <module>
    tf.app.run()
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_classifier.py", line 819, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2876, in train
    rendezvous.raise_errors()
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 131, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2871, in train
    saving_listeners=saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1188, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2709, in _call_model_fn
    config)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1146, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2967, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1549, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1867, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "run_classifier.py", line 498, in model_fn
    ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
  File "/home/lcy/nlp/SecretProject/ALBERT/albert_zh/modeling.py", line 352, in get_assignment_map_from_checkpoint
    init_vars = tf.train.list_variables(init_checkpoint)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 97, in list_variables
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 636, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/home/lcy/.conda/envs/lcyVenv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 648, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./albert_large_zh/bert_model.ckpt

请问该如何解决?非常感谢

预训练无法使用gpu

你好,我用自己的数据预训练tiny,看起来只使用cpu在跑,环境设置如下:
1、device_list 可以看到1块cpu和4块gpu
2、tf版本:只有gpu版本 和 gpu/cpu版本共存(保证gpu版本>=cpu),都试过
3、CUDA_VISIBLE_DEVICES 设置为已有gpu index

结果如下-cpu大量使用、gpu只使用100多M:
/Users/zhangyang/Documents/Albert cpu使用情况.png

Could I run pretraining on multi gpu?

Hi,
I've been taking a look at the codes implemented and couldn't find the way to train on multi gpu.
Is there some way to do that? or you have any plans to implement this feature.
Please let me know and thanks in advance :)

6层albert模型的发布问题

目前发布的albert_tiny模型仅有4层,虽然模型体量小,但模型效果与其他模型还是有差距。albert_base有12层,但模型整体规模比较大,预测效率还是与bert_base、roberta_base预测效率相近。所以希望作者可以发布6层的albert模型,以适应更多的任务需求。谢谢。

cmrc2018任务

可以发下cmrc2018任务的代码吗 想自己验证一下

albert预训练问题

请问在训练层数比较深的albert时,需要像论文里说的先训6层的,然后12层的参数从6层的finetune吗?还是可以直接train from scratch?这两种方式预训练结果差异会大吗?

使用albert_large运行报错,感觉是word embedding的dim不对,求帮助

ValueError: Shape of variable bert/embeddings/word_embeddings:0 ((21128, 1024)) doesn't match with shape of tensor bert/embeddings/word_embeddings ([21128, 128]) from checkpoint reader.
报错如上
我是直接跑了bert官方的分类脚本,用bert的时候没有问题,不知道是不是embedding的dim设置不一样,作者你在跑分类的时候有修改其他 代码吗?感谢

请问使用TPU预训练好的模型怎么把adam相关的variable去掉?

我使用下面的代码加载训练好的模型,这是因为GPU和TPU不一样吗?

# tf.__version__ 1.15.0
sess = tf.Session()
imported_meta = tf.train.import_meta_graph('model.ckpt-250000.meta')
imported_meta.restore(sess,  'model.ckpt-250000.data-00000-of-00001') --->> 这里抛异常

my_vars = []
for var in tf.all_variables():
    if 'adam_v' not in var.name and 'adam_m' not in var.name:
        my_vars.append(var)
saver = tf.train.Saver(my_vars)
saver.save(sess, './model.ckpt')
INFO:tensorflow:Restoring parameters from gs://medical_bert/medical/model.ckpt-250000
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
   1364     try:
-> 1365       return fn(*args)
   1366     except errors.OpError as e:

8 frames
InvalidArgumentError: No OpKernel was registered to support Op 'TPUReplicatedInput' used by {{node input0}}with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
InvalidArgumentError: No OpKernel was registered to support Op 'TPUReplicatedInput' used by node input0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/saver.py in restore(self, sess, save_path)
   1324       # We add a more reasonable error message here to help users (b/110263146)
   1325       raise _wrap_restore_error_with_msg(
-> 1326           err, "a mismatch between the current graph and the graph")
   1327 
   1328   @staticmethod

InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'TPUReplicatedInput' used by node input0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [N=8, T=DT_INT32]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

	 [[input0]]

请问如何在训练好的albert checkpoint使用自己的语料继续训练?

我使用run_pretrain.py代码,指定好训练好的model_dir,但是会抛异常Key bert/embedding/Layernorm/beta/lamb_m not found,

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Key bert/embeddings/LayerNorm/beta/lamb_m not found in checkpoint
	 [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

关于lamb_m的优化参数会在训练完成后去掉,想了解一下如果正确加载预训练checkpoint继续训练?

发现的一个bug 及提出一个问题:按字符级和按词语级分割那种更好?

修复的一个bug:
在create_pretraining_data.py中,masked_lm_labels 中的label是含有前缀##的而tokens是不含这个在做convert_tokens_to_ids转换时会报错。解决方法在convert_by_vocab加入删除##的逻辑就可以了。

问题:
看到大神的代码在create_pretraining_data.py中是将所有的prefix '##' 去除干净在convert_tokens_to_ids。但是vocab.txt中其实是含有很多带有##为前缀的token的,比如 '##好', '##在' 。以此引出一个问题是按字符级和按词语级分割那种更好?

How to create vocab for other language?

Thanks for the implementation!
I would like to know how you have created vocab for training ALBERT.
I am using Sentencepiece for this but which model_type to choose from bpe,word or char

similarity脚本报错

以albert_tiny_zh预训练模型作为输出文件夹的模型报错(即不经过fine tune这一步 直接以albert_tiny_zh模型运行similarity脚本)

错误信息:tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_resto re_v2_ops.cc:184 : Not found: Key global_step not found in checkpoint

备注:经过fine tune lcmqc数据集这步后 运行没出现报错

关于xlarge模型的batch_size和学习率

您好,我最近在使用xlarge-albert在自己任务上微调,起初我设置的batch_size是16,学习率是2e-5,然后训练过程中发现loss震荡的厉害,验证集效果极差。
然后,我把学习率调低到2e-6,发现效果好一些,但是验证集精度仍然和原始bert有差距。
最后,我又继续把学习率调低到2e-7,发现效果又会好一些,但是和原始bert还是有差距。另外和使用albert-base相比也有差距,所以我觉得是训练出了问题。
所有我想请教下您,使用xlarge-albert微调时,学习率和batch_size需要设置成多少合适呢?我看到您说batch_size不能太小,否则可能影响精度,我16的batch_size是否过小了?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.