google-research / electra Goto Github PK
View Code? Open in Web Editor NEWELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
License: Apache License 2.0
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
License: Apache License 2.0
Hi clarkkev,
The tensorflow official website does not quote the configuration information of version 1.15.
https://tensorflow.google.cn/install/source_windows#gpu
But, you are using version 1.15 of tensorflow_gpu (readme.md). How do you configure the corresponding CUDA, cuDNN, and Bazel? The main question is about the version.
Or can I use other versions of Tensorflow (on my laptop)?
Thanks
I have pretraining data generated for Albert. Can I use it for in pretraining of Electra?
As I am doing it for Hindi language, Using sentence piece vocab will suffice?
hi, where can i find the pretrained model for chinese text.
Hello, @clarkkev @michelole @stefan-it @ormandi, thank you so much for your great work!
I am wondering if the current implementation has the interface to obtain the original/replaced prediction for an input sequence, i.e. input a sentence, output the original/replaced label for each token.
Hello,
I noticed that BasicTokenizer
runs _run_strip_accents
by default and it's not parameterizable. Is it critical for training Electra? Is it OK to turn it off for training non-English models?
Thanks.
Hi Kevin,
Thanks for the great work and releasing the codes/models. Was wondering if you have tried multi-GPU training for ELECTRA-base and ELECTRA-large (does your current codes support multi-GPU)? And if you have stats for multi-GPU experiments as well?
Also the stats for single GPU training of ELECTRA-base and ELECTRA-large (how many days needed till they converge to a descent performance?).
Thanks!
-Hamid
Hi,
I just wanted to train a small model on a v3-8 TPU. I did modify the parameters in the configure_pretraining.py
file.
Training command shows the following output at the beginning:
Config:
debug False
disallow_correct False
disc_weight 50.0
do_eval False
do_lower_case False
do_train True
electra_objective True
embedding_size 128
eval_batch_size 128
gcp_project None
gen_weight 1.0
generator_hidden_size 0.25
generator_layers 1.0
iterations_per_loop 200
learning_rate 0.0005
lr_decay_power 1.0
mask_prob 0.15
max_predictions_per_seq 19
max_seq_length 128
model_dir gs://tr-electra/models/electra-small-cased
model_hparam_overrides {}
model_name electra-small-cased
model_size small
num_eval_steps 100
num_tpu_cores 8
num_train_steps 1000000
num_warmup_steps 10000
pretrain_tfrecords gs://tr-electra/pretrain_tfrecords/pretrain_data.tfrecord*
results_pkl gs://tr-electra/models/electra-small-cased/results/unsup_results.pkl
results_txt gs://tr-electra/models/electra-small-cased/results/unsup_results.txt
save_checkpoints_steps 1000
temperature 1.0
tpu_name bert-4
tpu_zone None
train_batch_size 128
uniform_generator False
untied_generator True
untied_generator_embeddings False
use_tpu True
vocab_file gs://tr-electra/vocab.txt
vocab_size 32000
weight_decay_rate 0.01
Training command I used was:
$ python3 run_pretraining.py --data-dir gs://tr-electra --model-name electra-small-cased
Then the following error message is thrown:
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
return self._sess_creator.create_session()
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
self.tf_sess = self._session_creator.create_session()
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
init_fn=self._scaffold.init_fn)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No matching devices found for '/job:train_tpu_worker/device:TPU_SYSTEM:0'
I changed the logging level to debug and it seems that the specified TPU is found:
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 6331406465745664406)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 8261923844496220977)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 4004177864999644671)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 15443343382830505559)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 4979557051736282662)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 11642345092301563746)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 14428381851821878348)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 10090976446270558365)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 4924679865202343225)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 5665404154059623606)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5630790993080310184)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops
.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
I used this GCP instance:
gcloud compute instances create bert --zone=<zone> --machine-type=n1-standard-2 \
--image-project=ml-images --image-family=tf-1-15 --scopes=cloud-platform
and created the TPU with:
gcloud compute tpus create bert-4 --zone=<zone> --accelerator-type=v3-8 \
--network=default --range=192.168.4.0/29 --version=1.15
Pretty much the same configuration as I used for training my BERT models, like for Turkish, as documented in this cheatsheet.
Do you have any idea what causes this error message? Would be awesome to train ELECTRA models on TPU 🤗
Thanks many in advance!
Can you explain a method to build vocab.txt file ?
first_segment = []
second_segment = []
for sentence in self._current_sentences:
# the sentence goes to the first segment if (1) the first segment is
# empty, (2) the sentence doesn't put the first segment over length or
# (3) 50% of the time when it does put the first segment over length
if (first_segment or
len(first_segment) + len(sentence) < first_segment_target_length or
(second_segment and
len(first_segment) < first_segment_target_length and
random.random() < 0.5)):
first_segment += sentence
else:
second_segment += sentence
i have two question about above snippet code:
when running run_pretraining.py I get this error before it pretrains:
2020-04-28 04:43:55.132186: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:356] GrpcSession::ListDevices will initialize the session with an empty graph and other
defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from training_loop: '[SEP]'
Traceback (most recent call last):
File "run_pretraining.py", line 384, in
main()
.
(lines ignored because they're not useful)
.
File "/home/manai_elye2s/pretrain/electra/pretrain/pretrain_helpers.py", line 121, in _get_candidates_mask
ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]]
KeyError: '[SEP]'
I got this both with my own vocab and the default one I downloaded from this repo.
In both vocab.txt files there are the [SEP] [CLS] and [MASK] tokens, without space
Hi @clarkkev ,
I just wanted to hear your opinion of adding a new section to the main readme file, where users can link their own trained ELECTRA models - maybe something like "Community models" 🤔
E.g. I just saw ELECTRA models on the Hugging Face model hub for various languages like Malay, Indonesian or Korean. Today I released base and small models for Turkish.
What do you think?
Best,
Stefan
Thank you for releasing your codes.
I have succeeded training a small model using a GPU by following Quickstart: Pre-train a small ELECTRA model
, but a NaN loss during training
error occurred when I trained a base model.
Do you have any idea?
I use tensorflow 1.15.0 and Tesla V100-PCIE-32GB, and an error log is as follows:
$ python run_pretraining.py --data-dir ../electra-en-data --model-name electra_base_owt_200k --hparams '{"num_train_steps": 200000, "model_size": "base", "train_batch_size": 128}'
..
2020-04-07 09:51:45.360762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30458 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:83:00.0, compute capability: 7.0)
2020-04-07 09:52:37.813499: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
1/200000 = 0.0%, SPS: 0.0, ELAP: 21, ETA: 47 days, 14:12:08 - loss: 44.4968
2/200000 = 0.0%, SPS: 0.1, ELAP: 39, ETA: 45 days, 4:02:14 - loss: 44.3760
3/200000 = 0.0%, SPS: 0.1, ELAP: 40, ETA: 31 days, 5:01:16 - loss: 44.5174
4/200000 = 0.0%, SPS: 0.1, ELAP: 42, ETA: 24 days, 5:38:09 - loss: 44.1623
5/200000 = 0.0%, SPS: 0.1, ELAP: 43, ETA: 20 days, 1:12:59 - loss: 44.2913
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Error recorded from training_loop: NaN loss during training.
Traceback (most recent call last):
File "run_pretraining.py", line 385, in <module>
main()
File "run_pretraining.py", line 381, in main
args.model_name, args.data_dir, **hparams))
File "run_pretraining.py", line 344, in train_or_eval
max_steps=config.num_train_steps)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/home/***/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/home/***/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
Hello, I am trying to further pretrain the base model and large model use domain-specific corpus. But I see in the document, it says that when continuing pre-training from the released small ELECTRA checkpoints, we should:
Setting num_train_steps by (for example) adding "num_train_steps": 4010000 to the --hparams. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps).
But Table 6 of the paper shows that small ELECTRA model is trained for 1M steps. Which one should we set?
If 4e6 is correct, how many steps has the base model or large model been trained?
Thanks for share your implementation.
I did a fine tuning on Squad 2.0 using a 8 GB GPU and achieved great results even tough training partially the network.
Here's the repo: https://github.com/renatoviolin/electra-squad-8GB
What if I want to Finetune ELECTRA on both classification and sequence tags?
What should i do?
First of all very nice work! I love the idea with a discriminator and that the current model architectures is even more capable if its given a more difficult pretraining task.
Unfortunately I think I have to bother you with a problem. I was planning to continue pretraining on the models with domain specific data for some experiments but the neccessary data is not in the released models.
raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: From /job:worker/replica:0/task:0: Key discriminator_predictions/dense/bias/adam_m not found in checkpoint [[node save/RestoreV2 (defined at tensorflow_core/python/framework/ops.py:1748) ]]
(unrelated)
A 60k test run on the cord19 dataset with eval:
https://tensorboard.dev/experiment/QwOMXKluQJKcn8D9i9p69A/#scalars
Lines 188 to 193 in 7911132
According to the code here, assume that n_layers=24
, then key_to_depths["encoder/layer_23/"] = 24
which is the depth for last encoder layer, but the learning rate for last layer is:
learning_rate * (layer_decay ** (24+ 2 - 24)) = learning_rate * (layer_decay ** (2))
.
That's what confused me. Why the learning rate for last layer is learning_rate * (layer_decay ** (2))
rather than learning_rate
? Do I ignore anything?
Hello,
Is there any plan in releasing this model through the Tensorflow Hub API?
Thanks!
Hi,
following the commands, I pre-trained electra-small on my dataset. After pre-training I want the learned embeddings which I need to use on some other complicated downstream tasks. Could you please help me with how to extract the word embeddings after pre-training?
There seems to be a confilt about generator sizes. That of electra small/base/large are 1/4, 1/3, 1/4 as lastest paper claimed. However, the provided pre-trained weights conflit that with generator size 1, 1/3, 1/4 respectively.
electra contains a generator and a discriminator, where generator replaces [mask] tokens with plausible alternatives ones and discriminator will figures them out.
As said in paper,
Typically k = [0.15n], i.e., 15% of the tokens are masked out
if the generator happens to generate the correct token, that token is considered “real” instead of “fake
thus only (1 - generator_inference_acc) * 0.15
tokens is fake, and leads to a extremely unbalanced binary classfication task for discriminator, but it does work, why?
Here, the corrupted tokens are produced in generator as fake data. I can understand why we should deal with the duplicated positions and only appy it once. However, I am confused about the below implementation take average value of corrupted token ids of duplicated, and what's the intuition behind it?
electra/pretrain/pretrain_helpers.py
Lines 101 to 103 in 19175a1
In the paper, it also described a model called ELECTRA-1.75M and has a better performance than ELECTRA-1.45M ( ELECTRA-large ).
So will the pre-trained ELECTRA-1.75M be released?
I'm trying to validate that I am running experiments on finetuned ELECTRA model correctly by using only tf.train.init_from_checkpoint() to load the weights. When evaluating my finetuned ELECTRA model using run_finetuning.py, I've found I get different accuracy results when using a different model_dir than the directory containing the finetuned model for run_config = tf.estimator.tpu.RunConfig(...).
With the original code, I get the expected accuracy (~81) but when I modify the model_dir argument in tf.estimator.tpu.RunConfig() to an empty directory, I get much lower and non-deterministic accuracy (~32). I was wondering why that is since the weights are still being loaded using tf.train.init_from_checkpoint(). Are there variables being loaded using tf.estimator.tpu.RunConfig()?
Hi guys!
I tried to evaluate pre-trained ELECTRA-small model with OpenWebText, but it cannot be loaded on default configuration.
(electra) judith@judith-dev:~/workspace/ELECTRA/electra$ python3 run_pretraining.py --data-dir /home/judith/workspace/ELECTRA/data_dir --model-name electra_small --hparams '{"do_train": false, "do_eval": true}'
(...)
ERROR:tensorflow:Error recorded from evaluation_loop: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
2 root error(s) found.
(0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [64] rhs shape= [256]
[[node save/Assign_275 (defined at /home/judith/anaconda3/envs/electra/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [64] rhs shape= [256]
[[node save/Assign_275 (defined at /home/judith/anaconda3/envs/electra/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[save/RestoreV2/_460]]
0 successful operations.
0 derived errors ignored.
But, I occasionally found that if I change generator_hidden_size
which actually means the fraction of discriminator hidden size for generator from 0.25 to 1, (and it makes generator hidden size from 64 to 256) it runs successfully.
1/100 = 1.0%, SPS: 0.5, ELAP: 2, ETA: 3:32 - loss: 11.9305
2/100 = 2.0%, SPS: 0.6, ELAP: 3, ETA: 2:42 - loss: 11.8205
3/100 = 3.0%, SPS: 0.7, ELAP: 4, ETA: 2:25 - loss: 11.7672
4/100 = 4.0%, SPS: 0.7, ELAP: 6, ETA: 2:16 - loss: 11.7212
5/100 = 5.0%, SPS: 0.7, ELAP: 7, ETA: 2:10 - loss: 11.7709
(...)
Was that intended or just mistake? It would be my great pleasure if I hear your opinion.
Thanks in advance!
According to the paper, ELECTRA does not involve NSP (next sentence prediction) task. In that case, do we need sentence segmentation?
Does build_pretraining_dataset.py
consider each line as a separate sentence? Or can we just feed raw text (with empty lines as separators for documents) ?
Hi! Thanks for making source code available and for great paper.
Are there any plans to support loading models in Pytorch? Or implementation in transformers by Huggingface?
I tried pretraining for small, base, large model on one 2080 Ti gpu but it seems like there is very low usage of gpu. Am I doing something wrong or is the usage of gpu actually low?
When I try to load the model following the instructions on huggingface.co/models, i.e.:
tokenizer = AutoTokenizer.from_pretrained("google/electra-small-generator")
I get the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-bb330c08e050> in <module>
----> 1 tokenizer = AutoTokenizer.from_pretrained("google/electra-small-generator")
/opt/conda/lib/python3.6/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
179 config = kwargs.pop("config", None)
180 if not isinstance(config, PretrainedConfig):
--> 181 config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
182
183 if "bert-base-japanese" in pretrained_model_name_or_path:
/opt/conda/lib/python3.6/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
185
186 if "model_type" in config_dict:
--> 187 config_class = CONFIG_MAPPING[config_dict["model_type"]]
188 return config_class.from_dict(config_dict, **kwargs)
189 else:
KeyError: 'electra'
The version of transformers is 2.7.0. I reproduced the problem in colab here.
Here is a set of tokens that should not be masked during dynamic masking.
electra/pretrain/pretrain_helpers.py
Line 121 in 7911132
But should we also avoid masking all those ['PAD']
at the end of a sentence (if the sentence is shorter than max_seq_length
and if there is no second sentence segment)?
I understand ['PAD']
itself has token_id = 0, but I do not see this being used to prevent masking in downstream steps. If we do not ignore it, this will affect the probability calculation here
electra/pretrain/pretrain_helpers.py
Lines 167 to 170 in 7911132
Also, we will be trying to predict 'PAD' that is outside a sequence, which is a bit unintuitive.
Maybe I am missing something here. Thanks again for putting up such a great work!
Hi there,
in the paper I noticed that you tried training the generator ahead of the discriminator, then initializing the discriminator from the generator weights and train only the discriminator subsequently.
I was wondering, if you also tried the same procedure but instead of training the generator you just initialized both generator and discriminator from a fully MLM pre-trained BERT (e.g. BERT-Large).
Sorry if something was mentioned in the paper and I missed it.
Cheers
I know that using https://github.com/shehzaadzd/pytorch-pretrained-BERT can predict masked token for a single example, but it seems that electra is a newly released model. How exactly can I input a sentence with some masked tokens and inference with the pretrained electra model and get the predicted tokens ?
Hello, I was wondering whether it is possible to train ELECTRA model on cloud TPUs since you didn't mention TPUs in your research paper. By following this tutorial based on BERT pre-training and fine-tuning on Google Cloud TPUs, I found out some kind of troubles while working in a TPU environment. Here is the link to the BERT TPU tutorial: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb
The Colab Notebook I've created from the aforementioned one is here:
https://colab.research.google.com/drive/1ZG8xwXSJm8iQe19M4lACUcCetjf-eZii
The last cell was intended to run training script with use_tpu option flag specified. However, it had no success.
So, am I doing anything wrong or this is a bug on your side?
Thank for your help!
While training I find there is one loss. Is it training loss or validation loss?
Hi, congrats for the paper. I really like the idea. I was wondering, what is your approach for masking tokens. Do you mask individual tokens independently, regardless of whether they might be units of a multi-token word, or do you mask all the tokens of a given word?
Let's say that we have this tokenized sentence and we want to mask shareholder
:
<s> ▁Meanwhile , ▁share hold er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
shareholder
consists of 3 tokens and you allow for one of them to be masked, without masking the other 2.<s> ▁Meanwhile , ▁share <mask> er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
shareholder
have to be masked.<s> ▁Meanwhile , <mask> <mask> <mask> ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
Which one is it? Or do you have a different approach?
Hi there,
great work on reducing training times!
I was wondering if the reported SQuAD 2.0 results for ELECTRA-Base on this repo are correct: 83.7 EM. This seems to strongly diverge from the reported results in the paper 80.5 EM.
Or is the reported score here meant to be F1?
Cheers
When I was trying further pretraining on the models with domain-specific data in Colab, I encountered a problem that the official pretrained model could not be loaded.
Here is the commend for further pretraining.
hparam = '{"model_size": "small", \
"use_tpu":true, \
"num_tpu_cores":8, \
"tpu_name":"grpc://10.53.161.26:8470", \
"num_train_steps":4000100,\
"pretrain_tfrecords":"gs://tweet_torch/electra/electra/data/pretrain_tf_records/pretrain_data.tfrecord*", \
"model_dir":"gs://tweet_torch/electra/electra/data/electra_small/", \
"generator_hidden_size":1.0\
}'
!python electra/run_pretraining.py \
--data-dir "gs://tweet_torch/electra/electra/data/" \
--model-name "electra_small" \
--hparams '{hparam}'
And the error message is pretty long so I just paste some of it that seems useful.
ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
From /job:worker/replica:0/task:0:
Key discriminator_predictions/dense/bias/adam_m not found in checkpoint
[[node save/RestoreV2 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
I am generating pre training data for hindi, I am using sentence piece vocab for it. Getting the following error.
python build_pretraining_dataset.py --corpus-dir data --vocab-file spie
ce.vocab --output-dir out --max-seq-length 128 --num-processes 1
Job 0: Creating example writer
Job 0: Writing tf examples
Traceback (most recent call last):
File "build_pretraining_dataset.py", line 230, in <module>
main()
File "build_pretraining_dataset.py", line 218, in main
write_examples(0, args)
File "build_pretraining_dataset.py", line 190, in write_examples
example_writer.write_examples(os.path.join(args.corpus_dir, fname))
File "build_pretraining_dataset.py", line 143, in write_examples
example = self._example_builder.add_line(line)
File "build_pretraining_dataset.py", line 50, in add_line
bert_tokids = self._tokenizer.convert_tokens_to_ids(bert_tokens)
File "/home/gamut/Downloads/electra-master/model/tokenization.py", line 130, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "/home/gamut/Downloads/electra-master/model/tokenization.py", line 91, in convert_by_vocab
output.append(vocab[item])
KeyError: '[UNK]'
I found that this kind of error have this solution. As here, there is only input for vocab and not for spice model, generation of pre training data through spiece vocab is problem, any solution?
Hello,
I was wondering about the possibility of making un-trainable the discriminator for the first n steps, to give time to generator to learn how to produce meaningful tokens before feeding them to the discriminator.
Since the the training of generator and discriminator are not adversarial (as far as I understand from the paper) this should be possible.
Is this option already implemented and I am missing some flag?
Would be difficult to implement it? Maybe some workaround with existent parameters?
Moreover, I see that the global loss that's a linear combination of the generator and the discriminator
appear by default heavily unbalance respect to discriminator (with a relative weight of 50), this of course make sense since the "real" model is the discriminator.
Do you believe that re-weighting the losses can enhance the generator training?
Thank you very much for the answers.
It seems that run_pretraining.py when launched with flag do_train==False and do_eval==True
is removing the input model (one want to evaluate) specified by --model-name leaving only the evaluation folder containing discriminator and generator scores.
In your paper, sometimes you use dev, sometimes you use dev&test set. But in this github, you did not mention which dataset are you using. Could you clarify? Or would you update the paper? Thanks
Besdies, what is the training flops you used in the results in the github? Thanks!
It is not clear what is num_eval_steps, and what is the loss printed when evaluating performances of
the pretrained model.
Is this just the number of batches that are taken to evaluate?
The loss is just the loss evaluated on the current batch, and hence the final one is just the average among the selected batches?
Thanks for the answers!
I downloaded this model and I tried to use this model.
I choosed the squad 2.0 dataset to fine-tune.
When I tried to fine-tune the model on the command line,
the program just stopped working and the command line seemed to exit the program.
The output is like this:
(env_tf115) D:\python_code\NLP\electra>python run_finetuning.py --data-dir "D:\python_code\NLP\electra\datadir" --model-name electra_small --hparams {"model_size": "small", "task_names": ["squad"], "num_trials": 2}
2020-03-26 22:00:19.349133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
{"model_size": "small", "task_names": ["squad"], "num_trials": 2}
================================================================================
Config: model=electra_small, trial 1/2
================================================================================
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir D:\python_code\NLP\electra\datadir
debug False
do_eval True
do_lower_case True
do_train True
doc_stride 128
double_unordered True
embedding_size 128
eval_batch_size 32
gcp_project None
init_checkpoint D:\python_code\NLP\electra\datadir\models\electra_small
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.8
learning_rate 0.0001
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 512
model_dir D:\python_code\NLP\electra\datadir\models\electra_small\finetuning_models\squad_model
model_hparam_overrides {}
model_name electra_small
model_size small
n_best_size 20
n_writes_test 5
num_tpu_cores 1
num_train_epochs 2.0
num_trials 2
predict_batch_size 32
preprocessed_data_dir D:\python_code\NLP\electra\datadir\models\electra_small\finetuning_tfrecords\squad_tfrecords
qa_eval_file <built-in method format of str object at 0x0000027494EB70B0>
qa_na_file <built-in method format of str object at 0x0000027494EB1AE0>
qa_na_threshold -2.75
qa_preds_file <built-in method format of str object at 0x0000027494EB70F0>
raw_data_dir <built-in method format of str object at 0x0000027494EAECA8>
results_pkl D:\python_code\NLP\electra\datadir\models\electra_small\results\squad_results.pkl
results_txt D:\python_code\NLP\electra\datadir\models\electra_small\results\squad_results.txt
save_checkpoints_steps 1000000
task_names ['squad']
test_predictions <built-in method format of str object at 0x0000027494EAD8F0>
tpu_job_name None
tpu_name None
tpu_zone None
train_batch_size 32
use_tfrecords_if_existing True
use_tpu False
vocab_file D:\python_code\NLP\electra\datadir\models\electra_small\vocab.txt
vocab_size 30522
warmup_proportion 0.1
weight_decay_rate 0.01
write_distill_outputs False
write_test_outputs False
Loading dataset squad_train
Existing tfrecords not found so creating
(env_tf115) D:\python_code\NLP\electra>
My computer setting:
Windows 10
Cuda 10.0.130
Cudnn 7.6.3
Tensorflow 1.15 GPU
I've put the pre-trained model file under "datadir\models\electra_small" directory
and squad dataset under "datadir\finetuning_data\squad" directory.
Does anyone know why the fine-tuning is not working?
Hello,
I was wondering whether it is possible to add some loss metrics to the training cycle? The only thing I see during training Electra model is
1275000/3000000 = 42.5%, SPS: 3.1, ELAP: 9:24:02, ETA: 6 days, 11:55:19
which tells nothing about how good is it. I'm trying to add some code to the estimator, but it seems to me that it could be much easier to show all the metrics in order to see how successful the model is at this stage.
I'm training non-English model, so I wanted to get better insight into how my model is performing at the moment.
Thanks
what is the standard tricks in glue benchmark that achieve +4.2 acc than the electra large?
Could not find url to download openwebtext.tar.xz . Could you share the urls?
Hello,
I am trying to compare the time performances of Bert and Electra pre-training.
Looking at the paper it seems that at fixed FLOPs (i.e. fixed time) electra performances are the best in downstreaming finetuning tasks.
I am struggling a bit with the comparison since the pre-training tasks are pretty different between the models and then I am not sure they can be directly compared by keeping the same hyper-parameters.
For instance:
max_seq_len used by Bert seems to stand for the total length of the couple of sentences while
for Electra is just one sentence,
this would suggest to half the length for Electra (or double for Bert), am I right?
On the other hand, if I do so, then the number of masks (or replacement in Electra) would be smaller since it scales as masked_lm_prob*max_seq_len.
Are there other parameters I should keep into account for this comparison?
I am keeping the number of steps to be the same all the time so I would expect an increased performance in finetuning tasks after electra pretraining.
As usual, any comment or help is really appreciated!
electra/build_pretraining_dataset.py
Line 53 in 62e478a
Is _current_length bigger than _target_length(max_seq_length)?
I think it should be the other way around.
Hi,
I'm currently working on a new non-English ELECTRA model. Training on GPU seems to work and is running fine 🤗
Next steps would be to try model training on a TPU, so I would just like to ask if you can post the final loss of both base and large models (or even share the loss training curve) so that we have a kind of reference point when training own models 🤔
Thanks many in advance,
Stefan
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.