GithubHelp home page GithubHelp logo

google-research / electra Goto Github PK

View Code? Open in Web Editor NEW
2.3K 2.3K 351.0 113 KB

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

License: Apache License 2.0

Python 100.00%
deep-learning nlp tensorflow

electra's Issues

BasicTokenizer: _run_strip_accents

Hello,

I noticed that BasicTokenizer runs _run_strip_accents by default and it's not parameterizable. Is it critical for training Electra? Is it OK to turn it off for training non-English models?

Thanks.

Multi-GPU training

Hi Kevin,

Thanks for the great work and releasing the codes/models. Was wondering if you have tried multi-GPU training for ELECTRA-base and ELECTRA-large (does your current codes support multi-GPU)? And if you have stats for multi-GPU experiments as well?

Also the stats for single GPU training of ELECTRA-base and ELECTRA-large (how many days needed till they converge to a descent performance?).

Thanks!
-Hamid

TPU training: No matching devices found for

Hi,

I just wanted to train a small model on a v3-8 TPU. I did modify the parameters in the configure_pretraining.py file.

Training command shows the following output at the beginning:

Config:                                                                                                                                                                         
debug False
disallow_correct False                                                                                                                                                         
disc_weight 50.0                                                                                                                          
do_eval False                                                                                                                                                                             
do_lower_case False                                                                                                                                        
do_train True                                                                                                                             
electra_objective True                                                                                                                                     
embedding_size 128                                                                                                                          
eval_batch_size 128                                                                                                                                                      
gcp_project None                                                                                                                                                 
gen_weight 1.0                                                                                                                                                                            
generator_hidden_size 0.25                                                                                                                                                 
generator_layers 1.0                                                                                                             
iterations_per_loop 200                                                                                                    
learning_rate 0.0005                                                                                                                                                                     
lr_decay_power 1.0                                                                                                          
mask_prob 0.15                                            
max_predictions_per_seq 19                                                                                                         
max_seq_length 128                                                                                              
model_dir gs://tr-electra/models/electra-small-cased                                                                                                                                      
model_hparam_overrides {}                                                                                      
model_name electra-small-cased                                                                                                   
model_size small                                                                                                                                                                          
num_eval_steps 100                                                                                                                                                                       
num_tpu_cores 8                                                                                                                                                                           
num_train_steps 1000000                                                                                                                                                                   
num_warmup_steps 10000                                                                                                                                                                    
pretrain_tfrecords gs://tr-electra/pretrain_tfrecords/pretrain_data.tfrecord*                                                                                                             
results_pkl gs://tr-electra/models/electra-small-cased/results/unsup_results.pkl                                                                                                          
results_txt gs://tr-electra/models/electra-small-cased/results/unsup_results.txt                                                                                                          
save_checkpoints_steps 1000                  
temperature 1.0                                                                                                             
tpu_name bert-4                                                                                                                                                                           
tpu_zone None                                                                                                                          
train_batch_size 128                                                                                                                                                                      
uniform_generator False                                                                                                                                                                   
untied_generator True                     
untied_generator_embeddings False                                                                                                                                           
use_tpu True                                                                                                                                                         
vocab_file gs://tr-electra/vocab.txt                                        
vocab_size 32000
weight_decay_rate 0.01

Training command I used was:

$ python3 run_pretraining.py --data-dir gs://tr-electra --model-name electra-small-cased

Then the following error message is thrown:

  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No matching devices found for '/job:train_tpu_worker/device:TPU_SYSTEM:0'

I changed the logging level to debug and it seems that the specified TPU is found:

INFO:tensorflow:Found TPU system:                                                                                                                                                         
INFO:tensorflow:*** Num TPU Cores: 8                                                                                                                                                      
INFO:tensorflow:*** Num TPU Workers: 1                                                                                                                                                    
INFO:tensorflow:*** Num TPU Cores Per Worker: 8                                                                                                                                           
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 6331406465745664406)                                                          
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 8261923844496220977)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 4004177864999644671)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 15443343382830505559)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 4979557051736282662)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 11642345092301563746)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 14428381851821878348)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 10090976446270558365)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 4924679865202343225)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 5665404154059623606)                                   
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5630790993080310184)                                         
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops
.resource_variable_ops) with constraint is deprecated and will be removed in a future version.

I used this GCP instance:

gcloud compute instances create bert --zone=<zone> --machine-type=n1-standard-2 \
--image-project=ml-images --image-family=tf-1-15 --scopes=cloud-platform

and created the TPU with:

gcloud compute tpus create bert-4 --zone=<zone> --accelerator-type=v3-8 \
--network=default --range=192.168.4.0/29 --version=1.15

Pretty much the same configuration as I used for training my BERT models, like for Turkish, as documented in this cheatsheet.

Do you have any idea what causes this error message? Would be awesome to train ELECTRA models on TPU 🤗

Thanks many in advance!

issue about segment in build_pretrain_dataset.py

first_segment = []
second_segment = []
for sentence in self._current_sentences:
  # the sentence goes to the first segment if (1) the first segment is
  # empty, (2) the sentence doesn't put the first segment over length or
  # (3) 50% of the time when it does put the first segment over length
  if (first_segment or
      len(first_segment) + len(sentence) < first_segment_target_length or
      (second_segment and
       len(first_segment) < first_segment_target_length and
       random.random() < 0.5)):
    first_segment += sentence
  else:
    second_segment += sentence

i have two question about above snippet code:

  1. does the above code in “if“ branch should be "not first_segment" and "not second_segment"?
  2. it seems there would be chances like this: suppose sentences are A,B,C, and A,C in first_segment but B in second_segment? dose the sentence order not matter?

confusing about stop_gradient in the code

why there are two stop_gradients in "mask" function of pretrain_helpers.py. see snapshot under.
image

i can understand the stop_gradient here, but after that, there is still an argmax op, seems strange
image

KeyError: '[SEP]'

when running run_pretraining.py I get this error before it pretrains:

================================================================================
Running training

2020-04-28 04:43:55.132186: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:356] GrpcSession::ListDevices will initialize the session with an empty graph and other
defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from training_loop: '[SEP]'
Traceback (most recent call last):
File "run_pretraining.py", line 384, in
main()
.
(lines ignored because they're not useful)
.
File "/home/manai_elye2s/pretrain/electra/pretrain/pretrain_helpers.py", line 121, in _get_candidates_mask
ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]]
KeyError: '[SEP]'

I got this both with my own vocab and the default one I downloaded from this repo.
In both vocab.txt files there are the [SEP] [CLS] and [MASK] tokens, without space

RFC: List of community provided models

Hi @clarkkev ,

I just wanted to hear your opinion of adding a new section to the main readme file, where users can link their own trained ELECTRA models - maybe something like "Community models" 🤔

E.g. I just saw ELECTRA models on the Hugging Face model hub for various languages like Malay, Indonesian or Korean. Today I released base and small models for Turkish.

What do you think?

Best,

Stefan

NaN loss during training

Thank you for releasing your codes.

I have succeeded training a small model using a GPU by following Quickstart: Pre-train a small ELECTRA model, but a NaN loss during training error occurred when I trained a base model.

Do you have any idea?

I use tensorflow 1.15.0 and Tesla V100-PCIE-32GB, and an error log is as follows:

$ python run_pretraining.py --data-dir ../electra-en-data --model-name electra_base_owt_200k --hparams '{"num_train_steps": 200000, "model_size": "base", "train_batch_size": 128}'
..
2020-04-07 09:51:45.360762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30458 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:83:00.0, compute capability: 7.0)
2020-04-07 09:52:37.813499: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
1/200000 = 0.0%, SPS: 0.0, ELAP: 21, ETA: 47 days, 14:12:08 - loss: 44.4968
2/200000 = 0.0%, SPS: 0.1, ELAP: 39, ETA: 45 days, 4:02:14 - loss: 44.3760
3/200000 = 0.0%, SPS: 0.1, ELAP: 40, ETA: 31 days, 5:01:16 - loss: 44.5174
4/200000 = 0.0%, SPS: 0.1, ELAP: 42, ETA: 24 days, 5:38:09 - loss: 44.1623
5/200000 = 0.0%, SPS: 0.1, ELAP: 43, ETA: 20 days, 1:12:59 - loss: 44.2913
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Error recorded from training_loop: NaN loss during training.
Traceback (most recent call last):
  File "run_pretraining.py", line 385, in <module>
    main()
  File "run_pretraining.py", line 381, in main
    args.model_name, args.data_dir, **hparams))
  File "run_pretraining.py", line 344, in train_or_eval
    max_steps=config.num_train_steps)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/***/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/***/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

`num_train_steps` for further pretraining

Hello, I am trying to further pretrain the base model and large model use domain-specific corpus. But I see in the document, it says that when continuing pre-training from the released small ELECTRA checkpoints, we should:

Setting num_train_steps by (for example) adding "num_train_steps": 4010000 to the --hparams. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps).

But Table 6 of the paper shows that small ELECTRA model is trained for 1M steps. Which one should we set?

If 4e6 is correct, how many steps has the base model or large model been trained?

multi-task training

What if I want to Finetune ELECTRA on both classification and sequence tags?
What should i do?

Continue pretraining on custom dataset

First of all very nice work! I love the idea with a discriminator and that the current model architectures is even more capable if its given a more difficult pretraining task.

Unfortunately I think I have to bother you with a problem. I was planning to continue pretraining on the models with domain specific data for some experiments but the neccessary data is not in the released models.

raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: From /job:worker/replica:0/task:0: Key discriminator_predictions/dense/bias/adam_m not found in checkpoint [[node save/RestoreV2 (defined at tensorflow_core/python/framework/ops.py:1748) ]]

(unrelated)
A 60k test run on the cord19 dataset with eval:
https://tensorboard.dev/experiment/QwOMXKluQJKcn8D9i9p69A/#scalars

The implementation of layerwise learning rate decay

for layer in range(n_layers):
key_to_depths["encoder/layer_" + str(layer) + "/"] = layer + 1
return {
key: learning_rate * (layer_decay ** (n_layers + 2 - depth))
for key, depth in key_to_depths.items()
}

According to the code here, assume that n_layers=24, then key_to_depths["encoder/layer_23/"] = 24 which is the depth for last encoder layer, but the learning rate for last layer is:
learning_rate * (layer_decay ** (24+ 2 - 24)) = learning_rate * (layer_decay ** (2)).

That's what confused me. Why the learning rate for last layer is learning_rate * (layer_decay ** (2)) rather than learning_rate? Do I ignore anything?

How to get the embedding vector or matrix after pre-training

Hi,
following the commands, I pre-trained electra-small on my dataset. After pre-training I want the learned embeddings which I need to use on some other complicated downstream tasks. Could you please help me with how to extract the word embeddings after pre-training?

Model size conflit

There seems to be a confilt about generator sizes. That of electra small/base/large are 1/4, 1/3, 1/4 as lastest paper claimed. However, the provided pre-trained weights conflit that with generator size 1, 1/3, 1/4 respectively.

problem on electra's pretraining method

electra contains a generator and a discriminator, where generator replaces [mask] tokens with plausible alternatives ones and discriminator will figures them out.

As said in paper,

  1. Typically k = [0.15n], i.e., 15% of the tokens are masked out

  2. if the generator happens to generate the correct token, that token is considered “real” instead of “fake

thus only (1 - generator_inference_acc) * 0.15 tokens is fake, and leads to a extremely unbalanced binary classfication task for discriminator, but it does work, why?

Deal with the duplicated positions in generator

Here, the corrupted tokens are produced in generator as fake data. I can understand why we should deal with the duplicated positions and only appy it once. However, I am confused about the below implementation take average value of corrupted token ids of duplicated, and what's the intuition behind it?

if sequence.dtype == tf.float32:
updates_mask_3d = tf.cast(updates_mask_3d, tf.float32)
updates /= tf.maximum(1.0, updates_mask_3d)

Issue with loading weights for eval

I'm trying to validate that I am running experiments on finetuned ELECTRA model correctly by using only tf.train.init_from_checkpoint() to load the weights. When evaluating my finetuned ELECTRA model using run_finetuning.py, I've found I get different accuracy results when using a different model_dir than the directory containing the finetuned model for run_config = tf.estimator.tpu.RunConfig(...).

With the original code, I get the expected accuracy (~81) but when I modify the model_dir argument in tf.estimator.tpu.RunConfig() to an empty directory, I get much lower and non-deterministic accuracy (~32). I was wondering why that is since the weights are still being loaded using tf.train.init_from_checkpoint(). Are there variables being loaded using tf.estimator.tpu.RunConfig()?

Pre-trained SMALL model cannot be loaded

Hi guys!
I tried to evaluate pre-trained ELECTRA-small model with OpenWebText, but it cannot be loaded on default configuration.

(electra) judith@judith-dev:~/workspace/ELECTRA/electra$ python3 run_pretraining.py --data-dir /home/judith/workspace/ELECTRA/data_dir --model-name electra_small --hparams '{"do_train": false, "do_eval": true}'
(...)
ERROR:tensorflow:Error recorded from evaluation_loop: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
  (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [64] rhs shape= [256]
         [[node save/Assign_275 (defined at /home/judith/anaconda3/envs/electra/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [64] rhs shape= [256]
         [[node save/Assign_275 (defined at /home/judith/anaconda3/envs/electra/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[save/RestoreV2/_460]]
0 successful operations.
0 derived errors ignored.

But, I occasionally found that if I change generator_hidden_size which actually means the fraction of discriminator hidden size for generator from 0.25 to 1, (and it makes generator hidden size from 64 to 256) it runs successfully.

1/100 = 1.0%, SPS: 0.5, ELAP: 2, ETA: 3:32 - loss: 11.9305
2/100 = 2.0%, SPS: 0.6, ELAP: 3, ETA: 2:42 - loss: 11.8205
3/100 = 3.0%, SPS: 0.7, ELAP: 4, ETA: 2:25 - loss: 11.7672
4/100 = 4.0%, SPS: 0.7, ELAP: 6, ETA: 2:16 - loss: 11.7212
5/100 = 5.0%, SPS: 0.7, ELAP: 7, ETA: 2:10 - loss: 11.7709
(...)

Was that intended or just mistake? It would be my great pleasure if I hear your opinion.
Thanks in advance!

Format of corpus

According to the paper, ELECTRA does not involve NSP (next sentence prediction) task. In that case, do we need sentence segmentation?
Does build_pretraining_dataset.py consider each line as a separate sentence? Or can we just feed raw text (with empty lines as separators for documents) ?

Load model in Pytorch.

Hi! Thanks for making source code available and for great paper.

Are there any plans to support loading models in Pytorch? Or implementation in transformers by Huggingface?

Low usage of gpu

I tried pretraining for small, base, large model on one 2080 Ti gpu but it seems like there is very low usage of gpu. Am I doing something wrong or is the usage of gpu actually low?

Auto loading in huggingface Transformers is broken

When I try to load the model following the instructions on huggingface.co/models, i.e.:

tokenizer = AutoTokenizer.from_pretrained("google/electra-small-generator")

I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-bb330c08e050> in <module>
----> 1 tokenizer = AutoTokenizer.from_pretrained("google/electra-small-generator")

/opt/conda/lib/python3.6/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    179         config = kwargs.pop("config", None)
    180         if not isinstance(config, PretrainedConfig):
--> 181             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    182 
    183         if "bert-base-japanese" in pretrained_model_name_or_path:

/opt/conda/lib/python3.6/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    185 
    186         if "model_type" in config_dict:
--> 187             config_class = CONFIG_MAPPING[config_dict["model_type"]]
    188             return config_class.from_dict(config_dict, **kwargs)
    189         else:

KeyError: 'electra'

The version of transformers is 2.7.0. I reproduced the problem in colab here.

Should dynamic masking also ignore ['PAD']

Here is a set of tokens that should not be masked during dynamic masking.

ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]]

But should we also avoid masking all those ['PAD'] at the end of a sentence (if the sentence is shorter than max_seq_length and if there is no second sentence segment)?

I understand ['PAD'] itself has token_id = 0, but I do not see this being used to prevent masking in downstream steps. If we do not ignore it, this will affect the probability calculation here

# Get a probability of masking each position in the sequence
candidate_mask_float = tf.cast(candidates_mask, tf.float32)
sample_prob = (proposal_distribution * candidate_mask_float)
sample_prob /= tf.reduce_sum(sample_prob, axis=-1, keepdims=True)

Also, we will be trying to predict 'PAD' that is outside a sequence, which is a bit unintuitive.

Maybe I am missing something here. Thanks again for putting up such a great work!

Init disc/generator from pre-trained BERT

Hi there,

in the paper I noticed that you tried training the generator ahead of the discriminator, then initializing the discriminator from the generator weights and train only the discriminator subsequently.

I was wondering, if you also tried the same procedure but instead of training the generator you just initialized both generator and discriminator from a fully MLM pre-trained BERT (e.g. BERT-Large).

Sorry if something was mentioned in the paper and I missed it.

Cheers

TPU training

Hello, I was wondering whether it is possible to train ELECTRA model on cloud TPUs since you didn't mention TPUs in your research paper. By following this tutorial based on BERT pre-training and fine-tuning on Google Cloud TPUs, I found out some kind of troubles while working in a TPU environment. Here is the link to the BERT TPU tutorial: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb

The Colab Notebook I've created from the aforementioned one is here:
https://colab.research.google.com/drive/1ZG8xwXSJm8iQe19M4lACUcCetjf-eZii

The last cell was intended to run training script with use_tpu option flag specified. However, it had no success.

So, am I doing anything wrong or this is a bug on your side?

Thank for your help!

Definition of Loss

While training I find there is one loss. Is it training loss or validation loss?

Token-masking method: whole words or sub-words?

Hi, congrats for the paper. I really like the idea. I was wondering, what is your approach for masking tokens. Do you mask individual tokens independently, regardless of whether they might be units of a multi-token word, or do you mask all the tokens of a given word?

Let's say that we have this tokenized sentence and we want to mask shareholder:

<s> ▁Meanwhile , ▁share hold er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
  1. Independent masking: shareholder consists of 3 tokens and you allow for one of them to be masked, without masking the other 2.
<s> ▁Meanwhile , ▁share <mask> er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
  1. Whole word masking: all tokens of shareholder have to be masked.
<s> ▁Meanwhile , <mask> <mask> <mask> ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'

Which one is it? Or do you have a different approach?

SQuAD2 Score ELECTRA-Base

Hi there,

great work on reducing training times!

I was wondering if the reported SQuAD 2.0 results for ELECTRA-Base on this repo are correct: 83.7 EM. This seems to strongly diverge from the reported results in the paper 80.5 EM.

Or is the reported score here meant to be F1?

Cheers

'adam_m not found in checkpoint ' when further pretraining

When I was trying further pretraining on the models with domain-specific data in Colab, I encountered a problem that the official pretrained model could not be loaded.

Here is the commend for further pretraining.

hparam =    '{"model_size": "small", \
             "use_tpu":true, \
             "num_tpu_cores":8, \
             "tpu_name":"grpc://10.53.161.26:8470", \
             "num_train_steps":4000100,\
             "pretrain_tfrecords":"gs://tweet_torch/electra/electra/data/pretrain_tf_records/pretrain_data.tfrecord*", \
             "model_dir":"gs://tweet_torch/electra/electra/data/electra_small/", \
             "generator_hidden_size":1.0\
            }'
!python electra/run_pretraining.py  \
                    --data-dir "gs://tweet_torch/electra/electra/data/" \
                    --model-name "electra_small" \
                    --hparams '{hparam}'

And the error message is pretty long so I just paste some of it that seems useful.

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
From /job:worker/replica:0/task:0:
Key discriminator_predictions/dense/bias/adam_m not found in checkpoint
	 [[node save/RestoreV2 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]

Issue while generating pre training data

I am generating pre training data for hindi, I am using sentence piece vocab for it. Getting the following error.

python build_pretraining_dataset.py --corpus-dir data --vocab-file spie
ce.vocab --output-dir out --max-seq-length 128 --num-processes 1
Job 0: Creating example writer
Job 0: Writing tf examples
Traceback (most recent call last):
  File "build_pretraining_dataset.py", line 230, in <module>
    main()
  File "build_pretraining_dataset.py", line 218, in main
    write_examples(0, args)
  File "build_pretraining_dataset.py", line 190, in write_examples
    example_writer.write_examples(os.path.join(args.corpus_dir, fname))
  File "build_pretraining_dataset.py", line 143, in write_examples
    example = self._example_builder.add_line(line)
  File "build_pretraining_dataset.py", line 50, in add_line
    bert_tokids = self._tokenizer.convert_tokens_to_ids(bert_tokens)
  File "/home/gamut/Downloads/electra-master/model/tokenization.py", line 130, in convert_tokens_to_ids
    return convert_by_vocab(self.vocab, tokens)
  File "/home/gamut/Downloads/electra-master/model/tokenization.py", line 91, in convert_by_vocab
    output.append(vocab[item])
KeyError: '[UNK]'

I found that this kind of error have this solution. As here, there is only input for vocab and not for spice model, generation of pre training data through spiece vocab is problem, any solution?

freeze discriminator and train generator only

Hello,

I was wondering about the possibility of making un-trainable the discriminator for the first n steps, to give time to generator to learn how to produce meaningful tokens before feeding them to the discriminator.
Since the the training of generator and discriminator are not adversarial (as far as I understand from the paper) this should be possible.
Is this option already implemented and I am missing some flag?
Would be difficult to implement it? Maybe some workaround with existent parameters?

Moreover, I see that the global loss that's a linear combination of the generator and the discriminator
appear by default heavily unbalance respect to discriminator (with a relative weight of 50), this of course make sense since the "real" model is the discriminator.
Do you believe that re-weighting the losses can enhance the generator training?

Thank you very much for the answers.

eval pretrained model

It seems that run_pretraining.py when launched with flag do_train==False and do_eval==True
is removing the input model (one want to evaluate) specified by --model-name leaving only the evaluation folder containing discriminator and generator scores.

Is the result based on dev or test set?

In your paper, sometimes you use dev, sometimes you use dev&test set. But in this github, you did not mention which dataset are you using. Could you clarify? Or would you update the paper? Thanks


Besdies, what is the training flops you used in the results in the github? Thanks!

num_eval_steps

It is not clear what is num_eval_steps, and what is the loss printed when evaluating performances of
the pretrained model.

Is this just the number of batches that are taken to evaluate?
The loss is just the loss evaluated on the current batch, and hence the final one is just the average among the selected batches?

Thanks for the answers!

Question about fine-tuning on squad dataset

I downloaded this model and I tried to use this model.
I choosed the squad 2.0 dataset to fine-tune.
When I tried to fine-tune the model on the command line,
the program just stopped working and the command line seemed to exit the program.
The output is like this:


(env_tf115) D:\python_code\NLP\electra>python run_finetuning.py --data-dir "D:\python_code\NLP\electra\datadir" --model-name electra_small --hparams {"model_size": "small", "task_names": ["squad"], "num_trials": 2}
2020-03-26 22:00:19.349133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
{"model_size": "small", "task_names": ["squad"], "num_trials": 2}
================================================================================
Config: model=electra_small, trial 1/2
================================================================================
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir D:\python_code\NLP\electra\datadir
debug False
do_eval True
do_lower_case True
do_train True
doc_stride 128
double_unordered True
embedding_size 128
eval_batch_size 32
gcp_project None
init_checkpoint D:\python_code\NLP\electra\datadir\models\electra_small
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.8
learning_rate 0.0001
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 512
model_dir D:\python_code\NLP\electra\datadir\models\electra_small\finetuning_models\squad_model
model_hparam_overrides {}
model_name electra_small
model_size small
n_best_size 20
n_writes_test 5
num_tpu_cores 1
num_train_epochs 2.0
num_trials 2
predict_batch_size 32
preprocessed_data_dir D:\python_code\NLP\electra\datadir\models\electra_small\finetuning_tfrecords\squad_tfrecords
qa_eval_file <built-in method format of str object at 0x0000027494EB70B0>
qa_na_file <built-in method format of str object at 0x0000027494EB1AE0>
qa_na_threshold -2.75
qa_preds_file <built-in method format of str object at 0x0000027494EB70F0>
raw_data_dir <built-in method format of str object at 0x0000027494EAECA8>
results_pkl D:\python_code\NLP\electra\datadir\models\electra_small\results\squad_results.pkl
results_txt D:\python_code\NLP\electra\datadir\models\electra_small\results\squad_results.txt
save_checkpoints_steps 1000000
task_names ['squad']
test_predictions <built-in method format of str object at 0x0000027494EAD8F0>
tpu_job_name None
tpu_name None
tpu_zone None
train_batch_size 32
use_tfrecords_if_existing True
use_tpu False
vocab_file D:\python_code\NLP\electra\datadir\models\electra_small\vocab.txt
vocab_size 30522
warmup_proportion 0.1
weight_decay_rate 0.01
write_distill_outputs False
write_test_outputs False

Loading dataset squad_train
Existing tfrecords not found so creating

(env_tf115) D:\python_code\NLP\electra>

My computer setting:
Windows 10
Cuda 10.0.130
Cudnn 7.6.3
Tensorflow 1.15 GPU

I've put the pre-trained model file under "datadir\models\electra_small" directory
and squad dataset under "datadir\finetuning_data\squad" directory.
Does anyone know why the fine-tuning is not working?

Training loss

Hello,

I was wondering whether it is possible to add some loss metrics to the training cycle? The only thing I see during training Electra model is

1275000/3000000 = 42.5%, SPS: 3.1, ELAP: 9:24:02, ETA: 6 days, 11:55:19

which tells nothing about how good is it. I'm trying to add some code to the estimator, but it seems to me that it could be much easier to show all the metrics in order to see how successful the model is at this stage.

I'm training non-English model, so I wanted to get better insight into how my model is performing at the moment.

Thanks

Bert vs Electra performances

Hello,

I am trying to compare the time performances of Bert and Electra pre-training.
Looking at the paper it seems that at fixed FLOPs (i.e. fixed time) electra performances are the best in downstreaming finetuning tasks.

I am struggling a bit with the comparison since the pre-training tasks are pretty different between the models and then I am not sure they can be directly compared by keeping the same hyper-parameters.

For instance:
max_seq_len used by Bert seems to stand for the total length of the couple of sentences while
for Electra is just one sentence,
this would suggest to half the length for Electra (or double for Bert), am I right?
On the other hand, if I do so, then the number of masks (or replacement in Electra) would be smaller since it scales as masked_lm_prob*max_seq_len.
Are there other parameters I should keep into account for this comparison?

I am keeping the number of steps to be the same all the time so I would expect an increased performance in finetuning tasks after electra pretraining.

As usual, any comment or help is really appreciated!

Loss of base and large models

Hi,

I'm currently working on a new non-English ELECTRA model. Training on GPU seems to work and is running fine 🤗

Next steps would be to try model training on a TPU, so I would just like to ask if you can post the final loss of both base and large models (or even share the loss training curve) so that we have a kind of reference point when training own models 🤔

Thanks many in advance,

Stefan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.