GithubHelp home page GithubHelp logo

google-research / electra Goto Github PK

View Code? Open in Web Editor NEW
2.3K 2.3K 351.0 113 KB

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

License: Apache License 2.0

Python 100.00%
deep-learning nlp tensorflow

electra's Introduction

ELECTRA

Introduction

ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU. At large scale, ELECTRA achieves state-of-the-art results on the SQuAD 2.0 dataset.

For a detailed description and experimental results, please refer to our ICLR 2020 paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.

This repository contains code to pre-train ELECTRA, including small ELECTRA models on a single GPU. It also supports fine-tuning ELECTRA on downstream tasks including classification tasks (e.g,. GLUE), QA tasks (e.g., SQuAD), and sequence tagging tasks (e.g., text chunking).

This repository also contains code for Electric, a version of ELECTRA inspired by energy-based models. Electric provides a more principled view of ELECTRA as a "negative sampling" cloze model. It can also efficiently produce pseudo-likelihood scores for text, which can be used to re-rank the outputs of speech recognition or machine translation systems. For details on Electric, please refer to out EMNLP 2020 paper Pre-Training Transformers as Energy-Based Cloze Models.

Released Models

We are initially releasing three pre-trained models:

Model Layers Hidden Size Params GLUE score (test set) Download
ELECTRA-Small 12 256 14M 77.4 link
ELECTRA-Base 12 768 110M 82.7 link
ELECTRA-Large 24 1024 335M 85.2 link

The models were trained on uncased English text. They correspond to ELECTRA-Small++, ELECTRA-Base++, ELECTRA-1.75M in our paper. We hope to release other models, such as multilingual models, in the future.

On GLUE, ELECTRA-Large scores slightly better than ALBERT/XLNET, ELECTRA-Base scores better than BERT-Large, and ELECTRA-Small scores slightly worst than TinyBERT (but uses no distillation). See the expected results section below for detailed performance numbers.

Requirements

Pre-training

Use build_pretraining_dataset.py to create a pre-training dataset from a dump of raw text. It has the following arguments:

  • --corpus-dir: A directory containing raw text files to turn into ELECTRA examples. A text file can contain multiple documents with empty lines separating them.
  • --vocab-file: File defining the wordpiece vocabulary.
  • --output-dir: Where to write out ELECTRA examples.
  • --max-seq-length: The number of tokens per example (128 by default).
  • --num-processes: If >1 parallelize across multiple processes (1 by default).
  • --blanks-separate-docs: Whether blank lines indicate document boundaries (True by default).
  • --do-lower-case/--no-lower-case: Whether to lower case the input text (True by default).

Use run_pretraining.py to pre-train an ELECTRA model. It has the following arguments:

  • --data-dir: a directory where pre-training data, model weights, etc. are stored. By default, the training loads examples from <data-dir>/pretrain_tfrecords and a vocabulary from <data-dir>/vocab.txt.
  • --model-name: a name for the model being trained. Model weights will be saved in <data-dir>/models/<model-name> by default.
  • --hparams (optional): a JSON dict or path to a JSON file containing model hyperparameters, data paths, etc. See configure_pretraining.py for the supported hyperparameters.

If training is halted, re-running the run_pretraining.py with the same arguments will continue the training where it left off.

You can continue pre-training from the released ELECTRA checkpoints by

  1. Setting the model-name to point to a downloaded model (e.g., --model-name electra_small if you downloaded weights to $DATA_DIR/electra_small).
  2. Setting num_train_steps by (for example) adding "num_train_steps": 4010000 to the --hparams. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps).
  3. Increase the learning rate to account for the linear learning rate decay. For example, to start with a learning rate of 2e-4 you should set the learning_rate hparam to 2e-4 * (4e6 + 10000) / 10000.
  4. For ELECTRA-Small, you also need to specifiy "generator_hidden_size": 1.0 in the hparams because we did not use a small generator for that model.

Quickstart: Pre-train a small ELECTRA model.

These instructions pre-train a small ELECTRA model (12 layers, 256 hidden size). Unfortunately, the data we used in the paper is not publicly available, so we will use the OpenWebTextCorpus released by Aaron Gokaslan and Vanya Cohen instead. The fully-trained model (~4 days on a v100 GPU) should perform roughly in between GPT and BERT-Base in terms of GLUE performance. By default the model is trained on length-128 sequences, so it is not suitable for running on question answering. See the "expected results" section below for more details on model performance.

Setup

  1. Place a vocabulary file in $DATA_DIR/vocab.txt. Our ELECTRA models all used the exact same vocabulary as English uncased BERT, which you can download here.
  2. Download the OpenWebText corpus (12G) and extract it (i.e., run tar xf openwebtext.tar.xz). Place it in $DATA_DIR/openwebtext.
  3. Run python3 build_openwebtext_pretraining_dataset.py --data-dir $DATA_DIR --num-processes 5. It pre-processes/tokenizes the data and outputs examples as tfrecord files under $DATA_DIR/pretrain_tfrecords. The tfrecords require roughly 30G of disk space.

Pre-training the model.

Run python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt to train a small ELECTRA model for 1 million steps on the data. This takes slightly over 4 days on a Tesla V100 GPU. However, the model should achieve decent results after 200k steps (10 hours of training on the v100 GPU).

To customize the training, add --hparams '{"hparam1": value1, "hparam2": value2, ...}' to the run command. --hparams can also be a path to a .json file containing the hyperparameters. Some particularly useful options:

  • "debug": true trains a tiny ELECTRA model for a few steps.
  • "model_size": one of "small", "base", or "large": determines the size of the model
  • "electra_objective": false trains a model with masked language modeling instead of replaced token detection (essentially BERT with dynamic masking and no next-sentence prediction).
  • "num_train_steps": n controls how long the model is pre-trained for.
  • "pretrain_tfrecords": <paths> determines where the pre-training data is located. Note you need to specify the specific files not just the directory (e.g., <data-dir>/pretrain_tf_records/pretrain_data.tfrecord*)
  • "vocab_file": <path> and "vocab_size": n can be used to set a custom wordpiece vocabulary.
  • "learning_rate": lr, "train_batch_size": n, etc. can be used to change training hyperparameters
  • "model_hparam_overrides": {"hidden_size": n, "num_hidden_layers": m}, etc. can be used to changed the hyperparameters for the underlying transformer (the "model_size" flag sets the default values).

See configure_pretraining.py for the full set of supported hyperparameters.

Evaluating the pre-trained model.

To evaluate the model on a downstream task, see the below finetuning instructions. To evaluate the generator/discriminator on the openwebtext data run python3 run_pretraining.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"do_train": false, "do_eval": true}'. This will print out eval metrics such as the accuracy of the generator and discriminator, and also writing the metrics out to data-dir/model-name/results.

Fine-tuning

Use run_finetuning.py to fine-tune and evaluate an ELECTRA model on a downstream NLP task. It expects three arguments:

  • --data-dir: a directory where data, model weights, etc. are stored. By default, the script loads finetuning data from <data-dir>/finetuning_data/<task-name> and a vocabulary from <data-dir>/vocab.txt.
  • --model-name: a name of the pre-trained model: the pre-trained weights should exist in data-dir/models/model-name.
  • --hparams: a JSON dict containing model hyperparameters, data paths, etc. (e.g., --hparams '{"task_names": ["rte"], "model_size": "base", "learning_rate": 1e-4, ...}'). See configure_pretraining.py for the supported hyperparameters. Instead of a dict, this can also be a path to a .json file containing the hyperparameters. You must specify the "task_names" and "model_size" (see examples below).

Eval metrics will be saved in data-dir/model-name/results and model weights will be saved in data-dir/model-name/finetuning_models by default. Evaluation is done on the dev set by default. To customize the training, add --hparams '{"hparam1": value1, "hparam2": value2, ...}' to the run command. Some particularly useful options:

  • "debug": true fine-tunes a tiny ELECTRA model for a few steps.
  • "task_names": ["task_name"]: specifies the tasks to train on. A list because the codebase nominally supports multi-task learning, (although be warned this has not been thoroughly tested).
  • "model_size": one of "small", "base", or "large": determines the size of the model; you must set this to the same size as the pre-trained model.
  • "do_train" and "do_eval": train and/or evaluate a model (both are set to true by default). For using "do_eval": true with "do_train": false, you need to specify the init_checkpoint, e.g., python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"], "do_train": false, "do_eval": true, "init_checkpoint": "<data-dir>/models/electra_base/finetuning_models/mnli_model_1"}'
  • "num_trials": n: If >1, does multiple fine-tuning/evaluation runs with different random seeds.
  • "learning_rate": lr, "train_batch_size": n, etc. can be used to change training hyperparameters.
  • "model_hparam_overrides": {"hidden_size": n, "num_hidden_layers": m}, etc. can be used to changed the hyperparameters for the underlying transformer (the "model_size" flag sets the default values).

Setup

Get a pre-trained ELECTRA model either by training your own (see pre-training instructions above), or downloading the release ELECTRA weights and unziping them under $DATA_DIR/models (e.g., you should have a directory$DATA_DIR/models/electra_large if you are using the large model).

Finetune ELECTRA on a GLUE task

Download the GLUE data by running this script. Set up the data by running mv CoLA cola && mv MNLI mnli && mv MRPC mrpc && mv QNLI qnli && mv QQP qqp && mv RTE rte && mv SST-2 sst && mv STS-B sts && mv diagnostic/diagnostic.tsv mnli && mkdir -p $DATA_DIR/finetuning_data && mv * $DATA_DIR/finetuning_data.

Then run run_finetuning.py. For example, to fine-tune ELECTRA-Base on MNLI

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["mnli"]}'

Or fine-tune a small model pre-trained using the above instructions on CoLA.

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_small_owt --hparams '{"model_size": "small", "task_names": ["cola"]}'

Finetune ELECTRA on question answering

The code supports SQuAD 1.1 and 2.0, as well as datasets in the 2019 MRQA shared task

  • Squad 1.1: Download the train and dev datasets and move them under $DATA_DIR/finetuning_data/squadv1/(train|dev).json
  • Squad 2.0: Download the datasets from the SQuAD Website and move them under $DATA_DIR/finetuning_data/squad/(train|dev).json
  • MRQA tasks: Download the data from here. Move the data to $DATA_DIR/finetuning_data/(newsqa|naturalqs|triviaqa|searchqa)/(train|dev).jsonl.

Then run (for example)

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["squad"]}'

This repository uses the official evaluation code released by the SQuAD authors and the MRQA shared task to compute metrics

Finetune ELECTRA on sequence tagging

Download the CoNLL-2000 text chunking dataset from here and put it under $DATA_DIR/finetuning_data/chunk/(train|dev).txt. Then run

python3 run_finetuning.py --data-dir $DATA_DIR --model-name electra_base --hparams '{"model_size": "base", "task_names": ["chunk"]}'

Adding a new task

The easiest way to run on a new task is to implement a new finetune.task.Task, add it to finetune.task_builder.py, and then use run_finetuning.py as normal. For classification/qa/sequence tagging, you can inherit from a finetune.classification.classification_tasks.ClassificationTask, finetune.qa.qa_tasks.QATask, or finetune.tagging.tagging_tasks.TaggingTask. For preprocessing data, we use the same tokenizer as BERT.

Expected Results

Here are expected results for ELECTRA on various tasks (test set for chunking, dev set for the other tasks). Note that variance in fine-tuning can be quite large, so for some tasks you may see big fluctuations in scores when fine-tuning from the same checkpoint multiple times. The below scores show median performance over a large number of random seeds. ELECTRA-Small/Base/Large are our released models. ELECTRA-Small-OWT is the OpenWebText-trained model from above (it performs a bit worse than ELECTRA-Small due to being trained for less time and on a smaller dataset).

CoLA SST MRPC STS QQP MNLI QNLI RTE SQuAD 1.1 SQuAD 2.0 Chunking
Metrics MCC Acc Acc Spearman Acc Acc Acc Acc EM EM F1
ELECTRA-Large 69.1 96.9 90.8 92.6 92.4 90.9 95.0 88.0 89.7 88.1 97.2
ELECTRA-Base 67.7 95.1 89.5 91.2 91.5 88.8 93.2 82.7 86.8 80.5 97.1
ELECTRA-Small 57.0 91.2 88.0 87.5 89.0 81.3 88.4 66.7 75.8 70.1 96.5
ELECTRA-Small-OWT 56.8 88.3 87.4 86.8 88.3 78.9 87.9 68.5 -- -- --

See here for losses / training curves of the models during pre-training.

Electric

To train Electric, use the same pre-training script and command as ELECTRA. Pass "electra_objective": false and "electric_objective": true to the hyperparameters. We plan to release pre-trained Electric models soon!

Citation

If you use this code for your publication, please cite the original paper:

@inproceedings{clark2020electra,
  title = {{ELECTRA}: Pre-training Text Encoders as Discriminators Rather Than Generators},
  author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},
  booktitle = {ICLR},
  year = {2020},
  url = {https://openreview.net/pdf?id=r1xMH1BtvB}
}

If you use the code for Electric, please cite the Electric paper:

@inproceedings{clark2020electric,
  title = {Pre-Training Transformers as Energy-Based Cloze Models},
  author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},
  booktitle = {EMNLP},
  year = {2020},
  url = {https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf}
}

Contact Info

For help or issues using ELECTRA, please submit a GitHub issue.

For personal communication related to ELECTRA, please contact Kevin Clark ([email protected]).

electra's People

Contributors

clarkkev avatar michelole avatar mrm8488 avatar philipmay avatar stefan-it avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

electra's Issues

Question about fine-tuning on squad dataset

I downloaded this model and I tried to use this model.
I choosed the squad 2.0 dataset to fine-tune.
When I tried to fine-tune the model on the command line,
the program just stopped working and the command line seemed to exit the program.
The output is like this:


(env_tf115) D:\python_code\NLP\electra>python run_finetuning.py --data-dir "D:\python_code\NLP\electra\datadir" --model-name electra_small --hparams {"model_size": "small", "task_names": ["squad"], "num_trials": 2}
2020-03-26 22:00:19.349133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
{"model_size": "small", "task_names": ["squad"], "num_trials": 2}
================================================================================
Config: model=electra_small, trial 1/2
================================================================================
answerable_classifier True
answerable_uses_start_logits True
answerable_weight 0.5
beam_size 20
data_dir D:\python_code\NLP\electra\datadir
debug False
do_eval True
do_lower_case True
do_train True
doc_stride 128
double_unordered True
embedding_size 128
eval_batch_size 32
gcp_project None
init_checkpoint D:\python_code\NLP\electra\datadir\models\electra_small
iterations_per_loop 1000
joint_prediction True
keep_all_models True
layerwise_lr_decay 0.8
learning_rate 0.0001
log_examples False
max_answer_length 30
max_query_length 64
max_seq_length 512
model_dir D:\python_code\NLP\electra\datadir\models\electra_small\finetuning_models\squad_model
model_hparam_overrides {}
model_name electra_small
model_size small
n_best_size 20
n_writes_test 5
num_tpu_cores 1
num_train_epochs 2.0
num_trials 2
predict_batch_size 32
preprocessed_data_dir D:\python_code\NLP\electra\datadir\models\electra_small\finetuning_tfrecords\squad_tfrecords
qa_eval_file <built-in method format of str object at 0x0000027494EB70B0>
qa_na_file <built-in method format of str object at 0x0000027494EB1AE0>
qa_na_threshold -2.75
qa_preds_file <built-in method format of str object at 0x0000027494EB70F0>
raw_data_dir <built-in method format of str object at 0x0000027494EAECA8>
results_pkl D:\python_code\NLP\electra\datadir\models\electra_small\results\squad_results.pkl
results_txt D:\python_code\NLP\electra\datadir\models\electra_small\results\squad_results.txt
save_checkpoints_steps 1000000
task_names ['squad']
test_predictions <built-in method format of str object at 0x0000027494EAD8F0>
tpu_job_name None
tpu_name None
tpu_zone None
train_batch_size 32
use_tfrecords_if_existing True
use_tpu False
vocab_file D:\python_code\NLP\electra\datadir\models\electra_small\vocab.txt
vocab_size 30522
warmup_proportion 0.1
weight_decay_rate 0.01
write_distill_outputs False
write_test_outputs False

Loading dataset squad_train
Existing tfrecords not found so creating

(env_tf115) D:\python_code\NLP\electra>

My computer setting:
Windows 10
Cuda 10.0.130
Cudnn 7.6.3
Tensorflow 1.15 GPU

I've put the pre-trained model file under "datadir\models\electra_small" directory
and squad dataset under "datadir\finetuning_data\squad" directory.
Does anyone know why the fine-tuning is not working?

KeyError: '[SEP]'

when running run_pretraining.py I get this error before it pretrains:

================================================================================
Running training

2020-04-28 04:43:55.132186: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:356] GrpcSession::ListDevices will initialize the session with an empty graph and other
defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from training_loop: '[SEP]'
Traceback (most recent call last):
File "run_pretraining.py", line 384, in
main()
.
(lines ignored because they're not useful)
.
File "/home/manai_elye2s/pretrain/electra/pretrain/pretrain_helpers.py", line 121, in _get_candidates_mask
ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]]
KeyError: '[SEP]'

I got this both with my own vocab and the default one I downloaded from this repo.
In both vocab.txt files there are the [SEP] [CLS] and [MASK] tokens, without space

TPU training: No matching devices found for

Hi,

I just wanted to train a small model on a v3-8 TPU. I did modify the parameters in the configure_pretraining.py file.

Training command shows the following output at the beginning:

Config:                                                                                                                                                                         
debug False
disallow_correct False                                                                                                                                                         
disc_weight 50.0                                                                                                                          
do_eval False                                                                                                                                                                             
do_lower_case False                                                                                                                                        
do_train True                                                                                                                             
electra_objective True                                                                                                                                     
embedding_size 128                                                                                                                          
eval_batch_size 128                                                                                                                                                      
gcp_project None                                                                                                                                                 
gen_weight 1.0                                                                                                                                                                            
generator_hidden_size 0.25                                                                                                                                                 
generator_layers 1.0                                                                                                             
iterations_per_loop 200                                                                                                    
learning_rate 0.0005                                                                                                                                                                     
lr_decay_power 1.0                                                                                                          
mask_prob 0.15                                            
max_predictions_per_seq 19                                                                                                         
max_seq_length 128                                                                                              
model_dir gs://tr-electra/models/electra-small-cased                                                                                                                                      
model_hparam_overrides {}                                                                                      
model_name electra-small-cased                                                                                                   
model_size small                                                                                                                                                                          
num_eval_steps 100                                                                                                                                                                       
num_tpu_cores 8                                                                                                                                                                           
num_train_steps 1000000                                                                                                                                                                   
num_warmup_steps 10000                                                                                                                                                                    
pretrain_tfrecords gs://tr-electra/pretrain_tfrecords/pretrain_data.tfrecord*                                                                                                             
results_pkl gs://tr-electra/models/electra-small-cased/results/unsup_results.pkl                                                                                                          
results_txt gs://tr-electra/models/electra-small-cased/results/unsup_results.txt                                                                                                          
save_checkpoints_steps 1000                  
temperature 1.0                                                                                                             
tpu_name bert-4                                                                                                                                                                           
tpu_zone None                                                                                                                          
train_batch_size 128                                                                                                                                                                      
uniform_generator False                                                                                                                                                                   
untied_generator True                     
untied_generator_embeddings False                                                                                                                                           
use_tpu True                                                                                                                                                         
vocab_file gs://tr-electra/vocab.txt                                        
vocab_size 32000
weight_decay_rate 0.01

Training command I used was:

$ python3 run_pretraining.py --data-dir gs://tr-electra --model-name electra-small-cased

Then the following error message is thrown:

  File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/session_manager.py", line 296, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No matching devices found for '/job:train_tpu_worker/device:TPU_SYSTEM:0'

I changed the logging level to debug and it seems that the specified TPU is found:

INFO:tensorflow:Found TPU system:                                                                                                                                                         
INFO:tensorflow:*** Num TPU Cores: 8                                                                                                                                                      
INFO:tensorflow:*** Num TPU Workers: 1                                                                                                                                                    
INFO:tensorflow:*** Num TPU Cores Per Worker: 8                                                                                                                                           
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 6331406465745664406)                                                          
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 8261923844496220977)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 4004177864999644671)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 15443343382830505559)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 4979557051736282662)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 11642345092301563746)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 14428381851821878348)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 10090976446270558365)                                                
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 4924679865202343225)                                                 
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 5665404154059623606)                                   
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5630790993080310184)                                         
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops
.resource_variable_ops) with constraint is deprecated and will be removed in a future version.

I used this GCP instance:

gcloud compute instances create bert --zone=<zone> --machine-type=n1-standard-2 \
--image-project=ml-images --image-family=tf-1-15 --scopes=cloud-platform

and created the TPU with:

gcloud compute tpus create bert-4 --zone=<zone> --accelerator-type=v3-8 \
--network=default --range=192.168.4.0/29 --version=1.15

Pretty much the same configuration as I used for training my BERT models, like for Turkish, as documented in this cheatsheet.

Do you have any idea what causes this error message? Would be awesome to train ELECTRA models on TPU 🤗

Thanks many in advance!

Token-masking method: whole words or sub-words?

Hi, congrats for the paper. I really like the idea. I was wondering, what is your approach for masking tokens. Do you mask individual tokens independently, regardless of whether they might be units of a multi-token word, or do you mask all the tokens of a given word?

Let's say that we have this tokenized sentence and we want to mask shareholder:

<s> ▁Meanwhile , ▁share hold er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
  1. Independent masking: shareholder consists of 3 tokens and you allow for one of them to be masked, without masking the other 2.
<s> ▁Meanwhile , ▁share <mask> er ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'
  1. Whole word masking: all tokens of shareholder have to be masked.
<s> ▁Meanwhile , <mask> <mask> <mask> ▁funds ▁have ▁climbed ▁160 ▁per ▁cent ▁since ▁2009 , ▁when ▁Miss ▁Moore ▁first ▁became ▁a ▁fully - fl ed ged ▁BBC ▁commissioner , </s>'

Which one is it? Or do you have a different approach?

'adam_m not found in checkpoint ' when further pretraining

When I was trying further pretraining on the models with domain-specific data in Colab, I encountered a problem that the official pretrained model could not be loaded.

Here is the commend for further pretraining.

hparam =    '{"model_size": "small", \
             "use_tpu":true, \
             "num_tpu_cores":8, \
             "tpu_name":"grpc://10.53.161.26:8470", \
             "num_train_steps":4000100,\
             "pretrain_tfrecords":"gs://tweet_torch/electra/electra/data/pretrain_tf_records/pretrain_data.tfrecord*", \
             "model_dir":"gs://tweet_torch/electra/electra/data/electra_small/", \
             "generator_hidden_size":1.0\
            }'
!python electra/run_pretraining.py  \
                    --data-dir "gs://tweet_torch/electra/electra/data/" \
                    --model-name "electra_small" \
                    --hparams '{hparam}'

And the error message is pretty long so I just paste some of it that seems useful.

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
From /job:worker/replica:0/task:0:
Key discriminator_predictions/dense/bias/adam_m not found in checkpoint
	 [[node save/RestoreV2 (defined at /tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]

Low usage of gpu

I tried pretraining for small, base, large model on one 2080 Ti gpu but it seems like there is very low usage of gpu. Am I doing something wrong or is the usage of gpu actually low?

BasicTokenizer: _run_strip_accents

Hello,

I noticed that BasicTokenizer runs _run_strip_accents by default and it's not parameterizable. Is it critical for training Electra? Is it OK to turn it off for training non-English models?

Thanks.

Deal with the duplicated positions in generator

Here, the corrupted tokens are produced in generator as fake data. I can understand why we should deal with the duplicated positions and only appy it once. However, I am confused about the below implementation take average value of corrupted token ids of duplicated, and what's the intuition behind it?

if sequence.dtype == tf.float32:
updates_mask_3d = tf.cast(updates_mask_3d, tf.float32)
updates /= tf.maximum(1.0, updates_mask_3d)

Bert vs Electra performances

Hello,

I am trying to compare the time performances of Bert and Electra pre-training.
Looking at the paper it seems that at fixed FLOPs (i.e. fixed time) electra performances are the best in downstreaming finetuning tasks.

I am struggling a bit with the comparison since the pre-training tasks are pretty different between the models and then I am not sure they can be directly compared by keeping the same hyper-parameters.

For instance:
max_seq_len used by Bert seems to stand for the total length of the couple of sentences while
for Electra is just one sentence,
this would suggest to half the length for Electra (or double for Bert), am I right?
On the other hand, if I do so, then the number of masks (or replacement in Electra) would be smaller since it scales as masked_lm_prob*max_seq_len.
Are there other parameters I should keep into account for this comparison?

I am keeping the number of steps to be the same all the time so I would expect an increased performance in finetuning tasks after electra pretraining.

As usual, any comment or help is really appreciated!

Format of corpus

According to the paper, ELECTRA does not involve NSP (next sentence prediction) task. In that case, do we need sentence segmentation?
Does build_pretraining_dataset.py consider each line as a separate sentence? Or can we just feed raw text (with empty lines as separators for documents) ?

SQuAD2 Score ELECTRA-Base

Hi there,

great work on reducing training times!

I was wondering if the reported SQuAD 2.0 results for ELECTRA-Base on this repo are correct: 83.7 EM. This seems to strongly diverge from the reported results in the paper 80.5 EM.

Or is the reported score here meant to be F1?

Cheers

Continue pretraining on custom dataset

First of all very nice work! I love the idea with a discriminator and that the current model architectures is even more capable if its given a more difficult pretraining task.

Unfortunately I think I have to bother you with a problem. I was planning to continue pretraining on the models with domain specific data for some experiments but the neccessary data is not in the released models.

raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: From /job:worker/replica:0/task:0: Key discriminator_predictions/dense/bias/adam_m not found in checkpoint [[node save/RestoreV2 (defined at tensorflow_core/python/framework/ops.py:1748) ]]

(unrelated)
A 60k test run on the cord19 dataset with eval:
https://tensorboard.dev/experiment/QwOMXKluQJKcn8D9i9p69A/#scalars

freeze discriminator and train generator only

Hello,

I was wondering about the possibility of making un-trainable the discriminator for the first n steps, to give time to generator to learn how to produce meaningful tokens before feeding them to the discriminator.
Since the the training of generator and discriminator are not adversarial (as far as I understand from the paper) this should be possible.
Is this option already implemented and I am missing some flag?
Would be difficult to implement it? Maybe some workaround with existent parameters?

Moreover, I see that the global loss that's a linear combination of the generator and the discriminator
appear by default heavily unbalance respect to discriminator (with a relative weight of 50), this of course make sense since the "real" model is the discriminator.
Do you believe that re-weighting the losses can enhance the generator training?

Thank you very much for the answers.

Load model in Pytorch.

Hi! Thanks for making source code available and for great paper.

Are there any plans to support loading models in Pytorch? Or implementation in transformers by Huggingface?

Definition of Loss

While training I find there is one loss. Is it training loss or validation loss?

multi-task training

What if I want to Finetune ELECTRA on both classification and sequence tags?
What should i do?

issue about segment in build_pretrain_dataset.py

first_segment = []
second_segment = []
for sentence in self._current_sentences:
  # the sentence goes to the first segment if (1) the first segment is
  # empty, (2) the sentence doesn't put the first segment over length or
  # (3) 50% of the time when it does put the first segment over length
  if (first_segment or
      len(first_segment) + len(sentence) < first_segment_target_length or
      (second_segment and
       len(first_segment) < first_segment_target_length and
       random.random() < 0.5)):
    first_segment += sentence
  else:
    second_segment += sentence

i have two question about above snippet code:

  1. does the above code in “if“ branch should be "not first_segment" and "not second_segment"?
  2. it seems there would be chances like this: suppose sentences are A,B,C, and A,C in first_segment but B in second_segment? dose the sentence order not matter?

Init disc/generator from pre-trained BERT

Hi there,

in the paper I noticed that you tried training the generator ahead of the discriminator, then initializing the discriminator from the generator weights and train only the discriminator subsequently.

I was wondering, if you also tried the same procedure but instead of training the generator you just initialized both generator and discriminator from a fully MLM pre-trained BERT (e.g. BERT-Large).

Sorry if something was mentioned in the paper and I missed it.

Cheers

The implementation of layerwise learning rate decay

for layer in range(n_layers):
key_to_depths["encoder/layer_" + str(layer) + "/"] = layer + 1
return {
key: learning_rate * (layer_decay ** (n_layers + 2 - depth))
for key, depth in key_to_depths.items()
}

According to the code here, assume that n_layers=24, then key_to_depths["encoder/layer_23/"] = 24 which is the depth for last encoder layer, but the learning rate for last layer is:
learning_rate * (layer_decay ** (24+ 2 - 24)) = learning_rate * (layer_decay ** (2)).

That's what confused me. Why the learning rate for last layer is learning_rate * (layer_decay ** (2)) rather than learning_rate? Do I ignore anything?

TPU training

Hello, I was wondering whether it is possible to train ELECTRA model on cloud TPUs since you didn't mention TPUs in your research paper. By following this tutorial based on BERT pre-training and fine-tuning on Google Cloud TPUs, I found out some kind of troubles while working in a TPU environment. Here is the link to the BERT TPU tutorial: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb

The Colab Notebook I've created from the aforementioned one is here:
https://colab.research.google.com/drive/1ZG8xwXSJm8iQe19M4lACUcCetjf-eZii

The last cell was intended to run training script with use_tpu option flag specified. However, it had no success.

So, am I doing anything wrong or this is a bug on your side?

Thank for your help!

Multi-GPU training

Hi Kevin,

Thanks for the great work and releasing the codes/models. Was wondering if you have tried multi-GPU training for ELECTRA-base and ELECTRA-large (does your current codes support multi-GPU)? And if you have stats for multi-GPU experiments as well?

Also the stats for single GPU training of ELECTRA-base and ELECTRA-large (how many days needed till they converge to a descent performance?).

Thanks!
-Hamid

eval pretrained model

It seems that run_pretraining.py when launched with flag do_train==False and do_eval==True
is removing the input model (one want to evaluate) specified by --model-name leaving only the evaluation folder containing discriminator and generator scores.

Issue while generating pre training data

I am generating pre training data for hindi, I am using sentence piece vocab for it. Getting the following error.

python build_pretraining_dataset.py --corpus-dir data --vocab-file spie
ce.vocab --output-dir out --max-seq-length 128 --num-processes 1
Job 0: Creating example writer
Job 0: Writing tf examples
Traceback (most recent call last):
  File "build_pretraining_dataset.py", line 230, in <module>
    main()
  File "build_pretraining_dataset.py", line 218, in main
    write_examples(0, args)
  File "build_pretraining_dataset.py", line 190, in write_examples
    example_writer.write_examples(os.path.join(args.corpus_dir, fname))
  File "build_pretraining_dataset.py", line 143, in write_examples
    example = self._example_builder.add_line(line)
  File "build_pretraining_dataset.py", line 50, in add_line
    bert_tokids = self._tokenizer.convert_tokens_to_ids(bert_tokens)
  File "/home/gamut/Downloads/electra-master/model/tokenization.py", line 130, in convert_tokens_to_ids
    return convert_by_vocab(self.vocab, tokens)
  File "/home/gamut/Downloads/electra-master/model/tokenization.py", line 91, in convert_by_vocab
    output.append(vocab[item])
KeyError: '[UNK]'

I found that this kind of error have this solution. As here, there is only input for vocab and not for spice model, generation of pre training data through spiece vocab is problem, any solution?

Model size conflit

There seems to be a confilt about generator sizes. That of electra small/base/large are 1/4, 1/3, 1/4 as lastest paper claimed. However, the provided pre-trained weights conflit that with generator size 1, 1/3, 1/4 respectively.

NaN loss during training

Thank you for releasing your codes.

I have succeeded training a small model using a GPU by following Quickstart: Pre-train a small ELECTRA model, but a NaN loss during training error occurred when I trained a base model.

Do you have any idea?

I use tensorflow 1.15.0 and Tesla V100-PCIE-32GB, and an error log is as follows:

$ python run_pretraining.py --data-dir ../electra-en-data --model-name electra_base_owt_200k --hparams '{"num_train_steps": 200000, "model_size": "base", "train_batch_size": 128}'
..
2020-04-07 09:51:45.360762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30458 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:83:00.0, compute capability: 7.0)
2020-04-07 09:52:37.813499: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
1/200000 = 0.0%, SPS: 0.0, ELAP: 21, ETA: 47 days, 14:12:08 - loss: 44.4968
2/200000 = 0.0%, SPS: 0.1, ELAP: 39, ETA: 45 days, 4:02:14 - loss: 44.3760
3/200000 = 0.0%, SPS: 0.1, ELAP: 40, ETA: 31 days, 5:01:16 - loss: 44.5174
4/200000 = 0.0%, SPS: 0.1, ELAP: 42, ETA: 24 days, 5:38:09 - loss: 44.1623
5/200000 = 0.0%, SPS: 0.1, ELAP: 43, ETA: 20 days, 1:12:59 - loss: 44.2913
ERROR:tensorflow:Model diverged with loss = NaN.
ERROR:tensorflow:Error recorded from training_loop: NaN loss during training.
Traceback (most recent call last):
  File "run_pretraining.py", line 385, in <module>
    main()
  File "run_pretraining.py", line 381, in main
    args.model_name, args.data_dir, **hparams))
  File "run_pretraining.py", line 344, in train_or_eval
    max_steps=config.num_train_steps)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/***/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/home/***/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/monitored_session.py", line 1426, in run
    run_metadata=run_metadata))
  File "/home/***/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/training/basic_session_run_hooks.py", line 761, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

Auto loading in huggingface Transformers is broken

When I try to load the model following the instructions on huggingface.co/models, i.e.:

tokenizer = AutoTokenizer.from_pretrained("google/electra-small-generator")

I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-9-bb330c08e050> in <module>
----> 1 tokenizer = AutoTokenizer.from_pretrained("google/electra-small-generator")

/opt/conda/lib/python3.6/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    179         config = kwargs.pop("config", None)
    180         if not isinstance(config, PretrainedConfig):
--> 181             config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
    182 
    183         if "bert-base-japanese" in pretrained_model_name_or_path:

/opt/conda/lib/python3.6/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
    185 
    186         if "model_type" in config_dict:
--> 187             config_class = CONFIG_MAPPING[config_dict["model_type"]]
    188             return config_class.from_dict(config_dict, **kwargs)
    189         else:

KeyError: 'electra'

The version of transformers is 2.7.0. I reproduced the problem in colab here.

RFC: List of community provided models

Hi @clarkkev ,

I just wanted to hear your opinion of adding a new section to the main readme file, where users can link their own trained ELECTRA models - maybe something like "Community models" 🤔

E.g. I just saw ELECTRA models on the Hugging Face model hub for various languages like Malay, Indonesian or Korean. Today I released base and small models for Turkish.

What do you think?

Best,

Stefan

Pre-trained SMALL model cannot be loaded

Hi guys!
I tried to evaluate pre-trained ELECTRA-small model with OpenWebText, but it cannot be loaded on default configuration.

(electra) judith@judith-dev:~/workspace/ELECTRA/electra$ python3 run_pretraining.py --data-dir /home/judith/workspace/ELECTRA/data_dir --model-name electra_small --hparams '{"do_train": false, "do_eval": true}'
(...)
ERROR:tensorflow:Error recorded from evaluation_loop: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
  (0) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [64] rhs shape= [256]
         [[node save/Assign_275 (defined at /home/judith/anaconda3/envs/electra/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Invalid argument: Assign requires shapes of both tensors to match. lhs shape= [64] rhs shape= [256]
         [[node save/Assign_275 (defined at /home/judith/anaconda3/envs/electra/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
         [[save/RestoreV2/_460]]
0 successful operations.
0 derived errors ignored.

But, I occasionally found that if I change generator_hidden_size which actually means the fraction of discriminator hidden size for generator from 0.25 to 1, (and it makes generator hidden size from 64 to 256) it runs successfully.

1/100 = 1.0%, SPS: 0.5, ELAP: 2, ETA: 3:32 - loss: 11.9305
2/100 = 2.0%, SPS: 0.6, ELAP: 3, ETA: 2:42 - loss: 11.8205
3/100 = 3.0%, SPS: 0.7, ELAP: 4, ETA: 2:25 - loss: 11.7672
4/100 = 4.0%, SPS: 0.7, ELAP: 6, ETA: 2:16 - loss: 11.7212
5/100 = 5.0%, SPS: 0.7, ELAP: 7, ETA: 2:10 - loss: 11.7709
(...)

Was that intended or just mistake? It would be my great pleasure if I hear your opinion.
Thanks in advance!

Loss of base and large models

Hi,

I'm currently working on a new non-English ELECTRA model. Training on GPU seems to work and is running fine 🤗

Next steps would be to try model training on a TPU, so I would just like to ask if you can post the final loss of both base and large models (or even share the loss training curve) so that we have a kind of reference point when training own models 🤔

Thanks many in advance,

Stefan

How to get the embedding vector or matrix after pre-training

Hi,
following the commands, I pre-trained electra-small on my dataset. After pre-training I want the learned embeddings which I need to use on some other complicated downstream tasks. Could you please help me with how to extract the word embeddings after pre-training?

problem on electra's pretraining method

electra contains a generator and a discriminator, where generator replaces [mask] tokens with plausible alternatives ones and discriminator will figures them out.

As said in paper,

  1. Typically k = [0.15n], i.e., 15% of the tokens are masked out

  2. if the generator happens to generate the correct token, that token is considered “real” instead of “fake

thus only (1 - generator_inference_acc) * 0.15 tokens is fake, and leads to a extremely unbalanced binary classfication task for discriminator, but it does work, why?

Should dynamic masking also ignore ['PAD']

Here is a set of tokens that should not be masked during dynamic masking.

ignore_ids = [vocab["[SEP]"], vocab["[CLS]"], vocab["[MASK]"]]

But should we also avoid masking all those ['PAD'] at the end of a sentence (if the sentence is shorter than max_seq_length and if there is no second sentence segment)?

I understand ['PAD'] itself has token_id = 0, but I do not see this being used to prevent masking in downstream steps. If we do not ignore it, this will affect the probability calculation here

# Get a probability of masking each position in the sequence
candidate_mask_float = tf.cast(candidates_mask, tf.float32)
sample_prob = (proposal_distribution * candidate_mask_float)
sample_prob /= tf.reduce_sum(sample_prob, axis=-1, keepdims=True)

Also, we will be trying to predict 'PAD' that is outside a sequence, which is a bit unintuitive.

Maybe I am missing something here. Thanks again for putting up such a great work!

confusing about stop_gradient in the code

why there are two stop_gradients in "mask" function of pretrain_helpers.py. see snapshot under.
image

i can understand the stop_gradient here, but after that, there is still an argmax op, seems strange
image

num_eval_steps

It is not clear what is num_eval_steps, and what is the loss printed when evaluating performances of
the pretrained model.

Is this just the number of batches that are taken to evaluate?
The loss is just the loss evaluated on the current batch, and hence the final one is just the average among the selected batches?

Thanks for the answers!

Issue with loading weights for eval

I'm trying to validate that I am running experiments on finetuned ELECTRA model correctly by using only tf.train.init_from_checkpoint() to load the weights. When evaluating my finetuned ELECTRA model using run_finetuning.py, I've found I get different accuracy results when using a different model_dir than the directory containing the finetuned model for run_config = tf.estimator.tpu.RunConfig(...).

With the original code, I get the expected accuracy (~81) but when I modify the model_dir argument in tf.estimator.tpu.RunConfig() to an empty directory, I get much lower and non-deterministic accuracy (~32). I was wondering why that is since the weights are still being loaded using tf.train.init_from_checkpoint(). Are there variables being loaded using tf.estimator.tpu.RunConfig()?

Training loss

Hello,

I was wondering whether it is possible to add some loss metrics to the training cycle? The only thing I see during training Electra model is

1275000/3000000 = 42.5%, SPS: 3.1, ELAP: 9:24:02, ETA: 6 days, 11:55:19

which tells nothing about how good is it. I'm trying to add some code to the estimator, but it seems to me that it could be much easier to show all the metrics in order to see how successful the model is at this stage.

I'm training non-English model, so I wanted to get better insight into how my model is performing at the moment.

Thanks

`num_train_steps` for further pretraining

Hello, I am trying to further pretrain the base model and large model use domain-specific corpus. But I see in the document, it says that when continuing pre-training from the released small ELECTRA checkpoints, we should:

Setting num_train_steps by (for example) adding "num_train_steps": 4010000 to the --hparams. This will continue training the small model for 10000 more steps (it has already been trained for 4e6 steps).

But Table 6 of the paper shows that small ELECTRA model is trained for 1M steps. Which one should we set?

If 4e6 is correct, how many steps has the base model or large model been trained?

Is the result based on dev or test set?

In your paper, sometimes you use dev, sometimes you use dev&test set. But in this github, you did not mention which dataset are you using. Could you clarify? Or would you update the paper? Thanks


Besdies, what is the training flops you used in the results in the github? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.