google-research / bigbird Goto Github PK

View Code? Open in Web Editor NEW

553.0 553.0 100.0 1.39 MB

Transformers for Longer Sequences

Home Page: https://arxiv.org/abs/2007.14062

License: Apache License 2.0

Python 74.61% Shell 0.94% Jupyter Notebook 24.45%

bert deep-learning longer-sequences nlp transformer

bigbird's Introduction

Google Research

This repository contains code released by Google Research.

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.

Because the repo is large, we recommend you download only the subdirectory of interest:

SUBDIR=foo
svn export https://github.com/google-research/google-research/trunk/$SUBDIR

If you'd like to submit a pull request, you'll need to clone the repository; we recommend making a shallow clone (without history).

git clone [email protected]:google-research/google-research.git --depth=1

Disclaimer: This is not an official Google product.

Updated in 2023.

bigbird's People

Contributors

Stargazers

Watchers

Forkers

olenet carlomarxdk joytianya johnson7788 batuhan-ozyurt joey12300 wang-yufei ztherebe-github sailfish009 learnpythontheew yyxymint iassael elnaz776655 ozturkosu maria-philna guiyong96 kevin-michael-cs230 gargimahale amit-gh pugangqiang pukkapies hubayirp forestofrain dat-nguyen96 kiranvarghesev marziehngh jameschamberlain earlbabson jasabanta1992 phymucs young768 hori-ryota schenbergzy e0397123 masa-ita noctillion decaf0cokes rghotra erfanthinker gymbeijing athaarhmo airc-keti moqingxinai xiaochen93 mshlis manzilz sandguine jonahwinninghoff taehoonkoo cytsinghua anirudh930 isabella232 base5genomics dennisjay jtfields nannim merouone shigangli cateto liu-nlper python-repository-hub wangdongde nashid demonbibi amobular phucnguyen250300 ian-jihoonpark zbn123 tgoldsack1 dehghanm bhakti-visotrust aramist lecra9 haojiepan1 loujc surafelteka lingshuhu litian96 qhfan tiangongtimsu tianqitheodorejiang edwin-zft yutan9 tooyassem higuseonhye elbakramer huajiang123 jganitzer kongyanlei iq-scm assassindesign yjlikecode apollohuang1 mzheng3 xiuquan0418 betikuoluwatobi ambifire dexter-gt-86 gptconsoledemo

bigbird's Issues

Learning rate mentioned in paper vs run_summarization.py

Hi ,

The learning rate mentioned in paper for summarization is around 3e-5 . But in the run_summarization.py it is mentioned as 0.32 ( default ) in the flags.
In roberta_base.sh script, there is no changing happen for the learning rate.

Can anyone please update on this, as learning rate is very crucial for models like these.

Thanks

I've added bigbird's attention to my model, but not seeing a decrease in memory

I've replaced the attention layers in Enformer with those in bigbird, but the memory usage calculated by tf.get_memory_info shows the usage is still basically the same (within 1%). I'm wondering if I need to include code from the encoder or decoder to see a decrease in memory usage?

Thanks!

the versions of all libraries in the deployment environment?

Unconditional assert False in bigbird/core/utils.py

Hi,

I wanted to point out that in bigbird/core/utils.py at line 58, there is an unconditional assert False:

assert False, "Static shape not available for {}".format(tensor)

However, there is code after the assert statment. If I'm not mistaken, that means it is dead code:

  assert False, "Static shape not available for {}".format(tensor)

  dyn_shape = tf.shape(tensor)
  for index in non_static_indexes:
    shape[index] = dyn_shape[index]
  return shape

Are encoder and decoder both implemented with sparse attention? How long is the verified output length for the decoder?

Any plan to provide chinese pretrain model ?

TFDS Custom Dataset Issue - normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.

I am using BigBird with a custom dataset (essay, label) for classification. I successfully imported the dataset as a custom tfds dataset and the BigBird classifier runs but does not return any results as shown in the log below. In my_datset.py configuration file for tfds, I am using this code to define the text feature - 'text': tfds.features.Text(). However, I believe that I need to add an encoder but TensorFlow has deprecated this in tfds.features.Text and recommends using the new tensorflow_text but doesn't explain how to do this in tfds.features.Text. Can anyone provide a recommendation for how to encode the text so BigBird can perform the classification?

My GPUS are 0
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
{'label': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:0' shape=() dtype=int64>, 'text': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:1' shape=() dtype=string>}
Tensor("args_1:0", shape=(), dtype=string)
Tensor("args_0:0", shape=(), dtype=int64)

0%| | 0/199 [00:00<?, ?it/s]
42%|████▏ | 84/199 [00:00<00:00, 838.07it/s]
100%|██████████| 199/199 [00:00<00:00, 1124.10it/s]

0%| | 0/2000 [00:00<?, ?it/s]
0%| | 0/2000 [00:00<?, ?it/s]
{'label': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:0' shape=() dtype=int64>, 'text': <tf.Tensor 'ParseSingleExample/ParseExample/ParseExampleV2:1' shape=() dtype=string>}
Tensor("args_1:0", shape=(), dtype=string)
Tensor("args_0:0", shape=(), dtype=int64)

0it [00:00, ?it/s]
0it [00:00, ?it/s]
Loss = 0.0 Accuracy = 0.0

Why is BigBird Pegasus/Pegasus Repeating the Same Sentence for Summarization?

Hello,

BigBird Pegaus, when creating summaries of text, is repeating the same sentence over and over. I have tried using text on the Hugging Face model hub and there is an issue posted on Stack Overflow (https://stackoverflow.com/questions/68911203/big-bird-pegasus-summarization-output-is-repeating-itself). Additionally, below are some images from the Hugging Face hub.

I am doing text summarization for my thesis and I am not sure why this is happening, but apparently it has been an issue for 6 months. Is there a way to prevent this from happening?

Thank you.

Preprocessing code for TriviaQA dataset

Dear authors,

Do you use the same preprocessing code as Longformer on TriviaQA dataset such as truncating each document less than 4096, answer string match algorithm and normalized aliases as training labels?

Variable error with the full_bigbird_mask method in the multi head attention class

There is a variable error with the full_bigbird_mask method in the multi-head attention class for the big bird mask that uses MAX_SEQ_LEN instead of from_sequence_length passed, this will affect the creation of attention_mask with the using the convert_attn_list_to_mask(self, rand_attn) method.
temp_mask = [ full_bigbird_mask( # pylint: disable=g-complex-comprehension self.from_seq_length, self.to_seq_length, self.from_block_size, self.to_block_size, rand_attn=rand_attn[i]) for i in range(self.num_attention_heads) ]
`def full_bigbird_mask(from_seq_length,
to_seq_length,
from_block_size,
to_block_size,
rand_attn):
"""Calculate BigBird attention pattern as a full dense matrix.

Args:
from_seq_length: int. length of from sequence.
to_seq_length: int. length of to sequence.
from_block_size: int. size of block in from sequence.
to_block_size: int. size of block in to sequence.
rand_attn: adjajency matrix for random attention.

Returns:
attention mask matrix of shape [from_seq_length, to_seq_length]
"""

attn_mask = np.zeros((MAX_SEQ_LEN, MAX_SEQ_LEN), dtype=np.int32)
for i in range(1, (MAX_SEQ_LEN // from_block_size) - 1):`
full_bird_mask method uses MAX_SEQ_LEN instead of from_seq_length or to_seq_length which does not make the method dynamic as MAX_SEQ_LEN is only defined at the top of the module and seems to be causing a glitch with the convert_attn_list_to_mask method.

Error in PubMed evaluation using run_summarization.py

I am using the script roberta_base.sh to train and test the model on PubMed summarization task. I am able to successfully train the model for multiple steps (5000) but it fails during evaluation time. Below is some of the error string.

I0416 18:16:41.567906 139788890330944 error_handling.py:115] evaluation_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0416 18:16:41.568143 139788890330944 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "bigbird/summarization/run_summarization.py", line 534, in <module>
    app.run(main)
...
File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2268, in create_tpu_hostcall
    'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:0", shape=(), dtype=float32, device=/job:worker/task:0/device:CPU:0)

I am not too familiar with the code and about this error. Searched it online but didn't get much help. Hope you can help. Below is the script which I ran to reproduce this error:

python3 bigbird/summarization/run_summarization.py \
  --data_dir="tfds://scientific_papers/pubmed" \
  --output_dir=gs://bigbird-replication-bucket/summarization/pubmed \
  --attention_type=block_sparse \
  --couple_encoder_decoder=True \
  --max_encoder_length=3072 \
  --max_decoder_length=256 \
  --num_attention_heads=12 \
  --num_hidden_layers=12 \
  --hidden_size=768 \
  --intermediate_size=3072 \
  --block_size=64 \
  --train_batch_size=2 \
  --eval_batch_size=4 \
  --num_train_steps=1000 \
  --do_train=True \
  --do_eval=True \
  --use_tpu=True \
  --tpu_name=bigbird \
  --tpu_zone=us-central1-b \
  --gcp_project=bigbird-replication \
  --num_tpu_cores=8 \
  --save_checkpoints_steps=1000 \
  --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0

code error in version of tensorflow?

Hello google research~
Thanks for code for big bird.
But i've got in trouble with the code.
Below is the error message in terminal

my python version is 3.9.7
and version of packages are in below.

if you have a answer for this problem, please let me know ..

---------------------------------------VERSION OF PACKAGES----------------------------------------------------

---------------------------------------ERROR MESSAGE----------------------------------------------------
(bigbird) kjk88@gpu2:~/bigbird$ sh -x bigbird/classifier/base_size.sh

python3 bigbird/classifier/run_classifier.py --data_dir=tfds://imdb_reviews/plain_text --output_dir=gs://bigbird-transformer-training/classifier/imdb
WARNING:tensorflow:From /home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Traceback (most recent call last):
File "/home/kjk88/bigbird/bigbird/classifier/run_classifier.py", line 460, in
app.run(main)
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/kjk88/bigbird/bigbird/classifier/run_classifier.py", line 375, in main
bert_config = flags.as_dictionary()
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/site-packages/bigbird/core/flags.py", line 187, in as_dictionary
FLAGS.vocab_model_file = str(importlib_resources.files(bigbird).joinpath(
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/importlib/resources.py", line 147, in files
return _common.from_package(_get_package(package))
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/importlib/_common.py", line 14, in from_package
return fallback_resources(package.spec)
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/importlib/_common.py", line 18, in fallback_resources
package_directory = pathlib.Path(spec.origin).parent
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/pathlib.py", line 1082, in new
self = cls._from_parts(args, init=False)
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/pathlib.py", line 707, in _from_parts
drv, root, parts = self._parse_args(args)
File "/home/kjk88/anaconda3/envs/bigbird/lib/python3.9/pathlib.py", line 691, in _parse_args
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
--attention_type=block_sparse --max_encoder_length=4096 --num_attention_heads=12 --num_hidden_layers=12 --hidden_size=768 --intermediate_size=3072 --block_size=64 --train_batch_size=2 --eval_batch_size=2 --do_train=True --do_eval=True --use_tpu=True --tpu_name=bigbird --tpu_zone=europe-west4-a --gcp_project=bigbird-project
bigbird/classifier/base_size.sh: 8: bigbird/classifier/base_size.sh: --attention_type=block_sparse: not found
--num_tpu_cores=32 --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0
bigbird/classifier/base_size.sh: 24: bigbird/classifier/base_size.sh: --num_tpu_cores=32: not found

Precision equals Recall in run_classifier.py script run.

I am trying to replicate the results of the paper. I ran run_classifier.py script for 7000 train-steps on imdb reviews. After every 1000 batches, we see precision, recall, accuracy, F1 score and loss printed on the terminal. For all the checkpoints, precision=recall=F1=accuracy up to all decimal points. I wonder if this has some mistake in calculation. For a binary dataset, we should not have precision=recall=accuracy.

For e.g. for ckpt-1000, I got 0.9408210 as the values for p, r, a, f1.

How is Prior Arts, which can only accept short text input, evaluated on long text datasets.

Such as Attn-Seq2Seq

Preprocessing code for the arxiv classification dataset.

Dear authors,

Could you kindly provide the preprocessing code for the Arxiv Classification dataset? Or, some descriptions about how to do the preprocessing.

Error in run_classifier.py for attention_type=simulated_sparse

I am using script base_size.sh to run the class run_classifier.py. I am able to train and evaluate on imdb data for attention_type set as original_full and block_sparse but when I set it to simulated_sparse I see errors in initializing the training itself. The 12 layers are initialized but training doesn't start. The major error log is below:

File "/home/amitghattimare/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3211, in _as_graph_def
    graph.ParseFromString(compat.as_bytes(data))
google.protobuf.message.DecodeError: Error parsing message

I used the below script to run the code in case it helps in investigation. If I change attention_type to the other 2 options, it works fine. I am using only 8 cores because that's the max available in preemptible mode. I have reduced train_batch_size so that it fits in memory. I wonder if that's causing the issue though error logs don't indicate that.

python3 bigbird/classifier/run_classifier.py \
  --data_dir=tfds://imdb_reviews/plain_text \
  --output_dir=gs://bigbird-replication-bucket/classifier/imdb/sim_sparse_attention \
  --attention_type=simulated_sparse \
  --max_encoder_length=4096 \
  --num_attention_heads=12 \
  --num_hidden_layers=12 \
  --hidden_size=768 \
  --intermediate_size=3072 \
  --block_size=64 \
  --train_batch_size=1 \
  --eval_batch_size=2 \
  --do_train=True \
  --do_eval=False \
  --num_train_steps=1000 \
  --use_tpu=True \
  --tpu_name=bigbird \
  --tpu_zone=us-central1-b \
  --gcp_project=bigbird-replication \
  --num_tpu_cores=8 \
  --init_checkpoint=gs://bigbird-transformer/pretrain/bigbr_base/model.ckpt-0

What's the difference of bigbr_base and bigbr_base_tf2 at the gs://bigbird-transformer/pretrain ?

I found there are two bigbr_base pretrain weights at Google Cloud Storage Bucket, what is the difference? And I have checked that their word embeddings are different by this script, which means they are not only different in the type of tf2/tf1.

Question about pre-trained weights

Thanks so much for releasing BigBird!

Quick question about the pre-trained weights. Do the bigbr_large and bigbr_base correspond to BERT-like encoder-only checkpoints and bigbp_large to the encoder-decoder version?

Export predictions for each example

I have successfully run Google's BigBird NLP on the IMDB dataset and also a custom dataset imported using tfds. BigBird's imdb.ipynb only prints the overall accuracy and loss. I'm trying to export the predictions for each record in the dataset and have been unable to find any information on how to do this. Any help is appreciated!

Here is the current code that I used for the summary metrics:
eval_loss = tf.keras.metrics.Mean(name='eval_loss')
eval_accuracy = tf.keras.metrics.CategoricalAccuracy(name='eval_accuracy')

opt = tf.keras.optimizers.Adam(FLAGS.learning_rate)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.CategoricalAccuracy(name='train_accuracy')

for i, ex in enumerate(tqdm(dataset.take(FLAGS.num_train_steps), position=0)):
loss, log_probs, grads = fwd_bwd(ex[0], ex[1])
opt.apply_gradients(zip(grads, model.trainable_weights+headl.trainable_weights))
train_loss(loss)
train_accuracy(tf.one_hot(ex[1], 2), log_probs)
if i% 200 == 0:
print('Loss = {} Accuracy = {}'.format(train_loss.result().numpy(), train_accuracy.result().numpy()))

Differences between ETC and BigBird-ETC version

@manzilz Thank you for sharing the excellent research. :)

I have two quick questions. If I missed some info in your paper, could you please let me know what I missed?

Q1. Is the Global-local attention method used in the BigBird-ETC version totally the same as the ETC paper, otherwise Longformer?
As I know, some special tokens(global tokens) only take full attention to the restricted sequences according to the ETC paper. For example, in the HotpotQA task, a paragraph token attends to all tokens within the paragraph. Also, a sentence token attends to all tokens within the sentence. ( I can't find about how [CLS] and question tokens take attention to. )

In Longformer, the special tokens between sentences take full attention to the context.

In BigBird paper(above of section 3), the author said

"we add g global tokens that attend to all existing tokens."

It seems to say the BigBird-ETC version is similar to Longformer. However, when the author mentioned differences between Longformer and BigBird-ETC, point to the reference as an ETC (in Appendix E.3). It makes me confused.

Q2. Is there a source code or a pre-trained model for the BigBird-ETC version? If you could share it used in your paper, I will really appreciate it!

I look forward to your response.

Would you like to release the code about how to train a bigbird with other language

@manzilz I want to train a bigbird with other language

Pre-trained model for genomic sequences

Good morning,

Thank you for sharing the paper, code and pre-trained model for NLP text data. Your research work results are impressive. Because I am developing embeddings solutions for genes and proteins, the application to genomic sequences part interests me the most.

Is there any chance to try BigBird nucleotide-based pre-trained model for research purpose? I would like to include it in my benchmark and compare it with existing non-contextual embeddings (Word2Vec, FastText and Glove).

Regards,
Piotr

How can we finetune the pretrained model using tfrecord files?

I've tried to finetune the model on my own text summarization dataset. Before doing that, I tested using tfrecord as the input file. So I put /tmp/bigb/tfds/aeslc/1.0.0 as data_dir:

flags.DEFINE_string(
    "data_dir", "/tmp/bigb/tfds/aeslc/1.0.0",
    "The input data dir. Should contain the TFRecord files. "
    "Can be TF Dataset with prefix tfds://")

Then I run run_summarization.py. But I got the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Feature: document (data type: string) is required but could not be found.
         [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]]
         [[Mean/_19475]]
  (1) Invalid argument: Feature: document (data type: string) is required but could not be found.
         [[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
         [[MultiDeviceIteratorGetNextFromShard]]
         [[RemoteCall]]
         [[IteratorGetNext]]

Could anyone advise me how to finetune the model using tfrecord as the input file?

Why ``last_idx`` set to 1024 even when sequence length goes upto 4096?

I wonder why the last_idx(the last index upto which blocks are chosen from for random attention) variable here has been set to 1024 even when the sequence length increases to 4096? Is this an error, or am I getting something wrong?

Thank you for your precious time.
Yours gratefully

Couldn't able to save and load the model after finetuning

In bigbird summarization, I have loaded pretrained model , after that I have performed finetuning on gigaword tensorflow dataset , then I tried to save the model using tf.saved_model.save(model, data_dir=export_dir) and loaded the model using loaded_model = tf.keras.models.load_model("/drive/My Drive/Checkpoint_Summarization/original_saved") and it is throwing
ValueError: Found zero restored functions for caller function.

bug in line-494 of script- run_pretraining.py

There is small bug in following:

bigbird/bigbird/pretrain/run_pretraining.py

Line 494 in 103a334

self._trainable_weights = (self.extra_layer +

It should be self._trainable_weights = (self.extra_layer.trainable_weights +

Is it valid to train on GRCh38.p13 human reference instead of GRCh37 ?

Dear authors,

Thank you for this outstanding work!

I have a question regarding the reference genome for training genomic model.
In your paper you refer to GRCh37, but it seems that it is an outdated version now and Build 38 can be used (https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39)
Do you think it will be valid to train BigBird model on chromosomes of GRCh38.p13 for chromatin profile prediction, considering that DeepSEA training dataset is based on GRCh37? Or is should be same reference genome GRCh37 in both datasets?

I want to know d.map("preprocess function",... ) processing

I had try to debug "do_making" function of "run_pretraining.py" file for using pycharm IDE

But don't stop break point for over function

My tensorflow version "tensorflow-2.4.0"
Pycharm debug app is "pydev"

I have some two question

Q.1 Why not working break point in this function
Q.2 How about that to solve this problem

Problem image

thank you

Model for genomic sequences

Hi
I could not find pretrained model for the genomic sequences task , neither I could find the script (training algo , tokenizer) which I could use to train my own model for the mlm task for genomic sequences.

Pegasus variables mapping

I have my own pretrained Pegasus model, now I want to finetune using BigBird, so this is my mapping function,

import re
import collections

def get_assignment_map_from_checkpoint(tvars, init_checkpoint):
    """Compute the union of the current variables and checkpoint variables."""
    assignment_map = {}
    initialized_variable_names = {}

    name_to_variable = collections.OrderedDict()
    for var in tvars:
        name = var.name
        m = re.match('^(.*):\\d+$', name)
        if m is not None:
            name = m.group(1)
        name_to_variable[name] = var

    init_vars = tf.train.list_variables(init_checkpoint)
    assignment_map = collections.OrderedDict()
    for x in init_vars:
        (name, var) = (x[0], x[1])

        l = 'pegasus/' + name
        l = l.replace('embeddings/weights', 'embeddings/word_embeddings')
        l = l.replace('self/output', 'output')
        l = l.replace('ffn/dense_1', 'output/dense')
        l = l.replace('ffn', 'intermediate')
        l = l.replace('memory_attention/output', 'attention/encdec_output')
        l = l.replace('memory_attention', 'attention/encdec')

        if l not in name_to_variable:
            continue
        assignment_map[name] = name_to_variable[l]
        initialized_variable_names[l + ':0'] = 1

    return (assignment_map, initialized_variable_names)

output,

OrderedDict([('decoder/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_0/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_0/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_0/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_1/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_1/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_1/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_2/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_2/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_2/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_3/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_3/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_3/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_4/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_4/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_4/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/attention/self/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense/bias',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense_1/bias',
              <tf.Variable 'pegasus/decoder/layer_5/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/LayerNorm/beta',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/LayerNorm/gamma',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/key/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/output/dense/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec_output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/query/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('decoder/layer_5/memory_attention/value/kernel',
              <tf.Variable 'pegasus/decoder/layer_5/attention/encdec/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('embeddings/weights',
              <tf.Variable 'pegasus/embeddings/word_embeddings:0' shape=(32128, 512) dtype=float32_ref>),
             ('encoder/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_0/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_0/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_0/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_0/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_1/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_1/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_1/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_1/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_2/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_2/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_2/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_2/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_3/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_3/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_3/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_3/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_4/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_4/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_4/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_4/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/key/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/key/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/output/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/output/dense/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/query/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/query/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/attention/self/value/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/attention/self/value/kernel:0' shape=(512, 512) dtype=float32_ref>),
             ('encoder/layer_5/ffn/LayerNorm/beta',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/LayerNorm/gamma',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/LayerNorm/gamma:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense/bias',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/dense/bias:0' shape=(3072,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/intermediate/dense/kernel:0' shape=(512, 3072) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense_1/bias',
              <tf.Variable 'pegasus/encoder/layer_5/output/dense/bias:0' shape=(512,) dtype=float32_ref>),
             ('encoder/layer_5/ffn/dense_1/kernel',
              <tf.Variable 'pegasus/encoder/layer_5/output/dense/kernel:0' shape=(3072, 512) dtype=float32_ref>)])

My pegasus config, Copy pasted from https://github.com/google-research/bigbird/blob/master/bigbird/summarization/pegasus_large.sh

bert_config = {
    # transformer basic configs
    'attention_probs_dropout_prob': 0.1,
    'hidden_act': 'relu',
    'hidden_dropout_prob': 0.1,
    'hidden_size': 512,
    'initializer_range': 0.02,
    'intermediate_size': 3072,
    'max_position_embeddings': 4096,
    'max_encoder_length': 2048,
    'max_decoder_length': 512,
    'num_attention_heads': 8,
    'num_hidden_layers': 6,
    'type_vocab_size': 2,
    'scope': 'pegasus',
    'use_bias': False,
    'rescale_embedding': True,
    'vocab_model_file': None,
    # sparse mask configs
    'attention_type': 'block_sparse',
    'norm_type': 'prenorm',
    'block_size': 64,
    'num_rand_blocks': 3,
    'vocab_size': 32128,
    'beam_size': 1,
    'alpha': 0.0,
    'couple_encoder_decoder': False,
    'num_warmup_steps': 10000,
    'learning_rate': 0.1,
    'label_smoothing': 0.1,
    'optimizer': 'Adafactor',
    'use_tpu': True,
}

Not sure this is the correct one, finetuning is really slow, so any guide about variable mapping is really helpful.

detail about warm start from RoBERTa’s checkpoint.

how to use the pretrain RoBERTa’s checkpoint, I was doubt that whether use the pretrain position embedding in Roberta

Roberta Training

Hello,

First, congratulations for your work.

Second, from what I have discovered so far, you only allow Bert like training and not Roberta training.
Even if the NSP is set to false, still your script requires the "next_sentence_labels" field which is generated by Bert script.

My question is:
How can we generator and train a model like Roberta, where there is only a single sequence per example without NSP.

@manzilz @ppham27 your feedback is highly appreciated.
Thanks in advance for your reply.

reproduce arxiv classification task

We try to reproduce arxiv task with f1 92 as shown in the paper, we are using default hyperparameters defined in bigbird/classifier/base_size.sh, pretrained checkpoint here, but with batch size = 2 due to memory limitation (total batch size = 8gpu * 2 = 16), after 16k steps (16000 * 16 / 30034 = 8.5 epoch), but only get f1 84 in the end, which is too low compare to the paper which is trained by 10 epochs.
Did we missing something? preprocessing of Arxiv? or just because of the batch size is too small?
Will you release the checkpoint of Arxiv in the future?

About the difference of dataset, we have finetune roberta on the same arxiv dataset and get f1 86, pretty close the the paper.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.