connorjl / gpt2 Goto Github PK

An implementation of training for GPT2, supports TPUs

License: MIT License

Python 99.82% Shell 0.18%

gpt2's Introduction

GPT2

Disclaimer: This is not the official GPT2 implementation! I've done my best to follow the specifications of the original GPT2 model as closely as possible, but be warned that I have not been able to replicate the full performance of the original model using this code. I don't know why this is, I haven't been able to track down any bug that could be causing this.

An implementation of training for GPT2 that supports both GPUs and TPUs. The dataset scripts are a bit hacky and will probably need to be adapted to your needs.

Requirements

For GPUs:

pip3 install tensorflow-gpu regex

For TPUs:

pip3 install tensorflow regex google-api-python-client oauth2client

For downloading the models:

pip3 install requests tqdm

For generating the dataset (in addition to Tensorflow):

pip3 install ftfy tqdm newspaper3k

Downloading Pretrained Models

If you want to use my models, I currently have "117M", "PrettyBig" and "1.5B" to offer. 117M was trained on a single v2 TPU for a week (probably less than the original OpenAI model), PrettyBig is slightly bigger than 345M and was trained on a v2-256 pod for a week. ~~I was originally also planning to release my version of the 1.5B model, but have decided against it. You can read about my reasoning here.~~ Since OpenAI has released their model, I have now also released my (inferior) 1.5B model, which was trained on a v3-512 pod for a week.

python3 download_model.py PrettyBig

This will create two directories, one named as the model and another named "encoder". Change the "model_dir" and "encoder_path" parameters in the .json corresponding to your model to point to these paths, respectively.

If you only want the encoder, use:

python3 download_model.py encoder

Generating Text

To predict you can either pass the prompt directly in the command line, or have it read from a file. (This is useful for prompts that include newlines) Text is output to the console and the file specified in the "predict_path" parameter. You need a model checkpoint and a copy of the BPE encoder at an accessible location for this to work. (Change the "model_dir" and "encoder_path" parameters in the .json)

From command line:

python3 main.py --model Your-Model.json [--top_k Top-K-Truncation] --predict_text "Hello there! My name is"

From file:

python3 main.py --model Your-Model.json [--top_k Top-K-Truncation] --predict_file input.txt

The optional top_k parameter causes the model to only consider the top k most likely tokens at each step. Setting this around 40 tends to create better results, but with less variety.

Prediction on TPUs is not supported.

Training

To train a model, define its parameters in a .json file (see examples) and then simply call

python3 main.py --model Your-Model.json [--tpu Your-TPU-Name]

Using a TPU is optional, it runs fine on GPUs without modification. (Note: Evaluation doesn't work on TPU pods and must be commented out)

This assumes you have a version of the openwebtext corpus stored in an accessible location. If you don't, see below how to generate your own version.

Generating the Dataset

GPT2 is trained on the webtext corpus, which is basically all websites linked to from Reddit with at least 3 Karma. Since the database is huge and contains a lot of copyrighted material, I can't provide a download here. Instead, I'll describe how I got it. Be aware it cost me around ~500€ in cloud compute resources to download and process the whole thing, but I'm not claiming I was optimally efficient.

Use the download script from here to download the archives (I used the prefiltered URLs file)
Use datasets/openwebtext/ run_newspaper_extract.py to extract the text
Once you have the raw .txt files use datasets/openwebtext/ create_tfrecords.py to encode them into .tfrecords files (Requires a copy of the encoder, see Downloading Pretrained Models)
Place the .tfrecords files into an accessible folder or Google Storage bucket (Placing in a Google Storage bucket is mandatory if you're using TPUs)
Change the "data_path" parameter in your .json to point to where your .tfrecords files are located and, if necessary, adapt the functions in inputs.py to open the correct filenames, in case you changed them

Using Your Own Data

You can also use your own text files as training data, but you'll need to modify some code by hand.

Modify the parameters in datasets/openwebtext/create_tfrecords.py:

base_dir = "/home/connor/my_text_dir" # Path to where your .txt files are located
files_per = 175000 # How many txt files to put in one tfrecord, not too important
name = "my-custom-data" # Name of output files will be name_i.tfrecords where i is the number of the file
output_dir = "/home/connor/output" # Where to place the .tfrecords files
log_dir = "logs" # Some logs will be placed here to support restarting if the encoding is interrupted
files = glob.glob(os.path.join(base_dir, "**/*.txt")) # This needs to result in a list of paths to all of your txt files
processes = 64 # Number of encoding processes to run
encoder_path = "/home/connor/encoder" # Path to encoder files
minimum_size = 128 # The minimum length (in BPE tokens) a file is allowed to have, otherwise it is discarded.

Run the script. This will result in a bunch of name_i.tfrecords files. Put these somewhere accessible (must be in a Google Storage bucket if you're using TPUs).
Create a new input function in inputs.py. Any input function should have the signature function_name(params, eval=False). The stitch value controls how many texts are concatenated so that you never end up with a sample that is too small. It should be: ceil((n_ctx+1) / minimum_size) So for example, if my minimum size is 128 and my n_ctx is 1024, stitch should be 9.

def my_input(params, eval=False):
    if not eval:
        numbers = [0, 3, 4, 5, 6, 7, 8, 9] # A random subset of files for train
    else:
        numbers = [1, 2] # Random subset for eval
    files = [os.path.join(params["data_path"], "my-custom-data_{}.tfrecords".format(str(i))) for i in numbers] # Generates the list of files

    return bpe_text(params["batch_size"], files, amount=params["n_ctx"], iterations=params["iterations"], stitch=9, batch=True)

inputs = {
    "openwebtext": openwebtext, # Standard OpenWebtext input
    "openwebtext_longbiased": openwebtext_longbiased, # OpenWebtext with a bias towards showing more long (>512 tokens) examples
    "openwebtext_long": openwebtext_long, # Openwebtext that only shows long examples
    "my_input": my_input,
}

Set your .json to use the new input.

[...]
    "iterations": 500,
    "n_embd": 768,
    "input": "my_input",
    "model": "GPT2",
[...]

You're done. The input described here should be as close to GPT2 as possible and run perfectly on TPUs.

Explanation of Parameters

Because passing two dozen parameters over the command line would be tedious, you pass all the model parameters in a .json file. Note that any paths also support Google Storage paths and must be gs:// paths if you're running on TPUs.

Values you'll definitely want to change:

model_path: Where to save and load checkpoints from
data_path: Where your .tfrecords files are located
encoder_path: Path to the BPE encoder files. To get this, use the download_model.py script to download any model (or just the encoder). You will get a folder called "encoder". This is what you want this to point to (only required for prediction)

Values you'll probably want to change:

train_batch_size: Batch size during training phase
eval_batch_size: Batch size during evaluation
predict_batch_size: Batch size during prediction
predict_path: Where to save predictions (point this to a text file to append to)

Model parameters:

model: A string that refers to which model to use. This should always just be "GPT2" (no other models are implemented here)
n_ctx: Number of tokens the model looks at (default: 1024)
n_vocab: Size of vocabulary (default: 50257)
n_embd: Dimension of embedding layers
n_layer: Number of layers in the model
n_head: Number of attention heads (default: n_embd / 64)
scale_by_depth: Whether or not to scale init by the number of layers (Default: true)
scale_by_in: Whether to scale init by the number of input channels (Default: true)

Training parameters:

precision: Whether to use float32 or bfloat16 variables (use "bfloat16" when training very large models) (optional, defaults to float32)
input: Which input function to use (default: "openwebtext")
lr: Learning rate (default: 0.00025)
warmup_steps: Number of warmup steps. If this is set, a linear warmup + cosine decay schedule is used (default: 2000) (optional)
opt_name: Name of optimizer, currently there are "adam" and "adafactor" (default: "adam")
weight_decay: Weight decay parameter, if not present no weight decay is used (the weight decay fix for Adam is used) (default: 0.01) (optional)
beta1: Adam/Adafactor beta1 parameter (adam default: 0.9, adafactor default: 0.0)
beta2: Adam/Adafactor beta2 parameter (default: 0.98) (optional for adafactor with pow decay type)
epsilon: Adam epsilon parameter (default: 1e-9)
decay_type: Adafactor decay type, either "pow" or "adam" (default: "pow")
decay_exponent: Adafactor pow decay exponent (default: 0.8)
train_steps: Number of training steps to take between evaluations
eval_steps: Number of steps per evaluation
max_steps: The maximum number of training steps (important for declining lr)
iterations: Number of iterations to perform on TPUs (Default: 100) (Only required for TPUs)
embed_dropout: Dropout chance on the word embedding, set to 0 to disable (default: 0.1)
attn_dropout: Dropout chance on attention layers, set to 0 to disable (default: 0.1)
res_dropout: Dropout chance on residual connections, set to 0 to disable (default: 0.1)

gpt2's People

Contributors

Stargazers

Watchers

Forkers

llabs-ai devsinghsachan ggozad johnjjung franck-dernoncourt randl samithaj theolivenbaum alphagit stomioka rusiaaman pavitrakumar78 mahjiong gmosaw vguanwenv zoucongchao michalliu leoyouli mbyase formleaf yoga0806 cathowlet albertchen121 keda969 lycreative buaazb uilzzw dofish crackgfw peng2017 thekiwichen wlovew dendisuhubdy castrol68 alexscofield liufei11111 zhongwei-jiu xujun05 wangjs superdickie appotry catkillfishhub tqdavid lioncgn charlottesean duyichao tubo70 pelfsollution fengzifrank greatballacuda lite-java codeachievedream nsdown soyoungcheng rambosofter hhy5277 jackandroselv frame-work wujiayi66 tkggand gittersyang samantha-fu aftnwinds 26sun marvelgalaxy mohirio danvez holm-xie stjordanis tifasky faolain qiufengke uyshero zhoukunpeng504 adamlouisky lst4ever liaoyibiao1987 slover2017 klgentle nasaallen shaunstanislauslau solmur sunny8898 lion-messi qianrenjian sunqingxiang leo-xxx zengguang2019 hidhineshraja jshuadvd ariellv2019 ericxsun fallenk xiaojingyi scilover sagmain l15193127940 gshpchina xiakexing716 snowcranestart

gpt2's Issues

Why gpt-2 could apply to other tasks without fine-tune?

Thank you!

Training problem

@ConnorJL Thanks for the great work.

Unfortunately, I found out my training using OpenWebTextCorpus is too slow even for 117M model. The cross entropy loss function decreases rapidly before 10k steps using a batch size of 64. After that it stayed around 3.0. Is this a known phenomenon or is it a dataset problem? I found the loss function in model_fns is not shifted. It should be loss_batch = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=output["logits"][:, :-1],labels=features[:, 1:]) , am I right?

Has anyone managed to work it on Windows? Which OS did you use to make it work?

I have windows 10, x64, Core i7 2600 K CPU, 32 ram memory, GTX 1050 Ti GPU

I have installed latest Phyton and Tensorflow

Also run these commands

1) pip3 install tensorflow-gpu regex

2) pip3 install requests tqdm

3) cd GPT2 folder (cloned via bash)

4) python download_model.py PrettyBig

Everything I believe is ready however i am not able to make it work

Here my configurations and what errors I am getting

Main folder

PrettyBig folder

PrettyBig.json - file paths are correct and working

Here the command line I have used

C:\GPT2>python main.py --model PrettyBig.json --predict_text "Pikachu"

At first it runs several minutes with around 70% CPU usage and above 2 GB ram usage

Here the full command line result of the above command

C:\GPT2>python main.py --model PrettyBig.json --predict_text "Pikachu"
{'n_head': 16, 'encoder_path': 'C:\GPT2\encoder', 'n_vocab': 50257, 'embed_dropout': 0.0, 'lr': 0.00025, 'warmup_steps': 2000, 'weight_decay': 0.01, 'beta1': 0.9, 'beta2': 0.98, 'epsilon': 1e-09, 'opt_name': 'adam', 'train_batch_size': 256, 'attn_dropout': 0.0, 'train_steps': 10000, 'eval_steps': 10, 'max_steps': 604800, 'data_path': 'gs://connors-datasets/openwebtext/', 'scale': 0.14433756729740646, 'res_dropout': 0.1, 'predict_batch_size': 1, 'eval_batch_size': 256, 'iterations': 100, 'n_embd': 1024, 'input': 'openwebtext_longbiased', 'model': 'GPT2', 'model_path': 'C:\GPT2\PrettyBig', 'n_ctx': 1024, 'predict_path': 'logs/predictions_SortaBig.txt', 'n_layer': 25, 'use_tpu': False, 'precision': 'float32'}
Using config: {'_model_dir': 'C:\GPT2\PrettyBig', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000016DD33ECEB8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Generating predictions...
From C:\Python37\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
Calling model_fn.
From C:\GPT2\models\gpt2\sample.py:57: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
From C:\GPT2\models\gpt2\sample.py:59: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
Done calling model_fn.
Graph was finalized.
From C:\Python37\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Restoring parameters from C:\GPT2\PrettyBig\model.ckpt
Running local_init_op.
Done running local_init_op.
Traceback (most recent call last):
File "C:\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
return fn(*args)
File "C:\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)
[[{{node sample_sequence/while/model/GatherV2_1}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 131, in
predict_fn(network, text, params)
File "C:\GPT2\predict_fns.py", line 18, in gpt2_predict
for i, p in enumerate(predictions):
File "C:\Python37\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 629, in predict
preds_evaluated = mon_sess.run(predictions)
File "C:\Python37\lib\site-packages\tensorflow\python\training\monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "C:\Python37\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1171, in run
run_metadata=run_metadata)
File "C:\Python37\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1270, in run
raise six.reraise(*original_exc_info)
File "C:\Python37\lib\site-packages\six.py", line 693, in reraise
raise value
File "C:\Python37\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1255, in run
return self._sess.run(*args, **kwargs)
File "C:\Python37\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1327, in run
run_metadata=run_metadata)
File "C:\Python37\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1091, in run
return self._sess.run(*args, **kwargs)
File "C:\Python37\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
run_metadata_ptr)
File "C:\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
run_metadata)
File "C:\Python37\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)
[[node sample_sequence/while/model/GatherV2_1 (defined at C:\GPT2\models\gpt2\gpt2.py:208) ]]

Caused by op 'sample_sequence/while/model/GatherV2_1', defined at:
File "main.py", line 131, in
predict_fn(network, text, params)
File "C:\GPT2\predict_fns.py", line 18, in gpt2_predict
for i, p in enumerate(predictions):
File "C:\Python37\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 611, in predict
features, None, model_fn_lib.ModeKeys.PREDICT, self.config)
File "C:\Python37\lib\site-packages\tensorflow_estimator\python\estimator\estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "C:\GPT2\model_fns.py", line 62, in gpt2_model
temperature=1.0, top_k=params["top_k"]
File "C:\GPT2\models\gpt2\sample.py", line 82, in sample_sequence
back_prop=False,
File "C:\Python37\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "C:\Python37\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "C:\Python37\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "C:\Python37\lib\site-packages\tensorflow\python\ops\control_flow_ops.py", line 3525, in
body = lambda i, lv: (i + 1, orig_body(*lv))
File "C:\GPT2\models\gpt2\sample.py", line 56, in body
next_outputs = step(params, prev[:, tf.newaxis], past=past)
File "C:\GPT2\models\gpt2\sample.py", line 40, in step
lm_output = lm_output = gpt2.model(params=params, X=tokens, past=past, reuse=tf.AUTO_REUSE)
File "C:\GPT2\models\gpt2\gpt2.py", line 208, in model
h = tf.gather(wte, X) + tf.gather(wpe, positions_for(X, past_length))
File "C:\Python37\lib\site-packages\tensorflow\python\util\dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "C:\Python37\lib\site-packages\tensorflow\python\ops\array_ops.py", line 3273, in gather
return gen_array_ops.gather_v2(params, indices, axis, name=name)
File "C:\Python37\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 4390, in gather_v2
"GatherV2", params=params, indices=indices, axis=axis, name=name)
File "C:\Python37\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\Python37\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Python37\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
op_def=op_def)
File "C:\Python37\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[0,0] = 1024 is not in [0, 1024)
[[node sample_sequence/while/model/GatherV2_1 (defined at C:\GPT2\models\gpt2\gpt2.py:208) ]]

I have used single text as author suggested but still fails

I have also tested input.txt method

Question about the metric reported in the paper?

Question about the metric reported in the paper?.
HELLO! I am a new NLPer. I am confused about the pipline(pretrain->fineturn->test) of pre-training large language models.

I would like to know which stage of the model was used for unlabeled dataset (e.g., WebText), labeled dataset (e.g., LAMBADA, CoQA, CNN and Daily Mail dataset), respectively?
Dose GPT2 model pre-trained on unlabeled dataset, then fine-tuned on labeled dataset (e.g., LAMBADA, CoQA, CNN and Daily Mail dataset), respectively? Finally, reported the score in the paper.
Other Large Language Models, like BART, RoBERTa, Mass, have these models been fine-tuned on labeled dataset (e.g., LAMBADA, CoQA, CNN and Daily Mail dataset) before reporting the scores?

Thank you!

Training 1.5B?

Hello,

I was wondering if you were able to train the 1.5B model or the large model on TPUs? Afaik it's too large to fit.
I would really like to know if you did succeed. Thanks.

Predicting with PrettyBigModel `InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)`

Hi, I was interested in testing your PrettyBig model. I've downloaded the model and edited the PrettyBig.json to point to the downloaded encoder and model paths. When running:

python3 main.py --model PrettyBig.eval.json --predict_text "Hello there! My name is"

I get the following error:

{'n_head': 16, 'encoder_path': '/Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/encoder', 'n_vocab': 50257, 'embed_dropout': 0.0, 'lr': 0.00025, 'warmup_steps': 2000, 'weight_decay': 0.01, 'beta1': 0.9, 'beta2': 0.98, 'epsilon': 1e-09, 'opt_n
ame': 'adam', 'train_batch_size': 256, 'attn_dropout': 0.0, 'train_steps': 10000, 'eval_steps': 10, 'max_steps': 604800, 'data_path': 'gs://connors-datasets/openwebtext/', 'scale': 0.14433756729740646, 'res_dropout': 0.1, 'predict_batch_s
ize': 1, 'eval_batch_size': 256, 'iterations': 100, 'n_embd': 1024, 'input': 'openwebtext_longbiased', 'model': 'GPT2', 'model_path': '/Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/PrettyBig', 'n_ctx': 1024, 'predict_path': 'logs/prediction
s_SortaBig.txt', 'n_layer': 25, 'use_tpu': False, 'precision': 'float32'}
Using config: {'_model_dir': '/Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/PrettyBig', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': , '_keep_checkpo
int_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x13fbf8ef0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}                                                                                                                                                                                                                          Generating predictions...
From /Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.                      Instructions for updating:                                                                                                                                                                                                                    Colocations handled automatically by placer.                                                                                                                                                                                                  Calling model_fn.
From /Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/models/gpt2/sample.py:57: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
From /Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/models/gpt2/sample.py:59: multinomial (from tensorflow.python.ops.random_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.random.categorical instead.
Done calling model_fn.
Graph was finalized.
2019-06-08 15:55:47.498527: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
From /Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Restoring parameters from /Users/pkmital/freelance/pkm/gpt-2/gpt-1.5b/PrettyBig/model.ckpt
Running local_init_op.
Done running local_init_op.
Traceback (most recent call last):
  File "/Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/Users/pkmital/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 1024 is not in [0, 1024)
         [[{{node sample_sequence/while/model/GatherV2_1}}]]

python3 --version                                                                                                                                                                                                                      
Python 3.6.8 :: Anaconda, Inc.

pip3 list | grep tensorflow
mesh-tensorflow                    0.0.5
tensorflow                         1.13.1
tensorflow-datasets                1.0.1
tensorflow-estimator               1.13.0
tensorflow-metadata                0.13.0
tensorflow-probability             0.6.0

Any ideas appreciated. Thanks!

To train my model means fit-tuning or retrain a model?

Sorry, i have question about how to perform fit-tuning in your pre-trained PrettyBig model, because i want to generate some texts related to my dataset.
Thank you very much :)

Are there some research papers about text-to-set generation?

I know this question is a little out of topic. But it is helpful to me. Thank you.

Text-to-(word)set generation or sequence-to-(token)set generation.

For example, input a text and then output the tags for this text:

'Peter is studying English' --> {'good behavior','person','doing something'}

Thank you!

Samples?

The link from here: https://medium.com/@NPCollapse/replicating-gpt2-1-5b-86454a7f26af
to here: https://github.com/ConnorJL/GPT2/tree/master/samples
is dead.

Your 1.5B model

Seeing as open AI released theirs, and those other researchers did prior. I would like to see yours for research and comparison. Thank you.

create_tfrecords.py。Dealing with problems with your own data set

May I ask what the problem is？

Got 0 files, divided into 0 chunks.
0it [00:00, ?it/s]
Done! In 1.50s, 0 / 0 good files.

format dataset

Wow nice repository, I also find GPT2 repo to train on TPU because I just got access google cloud TPU from tensorflow research cloud program. I have a plain text dataset but I don't know how to reformat my dataset into trainable format dataset like in your repo. So any formatted dataset you create to trian using this repo?

Very thanks for your answer and create this repo, awesome!

Input Chinese, the predicted is Japanese.

Hello Connor Leahy. Thank you very much for your excellent model project. It's very cool. And I'm more happy and exciting that there will be more open source models in the future. But I'm a Chinese user. When using Pretty Big model, I input Chinese, and the results are predicted in Japanese. Can I support the Chinese model?

Training on artificial language data (server logs, medical records, etc.)

Hi and thank you for your amazing work! I would like to train GPT-2 in Colab TPU on non-natural language sequential categorical data like server logs, medical records or weather events. What do I have to change in your code to prepare a dataset with word-level encoding (instead of BPE) and successfully run training?

P.S. I think I would be very useful for the community if we have a quick tutorial section on this in Readme.

Thank you!

Error on output

After about 5 hours I think I am throwing in the towel here. I have ran all the commands at noted. I am running this in google colab where I tested a few other systems. However this one works great until I try to run this last command.

!python3 main.py --model 1.5B.json [--top_k Top-K-Truncation] --predict_text "Hello there! My name is"

Error:

2020-10-16 23:42:00.183472: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
  File "main.py", line 12, in <module>
    from model_fns import *
  File "/home/GPT2/model_fns.py", line 6, in <module>
    from optimizers import create_train_op
  File "/home/GPT2/optimizers.py", line 100, in <module>
    class AdafactorOptimizer(tf.train.Optimizer):
AttributeError: module 'tensorflow._api.v2.train' has no attribute 'Optimizer'

Unable to predict with bfloat16 model

Can train a bfloat16 model but prediction on either GPU or CPU gives missing kernel op for bfloat16 for 'Rsqrt'. Have you been able to predict using bfloat16 models?

Also, would it be possible to do batch gradient averaging to simulate larger batch size on TPU without requiring more memory?

A meaningful performance comparison with OpenAI's models

Hi Connor,

I'd like to see some meaningful comparison with released, and, if possible, unreleased OpenAI's pretrained GPT-2 models.

My concern is that if you used different training techniques, the result may be very far off from what they've got. Including a possibility, that 1.5B model could be worse, than 345M model, that they have released.

P.S. Also pinged you on Twitter about this.

DOCKER: Web interface doesn't work

EDIT: I just realized this is the wrong repository to post this issue. The one(s) who forked your project and posted this docker solution, deepai-org, decided to go without issue reporting, but you can close this if you want.

I apologize if there's a better place to report docker image problems.

After putting the value in the textbox and hitting upload/submit button the result is always: console displays a thrown exception and the web interface has no visible reaction. Different types of input all have the same result: raw input text or JSON (like the piped input example), or raw input text formatted as base64

exception follows:

172.17.0.1 - - [13/Jan/2020 02:56:03] "POST / HTTP/1.1" 500 -
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 2463, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 2449, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1866, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 2446, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1951, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1820, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1949, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1935, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.5/dist-packages/ai_integration/modes/http.py", line 41, in hello
    'url': 'data:' + 'text/plain' + ';base64,' + base64.b64encode(inputs_dict[key]).decode("utf-8")
  File "/usr/lib/python3.5/base64.py", line 59, in b64encode
    encoded = binascii.b2a_base64(s)[:-1]
TypeError: a bytes-like object is required, not 'str'

character-level

Is there a way to build a character-level model?

what's the difference between sample and sample_free?

Hi. What is different between sample.txt and sample_free.txt?
I did not find explanation.
Thanks.

GPT vs BERT, under same computation and data resource, which one is better for downstream tasks like GLUE?

Thank you very much.

where is the length of the forecast article set? Thank you!

Docker documentation for CUDA

Please add these CUDA specific notes to the docker page or please correct me if I had missed them somewhere.

Requirement is to install nvidia-docker .

CUDA enabled docker run using the latest toolkit is:

docker run --gpus all --rm -it -e MODE=http -p 5000:5000 deepaiorg/gpt2

It might also help others to include the (probably common) error people get when trying to run the image without enabling CUDA:

[root@machine ~]# docker run --rm -it -e MODE=http -p 5000:5000 deepaiorg/gpt2
Unable to find image 'deepaiorg/gpt2:latest' locally
latest: Pulling from deepaiorg/gpt2
Digest: sha256:805592697648bd7e83ea558d071b3db3e486553e32d5622b56c74f6da97cb0a8
Status: Downloaded newer image for deepaiorg/gpt2:latest
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.5/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 4, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.5/imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.5/imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

117M/model.ckpt.index is corrupted?

Kept getting this error -

Create CheckpointSaverHook.
Done calling model_fn.
TPU job name worker
Graph was finalized.
Restoring parameters from gs://kogpt2/models/117M/model.ckpt
Error recorded from training_loop: From /job:worker/replica:0/task:0:
File contents are inconsistent for file: gs://kogpt2/models/117M/model.ckpt.index @ 0.
         [[node save/RestoreV2 (defined at /home/ksjcom0705_gmail_com/GPT2/venv/lib/python3.7/site-packages/tensorflow_co
re/python/framework/ops.py:1748) ]]

Anyone with a trained 117M model so I can pretrain them? It looks like the source is damaged somehow(or gsutil is damaging them)

Retraining a new model, only gpu 0 can be used

my batch size:
"train_batch_size": 4,

about encoder.json

How can I get encoder.json on my own dataset? I am comfused about it. I got a vocab file using SentencePiece.

How can i create smaller sized file for inference of 1.5B model

I am working on gpt2 1.5b model.
It is taking too much time for inference how can i decrease the time taken by the model?
How can i optimize my model?

quirks that hold the model back

In Addendum: Evaluation of My Model you mention:

Although I used the same amount of hardware (or more), the differences in my training setup and hyperparameters made a significant difference. Which is an unfortunate reality to anyone familiar with reproducing deep learning papers. I don’t think my model in its current state is even as dangerous as 117M in its text generating abilities. But I believe to have found the quirks in my setup that have held the model back, and they are easy to fix.

Are you willing to elaborate on this, and describe or fix the quirks? I think it would be really interesting/informative/useful for students of deep learning as a case study, showing how small non-obvious changes can make a big difference. Please consider doing so :) Thank you.

when reading metadata of gs://openwebtext/stuff/encoder/encoder.json

Error coming while executing the command

$ python3 main.py --model 345M.json --predict_text "Hello World. Hello there! My name"
The output is below
{'n_head': 16, 'encoder_path': 'gs://openwebtext/stuff/encoder', 'n_vocab': 50257, 'embed_dropout': 0.1, 'lr': 0.00025, 'warmup_steps': 2000, 'weight_decay': 0.01, 'beta1': 0.9, 'beta2': 0.98, 'epsilon': 1e-09, 'opt_name': 'adam', 'train_batch_size': 8, 'attn_dropout': 0.1, 'train_steps': 10000, 'eval_steps': 10, 'max_steps': 500000, 'data_path': 'gs://connors-datasets/openwebtext/', 'res_dropout': 0.1, 'predict_batch_size': 8, 'eval_batch_size': 8, 'iterations': 500, 'n_embd': 1024, 'input': 'openwebtext', 'model': 'GPT2', 'model_path': 'gs://connors-models/GPT2-345M', 'n_ctx': 1024, 'predict_path': 'logs/predictions.txt', 'n_layer': 24, 'scale_by_depth': True, 'scale_by_in': True, 'use_tpu': False, 'precision': 'float32'}
2019-10-21 12:38:38.103626: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.159809 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:38.272828: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.053047 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:38.370688: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.050504 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:38.433094: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.564422 seconds (attempt 4 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:39.022315: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 0.256678 seconds (attempt 5 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:39.300586: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.24113 seconds (attempt 6 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:40.675821: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.13431 seconds (attempt 7 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:41.867547: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.20263 seconds (attempt 8 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:43.087045: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.05564 seconds (attempt 9 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:44.151391: I tensorflow/core/platform/cloud/retrying_utils.cc:73] The operation failed and will be automatically retried in 1.43831 seconds (attempt 10 out of 10), caused by: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'
2019-10-21 12:38:45.596157: W tensorflow/core/platform/cloud/google_auth_provider.cc:157] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Traceback (most recent call last):
File "main.py", line 118, in
enc = encoder.get_encoder(params["encoder_path"])
File "/home/kiran1/KiranResearch/TextSummerization/GPT2/models/gpt2/encoder.py", line 111, in get_encoder
encoder = json.load(f)
File "/home/kiran1/anaconda3/envs/tf_gpu/lib/python3.6/json/init.py", line 296, in load
return loads(fp.read(),
File "/home/kiran1/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 128, in read
length = self.size() - self.tell()
File "/home/kiran1/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 104, in size
return stat(self.__name).length
File "/home/kiran1/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 735, in stat
return stat_v2(filename)
File "/home/kiran1/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 754, in stat_v2
return file_statistics
File "/home/kiran1/anaconda3/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 401 with body '{
"error": {
"code": 401,
"message": "Anonymous caller does not have storage.objects.get access to openwebtext/stuff/encoder/encoder.json.",
"errors": [
{
"message": "Anonymous caller does not have storage.objects.get access to openwebtext/stuff/encoder/encoder.json.",
"domain": "global",
"reason": "required",
"locationType": "header",
"location": "Authorization"
}
]
}
}
'
when reading metadata of gs://openwebtext/stuff/encoder/encoder.json

Downloading Encoder Model fails

Hi, could someone please provide me the pre-trained encoder? There seems to be an issue with the GCP account, when I run python3 download_model.py encoder:

<?xml version='1.0' encoding='UTF-8'?><Error><Code>UserProjectAccountProblem</Code><Message>User project billing account not in good standing.</Message><Details>The billing account for project 916430819220 is disabled in state delinquent</Details></Error>$

I figured out how to cram GPT-2 1.5B onto a single TPU core with Adam optimizer

It comes down to tensor shape. 2D = good, 3D = bad.

Relevant commit: shawwn/gpt-2@4d766e9

error when using create_tfrecords.py

Got 142 files, divided into 1 chunks.
  0% 0/1 [00:00<?, ?it/s]0

Traceback (most recent call last):
  File "./GPT2/datasets/openwebtext/create_tfrecords.py", line 86, in <module>
    good += g
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

How to process raw text files to create similar "PrettyBig" model?

Thanks for the repo. Have sampling working fine from your "PrettyBig" model.

I would like to generate my own dataset from 6 gigs of raw, header free Gutenberg text files
and I was wondering how this can be done using datasets/create_tfrecords.py

Using tar I've created "RS_2017-04-4_data.xz" from the raw text files and placed in "openwebtext/RS_2017-04-4_data.xz"

I've edited one of your .json files to include the paths in the required "files.json" (# This file should contain paths to all your RS_--_data. files)

run create_tfrecords.py and creates parse/RS_2017-04 folders

90 minutes later from the terminal

Parsing chunk 1 took 54.41039276123047 seconds
-- 0.0% of chunk 1's docs yielded text.
Saving chunk 1 took 1.6689300537109375e-06 seconds
Parsing chunk 2 took 49.19901156425476 seconds
-- 0.0% of chunk 2's docs yielded text.
Saving chunk 2 took 1.1920928955078125e-06 second

... parse/RS_2017-04 is still empty

Stopped at this point because I assume this is wrong. Any suggestions how I can prepare a similar model as "PrettyBig" using standard raw text files?

Cheers,

P.S Do you plan on releasing the 1.7 model?

connorjl / gpt2 Goto Github PK

gpt2's Introduction

GPT2

Requirements

Downloading Pretrained Models

Generating Text

Training

Generating the Dataset

Using Your Own Data

Explanation of Parameters

gpt2's People

Contributors

Stargazers

Watchers

Forkers

gpt2's Issues

Error coming while executing the command

Recommend Projects

Recommend Topics

Recommend Org

Jobs