Comments (12)
What batchsize are you using? Also, are you using --FP16 O1
?
from transfer-learning-conv-ai.
https://www.gwern.net/GPT-2#training
Seems like a single 1080Ti with 11GB should be enough - if you switch to FP16 you wouldn't even need to use gradient checkpointing (Gwern used FP32).
from transfer-learning-conv-ai.
Tweaking the gradient checkpointing was enough to get 774M to work, but not 1.5b. We experimented with FP16 when we were trying to get 1.5b to work a 1080ti. It caused a lot of issues: the codebase multiplies by constants which can't be represented in FP16 (so it wound up generating '!!!!!!' infinitely, because that's the first BPE token or whatever) and once we figured that out and converted the pretrained model over to FP16, the output was completely screwed up, so something slightly more clever obviously was required to make reduced precision work. At which point we switched over to Colab TPUs (which opened up an entirely different kettle of worms relating to TPU iterations randomly freezing, our best guess so far is that some reshape or loop makes the TPU very unhappy).
from transfer-learning-conv-ai.
Tweaking the gradient checkpointing was enough to get 774M to work, but not 1.5b. We experimented with FP16 when we were trying to get 1.5b to work a 1080ti. It caused a lot of issues: the codebase multiplies by constants which can't be represented in FP16 (so it wound up generating '!!!!!!' infinitely, because that's the first BPE token or whatever) and once we figured that out and converted the pretrained model over to FP16, the output was completely screwed up, so something slightly more clever obviously was required to make reduced precision work. At which point we switched over to Colab TPUs (which opened up an entirely different kettle of worms relating to TPU iterations randomly freezing, our best guess so far is that some reshape or loop makes the TPU very unhappy).
@gwern mind mentioning broadly what you tweaked? I am using checkpointing in pytorch and can't fit even 1 sample into a 12 gb gpu for the 774M version.
from transfer-learning-conv-ai.
I believe we needed something like this:
diff --git a/src/model.py b/src/model.py
index 4e942d8..71092bc 100644
--- a/src/model.py
+++ b/src/model.py
@@ -124,10 +124,10 @@ def block(x, scope, *, past, hparams):
with tf.variable_scope(scope):
nx = x.shape[-1].value
a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
- x = x + a
+ x = x1 = x + a
m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams)
x = x + m
- return x, present
+ return x, present, x1
def past_shape(*, hparams, batch_size=None, sequence=None):
return [batch_size, hparams.n_layer, 2, hparams.n_head, sequence, hparams.n_embd // hparams.n_head]
@@ -161,9 +161,9 @@ def model(hparams, X, past=None, scope='model', reuse=tf.AUTO_REUSE):
pasts = tf.unstack(past, axis=1) if past is not None else [None] * hparams.n_layer
assert len(pasts) == hparams.n_layer
for layer, past in enumerate(pasts):
- h, present = block(h, 'h%d' % layer, past=past, hparams=hparams)
- if layer == 10:
- tf.add_to_collection('checkpoints', h)
+ h, present, x1 = block(h, 'h%d' % layer, past=past, hparams=hparams)
+ if layer < 48:
+ tf.add_to_collection('checkpoints', x1)
presents.append(present)
results['present'] = tf.stack(presents, axis=1)
h = norm(h, 'ln_f')
from transfer-learning-conv-ai.
What batchsize are you using? Also, are you using
--FP16 O1
?
@martinritchie I've been using FP16 O3 -- this is giving me a NaN error for the loss computation after like 55% of an epoch of training is done: WARNING:root:NaN or Inf found in input tensor.
. The training continues after the warning, but it simply continually prints NaN for the loss with the same warning.
I've also been using a batch size of 2, but I think the NaN error above is specific to FP16.
@michaelklachko @gwern I am using FP16 and facing NaN errors.
from transfer-learning-conv-ai.
FP16 O3
can be unstable (check the apex docs), stick to O1
and this should reduce the chance of it diverging. As Gwern suggested, use gradient checkpointing and reduce the batch size to one if you are still having memory problems.
from transfer-learning-conv-ai.
@martinritchie do you have any thoughts on how exactly to perform the gradient checkpointing when the underlying modules return variable-number of tensors, like here:
https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py#L478
I get errors like CheckpointFunctionBackward.forward: expected Variable (got list) for return value 0.
, and when I looked it up, it seems that variable-number of tensors is not supported.
Perhaps I could unpack the variable-number of tensors explicitly in every sub-module of the GPT2Model
and then place checkpoints?
from transfer-learning-conv-ai.
That sounds like it would be a little heavy handed. Could you provide a minimal working example or show me how you are using it?
from transfer-learning-conv-ai.
https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py#L478
@martinritchie So the above line, I basically replaced it with:
if i == 10:
outputs = checkpoint(block, hidden_states, layer_past=layer_past, attention_mask=attention_mask,
head_mask=head_mask[i])
else:
outputs = block(hidden_states,
layer_past=layer_past,
attention_mask=attention_mask,
head_mask=head_mask[i])
And that initially failed saying that checkpoint.py
does not support keyword arguments. So I removed the keywords from the arguments.
if i == 10:
outputs = checkpoint(block, hidden_states, layer_past, attention_mask, head_mask[i])
else:
outputs = block(hidden_states,
layer_past=layer_past,
attention_mask=attention_mask,
head_mask=head_mask[i])
And with that, the previous error went away and I now have CheckpointFunctionBackward.forward: expected Variable (got list) for return value 0.
.
Here's a thread I found about this on the PyTorch forums: https://discuss.pytorch.org/t/checkpoint-didnt-support-list-output/16957/3
from transfer-learning-conv-ai.
Update: It looks like in the transformers
library, I need to simply change the forward()
in Block
: specifically, it returns a list of outputs
, which needs to change to tuple(outputs)
for checkpointing to work.
from transfer-learning-conv-ai.
FP16 O3
can be unstable (check the apex docs), stick toO1
and this should reduce the chance of it diverging. As Gwern suggested, use gradient checkpointing and reduce the batch size to one if you are still having memory problems.
@martinritchie So I'm using gradient checkpointing like above (I changed the if condition to i <= 12
so all layers are check-pointed) with a batch size of 1, along with fp16 O1
. I'm still facing CUDA OOM errors. My train dataset tensor is (Batch, Candidates, Seq length): torch.Size([33410, 3, 150])
.
Any ideas how to get this to work?
from transfer-learning-conv-ai.
Related Issues (20)
- pip version problems - updated requirements.txt
- Issue when running interact script HOT 1
- Custom vocabulary HOT 1
- GPT2 chat-bot single interaction… Attribute Error: 'NoneType' object has no attribute 'multiprocessing_chunksize' issue:2
- How to use Interaction through a web api call instead CLI
- Train.py not working
- Reason for parameter 'personality_permutations' HOT 1
- Question about "token_type_ids"
- Batchwise padding dataset
- Load model checkpoint
- I just figured it out. You need both "train" and "valid" examples, following the same structure as: HOT 1
- Questions about interact.py
- Is It Possible To Train With Unsupported Language In Spacy
- Questions about segment embedding
- Which Python version to use to install the libraries from requirements.txt
- How should the value of --num_candidates and --personality_permutations be determined?
- the meaning of code
- metrics: HOT 1
- cannot import name 'cached_path' from 'transformers' HOT 3
- How to input personality?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transfer-learning-conv-ai.