Comments (17)
@thomwolf I think I have figured this out. Let me know what you think.
transfer-learning-conv-ai/train.py
Line 55 in b7f295f
If len(instance["input_ids"])
above is greater than 512 (which is the default value of n_positions
in OpenAIGPTConfig
in modeling_openai.py in pytorch-pretrained-bert), then the position_ids created in the below link will contain values much larger than 512.
Consequently, this line (https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L633) will fail.
I think you need to add truncation logic for the sequence
above prior to doing instance["input_ids"] = list(chain(*sequence))
so that the length is always less than or equal to 512.
from transfer-learning-conv-ai.
@g-karthik I think the device-side assert error triggered is due to the position embedding which is limited to 512 in size and your input size dimension sequence length is greater than that in the training loader
from transfer-learning-conv-ai.
Since the above stacktrace ends at the torch.addmm()
call happening in modeling_openai.py, I hypothesized that perhaps the addmm() op doesn't play well with the pytorch_p36 env in my EC2 instance. So I tried to reproduce the above RuntimeError in the interpreter in the instance (with the pytorch_p36 env activated).
>>> import torch
>>> a = torch.randn(2,3)
>>> b = torch.randn(3,2)
>>> c = torch.randn(2,2)
>>> x = torch.addmm(c,a,b)
The above worked perfectly fine. So I do not think the issue is due to the addmm() op itself, but I'm confused how to proceed.
My best hypothesis is that the matrix multiplication happening internally is instantiating memory, causing this failure, but if that were the case, we'd see many more similar issues on GitHub :)
from transfer-learning-conv-ai.
I printed the shapes of the tensors self.bias
, x.view(-1, x.size(-1))
and self.weight
right before the above failure in the forward pass. The shapes were torch.Size([2304])
, torch.Size([12560, 768])
and torch.Size([768, 2304])
respectively.
So I did the following test:
>>> a = torch.randn(2304).cuda()
>>> b = torch.randn(12560, 768).cuda()
>>> c = torch.randn(768, 2304).cuda()
>>> a.shape
torch.Size([2304])
>>> b.shape
torch.Size([12560, 768])
>>> c.shape
torch.Size([768, 2304])
>>> x = torch.addmm(a,b,c)
>>>
The above test worked fine - no CUDA resource allocation errors. So I really am clueless at this point.
from transfer-learning-conv-ai.
@thomwolf do you have any thoughts?
from transfer-learning-conv-ai.
You can try with:
- a batch size of 1 to be sure you are not just out-of-memory (
--train_batch_size 1 --valid_batch_size 1
) CUDA_LAUNCH_BLOCKING=1
before thepython train.py
to be sure where the error comes from
from transfer-learning-conv-ai.
from transfer-learning-conv-ai.
and with CUDA_LAUNCH_BLOCKING=1
, What exact error message do you get?
from transfer-learning-conv-ai.
@thomwolf here's a part of the error I get with that flag set:
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: CUDA error: device-side assert triggered.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: CUDA error: device-side assert triggered.
Traceback (most recent call last):
File "train.py", line 358, in <module>
train()
File "train.py", line 349, in train
trainer.run(train_loader, max_epochs=args.n_epochs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
hours, mins, secs = self._run_once_on_dataset()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
self.state.output = self._process_function(self, batch)
File "train.py", line 275, in update
lm_loss, mc_loss = model(*batch)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 811, in forward
hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 639, in forward
token_type_embeds = self.tokens_embed(token_type_ids)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
from transfer-learning-conv-ai.
Okay, I think the reason I'm facing these embedding matrix index out of bounds errors is that my dataset has really long utterances and the default config in OpenAIGPTConfig
(like n_positions=512
) is insufficient to handle the utterances in my dataset.
from transfer-learning-conv-ai.
I added some truncation code and managed to resolve the error corresponding to the previous stack trace. But I'm now facing a different stack trace (with CUDA_LAUNCH_BLOCKING=1
):
/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [637,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: CUDA error: device-side assert triggered.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: CUDA error: device-side assert triggered.
Traceback (most recent call last):
File "train.py", line 376, in <module>
train()
File "train.py", line 367, in train
trainer.run(train_loader, max_epochs=args.n_epochs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
hours, mins, secs = self._run_once_on_dataset()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
self._handle_exception(e)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
raise e
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
self.state.output = self._process_function(self, batch)
File "train.py", line 293, in update
lm_loss, mc_loss = model(*batch)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 808, in forward
hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 641, in forward
hidden_states = inputs_embeds + position_embeds + token_type_embeds
RuntimeError: CUDA error: device-side assert triggered
As a quick next step, I'm going to print the shapes of inputs_embeds
, position_embeds
and token_type_embeds
to see if they're all the same shape.
from transfer-learning-conv-ai.
Not sure why I got the above failure, but things seem to be working fine now.
from transfer-learning-conv-ai.
@sashank06 yes that is correct, I had realized it back then after a bunch of debugging, and I forgot to update this issue! Thanks for the reminder!
from transfer-learning-conv-ai.
@sashank06 yes that is correct, I had realized it back then after a bunch of debugging, and I forgot to update this issue! Thanks for the reminder!
How did you do the truncation? I mean wouldn't there be loss of information if we just omit the words > 512 length?
from transfer-learning-conv-ai.
@geekygirl123 You can do the truncation of individual components of your input_ids
, while ensuring len(input_ids) <= 512
for GPT. For example, you can truncate the dialog context to comprise just the recent turns.
Yes, there would be a loss of information if you truncated the dialog context this way, but it is something you could live with from a perf-POV.
from transfer-learning-conv-ai.
Had the same problem. My strategy was to pop history items up to a minimum of one item, and then history items up to a minimum of one item, until the sequence length was 512 or lower.
from transfer-learning-conv-ai.
@ioannist, @g-karthik, or anyone else who managed to make their truncation work, could you share the code with me? I'm unable to make the truncation work despite the lengths of each of the elements in the sequence
being truncated to 512.
I tried a simple slice:
sequence[0] = sequence[0][:512]
sequence[1] = sequence[1][:512]
sequence[2] = sequence[2][:512]
but that didn't seem to work. I also tried a number of other things, but I was mainly fumbling around in the dark.
Any pointers or code for this?
Thanks in advance!
Edit: I tried the following, which got me around the length issue but now I run into an assertion error.
Code I wrote to truncate:
input_ids_len = len(list(chain(*sequence)))
while input_ids_len > 512:
if len(sequence[1]) > 0:
sequence[1].pop()
elif len(sequence[2]) > 0:
sequence[2].pop()
elif len(sequence[0]) > 0:
sequence[0].pop()
input_ids_len = len(list(chain(*sequence)))
Error Traceback:
Traceback (most recent call last):
File "transfer-learning-conv-ai/train.py", line 308, in <module>
train()
File "transfer-learning-conv-ai/train.py", line 212, in train
train_loader, val_loader, train_sampler, valid_sampler = get_data_loaders(args, tokenizer)
File "transfer-learning-conv-ai/train.py", line 147, in get_data_loaders
train_dataset, valid_dataset = TensorDataset(*tensor_datasets["train"]), TensorDataset(*tensor_datasets["valid"])
File "C:\Anaconda3\miniconda3\envs\chatbot\lib\site-packages\torch\utils\data\dataset.py", line 166, in __init__
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
AssertionError
from transfer-learning-conv-ai.
Related Issues (20)
- pip version problems - updated requirements.txt
- Issue when running interact script HOT 1
- Custom vocabulary HOT 1
- GPT2 chat-bot single interaction… Attribute Error: 'NoneType' object has no attribute 'multiprocessing_chunksize' issue:2
- How to use Interaction through a web api call instead CLI
- Train.py not working
- Reason for parameter 'personality_permutations' HOT 1
- Question about "token_type_ids"
- Batchwise padding dataset
- Load model checkpoint
- I just figured it out. You need both "train" and "valid" examples, following the same structure as: HOT 1
- Questions about interact.py
- Is It Possible To Train With Unsupported Language In Spacy
- Questions about segment embedding
- Which Python version to use to install the libraries from requirements.txt
- How should the value of --num_candidates and --personality_permutations be determined?
- the meaning of code
- metrics: HOT 1
- cannot import name 'cached_path' from 'transformers' HOT 3
- How to input personality?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from transfer-learning-conv-ai.