Any ideas on resolving this issue would be greatly appreciated! GPU

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I printed the shapes of the tensors self.bias , <code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250,about huggingface/transfer-learning-conv-ai

Comments (17)

g-karthik commented on July 22, 2024 3

@thomwolf I think I have figured this out. Let me know what you think.

transfer-learning-conv-ai/train.py

Line 55 in b7f295f

instance["input_ids"] = list(chain(*sequence))

If len(instance["input_ids"]) above is greater than 512 (which is the default value of n_positions in OpenAIGPTConfig in modeling_openai.py in pytorch-pretrained-bert), then the position_ids created in the below link will contain values much larger than 512.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L620

Consequently, this line (https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L633) will fail.

I think you need to add truncation logic for the sequence above prior to doing instance["input_ids"] = list(chain(*sequence)) so that the length is always less than or equal to 512.

from transfer-learning-conv-ai.

sashank06 commented on July 22, 2024 1

@g-karthik I think the device-side assert error triggered is due to the position embedding which is limited to 512 in size and your input size dimension sequence length is greater than that in the training loader

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

Since the above stacktrace ends at the torch.addmm() call happening in modeling_openai.py, I hypothesized that perhaps the addmm() op doesn't play well with the pytorch_p36 env in my EC2 instance. So I tried to reproduce the above RuntimeError in the interpreter in the instance (with the pytorch_p36 env activated).

>>> import torch
>>> a = torch.randn(2,3)
>>> b = torch.randn(3,2)
>>> c = torch.randn(2,2)
>>> x = torch.addmm(c,a,b)

The above worked perfectly fine. So I do not think the issue is due to the addmm() op itself, but I'm confused how to proceed.

My best hypothesis is that the matrix multiplication happening internally is instantiating memory, causing this failure, but if that were the case, we'd see many more similar issues on GitHub :)

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

I printed the shapes of the tensors self.bias , x.view(-1, x.size(-1)) and self.weight right before the above failure in the forward pass. The shapes were torch.Size([2304]), torch.Size([12560, 768]) and torch.Size([768, 2304]) respectively.

So I did the following test:

>>> a = torch.randn(2304).cuda()
>>> b = torch.randn(12560, 768).cuda()
>>> c = torch.randn(768, 2304).cuda()
>>> a.shape
torch.Size([2304])
>>> b.shape
torch.Size([12560, 768])
>>> c.shape
torch.Size([768, 2304])
>>> x = torch.addmm(a,b,c)
>>>

The above test worked fine - no CUDA resource allocation errors. So I really am clueless at this point.

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

@thomwolf do you have any thoughts?

from transfer-learning-conv-ai.

thomwolf commented on July 22, 2024

You can try with:

a batch size of 1 to be sure you are not just out-of-memory (--train_batch_size 1 --valid_batch_size 1)
CUDA_LAUNCH_BLOCKING=1 before the python train.py to be sure where the error comes from

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

I already tried with a batch size of 1. The tensors were smaller in shape, but the exact same error repeated.

…

On Thu, May 23, 2019, 11:46 PM Thomas Wolf ***@***.***> wrote: You can try with: - a batch size of 1 to be sure you are not just out-of-memory - CUDA_LAUNCH_BLOCKING=1 before the python train.py to be sure where the error comes from — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10?email_source=notifications&email_token=AA5MNWLLBWVPQX3XMOJIXODPW6FMFA5CNFSM4HOZKZSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWEJ5PY#issuecomment-495492799>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5MNWKSA4OFMWOVKPXGDI3PW6FMFANCNFSM4HOZKZSA> .

from transfer-learning-conv-ai.

thomwolf commented on July 22, 2024

and with CUDA_LAUNCH_BLOCKING=1, What exact error message do you get?

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

@thomwolf here's a part of the error I get with that flag set:

/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: CUDA error: device-side assert triggered.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: CUDA error: device-side assert triggered.
Traceback (most recent call last):
  File "train.py", line 358, in <module>
    train()
  File "train.py", line 349, in train
    trainer.run(train_loader, max_epochs=args.n_epochs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
    hours, mins, secs = self._run_once_on_dataset()
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
    self.state.output = self._process_function(self, batch)
  File "train.py", line 275, in update
    lm_loss, mc_loss = model(*batch)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 811, in forward
    hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 639, in forward
    token_type_embeds = self.tokens_embed(token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

Okay, I think the reason I'm facing these embedding matrix index out of bounds errors is that my dataset has really long utterances and the default config in OpenAIGPTConfig (like n_positions=512) is insufficient to handle the utterances in my dataset.

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

I added some truncation code and managed to resolve the error corresponding to the previous stack trace. But I'm now facing a different stack trace (with CUDA_LAUNCH_BLOCKING=1):

/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [637,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: CUDA error: device-side assert triggered.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: CUDA error: device-side assert triggered.
Traceback (most recent call last):
  File "train.py", line 376, in <module>
    train()
  File "train.py", line 367, in train
    trainer.run(train_loader, max_epochs=args.n_epochs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
    hours, mins, secs = self._run_once_on_dataset()
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
    self.state.output = self._process_function(self, batch)
  File "train.py", line 293, in update
    lm_loss, mc_loss = model(*batch)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 808, in forward
    hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 641, in forward
    hidden_states = inputs_embeds + position_embeds + token_type_embeds
RuntimeError: CUDA error: device-side assert triggered

As a quick next step, I'm going to print the shapes of inputs_embeds, position_embeds and token_type_embeds to see if they're all the same shape.

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

Not sure why I got the above failure, but things seem to be working fine now.

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

@sashank06 yes that is correct, I had realized it back then after a bunch of debugging, and I forgot to update this issue! Thanks for the reminder!

from transfer-learning-conv-ai.

dishavarshney082 commented on July 22, 2024

@sashank06 yes that is correct, I had realized it back then after a bunch of debugging, and I forgot to update this issue! Thanks for the reminder!

How did you do the truncation? I mean wouldn't there be loss of information if we just omit the words > 512 length?

from transfer-learning-conv-ai.

g-karthik commented on July 22, 2024

@geekygirl123 You can do the truncation of individual components of your input_ids, while ensuring len(input_ids) <= 512 for GPT. For example, you can truncate the dialog context to comprise just the recent turns.

Yes, there would be a loss of information if you truncated the dialog context this way, but it is something you could live with from a perf-POV.

from transfer-learning-conv-ai.

ioannist commented on July 22, 2024

Had the same problem. My strategy was to pop history items up to a minimum of one item, and then history items up to a minimum of one item, until the sequence length was 512 or lower.

from transfer-learning-conv-ai.

AniAggarwal commented on July 22, 2024

@ioannist, @g-karthik, or anyone else who managed to make their truncation work, could you share the code with me? I'm unable to make the truncation work despite the lengths of each of the elements in the sequence being truncated to 512.

I tried a simple slice:

sequence[0] = sequence[0][:512]
sequence[1] = sequence[1][:512]
sequence[2] = sequence[2][:512]

but that didn't seem to work. I also tried a number of other things, but I was mainly fumbling around in the dark.

Any pointers or code for this?

Thanks in advance!

Edit: I tried the following, which got me around the length issue but now I run into an assertion error.
Code I wrote to truncate:

input_ids_len = len(list(chain(*sequence)))
    while input_ids_len > 512:
        if len(sequence[1]) > 0:
            sequence[1].pop()
        elif len(sequence[2]) > 0:
            sequence[2].pop()
        elif len(sequence[0]) > 0:
            sequence[0].pop()
        input_ids_len = len(list(chain(*sequence)))

Error Traceback:

Traceback (most recent call last):
  File "transfer-learning-conv-ai/train.py", line 308, in <module>
    train()
  File "transfer-learning-conv-ai/train.py", line 212, in train
    train_loader, val_loader, train_sampler, valid_sampler = get_data_loaders(args, tokenizer)
  File "transfer-learning-conv-ai/train.py", line 147, in get_data_loaders
    train_dataset, valid_dataset = TensorDataset(*tensor_datasets["train"]), TensorDataset(*tensor_datasets["valid"])
  File "C:\Anaconda3\miniconda3\envs\chatbot\lib\site-packages\torch\utils\data\dataset.py", line 166, in __init__
    assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
AssertionError

from transfer-learning-conv-ai.

RuntimeError: cublas runtime error : resource allocation failed at THCGeneral.cpp:250 about transfer-learning-conv-ai HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs