GithubHelp home page GithubHelp logo

Comments (17)

g-karthik avatar g-karthik commented on July 22, 2024 3

@thomwolf I think I have figured this out. Let me know what you think.

instance["input_ids"] = list(chain(*sequence))

If len(instance["input_ids"]) above is greater than 512 (which is the default value of n_positions in OpenAIGPTConfig in modeling_openai.py in pytorch-pretrained-bert), then the position_ids created in the below link will contain values much larger than 512.

https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L620

Consequently, this line (https://github.com/huggingface/pytorch-pretrained-BERT/blob/372a5c1ceec49b52c503707e9657bfaae7c236a0/pytorch_pretrained_bert/modeling_openai.py#L633) will fail.

I think you need to add truncation logic for the sequence above prior to doing instance["input_ids"] = list(chain(*sequence)) so that the length is always less than or equal to 512.

from transfer-learning-conv-ai.

sashank06 avatar sashank06 commented on July 22, 2024 1

@g-karthik I think the device-side assert error triggered is due to the position embedding which is limited to 512 in size and your input size dimension sequence length is greater than that in the training loader

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

Since the above stacktrace ends at the torch.addmm() call happening in modeling_openai.py, I hypothesized that perhaps the addmm() op doesn't play well with the pytorch_p36 env in my EC2 instance. So I tried to reproduce the above RuntimeError in the interpreter in the instance (with the pytorch_p36 env activated).

>>> import torch
>>> a = torch.randn(2,3)
>>> b = torch.randn(3,2)
>>> c = torch.randn(2,2)
>>> x = torch.addmm(c,a,b)

The above worked perfectly fine. So I do not think the issue is due to the addmm() op itself, but I'm confused how to proceed.

My best hypothesis is that the matrix multiplication happening internally is instantiating memory, causing this failure, but if that were the case, we'd see many more similar issues on GitHub :)

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

I printed the shapes of the tensors self.bias , x.view(-1, x.size(-1)) and self.weight right before the above failure in the forward pass. The shapes were torch.Size([2304]), torch.Size([12560, 768]) and torch.Size([768, 2304]) respectively.

So I did the following test:

>>> a = torch.randn(2304).cuda()
>>> b = torch.randn(12560, 768).cuda()
>>> c = torch.randn(768, 2304).cuda()
>>> a.shape
torch.Size([2304])
>>> b.shape
torch.Size([12560, 768])
>>> c.shape
torch.Size([768, 2304])
>>> x = torch.addmm(a,b,c)
>>>

The above test worked fine - no CUDA resource allocation errors. So I really am clueless at this point.

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

@thomwolf do you have any thoughts?

from transfer-learning-conv-ai.

thomwolf avatar thomwolf commented on July 22, 2024

You can try with:

  • a batch size of 1 to be sure you are not just out-of-memory (--train_batch_size 1 --valid_batch_size 1)
  • CUDA_LAUNCH_BLOCKING=1 before the python train.py to be sure where the error comes from

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

from transfer-learning-conv-ai.

thomwolf avatar thomwolf commented on July 22, 2024

and with CUDA_LAUNCH_BLOCKING=1, What exact error message do you get?

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

@thomwolf here's a part of the error I get with that flag set:

/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1544199946412/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [85,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: CUDA error: device-side assert triggered.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: CUDA error: device-side assert triggered.
Traceback (most recent call last):
  File "train.py", line 358, in <module>
    train()
  File "train.py", line 349, in train
    trainer.run(train_loader, max_epochs=args.n_epochs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
    hours, mins, secs = self._run_once_on_dataset()
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
    self.state.output = self._process_function(self, batch)
  File "train.py", line 275, in update
    lm_loss, mc_loss = model(*batch)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 811, in forward
    hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 639, in forward
    token_type_embeds = self.tokens_embed(token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

Okay, I think the reason I'm facing these embedding matrix index out of bounds errors is that my dataset has really long utterances and the default config in OpenAIGPTConfig (like n_positions=512) is insufficient to handle the utterances in my dataset.

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

I added some truncation code and managed to resolve the error corresponding to the previous stack trace. But I'm now facing a different stack trace (with CUDA_LAUNCH_BLOCKING=1):

/opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCTensorIndex.cu:362: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [637,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR:ignite.engine.engine.Engine:Current run is terminating due to exception: CUDA error: device-side assert triggered.
ERROR:ignite.engine.engine.Engine:Engine run is terminating due to exception: CUDA error: device-side assert triggered.
Traceback (most recent call last):
  File "train.py", line 376, in <module>
    train()
  File "train.py", line 367, in train
    trainer.run(train_loader, max_epochs=args.n_epochs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 388, in run
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 375, in run
    hours, mins, secs = self._run_once_on_dataset()
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 341, in _run_once_on_dataset
    self._handle_exception(e)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 352, in _handle_exception
    raise e
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/ignite/engine/engine.py", line 333, in _run_once_on_dataset
    self.state.output = self._process_function(self, batch)
  File "train.py", line 293, in update
    lm_loss, mc_loss = model(*batch)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 808, in forward
    hidden_states = self.transformer(input_ids, position_ids, token_type_ids)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling_openai.py", line 641, in forward
    hidden_states = inputs_embeds + position_embeds + token_type_embeds
RuntimeError: CUDA error: device-side assert triggered

As a quick next step, I'm going to print the shapes of inputs_embeds, position_embeds and token_type_embeds to see if they're all the same shape.

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

Not sure why I got the above failure, but things seem to be working fine now.

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

@sashank06 yes that is correct, I had realized it back then after a bunch of debugging, and I forgot to update this issue! Thanks for the reminder!

from transfer-learning-conv-ai.

dishavarshney082 avatar dishavarshney082 commented on July 22, 2024

@sashank06 yes that is correct, I had realized it back then after a bunch of debugging, and I forgot to update this issue! Thanks for the reminder!

How did you do the truncation? I mean wouldn't there be loss of information if we just omit the words > 512 length?

from transfer-learning-conv-ai.

g-karthik avatar g-karthik commented on July 22, 2024

@geekygirl123 You can do the truncation of individual components of your input_ids, while ensuring len(input_ids) <= 512 for GPT. For example, you can truncate the dialog context to comprise just the recent turns.

Yes, there would be a loss of information if you truncated the dialog context this way, but it is something you could live with from a perf-POV.

from transfer-learning-conv-ai.

ioannist avatar ioannist commented on July 22, 2024

Had the same problem. My strategy was to pop history items up to a minimum of one item, and then history items up to a minimum of one item, until the sequence length was 512 or lower.

from transfer-learning-conv-ai.

AniAggarwal avatar AniAggarwal commented on July 22, 2024

@ioannist, @g-karthik, or anyone else who managed to make their truncation work, could you share the code with me? I'm unable to make the truncation work despite the lengths of each of the elements in the sequence being truncated to 512.

I tried a simple slice:

sequence[0] = sequence[0][:512]
sequence[1] = sequence[1][:512]
sequence[2] = sequence[2][:512]

but that didn't seem to work. I also tried a number of other things, but I was mainly fumbling around in the dark.

Any pointers or code for this?

Thanks in advance!

Edit: I tried the following, which got me around the length issue but now I run into an assertion error.
Code I wrote to truncate:

input_ids_len = len(list(chain(*sequence)))
    while input_ids_len > 512:
        if len(sequence[1]) > 0:
            sequence[1].pop()
        elif len(sequence[2]) > 0:
            sequence[2].pop()
        elif len(sequence[0]) > 0:
            sequence[0].pop()
        input_ids_len = len(list(chain(*sequence)))

Error Traceback:

Traceback (most recent call last):
  File "transfer-learning-conv-ai/train.py", line 308, in <module>
    train()
  File "transfer-learning-conv-ai/train.py", line 212, in train
    train_loader, val_loader, train_sampler, valid_sampler = get_data_loaders(args, tokenizer)
  File "transfer-learning-conv-ai/train.py", line 147, in get_data_loaders
    train_dataset, valid_dataset = TensorDataset(*tensor_datasets["train"]), TensorDataset(*tensor_datasets["valid"])
  File "C:\Anaconda3\miniconda3\envs\chatbot\lib\site-packages\torch\utils\data\dataset.py", line 166, in __init__
    assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
AssertionError

from transfer-learning-conv-ai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.