Edit by Admin: If you see issues with the gpt2 tokenizer, please refer to <a class

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Missing dict.txt?,about facebookresearch/metaseq

Comments (14)

stephenroller commented on June 9, 2024 1

To unblock this, here's the dict.txt file. Put it in the same folder as the .pt files.

https://gist.github.com/stephenroller/fbc74423445091531aa6b0452f5efaa2

from metaseq.

stephenroller commented on June 9, 2024 1

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:
  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized
Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.
HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks

It should be that you need to set --model-parallel 8

from metaseq.

hunterlang commented on June 9, 2024

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:

  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).

But closing this issue. Thanks for the help.

from metaseq.

stephenroller commented on June 9, 2024

You need to specify --model-parallel N based on the settings of the particular model. (2 for 30B, 8 for 175B.)

from metaseq.

ParadoxZW commented on June 9, 2024

About the problem dict.txt, I found that when point LOCAL_SSD to a different path with that of MODEL_SHARED_FOLDER, the program will automatically generate a dict.txt in the folder of LOCAL_SSD. But this would not happen if LOCAL_SSD = None or LOCAL_SSD = MODEL_SHARED_FOLDER. Is this strange behavior a bug?

from metaseq.

guialfaro053 commented on June 9, 2024

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:

  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).

But closing this issue. Thanks for the help.

HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks

from metaseq.

aarush7 commented on June 9, 2024

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:
  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized
Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.
HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks
It should be that you need to set --model-parallel 8

Where should we set this parameter?

from metaseq.

Dev-hestabit commented on June 9, 2024

@hunterlang, @aarush7 , @stephenroller , @guialfaro053, @ParadoxZW hey, can you able to solve this issue?
i am trying to load 1.3B model but getting (AssertionError: intra_layer_model parallel group is not initialized) this error

from metaseq.

guialfaro053 commented on June 9, 2024

@hunterlang, @aarush7 , @stephenroller , @guialfaro053, @ParadoxZW hey, can you able to solve this issue?
i am trying to load 1.3B model but getting (AssertionError: intra_layer_model parallel group is not initialized) this error

I actually managed to run it after create a new conda env and installing everything again.
I remember I had some conflicts with the Apex library and Fairseq

from metaseq.

Dev-hestabit commented on June 9, 2024

i have tried creating conda env two times
can you please suggest some steps that you follow to create your conda env if you remember.

from metaseq.

Dev-hestabit commented on June 9, 2024

actually i am trying to run it on single GPU, if that is the issue then please suggest some solution.

from metaseq.

guialfaro053 commented on June 9, 2024

I installed all needed libraries from here. If there is an error while installing them, you won't be able to run BB3. It can take a while to install everything.
Also, when I ran BB3 30B (it's a big model), I used two (A100 40Gb) GPUs so I really doubt you can run it with one single GPU without high specs.

from metaseq.

Dev-hestabit commented on June 9, 2024

Actually i am trying to load 2.7B or 6.7B and i think it will work on A10G 24GB
i have followed the same setup guide but it is not working for me

from metaseq.

xyjigsaw commented on June 9, 2024

Any update? I encounter the same problem.

from metaseq.

Missing dict.txt? about metaseq HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs