GithubHelp home page GithubHelp logo

Comments (14)

stephenroller avatar stephenroller commented on June 9, 2024 1

To unblock this, here's the dict.txt file. Put it in the same folder as the .pt files.

https://gist.github.com/stephenroller/fbc74423445091531aa6b0452f5efaa2

from metaseq.

stephenroller avatar stephenroller commented on June 9, 2024 1

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:

  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.

HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks

It should be that you need to set --model-parallel 8

from metaseq.

hunterlang avatar hunterlang commented on June 9, 2024

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:

  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).

But closing this issue. Thanks for the help.

from metaseq.

stephenroller avatar stephenroller commented on June 9, 2024

You need to specify --model-parallel N based on the settings of the particular model. (2 for 30B, 8 for 175B.)

from metaseq.

ParadoxZW avatar ParadoxZW commented on June 9, 2024

About the problem dict.txt, I found that when point LOCAL_SSD to a different path with that of MODEL_SHARED_FOLDER, the program will automatically generate a dict.txt in the folder of LOCAL_SSD. But this would not happen if LOCAL_SSD = None or LOCAL_SSD = MODEL_SHARED_FOLDER. Is this strange behavior a bug?

from metaseq.

guialfaro053 avatar guialfaro053 commented on June 9, 2024

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:

  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).

But closing this issue. Thanks for the help.

HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks

from metaseq.

aarush7 avatar aarush7 commented on June 9, 2024

Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:

  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.

HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks

It should be that you need to set --model-parallel 8

Where should we set this parameter?

from metaseq.

Dev-hestabit avatar Dev-hestabit commented on June 9, 2024

@hunterlang, @aarush7 , @stephenroller , @guialfaro053, @ParadoxZW hey, can you able to solve this issue?
i am trying to load 1.3B model but getting (AssertionError: intra_layer_model parallel group is not initialized) this error

from metaseq.

guialfaro053 avatar guialfaro053 commented on June 9, 2024

@hunterlang, @aarush7 , @stephenroller , @guialfaro053, @ParadoxZW hey, can you able to solve this issue?
i am trying to load 1.3B model but getting (AssertionError: intra_layer_model parallel group is not initialized) this error

I actually managed to run it after create a new conda env and installing everything again.
I remember I had some conflicts with the Apex library and Fairseq

from metaseq.

Dev-hestabit avatar Dev-hestabit commented on June 9, 2024

i have tried creating conda env two times
can you please suggest some steps that you follow to create your conda env if you remember.

from metaseq.

Dev-hestabit avatar Dev-hestabit commented on June 9, 2024

actually i am trying to run it on single GPU, if that is the issue then please suggest some solution.

from metaseq.

guialfaro053 avatar guialfaro053 commented on June 9, 2024

I installed all needed libraries from here. If there is an error while installing them, you won't be able to run BB3. It can take a while to install everything.
Also, when I ran BB3 30B (it's a big model), I used two (A100 40Gb) GPUs so I really doubt you can run it with one single GPU without high specs.

from metaseq.

Dev-hestabit avatar Dev-hestabit commented on June 9, 2024

Actually i am trying to load 2.7B or 6.7B and i think it will work on A10G 24GB
i have followed the same setup guide but it is not working for me

from metaseq.

xyjigsaw avatar xyjigsaw commented on June 9, 2024

Any update? I encounter the same problem.

from metaseq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.