Comments (14)
To unblock this, here's the dict.txt file. Put it in the same folder as the .pt files.
https://gist.github.com/stephenroller/fbc74423445091531aa6b0452f5efaa2
from metaseq.
Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:
File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__ self.tensor_model_parallel_size = get_tensor_model_parallel_world_size() File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size return torch.distributed.get_world_size(group=get_tensor_model_parallel_group()) File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \ AssertionError: intra_layer_model parallel group is not initialized
Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks
It should be that you need to set --model-parallel 8
from metaseq.
Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:
File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized
Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.
from metaseq.
You need to specify --model-parallel N based on the settings of the particular model. (2 for 30B, 8 for 175B.)
from metaseq.
About the problem dict.txt
, I found that when point LOCAL_SSD
to a different path with that of MODEL_SHARED_FOLDER
, the program will automatically generate a dict.txt
in the folder of LOCAL_SSD
. But this would not happen if LOCAL_SSD = None
or LOCAL_SSD = MODEL_SHARED_FOLDER
. Is this strange behavior a bug?
from metaseq.
Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:
File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__ self.tensor_model_parallel_size = get_tensor_model_parallel_world_size() File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size return torch.distributed.get_world_size(group=get_tensor_model_parallel_group()) File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \ AssertionError: intra_layer_model parallel group is not initialized
Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.
HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks
from metaseq.
Thanks, Stephen! That definitely gets me farther. Now I'm stuck on the next issue:
File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__ self.tensor_model_parallel_size = get_tensor_model_parallel_world_size() File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size return torch.distributed.get_world_size(group=get_tensor_model_parallel_group()) File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \ AssertionError: intra_layer_model parallel group is not initialized
Which seems like the same problem Patrick is running in to here (I'm also trying with the smallest model for now).
But closing this issue. Thanks for the help.HI Hunter, were you able to solve this problem? I have been getting the same error warning. If so, would you mind sharing how you solved it? Thanks
It should be that you need to set --model-parallel 8
Where should we set this parameter?
from metaseq.
@hunterlang, @aarush7 , @stephenroller , @guialfaro053, @ParadoxZW hey, can you able to solve this issue?
i am trying to load 1.3B model but getting (AssertionError: intra_layer_model parallel group is not initialized) this error
from metaseq.
@hunterlang, @aarush7 , @stephenroller , @guialfaro053, @ParadoxZW hey, can you able to solve this issue?
i am trying to load 1.3B model but getting (AssertionError: intra_layer_model parallel group is not initialized) this error
I actually managed to run it after create a new conda env and installing everything again.
I remember I had some conflicts with the Apex library and Fairseq
from metaseq.
i have tried creating conda env two times
can you please suggest some steps that you follow to create your conda env if you remember.
from metaseq.
actually i am trying to run it on single GPU, if that is the issue then please suggest some solution.
from metaseq.
I installed all needed libraries from here. If there is an error while installing them, you won't be able to run BB3. It can take a while to install everything.
Also, when I ran BB3 30B (it's a big model), I used two (A100 40Gb) GPUs so I really doubt you can run it with one single GPU without high specs.
from metaseq.
Actually i am trying to load 2.7B or 6.7B and i think it will work on A10G 24GB
i have followed the same setup guide but it is not working for me
from metaseq.
Any update? I encounter the same problem.
from metaseq.
Related Issues (20)
- How to finetune from a consolidated model ? HOT 1
- Incorrect md5sums after running reshard_fsdp.py on OPT-175B HOT 2
- Converting OPT-175B tokenizer to HF format? HOT 2
- downloading opt-66B part7 get access denied HOT 1
- Confirm md5sums after running reshard_fsdp.py on OPT-175B #702 HOT 3
- Add type hints to all methods
- FSDP is incompatible with BF16 HOT 4
- OPT and LLaMA HOT 1
- load checkpoint failed when training with multi-nodes. HOT 1
- Grammatical Error Correction (GEC) prompt for OPT-IML
- train opt-125M from scratch HOT 1
- Possible feature and bugfix contributions from Microsoft research team's fork of Metaseq HOT 4
- OPT在中文对话上表现如何呢?
- Access request for opt-175b HOT 1
- Process blocks when deploying OPT-1.3B with FasterTransformer
- How can I pretrain an opt-model with the codes?
- setup to pyproject
- Weights/Code for CM3Leon HOT 2
- I change Num_head of OPT-1.3b,and it cause CUDA Error: IndexSelectLargeIndex,
- How to load the checkpoints into a HF model?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from metaseq.