❓ Questions and Help Before asking: sear

At HF we got the 350m checkpoint working ;-) <a href="https://github

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Loading models about metaseq HOT 14 CLOSED

facebookresearch commented on June 18, 2024 2

Loading models

from metaseq.

Comments (14)

patrickvonplaten commented on June 18, 2024 3

At HF we got the 350m checkpoint working ;-)

https://github.com/patrickvonplaten/metaseq/blob/main/README.md#7-how-to-run-the-350-model

from metaseq.

patrickvonplaten commented on June 18, 2024 1

@hunterlang, it's a GPT2Tokenizer that was loaded from the tokenizer files in https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/assets

I.e. patrickvonplaten/opt_gpt2_tokenizer just contains those two files and then you can load it with our GPT2Tokenizer implemenation

from metaseq.

stephenroller commented on June 18, 2024 1

There is no way that getting around the process group init that will work.

I'm going to take a look at getting these models into vanilla, non-MP format but I have many pressing demands.

The sketch of the solution, if someone wants to implement, is to slightly modify the consolidate FSDP script to load flattened parameters and then peak inside the wrapper to get Non-flattened parameters (i.e. model.module.state_dict). The latest OPT readme shows how to use the consolidate script.

from metaseq.

patrickvonplaten commented on June 18, 2024 1

@suchenzang @stephenroller any way that you guys could send us or open-source the param_metadata dicts?

I've tried for quite some time now to reproduce the correct parameter mapping without much success.

It's not really stated on how many GPUs (world_size models other than 175B were trained), nor was I able to reproduce the parameter mapping.

Also, there is one thing I don't fully understand:
I can load a randomely initialized model according to the model config in state[cfg], but this random model then has significantly less parameters than the number of parameters in the sharded checkpoitns.
E.g. for the 125M model the sum of parameters of the two checkpoints has more than 126M parameters even though the randomely initialized model has (the correct amount of) 125M parameters.

It would be extremely useful if you guys could provide some kind of script that allows to load the sharded checkpoints on CPU :-)

from metaseq.

mrseeker commented on June 18, 2024

If that's the case, I can only assume a HF conversion is not far away? We (KoboldAI team) managed to convert fairseq models to XGLM, I can assume that's the same case here. From our experience, XGLM uses the same architecture as fairseq.

from metaseq.

hunterlang commented on June 18, 2024

@patrickvonplaten how did you makepatrickvonplaten/opt_gpt2_tokenizer? Is it just the default HF GPT2 tokenizer?

from metaseq.

patrickvonplaten commented on June 18, 2024

@hunterlang, it's a GPT2Tokenizer that was loaded from the tokenizer files in https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/assets

I.e. patrickvonplaten/opt_gpt2_tokenizer just contains those two files and then you can load it with our GPT2Tokenizer implemenation

from metaseq.

Fengwills commented on June 18, 2024

@patrickvonplaten I follow your code and readme, but I encountered Runtime error in Megatron-LM/megatron/initialize.py
at line 180
# Call the init process
torch.distributed.init_process_group(
backend=args.distributed_backend,
world_size=args.world_size, rank=args.rank,
timeout=timedelta(days=7))
world_size = 1 rank = 0

environment?

PyTorch Version: 1.10.1
OS : Linux
Build command you used (if compiling from source): pip install -e .
Python version: 3.7.7
CUDA/cuDNN version: 11.0

command line

torchrun run_model.py --pipeline-model-parallel-size 1 --tensor-model-parallel-size 1

How can I load the model and run your code ? many thanks

from metaseq.

patrickvonplaten commented on June 18, 2024

Hey @Fengwills,

Yeah we just commented out / removed this line of code in the Megatron repo

from metaseq.

Fengwills commented on June 18, 2024

@patrickvonplaten
remove this line？ right？
#Call the init process
torch.distributed.init_process_group( backend=args.distributed_backend, world_size=args.world_size, rank=args.rank, timeout=timedelta(days=7))

get a new bug,
AssertionError: Default process group is not initialized

seem to doesn' work for me, am I doing something wrong?

from metaseq.

patrickvonplaten commented on June 18, 2024

Thanks for the hints here @stephenroller !

from metaseq.

rgzn-aiyun commented on June 18, 2024

Thanks for the hints here @stephenroller !

same problem.

from metaseq.

stephenroller commented on June 18, 2024

I think maybe some of these checkpoints were saved with use_sharded_state=False which means the checkpoints are pre-consolidated. The rank0 file would then be waaaay larger but rest would be just a few kb. This is the code path that results in shards with out that shard metadata field

from metaseq.

suchenzang commented on June 18, 2024

Closing this given #88, #78, and #77, which should cover this issue as well.

from metaseq.

Loading models about metaseq HOT 14 CLOSED

Comments (14)

environment?

command line

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs