facebookresearch / metaseq Goto Github PK

View Code? Open in Web Editor NEW

6.4K 111.0 719.0 26.34 MB

Repo for external large-scale work

License: MIT License

Python 80.21% Cython 0.73% HTML 0.61% Shell 0.09% Dockerfile 0.09% Scala 1.69% C++ 6.11% C 0.34% Cuda 10.13%

metaseq's Introduction

Metaseq

A codebase for working with Open Pre-trained Transformers, originally forked from fairseq.

Community Integrations

Using OPT with 🤗 Transformers

The OPT 125M--66B models are now available in Hugging Face Transformers. You can access them under the facebook organization on the Hugging Face Hub

Using OPT-175B with Alpa

The OPT 125M--175B models are now supported in the Alpa project, which enables serving OPT-175B with more flexible parallelisms on older generations of GPUs, such as 40GB A100, V100, T4, M60, etc.

Using OPT with Colossal-AI

The OPT models are now supported in the Colossal-AI, which helps users to efficiently and quickly deploy OPT models training and inference, reducing large AI model budgets and scaling down the labor cost of learning and deployment.

Using OPT with CTranslate2

The OPT 125M--66B models can be executed with CTranslate2, which is a fast inference engine for Transformer models. The project integrates the SmoothQuant technique to allow 8-bit quantization of OPT models. See the usage example to get started.

Using OPT with FasterTransformer

The OPT models can be served with FasterTransformer, a highly optimized inference framework written and maintained by NVIDIA. We provide instructions to convert OPT checkpoints into FasterTransformer format and a usage example with some benchmark results.

Using OPT with DeepSpeed

The OPT models can be finetuned using DeepSpeed. See the DeepSpeed-Chat example to get started.

Getting Started in Metaseq

Follow setup instructions here to get started.

Documentation on workflows

Training
API

Background Info

Support

If you have any questions, bug reports, or feature requests regarding either the codebase or the models released in the projects section, please don't hesitate to post on our Github Issues page.

Please remember to follow our Code of Conduct.

Contributing

We welcome PRs from the community!

You can find information about contributing to metaseq in our Contributing document.

The Team

Metaseq is currently maintained by the CODEOWNERS: Susan Zhang, Naman Goyal, Punit Singh Koura, Moya Chen, Kurt Shuster, David Esiobu, Igor Molybog, Peter Albert, Andrew Poulton, Nikolay Bashlykov, Binh Tang, Uriel Singer, Yuchen Zhang, Armen Aghajanya, Lili Yu, and Adam Polyak.

License

The majority of metaseq is licensed under the MIT license, however portions of the project are available under separate license terms:

Megatron-LM is licensed under the Megatron-LM license

metaseq's People

Contributors

Stargazers

Watchers

Forkers

entn-at anubrata zzarch sdi1982 uakbr techthiyanes patrickvonplaten estability plc-dev python-repository-hub saisurbehera richachoudhary shreyan1 ericsteinberger neuroidss aashiqmuhamed 1a3orn happyhappydaydayup straykat914 hesamgit piranha32 gendosplace fundou alanwangvt tianjunz samarsheikh001 dashstander julianangaritasuarez mitrofanovdmitry tulw4r nantero1 zhiqwang 34153320 cprakashagr amortx zhanglix qizhyuan cuoci bhushank enockipp timills bruinxiong chenjiewang peerdavid wh-forker cimsweb pkafma-aon oyelowo aylitat dumpmemory yutong-zhou-cv gtlee1106 abhyuday07 yuyangshu dst1213 learningpro shawnli paddoum fushenzhi joemzhao sharonng98 alepholiveira 00mjk weaties wingdi murilo yanndd1 chenxingqiang edsonrufino akkarimi ml-lab autogis-2018 tsor13 stjordanis munhouiani panayi absalan quoctran xing5 guillepaez53 genesisedge unimol ghosthamlet dongcf teehanming yousefazizi1982 changgeng-wei aotum snowwolfjay stevenchang8 uestcwkc chao-peng joseph-chan lipovsek swenkel thebennos dujiahong ayranamo hmoraes shenhao-stu

metaseq's Issues

[eval] Process hanging when eval 175B model on more than 1 nodes

🐛 Bug

Process got stuck and hanging when running eval on 175B model on more than 1 nodes.

Current available work around: change the following code in schedule_jobs_few_shot.py script to adjust number of nodes to 1. Eval is currently working when running with 1 node. It takes ~30 min to load the model when running with 1 node. Using 4 nodes reduce the model loading time to ~7min, so might still worth investigating the root cause to unblock eval on more than 1 nodes.

for model_name in AZURE:
  num_nodes = 1
  num_gpus = 1
  if "175B" in model_name:
    num_nodes = 4
    num_gpus = 8

To Reproduce

Login Azure
cd metaseq_internal (checkout main branch)
git checkout main
Run eval using schedule_jobs_few_shot.py script. This will kick off eval on 4 nodes with 8 GPUs each.

FSD=/data/xlmg/few_shot_data python ../metaseq-internal/metaseq_internal/scripts/eval/schedule_jobs_few_shot.py -o ~/evals/test_175B_gptz_reshard_again1_diff_hosts/ --slurm-partition hpc -m 175B_gptz_reshard  -t copa  --scoring sum --batch-size 8 --override-completed

The process got stuck and hanging.

Additional context

This only happens for 175B base model while 175B fine-tuned model works fine when running eval with 4 nodes 8 GPUs. The base model is almost 2 twice in size compared with fine-tuned model.

# base model
`"175B_gptz_reshard": gptz_sharded_config("/data/xlmg/models/175B/reshard.pt"),` 

# fine-tuned model
 "175B_none_eps_3e_05_n256": gptz_sharded_config(
        "/data/xlmg/models/final.novalid..gpt2.sbm_none.eps_3000.ckpt.tps_2048.175b.mu937.bsz4.uf1.dr0.1.atdr0.1.fp32adam.lr3e-05.ngpu256/checkpoint_last.pt",
        model_parallel_size=8,
    ),

175B base model works when running on one node. The problem only happens when node number > 1 (e.g num_nodes = 2 or num_nodes = 4)
Traced down the code and now narrows the issue point to fairseq/models/base_model.py -> class LanguageModel -> def forward(...): return self.decoder(src_tokens, **kwargs). This lines seems to get stuck. Need to further trace down to locate the root cause.


/shared/home/qingl/metaseq/fairseq/eval/gpt3_eval.py  -> def run_evaluations

=> def run_evaluation -> eval_predictions, metrics_scores = predictor.predict(eval_samples)

=> /shared/home/qingl/metaseq/fairseq/eval/predictors.py ->class CLMPromptingPredictor ->  def predict -> def predict_without_calibration
-> def predict_outputs -> def score_candidates -> local_hypotheses = self.model.generate( )

=> /shared/home/qingl/metaseq/fairseq/eval/hub_utils.py(203)generate() -> translations = self.task.inference_step(

=> /shared/home/qingl/metaseq/fairseq/tasks/language_modeling_inference_for_models_trained_with_streaming.py(365)inference_step() -> def inference_step -> generator.generate() 

=>/shared/home/qingl/metaseq/fairseq/sequence_scorer.py(28)generate() -> decoder_out = model(**net_input)

=> /shared/home/qingl/miniconda3/envs/fairseq-big38/lib/python3.8/site-packages/torch/nn/modules/module.py(1045)_call_impl() ->  return forward_call(*input, **kwargs)


=> /shared/home/qingl/fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py(1123)forward() -> outputs = self.module(*args, **kwargs) -> def module(self) -> FlattenParamsWrapper: -> return self._fsdp_wrapped_module  -> def _call_impl(self, *input, **kwargs): return forward_call(*input, **kwargs)


=> /shared/home/qingl/fairscale/fairscale/nn/misc/flatten_params_wrapper.py(457)forward()

=> /shared/home/qingl/metaseq/fairseq/models/base_model.py(357)forward() -> class LanguageModel -> def forward() return self.decoder(src_tokens, **kwargs)

Validate args to prevent RuntimeError: attn_batches % batches_per_block == 0

Error when running metaseq_internal/fb_sweep/sweep_openlm_baselines.py with model: Size(1, 4, 2, 2, int(0.1 * M), 6.0e-4, 2) in Megatron-LM: stack trace.

Possibly related: Batch size of 128 throws "batch_size should be a positive integer".

What's the difference between the two checkpoint name convention?

The released OPT-350M model has only one checkpoint named reshard.pt, but other OPT models have multiple checkpoints named reshard-model_part<i>.pt.

I downloaded OPT-1.3B, set variable MODEL_SHARDED_FOLDER properly (but not set LOCAL_SSD, so it equals to MODEL_SHARDED_FOLDER) in module metaseq.service.constants, and ran metaseq-api-local to serve it on my machine. But it tries to find local checkpoint named reshard.pt instead of sharded parts in the folder. Thus failed to start.

Is reshard.pt needed to run metaseq-api-local? If so, how to convert reshard-model_part<i>.pts to reshard.pt? Can I just read each shard and concat parameters with the same name?

Also, I find a guy that serves OPT-350M with reshard.pt successfully, but don't know any successful case in which reshard-model_part<i>.pts are used.

Remove dict.txt file dependency

Remove

metaseq/metaseq_cli/interactive_hosted.py

Lines 274 to 278 in 9afea52

 dict_path = os.path.join(os.path.dirname(CHECKPOINT_LOCAL), "dict.txt") 

 if not os.path.exists(dict_path): 

 with open(dict_path, "w+") as f: 

 f.write("\n".join([f"{i} 1" for i in range(4, 50271 + 1)])) 

 logger.info("Hackishly generated a dict.txt for use with 175B model")

hack

Can re-enable overriding dict.txt logic later.

Add kill-switch logic around time out errors

We are seeing cases where jobs will continue to stay alive after a series of

    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 4330, for key: store_based_barrier_key:522 (world_size=2048, worker_count=1792, timeout=0:30:00)

errors when initializing the job. We should add logic to terminate these runs to free up resources.

Refactor prompt generation to not require loading a model

Currently, we use this code fairseq/eval/predictors.py for prompt generation and it relies on the input model to this class. It would be good to remove this false dependency, especially when one needs to only generate prompt data independent of the model (e.g, see the discussion in this PR).

Add tiny baseline training integration test

#44 broke training, which would've been caught if we had an integration test that launches a small baseline run.

As usual, not an open license

In the best traditions of Facebook, you're offering this model under a license that makes it impossible to use it for commercial projects. Which, in turn, means that nobody will have the scale to turn it into something useful.

Well done, Facebook, as usual, you show your real face.

In addition to your non-existent customer support for users and advertisers.

I really hope that one day your company goes bankrupt, so you give space to somebody who truly cares. The culture at Facebook is rotten.

layer_norm is fp32, can't be wrapped inside half precision layers.

🐛 Bug

I encountered it when running 6.7b model with MODEL_PARALLEL = 2 and TOTAL_WORLD_SIZE = 2 with single_node_init

When running interactive_cli i encountered a problem in fairseq parameter flattening:

I inserted some prints and found that layer_norm weights and offsets were full precision and could not be flattened with other fp16 parameters.

Indeed in metaseq layer_norm definiton those are fp32

By inserting explicit cast to fp16 I managed to start the model. Is this intended behavior? Am I missing something?

Error

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspace/metaseq/metaseq_cli/interactive_cli.py", line 115, in <module>
    cli_main()
  File "/workspace/metaseq/metaseq_cli/interactive_cli.py", line 111, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/workspace/metaseq/metaseq/distributed/utils.py", line 256, in call_main
    return _spawn_helper(main, cfg, kwargs)
  File "/workspace/metaseq/metaseq/distributed/utils.py", line 234, in _spawn_helper
    retval = distributed_main(-1, main, cfg, kwargs)
  File "/workspace/metaseq/metaseq/distributed/utils.py", line 203, in distributed_main
    main(cfg, **kwargs)
  File "/workspace/metaseq/metaseq_cli/interactive_cli.py", line 66, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/workspace/metaseq/metaseq/hub_utils.py", line 485, in load_model
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/workspace/metaseq/metaseq/checkpoint_utils.py", line 503, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/workspace/metaseq/metaseq/hub_utils.py", line 474, in _build_model
    model = task.build_model(cfg.model).half().cuda()
  File "/workspace/metaseq/metaseq/tasks/language_modeling.py", line 164, in build_model
    model = super().build_model(args)
  File "/workspace/metaseq/metaseq/tasks/base_task.py", line 560, in build_model
    model = models.build_model(args, self)
  File "/workspace/metaseq/metaseq/models/__init__.py", line 89, in build_model
    return model.build_model(cfg, task)
  File "/workspace/metaseq/metaseq/model_parallel/models/transformer_lm.py", line 58, in build_model
    decoder = ModelParallelTransformerDecoder(
  File "/workspace/metaseq/metaseq/models/transformer.py", line 409, in __init__
    self.build_decoder_layer(
  File "/workspace/metaseq/metaseq/models/transformer.py", line 552, in build_decoder_layer
    layer = fsdp_wrap(
  File "/workspace/metaseq/metaseq/distributed/fully_sharded_data_parallel.py", line 141, in fsdp_wrap
    return wrap(module, **kwargs)
  File "/workspace/fairscale/fairscale/nn/wrap/auto_wrap.py", line 170, in wrap
    return ConfigAutoWrap.wrapper_cls(module, **wrap_overrides)
  File "/workspace/metaseq/metaseq/distributed/fully_sharded_data_parallel.py", line 48, in __init__
    super().__init__(*args, **kwargs)
  File "/workspace/fairscale/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 342, in __init__
    self._fsdp_wrapped_module: nn.Module = FlattenParamsWrapper(module, param_list=to_be_flatten_params)
  File "/workspace/fairscale/fairscale/nn/misc/flatten_params_wrapper.py", line 211, in __init__
    params, param_infos, shared_param_infos = self._init_flatten_params(new_p_set)
  File "/workspace/fairscale/fairscale/nn/misc/flatten_params_wrapper.py", line 278, in _init_flatten_params
    assert len(set(p.dtype for p in params)) == 1, "expects all parameters to have same dtype"
AssertionError: expects all parameters to have same dtype

Expected behavior

Environment

metaseq: Version master:
PyTorch Version: 1.10.1
OS: Ubuntu20.04:
How you installed metaseq: as in setup guide:
Build command you used: as in setup guide:
Python version: 3.8
CUDA/cuDNN version: 11.3.1/8.2.1
GPU models and configuration: two of Quadro RTX 8000

[convenience] Minimize tboard/wandb logspew

We're currently logging a bunch of things at the layer level in tensorboard/wandb. While it's still good to have these logged somewhere (ex. in the train.log) they're clogging up the visualization.

(making issue per discussion about a few other things with @stephenroller and @klshuster)

Clean up utils.py

There is too much junk lobbed together in the top level utils.py file. Clean this up.

Unable to run interactive_hosted locally

To reproduce:

ssh to node 131 on cluster
confirm that mp 175B shards exist in /mnt/scratch/175B/reshard_no_os/
permissions on dir has been opened to all (chmod 777)
metaseq is on main

Resulting stacktrace:

(metaseq-test-apr7) susanz@<node 131 on cluster>:~/staging/metaseq$ python -m metaseq_cli.interactive_hosted
2022-05-08 20:09:42 | INFO | metaseq_cli.interactive | Local checkpoint copy already exists, skipping copy
2022-05-08 20:09:42 | INFO | metaseq.tasks.language_modeling | dictionary: 50272 types
2022-05-08 20:09:42 | INFO | metaseq.hub_utils | loading model(s) from /mnt/scratch/175B/reshard_no_os/reshard.pt
Traceback (most recent call last):
  File "/shared/home/susanz/miniconda3/envs/metaseq-test-apr7/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/shared/home/susanz/miniconda3/envs/metaseq-test-apr7/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/shared/home/susanz/staging/metaseq/metaseq_cli/interactive_hosted.py", line 304, in <module>
    cli_main()
  File "/shared/home/susanz/staging/metaseq/metaseq_cli/interactive_hosted.py", line 300, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/shared/home/susanz/staging/metaseq/metaseq/distributed/utils.py", line 263, in call_main
    return main(cfg, **kwargs)
  File "/shared/home/susanz/staging/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/shared/home/susanz/staging/metaseq/metaseq/hub_utils.py", line 485, in load_model
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/shared/home/susanz/staging/metaseq/metaseq/checkpoint_utils.py", line 473, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/shared/home/susanz/staging/metaseq/metaseq/checkpoint_utils.py", line 408, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/shared/home/susanz/staging/metaseq/metaseq/checkpoint_utils.py", line 348, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/shared/home/susanz/staging/metaseq/metaseq/checkpoint_utils.py", line 339, in _is_checkpoint_sharded
    size_ratio = max(sizes) / min(sizes)
ValueError: max() arg is an empty sequence

@stephenroller last time we saw this, it seemed to be a read permissions issue?

Add GPU timers

The Megatron codebase has timers scattered all over portions of their code (i.e. https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/optimizer.py#L412). We should add similar timers to see if we can find areas of improvement.

Per @anj-s: these should be gated behind a debug flag in case there's a perf hit for measurement / to reduce log spam.

Issue running reshard_mp_launch.sh

Running the recently added reshard_mp_launch.sh on the 992 shard downloaded generated an an error.

To Reproduce

Running :

bash metaseq/scripts/reshard_mp_launch_no_slurm.sh model/checkpoint_last  metaseq_model_combined/ 8 1

generates the following:

python -m metaseq.scripts.reshard_mp model/checkpoint_last metaseq_model_combined/ --part 7 --target-ddp-size 1
reshard_mp7_ddp1
Waiting on jobs...


metaseq/scripts/reshard_mp_launch_no_slurm.sh: line 24: 423467 Killed                  python3 -m metaseq.scripts.reshard_mp $prefix $save_dir --part $i --target-ddp-size $tgt_size
metaseq/scripts/reshard_mp_launch_no_slurm.sh: line 24: 423469 Killed                  python3 -m metaseq.scripts.reshard_mp $prefix $save_dir --part $i --target-ddp-size $tgt_size
metaseq/scripts/reshard_mp_launch_no_slurm.sh: line 24: 423472 Killed                  python3 -m metaseq.scripts.reshard_mp $prefix $save_dir --part $i --target-ddp-size $tgt_size
Saving to metaseq_model_combined//reshard-model_part-1-shard[i].pt: 1it [01:42, 102.60s/it]
Saving to metaseq_model_combined//reshard-model_part-4-shard[i].pt: 1it [01:42, 102.01s/it]
Saving to metaseq_model_combined//reshard-model_part-7-shard[i].pt: 1it [01:42, 102.16s/it]
Saving to metaseq_model_combined//reshard-model_part-6-shard[i].pt: 1it [01:45, 105.07s/it]
metaseq/scripts/reshard_mp_launch_no_slurm.sh: line 24: 423470 Killed                  python3 -m metaseq.scripts.reshard_mp $prefix $save_dir --part $i --target-ddp-size $tgt_size
Done

Note that running the command output by reshard_mp_launch.sh; sample command listed below, seems to work and generates the files reshard-model_part-0-shard0.pt through reshard-model_part-7-shard0.pt

Sample command ran manually:
python -m metaseq.scripts.reshard_mp model/checkpoint_last metaseq_model_combined/ --part 0 --target-ddp-size 1

[eval] RuntimeError: attn_batches % batches_per_block when run with certain tasks

🐛 Bug

Running a model parallel 2 with 8 gpus on FAIR cluster raises the following exception with the 1.3B_gptz model only when run with arceasy, arcchallenge, openbookqa. Works with storycloze, hellaswag, winogrande.

RuntimeError: attn_batches % batches_per_block == 0INTERNAL ASSERT FAILED at "/private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h":363, please report a bug to PyTorch.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Login to FAIR cluster
The environment is setup with the following steps

apex commit: e1aa1fc1316a84e66869666270941265ec9cde77
fairscale commit: 1bc96fa8c69def6d990e42bfbd75f86146ce29bd
megatron: --branch fairseq_v2
metaseq - git checkout tbmihaylov/gshard-eval-script - this is rebased from main with added the model (below)

Model - fresh copy of the 1.3B_gptz from azure:

UNIDIR_LM_ROBERTA_DATA = {
# ...
"1.3B_gptz_model_parallel": gptz_sharded_config(
        "/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt",
        model_parallel_size=2
    ),
# ...
}

Slurm allocation (8 gpus)

srun --gpus=8 --nodes 1 --ntasks-per-node 1 --cpus-per-task 10 --mem 58G --constraint volta32gb --time 1440 --partition xlmg,devaccel,learnaccel --pty bash

Command:

export RUN_MODEL_NAME=1.3B_gptz_model_parallel
python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks arceasy --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 8

See the error in the log (at the end of this issue).

Expected behavior

Not failing in the given configuration.

Environment

Explained in the repro

Additional context

Here are the logs:

55149532_4_0_log.err:

WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
..
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 8, which is different with the world size 4. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
INFO:fairseq.checkpoint_utils:Done loading state dict
INFO:fairseq.models.fairseq_model:{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': '/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 4, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 2, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False, 'new_profiler': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 64, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://hpc-pg0-132:18422', 'distributed_port': 18422, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None}, 'dataset': {'_name': None, 'num_workers': 8, 'num_workers_valid': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': None, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': True, 'validate_interval': 1, 'validate_interval_updates': 1000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 286102, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'clip_norm_type': 'l2', 'skip_gradient_update_on_clip_norm': False, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0002], 'stop_min_lr': -1.0, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_best_checkpoints': True, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '-model_part-0', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': True, 's3_upload_path': 'https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', 'model_parallel_size': 2}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 64}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='transformer_lm_megatron', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=2048, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7faa62373160>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'task': {'_name': 'streaming_language_modeling', 'data': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'end_of_document_symbol': '</s>', 'sample_break_mode': 'none', 'tokens_per_sample': 2048, 'max_source_positions': None, 'max_target_positions': None, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'data_buffer_size': 10, 'tpu': False, 'update_freq': [1]}, 'criterion': Namespace(_name='vocab_parallel_cross_entropy', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7faa62373160>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.95)', 'adam_eps': 1e-08, 'weight_decay': 0.1, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0002], 'block_wise': False}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 357, 'force_anneal': None, 'end_learning_rate': 2e-05, 'zero_lr_warmup_steps': 0, 'power': 1.0, 'total_num_update': 286102.0, 'lr': [0.0002]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': {'_name': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True}, 'tokenizer': None, 'simul_type': None}
submitit ERROR (2022-03-31 16:50:15,553) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/submitit/core/submission.py", line 72, in submitit_main
    process_job(args.folder)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in process_job
    raise error
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job
    result = delayed.result()
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/submitit/core/utils.py", line 122, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 320, in run_evaluations_from_model_name
    results = load_lm_and_run_func(run_evaluations, model_name, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 178, in load_lm_and_run_func
    distributed_utils.call_main(
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 215, in call_main
    torch.multiprocessing.spawn(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 6 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 199, in distributed_main
    main(cfg, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 272, in _load_lm_and_run_func
    return_value = func(model=model, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 478, in run_evaluations
    for metric, score in run_evaluation(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 844, in run_evaluation
    eval_predictions, metrics_scores = predictor.predict(eval_samples)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 990, in predict
    return self.predict_without_calibration(samples)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 1060, in predict_without_calibration
    predictions = self.predict_outputs(samples)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 982, in predict_outputs
    return self.score_candidates(samples)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/predictors.py", line 884, in score_candidates
    local_hypotheses = self.model.generate(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 253, in generate
    translations = self.task.inference_step(
  File "/private/home/tbmihaylov/metaseq/fairseq/tasks/language_modeling_inference_for_models_trained_with_streaming.py", line 387, in inference_step
    return generator.generate(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/sequence_scorer.py", line 63, in generate
    decoder_out = model(**net_input)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1403, in forward
    outputs = self.module(*args, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/misc/flatten_params_wrapper.py", line 487, in forward
    return self.module(*inputs, **kwinputs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/fairseq_model.py", line 373, in forward
    return self.decoder(src_tokens, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 643, in forward
    x, extra = self.extract_features(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 668, in extract_features
    return self.extract_features_scriptable(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 727, in extract_features_scriptable
    x, layer_attn, _, l_aux_i = layer(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1403, in forward
    outputs = self.module(*args, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/misc/flatten_params_wrapper.py", line 487, in forward
    return self.module(*inputs, **kwinputs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/modules/checkpoint_activation_wrapper/checkpoint_activations.py", line 187, in _checkpointed_forward
    return original_forward(module, *args, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/modules/transformer_layer.py", line 513, in forward
    x, attn = self.forward_attention(
  File "/private/home/tbmihaylov/metaseq/fairseq/model_parallel/modules/transformer_layer.py", line 162, in forward_attention
    (attn_output, attn_bias), attn_weights = self.self_attn(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/model_parallel/modules/multihead_attention.py", line 374, in forward
    attn_probs = ScaledUpperTriangMaskedSoftmax.apply(matmul_result, 1.0)
  File "/private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/model/fused_softmax.py", line 35, in forward
    softmax_results = scaled_upper_triang_masked_softmax_cuda.forward(
RuntimeError: attn_batches % batches_per_block == 0INTERNAL ASSERT FAILED at "/private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h":363, please report a bug to PyTorch. 

srun: error: learnfair1090: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=55152069.0

55149532_4_0_log.out:

submitit INFO (2022-03-31 16:47:27,432) - Starting with JobEnvironment(job_id=55149532_4, hostname=learnfair1090, local_rank=0(1), node=0(1), global_rank=0(1))
submitit INFO (2022-03-31 16:47:27,432) - Loading pickle: /private/home/tbmihaylov/metaseq-internal/debug/55149532_4_submitted.pkl
model_name=1.3B_gptz_model_parallel
args:Namespace(add_bos_token=False, all_gather_list_size=16384, azureml_logging=False, batch_size=None, batch_size_valid=None, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, combine_valid_subsets=None, context_window=0, cpu=False, cpu_offload=False, criterion='cross_entropy', data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=15389, distributed_rank=0, distributed_world_size=1, dont_log_param_and_grad_norm=False, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, future_target=False, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=False, log_file=None, log_format=None, log_interval=100, log_nvidia_smi=False, lr_scheduler='fixed', max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_valid_steps=None, memory_efficient_fp16=True, min_loss_scale=0.0001, model_overrides='{}', model_parallel_size=1, new_profiler=False, no_progress_bar=False, no_reshard_after_forward=False, no_seed_provided=False, num_shards=1, num_workers=1, num_workers_valid=0, optimizer=None, output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, pad_to_fixed_bsz=False, pad_to_fixed_length=False, past_target=False, path=None, plasma_path='/tmp/plasma', profile=False, required_batch_size_multiple=8, results_path=None, sample_break_mode='none', score_sequences=False, seed=1, self_target=False, shard_id=0, shorten_data_split_list='', shorten_method='none', shuffle_docs=False, skip_invalid_size_inputs_valid_test=False, softmax_batch=9223372036854775807, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, train_subset='train', use_plasma_view=False, use_sharded_state=True, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=4000, zero_sharding='none')
model_config:{'model_path': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt', 'extra_args': ['--use-sharded-state', '--memory-efficient-fp16', '--fp16', '--distributed-port', '15389', '--ddp-backend', 'fully_sharded'], 'model_overrides': {'bpe': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True, 'specify_arch': True, 'batch_size': None, 'batch_size_valid': None}, 'model_parallel_size': 2, 'distributed_world_size': 1}
fairseq_cfg.common.model_parallel_size:2
distributed_training.distributed_port=15389
> initializing tensor model parallel with size 2
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
name decoder.embed_tokens.weight parameters Parameter containing:
tensor([[-0.0081,  0.0116,  0.0153,  ...,  0.0056, -0.0022, -0.0036],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0006,  0.0053, -0.0022,  ...,  0.0040,  0.0002, -0.0011],
        ...,
        [-0.0017,  0.0006,  0.0019,  ...,  0.0005,  0.0027, -0.0058],
        [-0.0010,  0.0111, -0.0051,  ..., -0.0024,  0.0083, -0.0031],
        [-0.0029, -0.0132, -0.0025,  ..., -0.0011, -0.0041, -0.0034]],
       requires_grad=True)
name decoder.embed_positions.weight parameters Parameter containing:
tensor([[ 0.0033, -0.0105,  0.0078,  ..., -0.0028, -0.0058, -0.0013],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0018, -0.0002, -0.0171,  ..., -0.0017, -0.0015,  0.0061],
        ...,
        [-0.0015, -0.0065, -0.0032,  ...,  0.0016, -0.0067,  0.0027],
        [ 0.0014,  0.0004, -0.0169,  ..., -0.0052,  0.0007,  0.0024],
        [ 0.0062,  0.0033, -0.0019,  ..., -0.0097,  0.0021,  0.0011]],
       requires_grad=True)
name decoder.layers.0._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0057,  0.0115,  0.0047,  ...,  0.0019,  0.0004, -0.0002],
       requires_grad=True)
name decoder.layers.1._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0001, -0.0012,  0.0029,  ...,  0.0001, -0.0013, -0.0004],
       requires_grad=True)
name decoder.layers.2._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 6.8599e-03, -5.5861e-03, -6.1739e-04,  ...,  1.4616e-03,
         1.1143e-03, -4.9934e-05], requires_grad=True)
name decoder.layers.3._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0081, -0.0024, -0.0071,  ..., -0.0002, -0.0007,  0.0003],
       requires_grad=True)
name decoder.layers.4._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-4.4335e-04,  5.2967e-04, -7.5067e-03,  ..., -1.5072e-03,
        -3.4442e-04, -3.9858e-05], requires_grad=True)
name decoder.layers.5._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 1.4034e-02,  1.4860e-03, -2.4700e-03,  ...,  1.0679e-03,
        -4.4604e-05,  2.2255e-04], requires_grad=True)
name decoder.layers.6._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0039, -0.0027, -0.0069,  ...,  0.0004, -0.0014, -0.0008],
       requires_grad=True)
name decoder.layers.7._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0009, -0.0083,  0.0029,  ...,  0.0002,  0.0003,  0.0007],
       requires_grad=True)
name decoder.layers.8._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0033, -0.0030,  0.0120,  ..., -0.0018,  0.0008, -0.0011],
       requires_grad=True)
name decoder.layers.9._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0055, -0.0099,  0.0100,  ...,  0.0006, -0.0006, -0.0002],
       requires_grad=True)
name decoder.layers.10._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0083, -0.0028, -0.0070,  ...,  0.0003, -0.0003, -0.0013],
       requires_grad=True)
name decoder.layers.11._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0054,  0.0006, -0.0027,  ..., -0.0011, -0.0003,  0.0002],
       requires_grad=True)
name decoder.layers.12._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0021, -0.0008, -0.0018,  ...,  0.0004, -0.0020, -0.0012],
       requires_grad=True)
name decoder.layers.13._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-3.5661e-03,  6.1952e-03, -6.5495e-03,  ..., -5.2914e-05,
         8.5266e-04, -2.6402e-04], requires_grad=True)
name decoder.layers.14._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 1.1657e-02, -2.1671e-03, -2.9637e-03,  ..., -5.6107e-05,
        -1.0188e-04, -1.6555e-03], requires_grad=True)
name decoder.layers.15._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0133, 0.0101, 0.0012,  ..., 0.0017, 0.0006, 0.0004],
       requires_grad=True)
name decoder.layers.16._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 5.8620e-03,  5.3024e-03, -6.2737e-04,  ...,  1.1526e-03,
        -1.6954e-05, -7.6445e-04], requires_grad=True)
name decoder.layers.17._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0074, -0.0001,  0.0032,  ...,  0.0005, -0.0013,  0.0009],
       requires_grad=True)
name decoder.layers.18._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 8.0252e-03,  2.1492e-03,  8.6357e-04,  ..., -3.0256e-04,
        -3.4153e-05, -3.2842e-04], requires_grad=True)
name decoder.layers.19._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0057, -0.0072,  0.0052,  ..., -0.0011,  0.0004, -0.0011],
       requires_grad=True)
name decoder.layers.20._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0026, -0.0013,  0.0018,  ..., -0.0003,  0.0003, -0.0008],
       requires_grad=True)
name decoder.layers.21._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0088,  0.0102,  0.0017,  ..., -0.0006,  0.0002,  0.0008],
       requires_grad=True)
name decoder.layers.22._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-5.4218e-04, -1.1330e-03, -5.0225e-03,  ...,  1.0770e-05,
         5.9360e-04, -1.2139e-03], requires_grad=True)
name decoder.layers.23._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-1.6337e-02,  7.0378e-03,  6.2587e-03,  ...,  1.7789e-03,
        -4.8786e-05, -1.5681e-03], requires_grad=True)
name decoder.layer_norm.weight parameters Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], requires_grad=True)
name decoder.layer_norm.bias parameters Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], requires_grad=True)
Loaded model
model_loading_time=95.9 seconds
model_loading_time_cuda=96.9 seconds
task=arceasy
eval_set=dev
eval language=en
train_set=train
train_lang=en
template=arc_old
calibration_options=[]
nb_few_shot_samples=0
expected_max_tgt_len=110, max_positions=2048
Average number of train samples: 0.00
Predicting 570 samples with 2281 prompts..
Before running model, bs=1, max_tgt_len=109 mem=0.31GB
submitit ERROR (2022-03-31 16:50:15,553) - Submitted job triggered an exception

"RuntimeError: torch.distributed is not yet initialized but process group is requested" when trying to run API

❓ Questions and Help

After following setup steps I ran metaseq-api-local and got this output:

$ metaseq-api-local
Traceback (most recent call last):
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq/service/constants.py", line 17, in <module>
    from metaseq_internal.constants import LOCAL_SSD, MODEL_SHARED_FOLDER
ModuleNotFoundError: No module named 'metaseq_internal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/jliu/miniconda3/envs/conda_env_opt/bin/metaseq-api-local", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/importlib/metadata.py", line 86, in load
    module = import_module(match.group('module'))
  File "/home/jliu/miniconda3/envs/conda_env_opt/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq_cli/interactive_hosted.py", line 31, in <module>
    from metaseq.service.constants import (
  File "/home/jliu/openpretrainedtransformer/metaseq/metaseq/service/constants.py", line 40, in <module>
    raise RuntimeError(
RuntimeError: You must set the variables in metaseq.service.constants to launch the API.

Am I missing a step? I tried manually setting LOCAL_SSD, MODEL_SHARED_FOLDER to a new folder I created but then other things failed.

fairseq Version (e.g., 1.0 or master): followed setup.md
PyTorch Version (e.g., 1.0) followed setup.md
OS (e.g., Linux): Ubuntu
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): followed setup.md
Python version: 3.9.12
CUDA/cuDNN version: 11.3
GPU models and configuration: Quadro RTX 5000
Any other relevant information:

How can I download the dataset?

Hi,
Is the preprocessed/filtered version of the dataset released somewhere? How can I download it?

"No package metadata was found for metaseq" when execute metaseq-api-local command

What is your question?

I try to run metaseq-api using Google Colab, but when I execute metaseq-api-local command after setup, I got these error message:

Traceback (most recent call last):
  File "/usr/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/usr/bin/metaseq-api-local", line 22, in importlib_load_entry_point
    for entry_point in distribution(dist_name).entry_points
  File "/usr/local/lib/python3.7/dist-packages/importlib_metadata/__init__.py", line 975, in distribution
    return Distribution.from_name(distribution_name)
  File "/usr/local/lib/python3.7/dist-packages/importlib_metadata/__init__.py", line 566, in from_name
    raise PackageNotFoundError(name)
importlib_metadata.PackageNotFoundError: No package metadata was found for metaseq

I'm not sure if anyone else has a similar problem. Also, I'm using a different environment to try to see if I can reproduce this issue. However, Google Colab can reproduce this problem, see the code below:

Code

https://gist.github.com/DGideas/7c7147d7c965a477d977644806f0ca2a

What have you tried?

Create a new Google Colab execution environment and retry
I'm using a different environment to try to see if I can reproduce this issue

What's your environment?

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0) 1.10.1
OS (e.g., Linux): Ubuntu 20.04 LTS / Linux
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source): see the code above
Python version: 3.8.10
CUDA/cuDNN version: 11.3
GPU models and configuration: Nvidia T40
Any other relevant information: n/a

Unify model and task registries with... everything else.

From looking at the merge_with_parent method, we see:

(base) √ fairseq-big-internal % ag merge_with_parent
fairseq/tasks/__init__.py
12:from fairseq.dataclass.utils import merge_with_parent, populate_dataclass
39:            cfg = merge_with_parent(dc(), cfg)

fairseq/registry.py
10:from fairseq.dataclass.utils import populate_dataclass, merge_with_parent
40:                cfg = merge_with_parent(dc(), cfg)

fairseq/models/__init__.py
12:from fairseq.dataclass.utils import merge_with_parent, populate_dataclass
81:            cfg = merge_with_parent(dc(), cfg)

fairseq/dataclass/utils.py
463:def merge_with_parent(dc: FairseqDataclass, cfg: FairseqDataclass):

What is notable here is that models and tasks are the only two modules that doesn't call on the fairseq/registry.py shared file. Meanwhile:

(base) √ fairseq-big-internal % ag setup_registry
fairseq/registry.py
17:def setup_registry(registry_name: str, base_class=None, default=None, required=False):

fairseq/optim/lr_scheduler/__init__.py
23:) = registry.setup_registry(

fairseq/optim/__init__.py
31:) = registry.setup_registry("--optimizer", base_class=FairseqOptimizer, required=True)

fairseq/data/encoders/__init__.py
13:build_tokenizer, register_tokenizer, TOKENIZER_REGISTRY, _ = registry.setup_registry(
19:build_bpe, register_bpe, BPE_REGISTRY, _ = registry.setup_registry(

fairseq/criterions/__init__.py
23:) = registry.setup_registry(

use registry to setup configurations.

Figure out how to unify these (or what a good reason for leaving model/task untouched would be).

Add tests for MultiplePadDataset class

Testing

Add tests for MultiplePadDataset class

Defining arch from args/checkpoint/default?

Given https://github.com/fairinternal/metaseq-internal/pull/181 , it seems like arch is not necessarily present in args in the training workflow when loading from disk. This seems like yet another case where there are multiple ways to track configuration - we should figure out why this codepath breaks for training, where args.arch is not present.

Addition of 2 OSs

📚 Documentation

I want to add two OSs in bug_report.md : Windows and macOS.
I am adding a PR for that as well.

Dynamic loss scaler does not fully checkpoint state, causing path dependency wrt restarts

Dynamic loss scaler has _iter and _last_overflow_iter attributes, which are not checkpointed: https://github.com/fairinternal/fairseq-py/blob/gshard_combine_megatron_fsdp/fairseq/optim/dynamic_loss_scaler.py#L32

As a result, loss scaling changes as a function of when we checkpoint / resume from checkpoints.

We should add a flag to enable checkpointing this state for reproducibility, along with keeping a flag for allowing this state to be forgotten if need be.

Make dynamic loss scale window depend on loss scale value

We have noticed anecdotally at 175B scale that when things are "too stable", the loss scaler will go up to a factor of 8.0 or 16.0, at which point training instabilities may occur.

One possible adjustment to this logic would be to increase scaling window (https://github.com/fairinternal/fairseq-py/blob/gshard_combine_megatron_fsdp/fairseq/optim/dynamic_loss_scaler.py#L19) as a function of loss scale value (the higher the loss scale value, the longer we will wait to scale up the loss scale).

Deduplicated Datasets

🚀 Feature Request

In the paper, it is mentioned that the authors used a deduplicated variant of The Pile. Will this deduplication (or index of which samples were deduplicated) be released?

Motivation

This would be useful for other applications involving language models and understanding how much of The Pile is duplicated, as well as studying GPT NeoX 20B.

fairinternal/metaseq.git required by setup.md is not available

🐛 Bug

setup.md instructs that metaseq should be installed using:
git clone https://github.com/fairinternal/metaseq.git

However, as of 5/3/2022 at 11:40 AM PDT, fairinternal has no publicly accessible repository metaseq

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run command: git clone https://github.com/fairinternal/metaseq.git
Response is: Username for 'https://github.com':
In my experience, this response can occur because the respository does not exist, or because it exists but is not publicly accessible.
To further investigate, go to web page https://github.com/orgs/fairinternal/repositories and note that no repository metaseq appears.

Code sample

See above

Expected behavior

I expected step 1. above to create a clone of metaseq.git on my system

Environment

N/A to this issue

Additional context

No additional context.

Runtime error in CLI

Hi I am running the CLI and querying it in an adhoc fashion using the code below:
`import requests
import json
from flask import jsonify

url = 'https://127.0.0.1:6010/completions'
headers = {'Content-Type': 'application/json'}

filters = {'prompt':'LinkedIn is a great company2'}
#print(jsonify(filters))
#params = dict(json=json.dumps(filters))
print(filters)
print(json.dumps(filters))
response = requests.post(url, json=filters,headers=headers,verify=False)
#print(response.text)
assert response.status_code == 200
print(response.json())I see the error below:Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/azureuser/metaseq/metaseq_cli/interactive_hosted.py", line 89, in batching_loop
item = target_queue.get(timeout=timeout / 1000)
File "/anaconda/envs/azureml_py38/lib/python3.8/queue.py", line 178, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/anaconda/envs/azureml_py38/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/anaconda/envs/azureml_py38/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/azureuser/metaseq/metaseq_cli/interactive_hosted.py", line 131, in batching_loop
generations = generator.generate(**request_object)
File "/home/azureuser/metaseq/metaseq/hub_utils.py", line 568, in generate
dataset=self.task.build_dataset_for_inference(tokens, lengths),
File "/home/azureuser/metaseq/metaseq/tasks/language_modeling.py", line 273, in build_dataset_for_inference
TokenBlockDataset(
File "/home/azureuser/metaseq/metaseq/data/token_block_dataset.py", line 62, in init
_sizes, block_to_dataset_index, slice_indices = self._build_slice_indices(
File "/home/azureuser/metaseq/metaseq/data/token_block_dataset.py", line 92, in _build_slice_indices
from metaseq.data.token_block_utils_fast import (
File "metaseq/data/token_block_utils_fast.pyx", line 1, in init metaseq.data.token_block_utils_fast
# cython: language_level=3
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject`

metaseq Version (e.g., 1.0 or master): 0.0.1
PyTorch Version (e.g., 1.0): '1.10.1+cu113'
OS (e.g., Linux): Linux
NAME="Ubuntu" VERSION="18.04.6 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.6 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
How you installed metaseq (pip, source): Same as setup instructions
Build command you used (if compiling from source): Same as setup instructions
Python version: 3.8.5
CUDA/cuDNN version:
(azureml_py38) azureuser@rparik4:~/metaseq$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Feb_14_21:12:58_PST_2021 Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0
GPU models and configuration: Azure compute node with 8 gpus
Virtual machine size
Standard_ND40rs_v2 (40 cores, 672 GB RAM, 2900 GB disk)
Processing unit
GPU - 8 x NVIDIA Tesla V100
Any other relevant information:

91$

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd '....'
See error

Code sample

Expected behavior

Environment

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

Releasing prediction logits for NLP tasks?

Hi,
Will the prediction logits for the 16 NLP tasks (Appendix A, Figure 6) be released?

175B download script downloads with api key appended to filename

🐛 Bug

When you run the download_opt17b.sh script to download the weights, it currently downloads as "checkpoint_last-model-part--shard.pt?&Policy=<policy_key>" instead of just "checkpoint_last-model-part--shard.pt".

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run cmd 'bash metaseq/scripts/download_opt175b.sh "<presigned_url_given_in_email>"'
Run cmd ls *pt*

Code sample

I believe this is an easy fix, just change wget "${presigned_url/$str_to_replace/$filename}" to wget -O "$filename" "${presigned_url/$str_to_replace/$filename}"

Expected behavior

I would expect it to save it in the form of ....pt files.

Environment

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0)
OS (e.g., Linux): Ubuntu
How you installed fairseq (pip, source): source
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:
wget version: 1.20.3-1ubuntu1

Additional context

Figure out potential duplication in ConfigStore

By the time we get to our first convert_namespace_to_omegaconf call, we already have:

# from hydra.core.config_store import ConfigStore
(Pdb) cs = ConfigStore.instance()
(Pdb) cs.repo.keys()
dict_keys(['hydra', '_dummy_empty_config_.yaml', 'base_config.yaml', '_name.yaml', 'common.yaml', 'common_eval.yaml', 'distributed_training.yaml', 'dataset.yaml', 'optimization.yaml', 'checkpoint.yaml', 'generation.yaml', 'eval_lm.yaml', 'model.yaml', 'task.yaml', 'criterion.yaml', 'optimizer.yaml', 'lr_scheduler.yaml', 'bpe.yaml', 'tokenizer.yaml', 'model', 'optimizer', 'lr_scheduler', 'bpe', 'task'])

which seems to have some redundancy in yaml vs non-yaml keys. Figure out if we can remove the *.yaml keys somehow.

Loading models

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I am trying to load the OPT models in a similar way as the fairseq models, but I seem to hit into quite some snags. Has anyone got a good example on how to load these models?

Code

from metaseq.models.transformer_lm import TransformerLanguageModel
model = TransformerLanguageModel.from_pretrained("./model_location")

What have you tried?

I have already tried various alternatives, but they always get the notification that the model could not get loaded, that it's missing something, etc.

What's your environment?

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0): 1.11
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install -e .
Python version: 3.9
CUDA/cuDNN version: N/A
GPU models and configuration: N/A
Any other relevant information: Just wanting to load the weights for conversion.

add web demo/model to Huggingface

Hi, would you be interested in adding metaseq to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. Models/datasets/spaces(web demos) can be added to a user account or organization similar to github.

there is already a facebook research organization on Huggingface with spaces: https://huggingface.co/facebook

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

Example from other organizations:
Keras: https://huggingface.co/keras-io
Microsoft: https://huggingface.co/microsoft

and here are guides for adding spaces/models/datasets to your org

How to add a Space: https://huggingface.co/blog/gradio-spaces
how to add models: https://huggingface.co/docs/hub/adding-a-model
uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

Add MyPy type checking to CI

🚀 Feature Request

let's have typechecking, like in ParlAI: https://github.com/fairinternal/ParlAI-Internal/blob/705258b5799e69ef8942a8fc56928ad779e0980b/.github/workflows/lint.yml

Motivation

It would be good to have some type checks (as suggestions, not a blocking test) to help catch common errors

Pitch

probably can just copy/past from ParlAI:
https://github.com/fairinternal/ParlAI-Internal/blob/705258b5799e69ef8942a8fc56928ad779e0980b/.github/workflows/lint.yml

https://fb.workplace.com/100020258253907/posts/744791912872744/
Some slides about the ParlAI version

Releasing intermediate checkpoints

Hi @stephenroller and @suchenzang ,

Thank you for your work on this project! I believe the public release of these models and the training data would benefit so many of us in the research community.

I am wondering if you have plans releasing the intermediate checkpoints of different models? I am interested in study how model evolve over time and the training dynamics so having access to those checkpoints during training will be really helpful.

Thank you!

Running the API

Hi,

Following up on #19 and #23 in a separate issue.

So far I've made the following changes to constants.py:

git diff metaseq/service/constants.py
diff --git a/metaseq/service/constants.py b/metaseq/service/constants.py
index da4ff19..5ba4b94 100644
--- a/metaseq/service/constants.py
+++ b/metaseq/service/constants.py
@@ -29,7 +29,7 @@ except ImportError:
     # reshard-model_part-5.pt
     # reshard-model_part-6.pt
     # reshard-model_part-7.pt
-    MODEL_SHARED_FOLDER = ""
+    MODEL_SHARED_FOLDER = "/home/hlang/opt_models/"
     # LOCAL_SSD is optional, but it's assuming you have some sort of local
     # hard disk where we can cache a copy of the weights for faster loading.
     LOCAL_SSD = ""
@@ -46,9 +46,10 @@ BPE_MERGES = os.path.join(MODEL_SHARED_FOLDER, "gpt2-merges.txt")
 BPE_VOCAB = os.path.join(MODEL_SHARED_FOLDER, "gpt2-vocab.json")

 # where to find the raw files on nfs
-CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, "175B", "reshard_no_os")
+#CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, "175B", "reshard_no_os")
+CHECKPOINT_FOLDER = MODEL_SHARED_FOLDER
 # where to store them on SSD for faster loading
-CHECKPOINT_LOCAL = os.path.join(LOCAL_SSD, "175B", "reshard_no_os", "reshard.pt")
+CHECKPOINT_LOCAL = MODEL_SHARED_FOLDER

My /home/hlang/opt_models looks like:

dict.txt
gpt2-merges.txt
gpt2-vocab.json
reshard-model_part-0.pt
reshard-model_part-1.pt

dict.txt is from Stephen's link in #19 and reshard-model_part-0.pt, reshard-model_part-1.pt are from the OPT-125M links.

I found that I also had to modify checkpoint_utils.py because get_paths_to_load wasn't actually finding those .pt files. So I just directly returned them (maybe this is not the right thing?):

root@node001:/home/hlang/metaseq# git diff metaseq/checkpoint_utils.py
diff --git a/metaseq/checkpoint_utils.py b/metaseq/checkpoint_utils.py
index 1ee8eee..0ea74df 100644
--- a/metaseq/checkpoint_utils.py
+++ b/metaseq/checkpoint_utils.py
@@ -344,6 +344,7 @@ def _is_checkpoint_sharded(checkpoint_files) -> bool:


 def get_paths_to_load(local_path, suffix="rank-"):
+    return ['/home/hlang/opt_models/reshard-model_part-0.pt', '/home/hlang/opt_models/reshard-model_part-1.pt']

Now when I run metaseq-api-local I get:

2022-05-03 19:37:06 | INFO | metaseq.hub_utils | loading model(s) from /home/hlang/opt_models/
2022-05-03 19:37:07 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
  File "/opt/conda/bin/metaseq-api-local", line 33, in <module>
    sys.exit(load_entry_point('metaseq', 'console_scripts', 'metaseq-api-local')())
  File "/home/hlang/metaseq/metaseq_cli/interactive_hosted.py", line 300, in cli_main
    dist_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/hlang/metaseq/metaseq/distributed/utils.py", line 226, in call_main
    main(cfg, **kwargs)
  File "/home/hlang/metaseq/metaseq_cli/interactive_hosted.py", line 156, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/hlang/metaseq/metaseq/hub_utils.py", line 485, in load_model
    models, _model_args, _task = checkpoint_utils.load_model_ensemble_and_task(
  File "/home/hlang/metaseq/metaseq/checkpoint_utils.py", line 504, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/hlang/metaseq/metaseq/hub_utils.py", line 474, in _build_model
    model = task.build_model(cfg.model).half().cuda()
  File "/home/hlang/metaseq/metaseq/tasks/language_modeling.py", line 164, in build_model
    model = super().build_model(args)
  File "/home/hlang/metaseq/metaseq/tasks/base_task.py", line 560, in build_model
    model = models.build_model(args, self)
  File "/home/hlang/metaseq/metaseq/models/__init__.py", line 89, in build_model
    return model.build_model(cfg, task)
  File "/home/hlang/metaseq/metaseq/model_parallel/models/transformer_lm.py", line 47, in build_model
    embed_tokens = cls.build_embedding(
  File "/home/hlang/metaseq/metaseq/model_parallel/models/transformer_lm.py", line 82, in build_embedding
    embed_tokens = VocabParallelEmbedding(
  File "/home/hlang/Megatron-LM/megatron/mpu/layers.py", line 190, in __init__
    self.tensor_model_parallel_size = get_tensor_model_parallel_world_size()
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 258, in get_tensor_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_tensor_model_parallel_group())
  File "/home/hlang/Megatron-LM/megatron/mpu/initialize.py", line 215, in get_tensor_model_parallel_group
    assert _TENSOR_MODEL_PARALLEL_GROUP is not None, \
AssertionError: intra_layer_model parallel group is not initialized

I tried Stephen's advice from #19 of setting --model-parallel N for N=0, N=1, N=2 but none worked.

Remove Namespace hack

See:

metaseq/metaseq/checkpoint_utils.py

Lines 424 to 425 in 9afea52

 # hack to be able to set Namespace in dict config. this should be removed when we update to newer 

 # omegaconf version that supports object flags, or when we migrate all existing models

Previous attempt at removing this hack broke generation, evals, and resuming training from checkpoint (namespaces were slipping through, despite conversion to omegaconf - need to track down where that happens).

Clean up last checkpoint on checkpoint writing failures

Run failed when writing checkpoints in one of our compute envs, and the checkpoints were left in a corrupted state.

Partial output of ls -la in the directory:

-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:56 checkpoint_23_22750-model_part-7-shard96.pt.tmp
-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:56 checkpoint_23_22750-model_part-7-shard97.pt
-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:56 checkpoint_23_22750-model_part-7-shard98.pt
-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:53 checkpoint_23_22750-model_part-7-shard99.pt
-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:56 checkpoint_23_22750-model_part-7-shard9.pt
-rw-rw-r--  1 susanz susanz          0 Mar  5 21:56 checkpoint_last-model_part-0-shard0.pt
-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:54 checkpoint_last-model_part-0-shard100.pt
-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:54 checkpoint_last-model_part-0-shard101.pt
-rw-rw-r--  1 susanz susanz  583649007 Mar  5 21:54 checkpoint_last-model_part-0-shard102.pt
-rw-rw-r--  1 susanz susanz          0 Mar  5 21:56 checkpoint_last-model_part-0-shard103.pt
-rw-rw-r--  1 susanz susanz  581828608 Mar  5 21:56 checkpoint_last-model_part-0-shard104.pt
-rw-rw-r--  1 susanz susanz          0 Mar  5 21:56 checkpoint_last-model_part-0-shard112.pt
-rw-rw-r--  1 susanz susanz          0 Mar  5 21:56 checkpoint_last-model_part-0-shard119.pt
-rw-rw-r--  1 susanz susanz          0 Mar  5 21:56 checkpoint_last-model_part-0-shard121.pt
-rw-rw-r--  1 susanz susanz          0 Mar  5 21:56 checkpoint_last-model_part-0-shard122.pt

On automatic requeue, we try to load from latest checkpoint, which failed due to the corruption.

We should add a step to cleanup corrupted checkpoints and restart from the last non-corrupted checkpoint.

Missing dict.txt?

Edit by Admin:
If you see issues with the gpt2 tokenizer, please refer to #132

Hi, I've gotten to the point of running metaseq-api-local but I'm having issues figuring out where to find the missing stuff in constants.py.

I gathered I need gpt2-merges.txt and gpt2-vocab.json, so I downloaded those from here. But based on the example ls output in constants.py, I also need a dict.txt, and when I run metaseq-api-local I get

File "/home/hlang/metaseq/metaseq/data/dictionary.py", line 235, in add_from_file
    with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
FileNotFoundError: [Errno 2] No such file or directory: '/home/hlang/opt_models/dict.txt'

Here I have MODEL_SHARED_FOLDER='/home/hlang/opt_models/'.

Is there a place to download the dict.txt for each model?

Running OPT-175B on CPU or 16 40GB A100s

Request: Documentation for running OPT-175B on a CPU or smaller GPU-based machines (16 NVIDIA A100 40GB Machines)

I am interested in running the text generation interface with the OPT-175B parameter model. However, I only have access to 16 40GB A100 machines. Is it possible to run inference on the model given these compute resources? Any guidance would be greatly appreciated!

Using the Metaseq API

Hi I've managed to get the API flask service up with the 125M OPT model by running python -m metaseq_cli.interactive_hosted on an Azure compute instance with 8 GPUs. I followed the suggestions in issue #23 and #26. I don't have browser access to the Azure compute instance. What's the best way to query the API? Should run the flask service in the background and then query it?

Add a python example on how to upload and use OPT

First of all, thank you for open-sourcing OPT! 💪🚀

I'd like to ask for a simple example of how to use the new OPT model.
In here there could be a python code description on how to upload the model or, because probably the code is too big, point to a script with that example/tutorial.

Make eval_lm.py work with .jsonl formatted data

Currently eval_lm.py requires data to be in the legacy format (.bin, .idx files and a dict.txt). This is annoying because all of my data is in the jsonl format and pre-processing them into the legacy format would be a lot of work - it would be much better if I could directly measure perplexity on any of my current validation sets as is.

I dug into the code a bit actually and it seemed that the issue is that eval_lm instantiates a language_modeling task, but it is streaming_language_modeling task that uses jsonl. So I went ahead and hacked a code minimally to see if I can fix this up: https://fburl.com/phabricator/a2rrxmvp

Turns out if I set all the flags similarly to how I'd set extra flags in model_configs.py this actually works! Here is an example run: https://fburl.com/phabricator/6fy7j7cp

The produced perplexity number, however, is way off from what I saw during training (< 50) and more similar to numbers I saw during gpt3_eval that I'm debugging.

I am looking for help from people actually familiar with the codebase. Is this a legit way to use eval_lm? Are the resulting numbers credible? That is, are language modeling and streaming language modeling tasks compatible in this way or am I missing something maybe?

Model_parallel=2 and 2 gpus on FAIR cluster: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)

🐛 Bug

Running a model parallel with 2 gpus on FAIR cluster raises the following exception with the 1.3B_gptz model:
UPDATE: When we use model_parallel=2 and 8 gpus, this works, but it should not fail with 2 gpus.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument index in method wrapper_index_select)

I found that there is a warning in the log which might be giving a clue about the problem -- the full log is at the bottom of the issue.

WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Login to FAIR cluster
The environment is setup with the following steps

apex commit: e1aa1fc1316a84e66869666270941265ec9cde77
fairscale commit: 1bc96fa8c69def6d990e42bfbd75f86146ce29bd
megatron: --branch fairseq_v2
metaseq - git checkout tbmihaylov/gshard-eval-script - this is rebased from main with added the model (below)

Model - fresh copy of the 1.3B_gptz from azure:

UNIDIR_LM_ROBERTA_DATA = {
# ...
"1.3B_gptz_model_parallel": gptz_sharded_config(
        "/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt",
        model_parallel_size=2
    ),
# ...
}

Slurm allocation

srun --gpus=2 --nodes 1 --ntasks-per-node 1 --cpus-per-task 10 --mem 58G --constraint volta32gb --time 1440 --partition xlmg,devaccel,learnaccel --pty bash

Command:

export RUN_MODEL_NAME=1.3B_gptz_model_parallel
python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks cb --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 2

See the error in the log (at the end of this issue).

Expected behavior

Not failing in the given configuration.

Environment

Explained in the repro

Additional context

Full error log:

(metaseq_20220328) tbmihaylov@learnfair1844:~/metaseq-internal$ python -m fairseq.eval.gpt3_eval --model-name ${RUN_MODEL_NAME}  --tasks cb --nb-few-shot-samples-values 0 --max-positions 1024 --train-sep ' ' --scoring mean --fsdp  --distributed-world-size 2 | tee debug.log
model_name=1.3B_gptz_model_parallel
args:Namespace(add_bos_token=False, all_gather_list_size=16384, azureml_logging=False, batch_size=None, batch_size_valid=None, bpe=None, broadcast_buffers=False, bucket_cap_mb=25, combine_valid_subsets=None, context_window=0, cpu=False, cpu_offload=False, criterion='cross_entropy', data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='pytorch_ddp', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=10791, distributed_rank=0, distributed_world_size=2, dont_log_param_and_grad_norm=False, empty_cache_freq=0, fast_stat_sync=False, find_unused_parameters=False, fix_batches_to_gpus=False, fixed_validation_seed=None, fp16=True, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, future_target=False, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=False, log_file=None, log_format=None, log_interval=100, log_nvidia_smi=False, lr_scheduler='fixed', max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_valid_steps=None, memory_efficient_fp16=True, min_loss_scale=0.0001, model_overrides='{}', model_parallel_size=1, new_profiler=False, no_progress_bar=False, no_reshard_after_forward=False, no_seed_provided=False, num_shards=1, num_workers=1, num_workers_valid=0, optimizer=None, output_dictionary_size=-1, output_word_probs=False, output_word_stats=False, pad_to_fixed_bsz=False, pad_to_fixed_length=False, past_target=False, path=None, plasma_path='/tmp/plasma', profile=False, required_batch_size_multiple=8, results_path=None, sample_break_mode='none', score_sequences=False, seed=1, self_target=False, shard_id=0, shorten_data_split_list='', shorten_method='none', shuffle_docs=False, skip_invalid_size_inputs_valid_test=False, softmax_batch=9223372036854775807, task='language_modeling', tensorboard_logdir=None, threshold_loss_scale=None, tokenizer=None, tokens_per_sample=1024, train_subset='train', use_plasma_view=False, use_sharded_state=True, user_dir=None, valid_subset='valid', validate_after_updates=0, validate_interval=1, validate_interval_updates=0, wandb_project=None, warmup_init_lr=-1, warmup_updates=4000, zero_sharding='none')
model_config:{'model_path': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B/checkpoint_last.pt', 'extra_args': ['--use-sharded-state', '--memory-efficient-fp16', '--fp16', '--distributed-port', '10791', '--ddp-backend', 'fully_sharded'], 'model_overrides': {'bpe': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True, 'specify_arch': True, 'batch_size': None, 'batch_size_valid': None}, 'model_parallel_size': 2, 'distributed_world_size': 2}
fairseq_cfg.common.model_parallel_size:2
distributed_training.distributed_port=10791
> initializing tensor model parallel with size 2
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /private/home/tbmihaylov/Megatron-LM-metaseq_20220328/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
WARNING:root:Rolled back to use the default process group for the reduce scatter operation because the reduce_scatter process group size is 2, which is different with the world size 1. Please make sure the process_group parameter uses all the available ranks for the optimal performance.
INFO:fairseq.checkpoint_utils:Done loading state dict
INFO:fairseq.models.fairseq_model:{'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 10, 'log_format': 'json', 'log_file': None, 'tensorboard_logdir': '/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': True, 'memory_efficient_fp16': True, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 4, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 2, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma', 'log_nvidia_smi': False, 'use_tutel_moe': False, 'new_profiler': False}, 'common_eval': {'_name': None, 'path': None, 'post_process': None, 'quiet': False, 'model_overrides': '{}', 'results_path': None, 'is_moe': False}, 'distributed_training': {'_name': None, 'distributed_world_size': 64, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://hpc-pg0-132:18422', 'distributed_port': 18422, 'device_id': 0, 'distributed_no_spawn': True, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 8, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'fp16': True, 'memory_efficient_fp16': True, 'tpu': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None}, 'dataset': {'_name': None, 'num_workers': 8, 'num_workers_valid': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': None, 'required_batch_size_multiple': 1, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', 'combine_valid_subsets': None, 'ignore_unused_valid_subsets': True, 'validate_interval': 1, 'validate_interval_updates': 1000, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': None, 'max_valid_steps': None, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 286102, 'stop_time_hours': 0.0, 'clip_norm': 1.0, 'clip_norm_type': 'l2', 'skip_gradient_update_on_clip_norm': False, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.0002], 'stop_min_lr': -1.0, 'use_bmuf': False, 'train_with_epoch_remainder_batch': False}, 'checkpoint': {'_name': None, 'save_dir': '/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 1000, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': True, 'no_last_checkpoints': False, 'no_best_checkpoints': True, 'no_save_optimizer_state': False, 'no_save_optimizer_state_on_training_finished': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '-model_part-0', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'write_checkpoints_asynchronously': True, 's3_upload_path': 'https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', 'model_parallel_size': 2}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 64}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 10, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807, 'max_valid_steps': None}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': Namespace(_name='transformer_lm_megatron', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=2048, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fd829da7a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'task': {'_name': 'streaming_language_modeling', 'data': '/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', 'vocab_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'merges_filename': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'end_of_document_symbol': '</s>', 'sample_break_mode': 'none', 'tokens_per_sample': 2048, 'max_source_positions': None, 'max_target_positions': None, 'seed': 1, 'batch_size': None, 'batch_size_valid': None, 'data_buffer_size': 10, 'tpu': False, 'update_freq': [1]}, 'criterion': Namespace(_name='vocab_parallel_cross_entropy', activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.95)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, add_bos_token=False, all_gather_list_size=16384, arch='transformer_lm_megatron', attention_dropout=0.1, azureml_logging=False, batch_size=None, batch_size_valid=None, best_checkpoint_metric='loss', bf16=False, block_wise=False, bpe='hf_byte_bpe', bpe_add_prefix_space=True, bpe_merges='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', bpe_vocab='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', broadcast_buffers=False, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', checkpoint_activations=True, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=1.0, clip_norm_type='l2', combine_valid_subsets=None, cpu=False, cpu_offload=False, criterion='cross_entropy', curriculum=0, data='/large_experiments/xlmg/models/1.3B_gptz_from_azure/1.3B', data_buffer_size=10, dataset_impl=None, ddp_backend='fully_sharded', decoder_attention_heads=32, decoder_embed_dim=2048, decoder_ffn_embed_dim=8192, decoder_input_dim=2048, decoder_layerdrop=0.0, decoder_layers=24, decoder_layers_to_keep=None, decoder_learned_pos=True, decoder_learned_sinusoidal=False, decoder_normalize_before=True, decoder_output_dim=2048, device_id=0, disable_validation=False, distribute_checkpointed_activations=True, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=18422, distributed_rank=0, distributed_world_size=64, dropout=0.1, empty_cache_freq=0, end_learning_rate=2e-05, end_of_document_symbol='</s>', eos=2, fast_stat_sync=False, find_unused_parameters=False, finetune_from_model=None, fix_batches_to_gpus=False, fixed_validation_seed=None, force_anneal=None, fp16=True, fp16_adam_stats=False, fp16_init_scale=4, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, fp32_reduce_scatter=False, full_megatron_init=True, gen_subset='test', gradient_predivide_factor=None, heartbeat_timeout=-1, ignore_unused_valid_subsets=True, keep_best_checkpoints=-1, keep_interval_updates=-1, keep_last_epochs=-1, layernorm_embedding=False, load_checkpoint_on_all_dp_ranks=False, localsgd_frequency=3, log_file=None, log_format='json', log_interval=10, log_nvidia_smi=False, lr=[0.0002], lr_scheduler='polynomial_decay', max_epoch=0, max_source_positions=None, max_target_positions=None, max_tokens=None, max_tokens_valid=None, max_update=286102, max_valid_steps=None, maximize_best_checkpoint_metric=False, megatron_init_sigma=0.006, memory_efficient_bf16=False, memory_efficient_fp16=True, merges_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', min_loss_scale=0.0001, model_parallel_size=2, new_profiler=False, no_best_checkpoints=True, no_decoder_final_norm=False, no_emb_dropout=True, no_epoch_checkpoints=True, no_last_checkpoints=False, no_progress_bar=False, no_reshard_after_forward=False, no_save=False, no_save_optimizer_state=False, no_save_optimizer_state_on_training_finished=False, no_scale_embedding=True, no_seed_provided=False, no_token_positional_embeddings=False, nprocs_per_node=8, num_shards=1, num_workers=8, num_workers_valid=1, optimizer='adam', optimizer_overrides='{}', pad=1, patience=-1, pipeline_balance=None, pipeline_checkpoint='never', pipeline_chunks=0, pipeline_decoder_balance=None, pipeline_decoder_devices=None, pipeline_devices=None, pipeline_encoder_balance=None, pipeline_encoder_devices=None, pipeline_model_parallel=False, plasma_path='/tmp/plasma', post_build_model_hook=<function load_and_get_model.<locals>.default_post_build_model_hook at 0x7fd829da7a60>, power=1.0, profile=False, quant_noise_pq=0.0, quant_noise_pq_block_size=8, quant_noise_scalar=0.0, quantization_config_path=None, relu_dropout=0.0, required_batch_size_multiple=1, required_seq_len_multiple=1, reset_dataloader=False, reset_logging=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', s3_upload_path='https://fairacceleastus.blob.core.windows.net/roller/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/?sv=2020-08-04&ss=b&srt=sco&sp=rwdlactfx&se=2023-10-06T11:23:33Z&st=2021-10-06T03:23:33Z&spr=https&sig=s6aw4Ca4Ohbr7LQ%2BG9s58PEyYJsbXHjs%2Fc%2BuoTvzTUo%3D', sample_break_mode='none', save_dir='/mnt/scratch/roller/checkpoints/2021-12-11/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64', save_interval=1, save_interval_updates=1000, scoring='bleu', seed=1, sentence_avg=False, shard_id=0, share_decoder_input_output_embed=True, simul_type=None, skip_gradient_update_on_clip_norm=False, skip_invalid_size_inputs_valid_test=False, slowmo_algorithm='LocalSGD', slowmo_momentum=None, specify_arch=True, stop_min_lr=-1.0, stop_time_hours=0, suffix='-model_part-0-shard0', suppress_crashes=False, task='streaming_language_modeling', tensorboard_logdir='/shared/home/roller/checkpoints/gptz_baselines/1.3b/base_1.3b.me_fp16.fsdp.relu.transformer_lm_megatron.nlay24.emb2048.lrnpos.0emb_scale.bm_none.tps2048.gpt2.adam.b2_0.95.eps1e-08.cl1.0.lr0.0002.endlr2e-05.wu357.dr0.1.atdr0.1.0emb_dr.wd0.1.ms16.uf1.mu286102.s1.ngpu64/tb', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokenizer=None, tokens_per_sample=2048, total_num_update='286102', tpu=False, train_subset='train', train_with_epoch_remainder_batch=False, unk=3, update_freq=[1], use_bmuf=False, use_old_adam=False, use_plasma_view=False, use_sharded_state=True, use_tutel_moe=False, user_dir=None, valid_subset='valid/BookCorpusFair,valid/CommonCrawl,valid/DM_Mathematics,valid/Gutenberg_PG-19,valid/HackerNews,valid/OpenSubtitles,valid/OpenWebText2,valid/USPTO,valid/Wikipedia_en,valid/redditflattened,valid/stories,valid/dialogue_chitchat,valid/dialogue_knowledge,valid/dialogue_tod,valid/dialogue_light', validate_after_updates=0, validate_interval=1, validate_interval_updates=1000, vocab_filename='/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', wandb_project=None, warmup_updates=357, weight_decay=0.1, write_checkpoints_asynchronously=True, zero_lr_warmup_steps=0, zero_sharding='none'), 'optimizer': {'_name': 'adam', 'adam_betas': '(0.9, 0.95)', 'adam_eps': 1e-08, 'weight_decay': 0.1, 'use_old_adam': False, 'fp16_adam_stats': False, 'tpu': False, 'lr': [0.0002], 'block_wise': False}, 'lr_scheduler': {'_name': 'polynomial_decay', 'warmup_updates': 357, 'force_anneal': None, 'end_learning_rate': 2e-05, 'zero_lr_warmup_steps': 0, 'power': 1.0, 'total_num_update': 286102.0, 'lr': [0.0002]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': {'_name': 'hf_byte_bpe', 'bpe_merges': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-merges.txt', 'bpe_vocab': '/large_experiments/xlmg/data/gptz/tokenizers/gpt2-vocab.json', 'bpe_add_prefix_space': True}, 'tokenizer': None, 'simul_type': None}
Loading extension module fused_mix_prec_layer_norm_cuda...
name decoder.embed_tokens.weight parameters Parameter containing:
tensor([[ 0.0014, -0.0082, -0.0032,  ..., -0.0111,  0.0054,  0.0015],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0050,  0.0010,  0.0044,  ...,  0.0003, -0.0001, -0.0035],
        ...,
        [ 0.0159,  0.0042,  0.0066,  ...,  0.0044,  0.0008, -0.0086],
        [-0.0008,  0.0032, -0.0032,  ..., -0.0060,  0.0036,  0.0086],
        [-0.0092, -0.0037, -0.0013,  ...,  0.0073,  0.0092, -0.0132]],
       requires_grad=True)
name decoder.embed_positions.weight parameters Parameter containing:
tensor([[-7.6732e-03, -5.4649e-03, -4.2956e-03,  ...,  7.5325e-03,
          7.7163e-03,  1.0300e-02],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-2.3755e-03,  2.4894e-03,  1.4279e-05,  ..., -8.2043e-03,
         -1.8271e-02,  3.9899e-03],
        ...,
        [-9.6320e-03, -8.2788e-03, -4.1433e-03,  ..., -6.7774e-03,
          6.1964e-03, -5.3095e-03],
        [-4.4763e-03,  1.4532e-02, -6.0640e-04,  ...,  1.5341e-03,
         -1.8106e-03, -5.6959e-04],
        [ 3.7042e-03,  5.2186e-03, -1.1615e-02,  ..., -1.0039e-02,
         -8.7586e-04,  7.5653e-03]], requires_grad=True)
name decoder.layers.0._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0059,  0.0019, -0.0075,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.1._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0023, -0.0028,  0.0170,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.2._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0030, -0.0005,  0.0028,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.3._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0077, -0.0097,  0.0007,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.4._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0011,  0.0143, -0.0066,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.5._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0025, -0.0069,  0.0071,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.6._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, -0.0018,  0.0052,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.7._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0046, -0.0019, -0.0044,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.8._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0011, 0.0047, 0.0105,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.9._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0011, 0.0014, 0.0070,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.10._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0068,  0.0033, -0.0046,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.11._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017,  0.0013,  0.0011,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.12._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-2.7278e-03,  7.8808e-03,  6.6479e-05,  ...,  0.0000e+00,
         0.0000e+00,  0.0000e+00], requires_grad=True)
name decoder.layers.13._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0012,  0.0047, -0.0049,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.14._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0065, 0.0002, 0.0080,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.15._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0017, -0.0017,  0.0030,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.16._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0025,  0.0132, -0.0027,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.17._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([ 0.0027,  0.0103, -0.0090,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.18._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0067, -0.0047,  0.0028,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.19._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0075,  0.0114, -0.0037,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.20._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0069,  0.0069,  0.0075,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.21._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0037, 0.0070, 0.0135,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layers.22._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([-0.0019,  0.0082, -0.0061,  ...,  0.0000,  0.0000,  0.0000],
       requires_grad=True)
name decoder.layers.23._fsdp_wrapped_module.flat_param_0 parameters Parameter containing:
tensor([0.0134, 0.0073, 0.0100,  ..., 0.0000, 0.0000, 0.0000],
       requires_grad=True)
name decoder.layer_norm.weight parameters Parameter containing:
tensor([1., 1., 1.,  ..., 1., 1., 1.], requires_grad=True)
name decoder.layer_norm.bias parameters Parameter containing:
tensor([0., 0., 0.,  ..., 0., 0., 0.], requires_grad=True)
Loaded model
model_loading_time=41.0 seconds
model_loading_time_cuda=41.6 seconds
Inferring max tokens for model...
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 893, in <module>
    cli_main()
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 56, in cli_main
    run_evaluations_from_model_name(**vars(args))
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/gpt3_eval.py", line 320, in run_evaluations_from_model_name
    results = load_lm_and_run_func(run_evaluations, model_name, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 178, in load_lm_and_run_func
    distributed_utils.call_main(
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 215, in call_main
    torch.multiprocessing.spawn(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/private/home/tbmihaylov/metaseq/fairseq/distributed/utils.py", line 199, in distributed_main
    main(cfg, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 261, in _load_lm_and_run_func
    max_tokens = get_or_infer_max_tokens(model, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 378, in get_or_infer_max_tokens
    return infer_max_tokens_before_oom(model)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 416, in infer_max_tokens_before_oom
    while not is_max_tokens_oom(candidate_max_tokens):
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 409, in is_max_tokens_oom
    raise e
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/models.py", line 405, in is_max_tokens_oom
    model.score(input_texts, batch_size=local_bsz, batch_by_size=False)
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 198, in score
    for hypos in self.generate(
  File "/private/home/tbmihaylov/metaseq/fairseq/eval/hub_utils.py", line 253, in generate
    translations = self.task.inference_step(
  File "/private/home/tbmihaylov/metaseq/fairseq/tasks/language_modeling_inference_for_models_trained_with_streaming.py", line 387, in inference_step
    return generator.generate(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/sequence_scorer.py", line 63, in generate
    decoder_out = model(**net_input)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1403, in forward
    outputs = self.module(*args, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/fairscale-metaseq_20220328/fairscale/nn/misc/flatten_params_wrapper.py", line 487, in forward
    return self.module(*inputs, **kwinputs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/fairseq_model.py", line 373, in forward
    return self.decoder(src_tokens, **kwargs)
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 643, in forward
    x, extra = self.extract_features(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 668, in extract_features
    return self.extract_features_scriptable(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 706, in extract_features_scriptable
    x, tok, pos = self.forward_embedding(
  File "/private/home/tbmihaylov/metaseq/fairseq/models/transformer.py", line 575, in forward_embedding
    positions = self.embed_positions(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/tbmihaylov/metaseq/fairseq/modules/learned_positional_embedding.py", line 53, in forward
    return F.embedding(
  File "/private/home/tbmihaylov/.conda/envs/metaseq_20220328/lib/python3.8/site-packages/torch/nn/functional.py", line 2043, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking arugment for argument index in method wrapper_index_select)

Track all of our RNG offsets to avoid collisions

We have RNG seed offsets sprinkled through the codebase:

(base) √ metaseq % ag seed --py | grep +
cpu_tests/test_streaming_token_block_dataset.py:78:        shadow_rng = np.random.default_rng(2273 + seed)
cpu_tests/test_streaming_token_block_dataset.py:124:        shadow_rng = np.random.default_rng(2273 + seed)
metaseq/tasks/language_modeling.py:217:            with data_utils.numpy_seed(self.args.seed + epoch):
metaseq/tasks/streaming_language_modeling.py:316:            seed=1284 + self.args.seed,
metaseq/trainer.py:1052:        seed = self.cfg.common.seed + self.get_num_updates()
metaseq/data/streaming_token_block_dataset.py:96:            rng = np.random.default_rng(2273 + self.seed)
metaseq/data/iterators.py:524:                batches = shuffle_batches(list(batches), self.seed + epoch)
metaseq/data/iterators.py:532:                batches = shuffle_batches(batches, self.seed + epoch + self.shard_id)
metaseq/data/iterators.py:535:                batches = shuffle_batches(list(self.frozen_batches), self.seed + epoch)

Would be good to track these offset to avoid collisions, in cases we're assuming no collision/coupling via offsets.

How to load sharded checkpoints?

❓ Questions and Help

After having set-up the libraries as described in: https://github.com/facebookresearch/metaseq/blob/main/docs/setup.md ,
it is possible to load the 350m checkpoint since it's not sharded as follows:

wget https://dl.fbaipublicfiles.com/opt/v1_20220502/350m/reshard.pt ./

Next we need to comment out one line in the Megatron-LM library which is only relevant for training (initialize different random seeds accross pp ranks):
Comment out this line: https://github.com/ngoyal2707/Megatron-LM/blob/ae0b844c1f6725c3433a95e42cac760b3885170b/megatron/initialize.py#L65 in your local clone of Megatron-LM
Now we write the following Python script to a run_model.py file:

import os

from transformers import AutoTokenizer, GPT2Tokenizer
from megatron.initialize import initialize_megatron
from metaseq import checkpoint_utils
import torch

path = "./"

# arguments taken from: https://arxiv.org/pdf/2205.01068.pdf | table 1
initialize_megatron(args_defaults={
    "micro_batch_size": 1, 
    "num_layers": 24, 
    "hidden_size": 1024, 
    "num_attention_heads": 16,
    "max_position_embeddings": 2048, 
    "encoder_seq_length": 2048 
})

tokenizer = GPT2Tokenizer.from_pretrained("patrickvonplaten/opt_gpt2_tokenizer")
tokenizer.save_pretrained(path)

checkpoint = checkpoint_utils.load_model_ensemble_and_task(
    [os.path.join(path, "reshard.pt")],
    arg_overrides={
        "vocab_filename": os.path.join(path, "vocab.json"),
        "merges_filename": os.path.join(path, "merges.txt"),
    }
)

model = checkpoint[0][0].eval()

We can load the checkpoint when running

torchrun run_model.py --pipeline-model-parallel-size 1 --tensor-model-parallel-size 1

Problem This only works for the 350m checkpoint!!! For the other checkpoints this doesn't work.
E.g. when replacing:
[os.path.join(path, "reshard.pt")]
by
[os.path.join(path, "reshard-model_part-0.pt"), os.path.join(path, "reshard-model_part-1.pt")] (part-0 and part-1 of the 125M model),
we're getting an error because the weigths are all flattened into 1D-arrays.

Using #29 sadly also doesn't help, since the checkpoints don't seem to be in the *shard* format as required here:

metaseq/metaseq/distributed/stitch_fsdp_ckpt.py

Line 45 in 48b9b6c

sorted(glob(f"{pth_prefix}*shard*.pt"), key=_get_shard_number)

The parameter flattening seems to come from Fairscale and we've found some functionality to unflatten it here: https://github.com/facebookresearch/fairscale/blob/51b53ddb6c3aa77426c7d5cc0b543b79628053c4/fairscale/nn/misc/flatten_params_wrapper.py#L358 , but we don't manage to wrap our head around how to make it work exactly.

@stephenroller @suchenzang @zhiqwang - any pointers on how we could load the 125M model (and the others) into a model instance of metaseq?

How to use 1.3B OPT weights in `metaseq-api-local` API?

I'm trying to use 1.3B OPT weights in metaseq-api-local API and test some inference examples. But I got stucked.

There are two reshard files of 1.3B OPT. But CHECKPOINT_LOCAL in constants.py can only point to a reshard.pt file. What is the correct way to load these two reshard files? (P.S. the comments in constants.py says MODEL_SHARED_FOLDER = "/example/175B/reshard_no_os" but there is CHECKPOINT_FOLDER = os.path.join(MODEL_SHARED_FOLDER, "175B", "reshard_no_os") in line 49, is this also a mistake?

Enable more granular initialization strategies

One example is that positional embeddings are set to have half the standard deviation of token embeddings in GPT-2: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/model.py#L152-L155

We have also seen the 175B model learn this distinction, as token embedding norms ended up being roughly ~1.5x positional embedding norms.

Inconsistency between valid ppl from TensorBoard and eval_lm.py

During the training of a 125M model I observe a relatively smooth valid ppl curve, with some minor jumps. For example, between steps 100K and 156K of the training, valid/redditflattened ppl shown on Tensorboard goes from ~40 to ~39.

If I run eval_lm.py on the above two snapshots on the very same validation sets (local changes: P490864569, command and output P490866721), I get very different numbers: 45 and 630 for consolidated checkpoints from step 100K and 156K respectively.

When I run gpt3_eval, average perplexities on the correct prompt ("ppl_answer_correct_gold" field from the results json) follow a pattern similar to eval_lm: they go from ~200 to ~1600 between these two checkpoints.

We reproduced the same results on AWS and Azure independently with @punitkoura and at this point we're unsure what's going on.

We would like to either:

Get someone familiar with the code say that this does not impact model evaluation with gpt3_eval and as such this is a low-pri issue
Or have someone help debugging why this is happening and potentially fix any inconsistency

Remove eos padding when running generation workflow

We prefix an eos token when running generation (https://github.com/fairinternal/fairseq-py/blob/roller/consolidated_interactive/fairseq/tasks/language_modeling.py#L319). This should be removed.

Per @stephenroller: It's worth trying but I consider it low priority now.

	dict_path = os.path.join(os.path.dirname(CHECKPOINT_LOCAL), "dict.txt")
	if not os.path.exists(dict_path):
	with open(dict_path, "w+") as f:
	f.write("\n".join([f"{i} 1" for i in range(4, 50271 + 1)]))
	logger.info("Hackishly generated a dict.txt for use with 175B model")

	# hack to be able to set Namespace in dict config. this should be removed when we update to newer
	# omegaconf version that supports object flags, or when we migrate all existing models

facebookresearch / metaseq Goto Github PK

metaseq's Introduction

Metaseq

Community Integrations

Using OPT with 🤗 Transformers

Using OPT-175B with Alpa

Using OPT with Colossal-AI

Using OPT with CTranslate2

Using OPT with FasterTransformer

Using OPT with DeepSpeed

Getting Started in Metaseq

Documentation on workflows

Background Info

Support

Contributing

The Team

License

metaseq's People

Contributors

Stargazers

Watchers

Forkers

metaseq's Issues

🐛 Bug

To Reproduce

Additional context

🐛 Bug

Error

Expected behavior

Environment

To Reproduce

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?

Testing

📚 Documentation

🚀 Feature Request

Motivation

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

🚀 Feature Request

Motivation

Pitch

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

❓ Questions and Help

Recommend Projects

Recommend Topics