Describe the bug Hello,Can some one get Help. I use V0.14.3, inst

RuntimeError: still have inflight params[BUG] about deepspeed HOT 1 OPEN

iszengxin commented on July 22, 2024

RuntimeError: still have inflight params[BUG]

from deepspeed.

Comments (1)

sxhysj commented on July 22, 2024

Same issue here. I used Zero-3, ZeRO Init and deepspeed 0.14.2. Here is the list o f my pip versions:
Package Version

accelerate 0.31.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.4.0
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.6.2
charset-normalizer 3.3.2
dataclasses-json 0.6.7
datasets 2.14.7
deepspeed 0.14.2
dill 0.3.7
distro 1.9.0
docstring_parser 0.16
evaluate 0.4.1
filelock 3.13.1
frozenlist 1.4.1
fsspec 2023.10.0
greenlet 3.0.3
h11 0.14.0
hjson 3.1.0
httpcore 1.0.5
httpx 0.27.0
huggingface-hub 0.23.4
idna 3.7
Jinja2 3.1.3
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
langchain 0.2.6
langchain-community 0.2.6
langchain-core 0.2.10
langchain-text-splitters 0.2.2
langsmith 0.1.82
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.21.3
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
mypy-extensions 1.0.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.1.105
nvidia-nvtx-cu12 12.1.105
openai 1.12.0
orjson 3.10.5
packaging 24.1
pandas 2.2.0
peft 0.11.1
pillow 10.2.0
pip 24.0
protobuf 4.25.2
psutil 6.0.0
py-cpuinfo 9.0.0
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pynvml 11.5.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
responses 0.18.0
rich 13.7.1
safetensors 0.4.3
scikit-learn 1.4.0
scipy 1.14.0
sentencepiece 0.1.99
setuptools 69.5.1
shtab 1.7.1
six 1.16.0
sniffio 1.3.1
SQLAlchemy 2.0.31
sympy 1.12
tenacity 8.2.3
threadpoolctl 3.5.0
tiktoken 0.6.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.41.2
triton 2.3.1
trl 0.9.4
typing_extensions 4.9.0
typing-inspect 0.9.0
tyro 0.8.5
tzdata 2024.1
urllib3 2.2.2
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4

I encountered error:

[2024-06-26 12:43:15,076] [INFO] [config.py:986:print_user_config]   json = {
    "fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-05,
            "warmup_num_steps": 1000
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp",
            "buffer_count": 40
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp2",
            "buffer_count": 40
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1.000000e+09,
        "reduce_bucket_size": 1.000000e+06,
        "stage3_prefetch_bucket_size": 1.509949e+07,
        "stage3_param_persistence_threshold": 4.096000e+04,
        "stage3_max_live_parameters": 1.000000e+09,
        "stage3_max_reuse_distance": 1.000000e+09,
        "stage3_gather_16bit_weights_on_model_save": false
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": false,
    "bf16": {
        "enabled": false
    },
    "zero_allow_untested_optimizer": true
}
reward_model_name:  ./saved_models/reward_model_vicuna-7b-adapter-merged
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
  warnings.warn(warning_msg)
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at ./saved_models/reward_model_vicuna-7b-adapter-merged and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0it [00:00, ?it/s]/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:426: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
  warnings.warn(
Invalidate trace cache @ step 10: expected module 1048, but got module 1055
0it [00:10, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/xxx/git/sep/main.py", line 94, in <module>
[rank0]:     exp_model.train()
[rank0]:   File "/home/xxx/git/sep/exp/exp_model.py", line 87, in train
[rank0]:     tuning_lm_with_rl(self.args)
[rank0]:   File "/home/xxx/git/sep/predict_module/tuning_lm_with_rl.py", line 264, in tuning_lm_with_rl
[rank0]:     stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 749, in step
[rank0]:     ref_logprobs, ref_logits_or_none, _, _ = self.batched_forward_pass(
[rank0]:                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 1013, in batched_forward_pass
[rank0]:     logits, _, values = model(**input_kwargs)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank0]:     hook_result = hook(self, args, result)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 232, in _end_of_forward_hook
[rank0]:     self.get_param_coordinator(training=False).reset_step()
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 207, in reset_step
[rank0]:     raise RuntimeError(f"still have inflight params "
[rank0]: RuntimeError: still have inflight params [{'id': 3, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 7, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 16, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 20, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}]
E0626 12:43:31.197000 125351468656448 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3290) of binary: /home/xxx/anaconda3/envs/ml/bin/python
Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/ml/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
    deepspeed_launcher(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-26_12:43:31
  host      : hcserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3290)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

from deepspeed.

RuntimeError: still have inflight params[BUG] about deepspeed HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs