GithubHelp home page GithubHelp logo

Comments (1)

sxhysj avatar sxhysj commented on July 22, 2024

Same issue here. I used Zero-3, ZeRO Init and deepspeed 0.14.2. Here is the list o f my pip versions:
Package Version

accelerate 0.31.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.4.0
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.6.2
charset-normalizer 3.3.2
dataclasses-json 0.6.7
datasets 2.14.7
deepspeed 0.14.2
dill 0.3.7
distro 1.9.0
docstring_parser 0.16
evaluate 0.4.1
filelock 3.13.1
frozenlist 1.4.1
fsspec 2023.10.0
greenlet 3.0.3
h11 0.14.0
hjson 3.1.0
httpcore 1.0.5
httpx 0.27.0
huggingface-hub 0.23.4
idna 3.7
Jinja2 3.1.3
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
langchain 0.2.6
langchain-community 0.2.6
langchain-core 0.2.10
langchain-text-splitters 0.2.2
langsmith 0.1.82
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.21.3
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
mypy-extensions 1.0.0
networkx 3.2.1
numpy 1.26.4
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.1.105
nvidia-nvtx-cu12 12.1.105
openai 1.12.0
orjson 3.10.5
packaging 24.1
pandas 2.2.0
peft 0.11.1
pillow 10.2.0
pip 24.0
protobuf 4.25.2
psutil 6.0.0
py-cpuinfo 9.0.0
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pynvml 11.5.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
responses 0.18.0
rich 13.7.1
safetensors 0.4.3
scikit-learn 1.4.0
scipy 1.14.0
sentencepiece 0.1.99
setuptools 69.5.1
shtab 1.7.1
six 1.16.0
sniffio 1.3.1
SQLAlchemy 2.0.31
sympy 1.12
tenacity 8.2.3
threadpoolctl 3.5.0
tiktoken 0.6.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.41.2
triton 2.3.1
trl 0.9.4
typing_extensions 4.9.0
typing-inspect 0.9.0
tyro 0.8.5
tzdata 2024.1
urllib3 2.2.2
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4

I encountered error:

[2024-06-26 12:43:15,076] [INFO] []   json = {
    "fp16": {
        "enabled": false,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-05,
            "warmup_num_steps": 1000
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp",
            "buffer_count": 40
        "offload_param": {
            "device": "cpu",
            "pin_memory": true,
            "nvme_path": "/home/xxx/git/sep/tmp2",
            "buffer_count": 40
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1.000000e+09,
        "reduce_bucket_size": 1.000000e+06,
        "stage3_prefetch_bucket_size": 1.509949e+07,
        "stage3_param_persistence_threshold": 4.096000e+04,
        "stage3_max_live_parameters": 1.000000e+09,
        "stage3_max_reuse_distance": 1.000000e+09,
        "stage3_gather_16bit_weights_on_model_save": false
    "gradient_accumulation_steps": 1,
    "gradient_clipping": 1.0,
    "steps_per_print": inf,
    "train_batch_size": 1,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": false,
    "bf16": {
        "enabled": false
    "zero_allow_untested_optimizer": true
reward_model_name:  ./saved_models/reward_model_vicuna-7b-adapter-merged
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/quantizers/ UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at ./saved_models/reward_model_vicuna-7b-adapter-merged and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0it [00:00, ?it/s]/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/bitsandbytes/nn/ UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/pipelines/ UserWarning: `return_all_scores` is now deprecated,  if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
Invalidate trace cache @ step 10: expected module 1048, but got module 1055
0it [00:10, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/xxx/git/sep/", line 94, in <module>
[rank0]:     exp_model.train()
[rank0]:   File "/home/xxx/git/sep/exp/", line 87, in train
[rank0]:     tuning_lm_with_rl(self.args)
[rank0]:   File "/home/xxx/git/sep/predict_module/", line 264, in tuning_lm_with_rl
[rank0]:     stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/anaconda3/envs/ml/lib/python3.11/", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/", line 749, in step
[rank0]:     ref_logprobs, ref_logits_or_none, _, _ = self.batched_forward_pass(
[rank0]:                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/", line 81, in inner
[rank0]:     return func(*args, **kwds)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/", line 1013, in batched_forward_pass
[rank0]:     logits, _, values = model(**input_kwargs)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/", line 1855, in forward
[rank0]:     loss = self.module(*inputs, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/", line 1595, in _call_impl
[rank0]:     hook_result = hook(self, args, result)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/", line 15, in wrapped_fn
[rank0]:     ret_val = func(*args, **kwargs)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/", line 232, in _end_of_forward_hook
[rank0]:     self.get_param_coordinator(training=False).reset_step()
[rank0]:   File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/", line 207, in reset_step
[rank0]:     raise RuntimeError(f"still have inflight params "
[rank0]: RuntimeError: still have inflight params [{'id': 3, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 7, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 16, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 20, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}]
E0626 12:43:31.197000 125351468656448 torch/distributed/elastic/multiprocessing/] failed (exitcode: 1) local_rank: 0 (pid: 3290) of binary: /home/xxx/anaconda3/envs/ml/bin/python
Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/ml/bin/accelerate", line 8, in <module>
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/", line 48, in main
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/", line 1082, in launch_command
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/", line 786, in deepspeed_launcher
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/", line 870, in run
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/", line 263, in launch_agent
    raise ChildFailedError(
============================================================ FAILED
Root Cause (first observed failure):
  time      : 2024-06-26_12:43:31
  host      : hcserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3290)
  error_file: <N/A>
  traceback : To enable traceback see:

from deepspeed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.