Comments (1)
Same issue here. I used Zero-3, ZeRO Init and deepspeed 0.14.2. Here is the list o f my pip versions:
Package Version
accelerate 0.31.0
aiohttp 3.9.5
aiosignal 1.3.1
annotated-types 0.7.0
anyio 4.4.0
attrs 23.2.0
bitsandbytes 0.43.1
certifi 2024.6.2
charset-normalizer 3.3.2
dataclasses-json 0.6.7
datasets 2.14.7
deepspeed 0.14.2
dill 0.3.7
distro 1.9.0
docstring_parser 0.16
evaluate 0.4.1
filelock 3.13.1
frozenlist 1.4.1
fsspec 2023.10.0
greenlet 3.0.3
h11 0.14.0
hjson 3.1.0
httpcore 1.0.5
httpx 0.27.0
huggingface-hub 0.23.4
idna 3.7
Jinja2 3.1.3
joblib 1.4.2
jsonpatch 1.33
jsonpointer 3.0.0
langchain 0.2.6
langchain-community 0.2.6
langchain-core 0.2.10
langchain-text-splitters 0.2.2
langsmith 0.1.82
markdown-it-py 3.0.0
MarkupSafe 2.1.5
marshmallow 3.21.3
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.15
mypy-extensions 1.0.0
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.1.105
nvidia-nvtx-cu12 12.1.105
openai 1.12.0
orjson 3.10.5
packaging 24.1
pandas 2.2.0
peft 0.11.1
pillow 10.2.0
pip 24.0
protobuf 4.25.2
psutil 6.0.0
py-cpuinfo 9.0.0
pyarrow 16.1.0
pyarrow-hotfix 0.6
pydantic 2.7.4
pydantic_core 2.18.4
Pygments 2.18.0
pynvml 11.5.0
python-dateutil 2.9.0.post0
pytz 2024.1
PyYAML 6.0.1
regex 2024.5.15
requests 2.32.3
responses 0.18.0
rich 13.7.1
safetensors 0.4.3
scikit-learn 1.4.0
scipy 1.14.0
sentencepiece 0.1.99
setuptools 69.5.1
shtab 1.7.1
six 1.16.0
sniffio 1.3.1
SQLAlchemy 2.0.31
sympy 1.12
tenacity 8.2.3
threadpoolctl 3.5.0
tiktoken 0.6.0
tokenizers 0.19.1
torch 2.3.1+cu121
torchaudio 2.3.1+cu121
torchvision 0.18.1+cu121
tqdm 4.66.4
transformers 4.41.2
triton 2.3.1
trl 0.9.4
typing_extensions 4.9.0
typing-inspect 0.9.0
tyro 0.8.5
tzdata 2024.1
urllib3 2.2.2
wheel 0.43.0
xxhash 3.4.1
yarl 1.9.4
I encountered error:
[2024-06-26 12:43:15,076] [INFO] [config.py:986:print_user_config] json = {
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 2e-05,
"warmup_num_steps": 1000
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"pin_memory": true,
"nvme_path": "/home/xxx/git/sep/tmp",
"buffer_count": 40
},
"offload_param": {
"device": "cpu",
"pin_memory": true,
"nvme_path": "/home/xxx/git/sep/tmp2",
"buffer_count": 40
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"reduce_bucket_size": 1.000000e+06,
"stage3_prefetch_bucket_size": 1.509949e+07,
"stage3_param_persistence_threshold": 4.096000e+04,
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_gather_16bit_weights_on_model_save": false
},
"gradient_accumulation_steps": 1,
"gradient_clipping": 1.0,
"steps_per_print": inf,
"train_batch_size": 1,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": false,
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
reward_model_name: ./saved_models/reward_model_vicuna-7b-adapter-merged
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/quantizers/auto.py:167: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.
warnings.warn(warning_msg)
Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at ./saved_models/reward_model_vicuna-7b-adapter-merged and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
0it [00:00, ?it/s]/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:426: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
warnings.warn(
/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/transformers/pipelines/text_classification.py:104: UserWarning: `return_all_scores` is now deprecated, if want a similar functionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
warnings.warn(
Invalidate trace cache @ step 10: expected module 1048, but got module 1055
0it [00:10, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/xxx/git/sep/main.py", line 94, in <module>
[rank0]: exp_model.train()
[rank0]: File "/home/xxx/git/sep/exp/exp_model.py", line 87, in train
[rank0]: tuning_lm_with_rl(self.args)
[rank0]: File "/home/xxx/git/sep/predict_module/tuning_lm_with_rl.py", line 264, in tuning_lm_with_rl
[rank0]: stats = ppo_trainer.step(question_tensors, response_tensors, rewards)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]: return func(*args, **kwds)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 749, in step
[rank0]: ref_logprobs, ref_logits_or_none, _, _ = self.batched_forward_pass(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/contextlib.py", line 81, in inner
[rank0]: return func(*args, **kwds)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/trl/trainer/ppo_trainer.py", line 1013, in batched_forward_pass
[rank0]: logits, _, values = model(**input_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1855, in forward
[rank0]: loss = self.module(*inputs, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1595, in _call_impl
[rank0]: hook_result = hook(self, args, result)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 232, in _end_of_forward_hook
[rank0]: self.get_param_coordinator(training=False).reset_step()
[rank0]: File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 207, in reset_step
[rank0]: raise RuntimeError(f"still have inflight params "
[rank0]: RuntimeError: still have inflight params [{'id': 3, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 7, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 16, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}, {'id': 20, 'status': 'AVAILABLE', 'numel': 32768, 'ds_numel': 32768, 'shape': (4096, 8), 'ds_shape': (4096, 8), 'requires_grad': False, 'grad_shape': None, 'persist': True, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([32768])}]
E0626 12:43:31.197000 125351468656448 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 3290) of binary: /home/xxx/anaconda3/envs/ml/bin/python
Traceback (most recent call last):
File "/home/xxx/anaconda3/envs/ml/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1082, in launch_command
deepspeed_launcher(args)
File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/accelerate/commands/launch.py", line 786, in deepspeed_launcher
distrib_run.run(args)
File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xxx/anaconda3/envs/ml/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-26_12:43:31
host : hcserver
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3290)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
from deepspeed.
Related Issues (20)
- [REQUEST] Does Universal Checkpoint supports for MoE Checkpoint? HOT 3
- Different seeds are giving the exact same loss on Zero 1,2 and 3 during multi gpu training [BUG]
- [BUG] fp16 not supported for CPU? HOT 1
- Issue with LoRA Tuning on llama3-70b using PEFT and TRL's SFTTrainer
- [REQUEST] Asynchronous Checkpointing HOT 1
- [BUG] ImportError: /home/nlp/.cache/torch_extensions/py310_cu121/cpu_adam/cpu_adam.so: cannot open shared object file: No such file or directory HOT 1
- CUDA error: no kernel image is available for execution on the device [BUG]
- lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to `deepspeed.initialize` [BUG]
- [BUG] PipelineEngine calculates loss with outputs and labels from different batches. HOT 1
- [BUG] Learning rate scheduler and optimizer logical issue
- In distributed training, in order to continue training, an error occurred when loading model checkpoints after saving them.
- DS communication issue when using NCCL backend: All_reduce instead of reduce_scatter (or several reduce ops) HOT 5
- [BUG] I can't run fp8 with pipeline parallel HOT 2
- [BUG] Multi-gpu stuck when the computation graph is not complete for wach process.
- [BUG] Multi-node fine-tuning with thunderbolt HOT 1
- Multi-node multi-GPUs training is slower than single-node multi-GPUs training[BUG] HOT 2
- Default libcurand path fails HOT 1
- [BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
- test
- how to set "training_step" during training?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.