模型检查到一半就报错，大佬能帮我看看吗 about chatglm_lora_multi-gpu HOT 2 OPEN

liangwq commented on September 15, 2024

模型检查到一半就报错，大佬能帮我看看吗

from chatglm_lora_multi-gpu.

Comments (2)

liangwq commented on September 15, 2024

你用分automodel方式加载的glm模型，模型里面这个moel_chatglm.py，你把这个文件放进去试试

from chatglm_lora_multi-gpu.

WXD7 commented on September 15, 2024

谢谢指导：）

我不太明白什么是“模型里面这个moel_chatglm.py，你把这个文件放进去试试”是把modeling_chatglm.py 跟在参数后面的意思吗，可这样也会报错

抱歉我有些新手

运行过程如下：
(GLM) wxd7@wxd7-EG341W-G21:~/glm/Chatglm_lora_multi-gpu-main$ torchrun --nproc_per_node=2 multi_gpu_fintune_belle.py --dataset_path /home/wxd7/glm/ChatGLM-Tuning-master/data/alpaca --lora_rank 8 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --fp16 --num_train_epochs 2 --remove_unused_columns false --logging_steps 50 --report_to wandb --output_dir output --deepspeed ds_config_zero3.json modeling_chatglm.py
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: CUDA runtime path found: /home/wxd7/anaconda3/envs/GLM/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 360, in
main()
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 211, in main
).parse_args_into_dataclasses()
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--modeling_chatglm.py']
Traceback (most recent call last):
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 360, in
main()
File "/home/wxd7/glm/Chatglm_lora_multi-gpu-main/multi_gpu_fintune_belle.py", line 211, in main
).parse_args_into_dataclasses()
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--modeling_chatglm.py']
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34807) of binary: /home/wxd7/anaconda3/envs/GLM/bin/python
Traceback (most recent call last):
File "/home/wxd7/anaconda3/envs/GLM/bin/torchrun", line 8, in
sys.exit(main())
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wxd7/anaconda3/envs/GLM/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

multi_gpu_fintune_belle.py FAILED

Failures:
[1]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 34808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34807)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

from chatglm_lora_multi-gpu.

模型检查到一半就报错，大佬能帮我看看吗 about chatglm_lora_multi-gpu HOT 2 OPEN

Comments (2)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

multi_gpu_fintune_belle.py FAILED

Failures:
[1]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 34808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

Comments (2)

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

multi_gpu_fintune_belle.py FAILED

Failures: [1]: time : 2023-04-13_12:11:39 host : wxd7-EG341W-G21 rank : 1 (local_rank: 1) exitcode : 1 (pid: 34808) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Jobs

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Failures:
[1]:
time : 2023-04-13_12:11:39
host : wxd7-EG341W-G21
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 34808)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html