Comments (15)
个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了
@Youggls 大佬请详细说说,怎么改的,感谢
@ray075hl
在 modeling_chatglm.py
文件中,有很多模块使用了 skip_init
初始化,需要将他们全部修改,修改修改初始化方法,如下图:
这里的 self.dense_h_to_4h
模块的初始化方法应该修改为:
但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。
from chatglm_lora_multi-gpu.
@Youggls 大佬,帮帮我 0.0
我遇到了下面的错误,不知道是哪里出问题了:
File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select) Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.000453948974609375 seconds 0%| | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.) attention_scores.masked_fill_(attention_mask.byte(), -10000.0) WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 116311 closing signal SIGTERM **ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 116312) of binary:** /home/la/anaconda3/envs/chatglm-tuning/bin/python Traceback (most recent call last): File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent **raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:**
看起来是多显卡问题,
model = ChatGLMForConditionalGeneration.from_pretrained(model_id, cache_dir ='./', trust_remote_code=True,torch_dtype=torch.float16).cuda()
指定cuda试试
from chatglm_lora_multi-gpu.
如果按现在的代码跑,至少32G,你得吧batchsize改小
小量化训练,估计要改比较久
from chatglm_lora_multi-gpu.
deepspeed 改成stage 3, 会报这个错,@liangwq 遇见过吗 好像是skip_init导致的,不知道怎么解决
NotImplementedError: Cannot copy out of meta tensor; no data!
tensor(..., device='meta', size=(308281344,), dtype=torch.float16,
grad_fn=)
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:355 in │
│ wrapper │
│ │
│ 352 │ │ │ │ │ is_child_module = True │
│ 353 │ │ │ │ │ setattr(module, "_ds_child_entered", True) │
│ 354 │ │ │ │ │
│ ❱ 355 │ │ │ │ f(module, *args, **kwargs) │
│ 356 │ │ │ │ │
│ 357 │ │ │ │ if is_child_module: │
│ 358 │ │ │ │ │ # child's init is done, now we can run a single post_init on the │
│ │
│ /HanLP3/Chatglm_lora_multi-gpu/modeling_chatglm.py:726 in init │
│ │
│ 723 │ │ self.position_encoding_2d = config.position_encoding_2d │
│ 724 │ │ self.model_parallel = True │
│ 725 │ │ │
│ ❱ 726 │ │ self.word_embeddings = skip_init( │
│ 727 │ │ │ torch.nn.Embedding, │
│ 728 │ │ │ num_embeddings=self.vocab_size, embedding_dim=self.hidden_size, │
│ 729 │ │ │ dtype=self.params_dtype │
│ │
│ /usr/local/lib/python3.8/dist-packages/torch/nn/utils/init.py:52 in skip_init │
│ │
│ 49 │ │
│ 50 │ final_device = kwargs.pop('device', 'cpu') │
│ 51 │ kwargs['device'] = 'meta' │
│ ❱ 52 │ return module_cls(*args, **kwargs).to_empty(device=final_device) │
│ 53 │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:363 in │
│ wrapper │
│ │
│ 360 │ │ │ │ │ │
│ 361 │ │ │ │ │ print_rank_0(f'Running post_init for {module.class.name}', │
│ 362 │ │ │ │ │ │ │ │ force=False) │
│ ❱ 363 │ │ │ │ │ self._post_init_method(module) │
│ 364 │ │ │ │ │
│ 365 │ │ │ │ print_rank_0( │
│ 366 │ │ │ │ │ f'After initializing followed by post init for {module.class.__n │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:760 in │
│ _post_init_method │
│ │
│ 757 │ │ │ │ │ │ logger.warn(f"param {name}
in {module.class.name} " │
│ 758 │ │ │ │ │ │ │ │ │ f"not on GPU so was not broadcasted from rank 0") │
│ 759 │ │ │ │ │
│ ❱ 760 │ │ │ │ param.partition() │
│ 761 │ │ see_memory_usage( │
│ 762 │ │ │ f"Param count {param_count}. After converting and partitioning parmas in {mo │
│ 763 │ │ │ force=False) │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:894 in │
│ partition │
│ │
│ 891 │ │ │ ) │
│ 892 │ │ │ if param_list is None: │
│ 893 │ │ │ │ param_list = [cls] │
│ ❱ 894 │ │ │ self._partition(param_list, has_been_updated=has_been_updated) │
│ 895 │ │ │
│ 896 │ │ def reduce_gradients_at_owner(param_list=None, hierarchy=0): │
│ 897 │ │ │ cls = param │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1038 in │
│ _partition │
│ │
│ 1035 │ │ for param in param_list: │
│ 1036 │ │ │ #print_rank_0(f"Before Partitioning Param {param.ds_id}") │
│ 1037 │ │ │ # self._param_status(param) │
│ ❱ 1038 │ │ │ self._partition_param(param, has_been_updated=has_been_updated) │
│ 1039 │ │ │ param.ds_status = ZeroParamStatus.NOT_AVAILABLE │
│ 1040 │ │ │ # if param.ds_tensor is not None: │
│ 1041 │ │ │ # assert id(param.data) == id(param.ds_tensor.data), \ │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │
│ │
│ 8 │ function call.""" │
│ 9 │ def wrapped_fn(*args, **kwargs): │
│ 10 │ │ get_accelerator().range_push(func.qualname) │
│ ❱ 11 │ │ ret_val = func(*args, **kwargs) │
│ 12 │ │ get_accelerator().range_pop() │
│ 13 │ │ return ret_val │
│ 14 │
│ │
│ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1128 in │
│ partition_param │
│ │
│ 1125 │ │ │ if start < param.ds_numel and end <= param.ds_numel: │
│ 1126 │ │ │ │ src_tensor = one_dim_param.narrow(0, start, partition_size) │
│ 1127 │ │ │ │ print(src_tensor) │
│ ❱ 1128 │ │ │ │ param.ds_tensor.copy(src_tensor) │
│ 1129 │ │ │ │ #partitioned_tensor = src_tensor.clone().detach().to(self.remote_device) │
│ 1130 │ │ │ │
│ 1131 │ │ │ else: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
from chatglm_lora_multi-gpu.
deepspeed 改成stage 3, 会报这个错,@liangwq 遇见过吗 好像是skip_init导致的,不知道怎么解决
NotImplementedError: Cannot copy out of meta tensor; no data! tensor(..., device='meta', size=(308281344,), dtype=torch.float16, grad_fn=)
│ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:355 in │ │ wrapper │ │ │ │ 352 │ │ │ │ │ is_child_module = True │ │ 353 │ │ │ │ │ setattr(module, "_ds_child_entered", True) │ │ 354 │ │ │ │ │ │ ❱ 355 │ │ │ │ f(module, *args, **kwargs) │ │ 356 │ │ │ │ │ │ 357 │ │ │ │ if is_child_module: │ │ 358 │ │ │ │ │ # child's init is done, now we can run a single post_init on the │ │ │ │ /HanLP3/Chatglm_lora_multi-gpu/modeling_chatglm.py:726 in init │ │ │ │ 723 │ │ self.position_encoding_2d = config.position_encoding_2d │ │ 724 │ │ self.model_parallel = True │ │ 725 │ │ │ │ ❱ 726 │ │ self.word_embeddings = skip_init( │ │ 727 │ │ │ torch.nn.Embedding, │ │ 728 │ │ │ num_embeddings=self.vocab_size, embedding_dim=self.hidden_size, │ │ 729 │ │ │ dtype=self.params_dtype │ │ │ │ /usr/local/lib/python3.8/dist-packages/torch/nn/utils/init.py:52 in skip_init │ │ │ │ 49 │ │ │ 50 │ final_device = kwargs.pop('device', 'cpu') │ │ 51 │ kwargs['device'] = 'meta' │ │ ❱ 52 │ return module_cls(*args, **kwargs).to_empty(device=final_device) │ │ 53 │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:363 in │ │ wrapper │ │ │ │ 360 │ │ │ │ │ │ │ 361 │ │ │ │ │ print_rank_0(f'Running post_init for {module.class.name}', │ │ 362 │ │ │ │ │ │ │ │ force=False) │ │ ❱ 363 │ │ │ │ │ self._post_init_method(module) │ │ 364 │ │ │ │ │ │ 365 │ │ │ │ print_rank_0( │ │ 366 │ │ │ │ │ f'After initializing followed by post init for {module.class.__n │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:760 in │ │ _post_init_method │ │ │ │ 757 │ │ │ │ │ │ logger.warn(f"param
{name}
in {module.class.name} " │ │ 758 │ │ │ │ │ │ │ │ │ f"not on GPU so was not broadcasted from rank 0") │ │ 759 │ │ │ │ │ │ ❱ 760 │ │ │ │ param.partition() │ │ 761 │ │ see_memory_usage( │ │ 762 │ │ │ f"Param count {param_count}. After converting and partitioning parmas in {mo │ │ 763 │ │ │ force=False) │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:894 in │ │ partition │ │ │ │ 891 │ │ │ ) │ │ 892 │ │ │ if param_list is None: │ │ 893 │ │ │ │ param_list = [cls] │ │ ❱ 894 │ │ │ self._partition(param_list, has_been_updated=has_been_updated) │ │ 895 │ │ │ │ 896 │ │ def reduce_gradients_at_owner(param_list=None, hierarchy=0): │ │ 897 │ │ │ cls = param │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1038 in │ │ _partition │ │ │ │ 1035 │ │ for param in param_list: │ │ 1036 │ │ │ #print_rank_0(f"Before Partitioning Param {param.ds_id}") │ │ 1037 │ │ │ # self._param_status(param) │ │ ❱ 1038 │ │ │ self._partition_param(param, has_been_updated=has_been_updated) │ │ 1039 │ │ │ param.ds_status = ZeroParamStatus.NOT_AVAILABLE │ │ 1040 │ │ │ # if param.ds_tensor is not None: │ │ 1041 │ │ │ # assert id(param.data) == id(param.ds_tensor.data), \ │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py:11 in wrapped_fn │ │ │ │ 8 │ function call.""" │ │ 9 │ def wrapped_fn(*args, **kwargs): │ │ 10 │ │ get_accelerator().range_push(func.qualname) │ │ ❱ 11 │ │ ret_val = func(*args, **kwargs) │ │ 12 │ │ get_accelerator().range_pop() │ │ 13 │ │ return ret_val │ │ 14 │ │ │ │ /usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py:1128 in │ │ partition_param │ │ │ │ 1125 │ │ │ if start < param.ds_numel and end <= param.ds_numel: │ │ 1126 │ │ │ │ src_tensor = one_dim_param.narrow(0, start, partition_size) │ │ 1127 │ │ │ │ print(src_tensor) │ │ ❱ 1128 │ │ │ │ param.ds_tensor.copy(src_tensor) │ │ 1129 │ │ │ │ #partitioned_tensor = src_tensor.clone().detach().to(self.remote_device) │ │ 1130 │ │ │ │ │ 1131 │ │ │ else: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
这看起来是deepspeed组网的时候除了问题,你现在是有几块GPU
你把你硬件配置信息也发上来下
from chatglm_lora_multi-gpu.
from chatglm_lora_multi-gpu.
你有4张卡可以用的,你指定下local_rank,或者看看是不是可以用deive_map,具体怎么弄你谷歌下:多张gpu卡如何制定local_rank环境变量
from chatglm_lora_multi-gpu.
个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了
from chatglm_lora_multi-gpu.
个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了
@Youggls 大佬请详细说说,怎么改的,感谢
from chatglm_lora_multi-gpu.
个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了
@Youggls 大佬请详细说说,怎么改的,感谢
@ray075hl 在
modeling_chatglm.py
文件中,有很多模块使用了skip_init
初始化,需要将他们全部修改,修改修改初始化方法,如下图: 这里的self.dense_h_to_4h
模块的初始化方法应该修改为:但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。
@Youggls 感谢大佬回复,已调通。 我的情况是 4xp40(22g),开启stage3 是可以在batch_size=1 的情形下微调的, 显存消耗比llama-7b要大一些,是不是词表太大了呢
from chatglm_lora_multi-gpu.
个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了
@Youggls 大佬请详细说说,怎么改的,感谢
@ray075hl 在
modeling_chatglm.py
文件中,有很多模块使用了skip_init
初始化,需要将他们全部修改,修改修改初始化方法,如下图: 这里的self.dense_h_to_4h
模块的初始化方法应该修改为:
但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。@Youggls 感谢大佬回复,已调通。 我的情况是 4xp40(22g),开启stage3 是可以在batch_size=1 的情形下微调的, 显存消耗比llama-7b要大一些,是不是词表太大了呢
可能是这个原因,但我没仔细研究过这两个模型结构,embedding matrix确实是模型参数大头。
from chatglm_lora_multi-gpu.
@Youggls 大佬,帮帮我 0.0
我遇到了下面的错误,不知道是哪里出问题了:
File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Using /home/la/.cache/torch_extensions/py38_cu113 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.000453948974609375 seconds
0%| | 0/10000 [00:00<?, ?it/s]/home2/la/chatgml-tuning/modeling_chatglm.py:266: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at /opt/conda/conda-bld/pytorch_1659484810403/work/aten/src/ATen/native/cuda/Indexing.cu:1239.)
attention_scores.masked_fill_(attention_mask.byte(), -10000.0)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 116311 closing signal SIGTERM
**ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 116312) of binary:** /home/la/anaconda3/envs/chatglm-tuning/bin/python
Traceback (most recent call last):
File "/home/la/anaconda3/envs/chatglm-tuning/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/la/anaconda3/envs/chatglm-tuning/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
**raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:**
from chatglm_lora_multi-gpu.
您好,按照上面修改了skip_init报错后,运行报错,提示
请问您知道这个要怎么修改吗?
from chatglm_lora_multi-gpu.
个人猜测这个问题应该是skip_init过程中并没有为相关权重分配内存所导致的,我将使用skip_init初始化的模块都改用原本的初始化方法后不再报这个错了
@Youggls 大佬请详细说说,怎么改的,感谢
@ray075hl 在
modeling_chatglm.py
文件中,有很多模块使用了skip_init
初始化,需要将他们全部修改,修改修改初始化方法,如下图: 这里的self.dense_h_to_4h
模块的初始化方法应该修改为:但是这样做可能会带来加载速度的减慢。另外,我目前的机器是 TitanXP x 2,每张卡 12G 显存,开启 zero3 后仍然显存不足。
请问您12G显存有调通吗,如果调通的话,能不能指点下需要修改哪些内容
from chatglm_lora_multi-gpu.
看上去是tokenizer报错了,check一下你加载的tokenizer试一下吧。
另外目前我12G显存并没有调通,根据别人的经验,可能需要4*12G显存才能运行。
from chatglm_lora_multi-gpu.
Related Issues (20)
- 如何训练自己数据集 HOT 1
- 显存占用问题 HOT 2
- 你的README.md与Chatglm_lora_multi-gpu/data HOT 1
- 运行web_ui.py,报错:NameError: name 'LoraConfig' is not defined HOT 1
- 运行 web_feadback.py 报错 HOT 6
- deepspeed和lora HOT 4
- 模型是否存在信息泄露 HOT 4
- alps import error HOT 1
- 多卡并行问题 HOT 1
- chatglm用deepspeed多卡推理问题 HOT 1
- chatglm做图应用怎么使用 HOT 1
- 多卡deepspeed模式 HOT 2
- Deepspeed并未生效 HOT 1
- 推理问题 HOT 17
- langchain版本是多少 HOT 1
- huggingface_hub.utils._validators.HFValidationError HOT 3
- 解析相应报错 HOT 3
- 看下我的搜索结果对不对
- clip retrieval 成功运行app.py,但是不显示streamlit界面 HOT 2
- Chatglm_lora_multi-gpu/APP_example/real_time_draw/realtime_draw_01.py HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chatglm_lora_multi-gpu.