GithubHelp home page GithubHelp logo

shibing624 / medicalgpt Goto Github PK

View Code? Open in Web Editor NEW
2.7K 31.0 422.0 13.65 MB

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。

License: Apache License 2.0

Python 84.29% Shell 1.89% Jupyter Notebook 13.82%
llama chatgpt gpt llm medical dpo medicalgpt

medicalgpt's Introduction

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models


MedicalGPT: Training Medical GPT Model

HF Models Github Stars Contributions welcome License Apache 2.0 python_version GitHub issues Wechat Group

📖 Introduction

MedicalGPT training medical GPT model with ChatGPT training pipeline, implemantation of Pretraining, Supervised Finetuning, RLHF(Reward Modeling and Reinforcement Learning) and DPO(Direct Preference Optimization).

MedicalGPT 训练医疗大模型,实现了包括增量预训练、有监督微调、RLHF(奖励建模、强化学习训练)和DPO(直接偏好优化)。

🔥 News

[2024/04/24] v2.0版本:支持了 Meta Llama 3 系列模型,详见Release-v2.0

[2024/04/17] v1.9版本:支持了 ORPO,详细用法请参照 run_orpo.sh。详见Release-v1.9

[2024/01/26] v1.8版本:支持微调Mixtral混合专家MoE模型 Mixtral 8x7B。详见Release-v1.8

[2024/01/14] v1.7版本:新增检索增强生成(RAG)的基于文件问答ChatPDF功能,代码chatpdf.py,可以基于微调后的LLM结合知识库文件问答提升行业问答准确率。详见Release-v1.7

[2023/10/23] v1.6版本:新增RoPE插值来扩展GPT模型的上下文长度;针对LLaMA模型支持了FlashAttention-2LongLoRA 提出的 $S^2$-Attn;支持了NEFTune给embedding加噪训练方法。详见Release-v1.6

[2023/08/28] v1.5版本: 新增DPO(直接偏好优化)方法,DPO通过直接优化语言模型来实现对其行为的精确控制,可以有效学习到人类偏好。详见Release-v1.5

[2023/08/08] v1.4版本: 发布基于ShareGPT4数据集微调的中英文Vicuna-13B模型shibing624/vicuna-baichuan-13b-chat,和对应的LoRA模型shibing624/vicuna-baichuan-13b-chat-lora,详见Release-v1.4

[2023/08/02] v1.3版本: 新增LLaMA, LLaMA2, Bloom, ChatGLM, ChatGLM2, Baichuan模型的多轮对话微调训练;新增领域词表扩充功能;新增中文预训练数据集和中文ShareGPT微调训练集,详见Release-v1.3

[2023/07/13] v1.1版本: 发布中文医疗LLaMA-13B模型shibing624/ziya-llama-13b-medical-merged,基于Ziya-LLaMA-13B-v1模型,SFT微调了一版医疗模型,医疗问答效果有提升,发布微调后的完整模型权重,详见Release-v1.1

[2023/06/15] v1.0版本: 发布中文医疗LoRA模型shibing624/ziya-llama-13b-medical-lora,基于Ziya-LLaMA-13B-v1模型,SFT微调了一版医疗模型,医疗问答效果有提升,发布微调后的LoRA权重,详见Release-v1.0

[2023/06/05] v0.2版本: 以医疗为例,训练领域大模型,实现了四阶段训练:包括二次预训练、有监督微调、奖励建模、强化学习训练。详见Release-v0.2

😊 Features

基于ChatGPT Training Pipeline,本项目实现了领域模型--医疗行业语言大模型的训练:

  • 第一阶段:PT(Continue PreTraining)增量预训练,在海量领域文档数据上二次预训练GPT模型,以适应领域数据分布(可选)
  • 第二阶段:SFT(Supervised Fine-tuning)有监督微调,构造指令微调数据集,在预训练模型基础上做指令精调,以对齐指令意图,并注入领域知识
  • 第三阶段
    • RLHF(Reinforcement Learning from Human Feedback)基于人类反馈对语言模型进行强化学习,分为两步:
      • RM(Reward Model)奖励模型建模,构造人类偏好排序数据集,训练奖励模型,用来建模人类偏好,主要是"HHH"原则,具体是"helpful, honest, harmless"
      • RL(Reinforcement Learning)强化学习,用奖励模型来训练SFT模型,生成模型使用奖励或惩罚来更新其策略,以便生成更高质量、更符合人类偏好的文本
    • DPO(Direct Preference Optimization)直接偏好优化方法,DPO通过直接优化语言模型来实现对其行为的精确控制,而无需使用复杂的强化学习,也可以有效学习到人类偏好,DPO相较于RLHF更容易实现且易于训练,效果更好
    • ORPO不需要参考模型的优化方法,通过ORPO,LLM可以同时学习指令遵循和满足人类偏好

Release Models

Model Base Model Introduction
shibing624/ziya-llama-13b-medical-lora IDEA-CCNL/Ziya-LLaMA-13B-v1 在240万条中英文医疗数据集shibing624/medical上SFT微调了一版Ziya-LLaMA-13B模型,医疗问答效果有提升,发布微调后的LoRA权重(单轮对话)
shibing624/ziya-llama-13b-medical-merged IDEA-CCNL/Ziya-LLaMA-13B-v1 在240万条中英文医疗数据集shibing624/medical上SFT微调了一版Ziya-LLaMA-13B模型,医疗问答效果有提升,发布微调后的完整模型权重(单轮对话)
shibing624/vicuna-baichuan-13b-chat-lora baichuan-inc/Baichuan-13B-Chat 在10万条多语言ShareGPT GPT4多轮对话数据集shibing624/sharegpt_gpt4上SFT微调了一版baichuan-13b-chat多轮问答模型,日常问答和医疗问答效果有提升,发布微调后的LoRA权重
shibing624/vicuna-baichuan-13b-chat baichuan-inc/Baichuan-13B-Chat 在10万条多语言ShareGPT GPT4多轮对话数据集shibing624/sharegpt_gpt4上SFT微调了一版baichuan-13b-chat多轮问答模型,日常问答和医疗问答效果有提升,发布微调后的完整模型权重
shibing624/llama-3-8b-instruct-262k-chinese Llama-3-8B-Instruct-262k 在2万条中英文偏好数据集shibing624/DPO-En-Zh-20k-Preference上使用ORPO方法微调得到的超长文本多轮对话模型,适用于RAG、多轮对话

演示shibing624/vicuna-baichuan-13b-chat模型效果: 具体case见Inference Examples

▶️ Demo

我们提供了一个简洁的基于gradio的交互式web界面,启动服务后,可通过浏览器访问,输入问题,模型会返回答案。

启动服务,命令如下:

CUDA_VISIBLE_DEVICES=0 python gradio_demo.py --model_type base_model_type --base_model path_to_llama_hf_dir --lora_model path_to_lora_dir

参数说明:

  • --model_type {base_model_type}:预训练模型类型,如llama、bloom、chatglm等
  • --base_model {base_model}:存放HF格式的LLaMA模型权重和配置文件的目录,也可使用HF Model Hub模型调用名称
  • --lora_model {lora_model}:LoRA文件所在目录,也可使用HF Model Hub模型调用名称。若lora权重已经合并到预训练模型,则删除--lora_model参数
  • --tokenizer_path {tokenizer_path}:存放对应tokenizer的目录。若不提供此参数,则其默认值与--base_model相同
  • --template_name:模板名称,如vicunaalpaca等。若不提供此参数,则其默认值是vicuna
  • --only_cpu: 仅使用CPU进行推理
  • --resize_emb:是否调整embedding大小,若不调整,则使用预训练模型的embedding大小,默认不调整

💾 Install

Updating the requirements

requirements.txt会不时更新以适配最新功能,使用以下命令更新依赖:

git clone https://github.com/shibing624/MedicalGPT
cd MedicalGPT
pip install -r requirements.txt --upgrade

Hardware Requirement (显存/VRAM)

训练方法 精度 7B 13B 30B 65B 8x7B
全参数 16 160GB 320GB 600GB 1200GB 900GB
LoRA 16 16GB 32GB 80GB 160GB 120GB
QLoRA 8 10GB 16GB 40GB 80GB 80GB
QLoRA 4 6GB 12GB 24GB 48GB 32GB

🚀 Training Pipeline

Training Stage:

Stage Introduction Python script Shell script
Continue Pretraining 增量预训练 pretraining.py run_pt.sh
Supervised Fine-tuning 有监督微调 supervised_finetuning.py run_sft.sh
Direct Preference Optimization 直接偏好优化 dpo_training.py run_dpo.sh
Reward Modeling 奖励模型建模 reward_modeling.py run_rm.sh
Reinforcement Learning 强化学习 ppo_training.py run_ppo.sh
ORPO 概率偏好优化 orpo_training.py run_orpo.sh

Supported Models

Model Name Model Size Template
BLOOMZ 560M/1.1B/1.7B/3B/7.1B/176B vicuna
LLaMA 7B/13B/33B/65B alpaca
LLaMA2 7B/13B/70B llama2
LLaMA3 8B/70B llama3
Mistral 7B/8x7B mistral
Baichuan 7B/13B baichuan
Baichuan2 7B/13B baichuan2
InternLM 7B intern
Qwen 1.8B/7B/14B/72B chatml
Qwen1.5 0.5B/1.8B/4B/14B/72B qwen
XVERSE 13B xverse
ChatGLM 6B chatglm
ChatGLM2 6B chatglm2
ChatGLM3 6B chatglm3
Yi 6B/34B yi
DeepSeek 7B/16B/67B deepseek
Orion 14B orion
Cohere 104B cohere

💻 Inference

训练完成后,现在我们加载训练好的模型,验证模型生成文本的效果。

CUDA_VISIBLE_DEVICES=0 python inference.py \
    --model_type base_model_type \
    --base_model path_to_model_hf_dir \
    --tokenizer_path path_to_model_hf_dir \
    --lora_model path_to_lora \
    --interactive

参数说明:

  • --model_type {base_model_type}:预训练模型类型,如llama、bloom、chatglm等
  • --base_model {base_model}:存放HF格式的LLaMA模型权重和配置文件的目录
  • --tokenizer_path {base_model}:存放HF格式的LLaMA模型权重和配置文件的目录
  • --lora_model {lora_model}:LoRA解压后文件所在目录,也可使用HF Model Hub模型调用名称。如果已经合并了LoRA权重到预训练模型,则可以不提供此参数
  • --tokenizer_path {tokenizer_path}:存放对应tokenizer的目录。若不提供此参数,则其默认值与--base_model相同
  • --template_name:模板名称,如vicunaalpaca等。若不提供此参数,则其默认值是vicuna
  • --interactive:以交互方式启动多轮问答,使用流式推理
  • --data_file {file_name}:非交互方式启动下,读取file_name中的的内容进行batch预测
  • --output_file {file_name}:非交互式方式下,将预测的结果以jsonl格式写入file_name
  • --resize_emb:是否调整embedding大小,若不调整,则使用预训练模型的embedding大小,默认不调整
  • --only_cpu:仅使用CPU进行推理
  • --gpus {gpu_ids}:指定使用的GPU设备编号,默认为0。如使用多张GPU,以逗号分隔,如0,1,2

多卡推理

多卡数据并行,batch推理

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 inference_multigpu_demo.py --model_type baichuan --base_model shibing624/vicuna-baichuan-13b-chat

Inference Examples

shibing624/vicuna-baichuan-13b-chat inference examples:

Model output case
Human Assitant
小孩发烧怎么办 发烧是身体对感染或其他疾病的自然反应。以下是一些可以帮助缓解发烧症状的建议:
1. 保持身体水分充足:发烧时,身体会失去水分,因此需要保持水分充足。
2. 使用冰袋:将冰袋放在额头或颈部可以帮助降低体温。
3. 使用退烧药:退烧药可以帮助降低体温。
4. 保持身体温度:保持身体温度可以帮助降低体温。
5. 使用冷水或温水擦拭身体:使用冷水或温水擦拭身体可以帮助降低体温。
6. 使用退烧药:退烧药可以帮助降低体温。
7. 保持身体温度:保持身体温度可以帮助降低体温。
8. 使用冷水或温水擦拭身体:使用冷水或温水擦拭身体可以帮助降低体温。
9. 使用退烧药:退烧药可以帮助降低体温。
10. 保持身体温度:保持身体温度可以帮助降低体温。
一岁宝宝发烧能吃啥药? 1岁宝宝发烧时,可以考虑以下几种药物:
1. 退烧药:如布洛芬(Ibuprofen)或对乙酰氨基酚(Paracetamol)。这些药物可以帮助降低体温,缓解发烧症状。
2. 抗生素:如果发烧是由细菌感染引起的,可以考虑使用抗生素。但请务必在医生的建议下使用。
3. 抗病毒药物:如果发烧是由病毒感染引起的,可以考虑使用抗病毒药物。但请务必在医生的建议下使用。
4. 解热镇痛药:如阿司匹林(Aspirin)或对乙酰氨基酚(Paracetamol)。这些药物可以帮助缓解发烧症状。
5. 抗组胺药:如苯海拉明(Diphenhydramine)或非索非那定(Naproxen)。这些药物可以帮助缓解发烧症状。
请务必在医生的建议下使用这些药物,并遵循药物说明书上的用量和用法。同时,保持宝宝充足的水分摄入,并确保宝宝有足够的休息。
肛门病变可能是什么疾病的症状 肛门病变可能由多种原因引起,包括感染、炎症、肿瘤或其他疾病。一些常见的肛门病变包括:
1. 肛裂:这是一种肛门皮肤撕裂,通常由便秘或其他压力因素引起。
2. 肛瘘:这是一种肛门感染,导致肛门皮肤和直肠之间形成瘘管。
3. 肛门疣:这是一种肛门皮肤上形成的疣状病变,通常由人类乳头状瘤病毒(HPV)引起。
4. 肛门脓肿:这是一种肛门感染,导致肛门皮肤下形成脓肿。
5. 肛门肿瘤:这是一种肛门皮肤或直肠内形成的肿瘤。
6. 肛门溃疡:这是一种肛门皮肤上形成的溃疡,通常由感染或炎症引起。
7. 肛门瘙痒:这是一种肛门皮肤瘙痒的症状,可能由感染、炎症或其他疾病引起。
如果您有肛门病变的症状,请咨询医疗保健专业人员以获得适当的诊断和治疗。
医生,您好白带增多泛黄怎么回事? 您好,白带增多和发黄可能是由多种原因引起的,包括感染、荷尔蒙变化、阴道干燥或其他妇科问题。以下是一些可能的原因:
1. 感染:白带增多和发黄可能是由阴道或尿道感染引起的。感染可能由细菌、病毒或真菌引起。
2. 荷尔蒙变化:荷尔蒙变化,如月经周期或更年期,可能会导致白带增多和发黄。
3. 阴道干燥:阴道干燥可能会导致白带增多和发黄。
4. 其他妇科问题:其他妇科问题,如子宫颈炎、子宫内膜炎或子宫内膜异位症,也可能导致白带增多和发黄。
如果您担心自己的白带,建议您咨询妇科医生进行检查和诊断。医生可以根据您的症状和检查结果确定原因并提供适当的治疗。
药流后有巧克力白带是没流干净吗 药流后出现巧克力白带并不一定意味着没有流干净。巧克力白带是由于子宫内膜脱落和出血引起的,通常在药流后的一段时间内会出现。

药流后出现巧克力白带的原因可能包括:

1. 子宫内膜脱落:药流后,子宫内膜会脱落,导致出血。这种出血可能表现为巧克力白带。
2. 感染:药流后,子宫内膜和宫颈口可能存在创伤,容易受到细菌感染。感染可能导致巧克力白带。
3. 激素水平变化:药流后,激素水平可能发生变化,导致巧克力白带。

如果巧克力白带持续时间较长,或者伴有其他症状,如腹痛、发热、分泌物异味等,建议及时就诊,进行相关检查,如妇科检查、B超等,以确定是否存在感染或其他问题。在医生的指导下进行相应的治疗。
维胺酯维E乳膏能治理什么疾病 维胺酯维E乳膏是一种外用药物,主要用于治疗痤疮、脂溢性皮炎等皮肤病。它可以帮助减少皮肤油脂分泌,减轻炎症,促进皮肤修复。

需要注意的是,维胺酯维E乳膏并非万能药,对于某些皮肤病可能并不适用。在使用前,建议咨询医生或药师,了解药物的适应症、禁忌症、用法用量等信息。同时,使用药物时应遵循医嘱,避免自行滥用。
痤疮是什么原因导致的? 痤疮,也称为粉刺,是由皮肤毛孔堵塞引起的。堵塞毛孔的原因有很多,包括:
1. 皮脂腺过度活跃:皮脂腺产生过多油脂,导致毛孔堵塞。
2. 毛孔堵塞:皮肤表面的死皮细胞、油脂和细菌堵塞毛孔。
3. 炎症:痤疮通常伴随着炎症,这可能是由于皮肤感染或免疫系统对痤疮的反应引起的。
4. 遗传因素:某些人可能更容易患痤疮,这可能是由于遗传因素引起的。
5. 荷尔蒙变化:痤疮在青春期和月经期间更为常见,这可能是由于荷尔蒙变化引起的。
6. 生活方式因素:某些生活方式因素,如压力、不良饮食和缺乏运动,也可能导致痤疮。
7. 化妆品和护肤品:某些化妆品和护肤品可能导致痤疮,特别是那些含有刺激性成分的产品。
8. 药物:某些药物,如口服避孕药,也可能导致痤疮。
要预防痤疮,可以采取一些措施,如保持皮肤清洁、使用温和的护肤品、避免过度使用化妆品和护肤品、保持健康的生活方式和饮食习惯、避免压力和荷尔蒙变化。如果痤疮严重,可能需要寻求皮肤科医生的帮助。
脸上长青春痘怎么办,能吃啥药,需要注意啥 青春痘,又称痤疮,是一种常见的皮肤病。青春痘的治疗需要综合考虑饮食、生活习惯、药物等多个方面。以下是一些建议:

1. 饮食:
* 避免高糖、高脂、高油的食物,如快餐、甜点、油炸食品等。
* 多吃新鲜蔬菜、水果、全谷类食物,以增加纤维摄入,促进肠道蠕动。
* 避免摄入过多的咖啡因、酒精、辛辣食物等刺激性食物。

2. 生活习惯:
* 保持良好的作息时间,保证充足的睡眠。
* 避免长时间使用电脑、手机等电子产品,以免辐射对皮肤造成刺激。
* 保持皮肤清洁,每天早晚使用温和的洁面产品清洁皮肤。
* 避免挤压青春痘,以免感染和留下疤痕。

3. 药物:
* 外用药物:可以使用含有水杨酸、苯甲酸、壬二酸等成分的护肤品,帮助去除皮肤表面的角质,减少痘痘的形成。
* 口服药物:如果青春痘严重,可以考虑口服抗生素、维生素B族、锌等药物,但需在医生指导下使用。

4. 注意:
* 不要随意使用化妆品、护肤品,以免刺激皮肤。
* 不要过度清洁皮肤,以免破坏皮肤屏障。
* 保持心情舒畅,避免焦虑、抑郁等负面情绪。

总之,青春痘的治疗需要综合考虑饮食、生活习惯、药物等多个方面。在治疗过程中,要保持耐心和信心,遵循医生的建议,逐步改善皮肤状况。
由于输精管堵塞会不会导致男性不育?,医生您好,我同事这几天不开心,我问她怎么了,她说她老公输精管堵塞,所以一直没孩子,不是不想要,请问输精管堵塞会不会导致男性不育? 输精管堵塞可能会导致男性不育。输精管是连接睾丸和尿道的一条管道,负责将精子从睾丸输送到尿道。如果输精管堵塞,精子就无法通过输精管到达尿道,从而导致不育。

输精管堵塞的原因有很多,包括先天性发育异常、炎症、肿瘤、外伤等。输精管堵塞的治疗方法包括手术疏通、药物治疗等。如果输精管堵塞严重,可能需要进行手术疏通。

需要注意的是,输精管堵塞并不一定导致不育,因为还有其他途径可以将精子输送到女性体内。例如,精子可以通过附睾、前列腺等组织到达女性体内。因此,如果输精管堵塞,建议及时就医,寻求专业医生的帮助。
心肌梗塞能治疗吗? 心肌梗死是一种严重的疾病,如果及时治疗,可以挽救生命并防止长期残疾。治疗心肌梗死的关键是尽早识别和治疗。

心肌梗死的治疗通常包括以下步骤:

1. 立即就医:如果出现胸痛、呼吸急促或晕厥等症状,立即就医。

2. 药物治疗:医生可能会开处方药物,如硝酸甘油或阿司匹林,以帮助缓解症状并防止进一步的损伤。

3. 溶栓治疗:如果患者在到达医院前出现心肌梗死,医生可能会进行溶栓治疗,通过注射溶解血栓的药物来帮助恢复血流。

4. 手术治疗:如果药物治疗或溶栓治疗无效,医生可能会建议进行手术,如冠状动脉搭桥手术或心脏支架植入术。

5. 康复和预防:心肌梗死的治疗通常包括康复计划,以帮助患者恢复身体功能并预防未来的心脏事件。此外,患者还应遵循医生的建议,以降低患心脏病的风险。

需要注意的是,心肌梗死的治疗取决于患者的具体情况和症状。因此,如果您或您认识的人出现心肌梗死症状,请立即就医。
怎么让自己精力充沛,列5点建议 1. 保持充足的睡眠:睡眠对于保持精力充沛至关重要。确保每晚获得7-9小时的高质量睡眠。
2. 均衡饮食:均衡饮食对于保持能量水平至关重要。确保摄入足够的蛋白质、碳水化合物和健康脂肪。
3. 定期锻炼:定期锻炼可以帮助提高能量水平。尝试每天进行至少30分钟的有氧运动,如散步、跑步或骑自行车。
4. 管理压力:压力会消耗能量,因此管理压力对于保持精力充沛至关重要。尝试找到放松和减压的方法,如冥想、瑜伽或深呼吸练习。
5. 保持水分:脱水会降低能量水平。确保每天喝足够的水,并根据需要补充水分。
写个春天的诗 春天,
是大地苏醒的季节,
是万物复苏的季节,
是生命绽放的季节。
春天,
是阳光明媚的季节,
是鸟语花香的季节,
是绿意盎然的季节,
是希望无限的季节。
春天,
是人们欢笑的季节,
是孩子们嬉戏的季节,
是恋人们相拥的季节,
是人们追求梦想的季节。
南极洲为啥没有北极熊? 南极洲没有北极熊,因为北极熊生活在北极地区,而南极洲位于南半球,气候和环境与北极地区有很大不同。南极洲的气候非常寒冷,常年被冰雪覆盖,而北极熊则生活在北极地区,那里有丰富的海洋生物和陆地哺乳动物。此外,南极洲的生态系统也与北极地区不同,没有北极熊所需的生存条件。

📚 Dataset

医疗数据集

通用数据集

Pretraining datasets(预训练数据集)

Supervised fine-tuning datasets(指令微调数据集)

Preference datasets(偏好数据集)

☎️ Contact

  • Issue(建议) :GitHub issues
  • 邮件我:xuming: [email protected]
  • 微信我: 加我微信号:xuming624, 备注:姓名-公司名-NLP 进NLP交流群(加我拉你进群)。

⚠️ LICENSE

本项目仅可应用于研究目的,项目开发者不承担任何因使用本项目(包含但不限于数据、模型、代码等)导致的危害或损失。详细请参考免责声明

MedicalGPT项目代码的授权协议为 The Apache License 2.0,代码可免费用做商业用途,模型权重和数据只能用于研究目的。请在产品说明中附加MedicalGPT的链接和授权协议。

😇 Citation

如果你在研究中使用了MedicalGPT,请按如下格式引用:

@misc{MedicalGPT,
  title={MedicalGPT: Training Medical GPT Model},
  author={Ming Xu},
  year={2023},
  howpublished={\url{https://github.com/shibing624/MedicalGPT}},
}

😍 Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

💕 Acknowledgements

Thanks for their great work!

关联项目推荐

  • shibing624/ChatPilot:给 LLM Agent(包括RAG、在线搜索、Code interpreter) 提供一个简单好用的Web UI界面

medicalgpt's People

Contributors

anwuzhiab avatar billccx avatar dividez avatar iiosnail avatar jiangtann avatar ker2xu avatar kinghuin avatar shibing624 avatar zhuangxialie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

medicalgpt's Issues

从peft加载LoraConfig报错

Describe the bug

执行from peft import LoraConfig, TaskType, get_peft_model, PeftModel, prepare_model_for_int8_training时报错
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

同时提示找不到libcuda.so和libcudart.so文件

训练进程卡住

 if training_args.do_train:
    logger.info("*** Train ***")
    logger.debug(f"Train dataloader example: {list(trainer.get_train_dataloader())[0]}")
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)

    metrics = train_result.metrics
    metrics["train_samples"] = max_train_samples
    logger.debug(f"Training metrics: {metrics}")
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()
    logger.info(f"Saving model checkpoint to {training_args.output_dir}")
    save_model(training_args.output_dir, model, tokenizer, training_args)

做Chatglm-6B SFT时使用medical 240万条数据,发现进程卡在了下面这行,把它注释掉就能正常运行,可能数据量太大卡住了?
logger.debug(f"Train dataloader example: {list(trainer.get_train_dataloader())[0]}")

在调用 run_sft.sh 时报错。

在调用
sh run_sft.sh
时报错。但是使用python supervised_finetuning.py 加参数时可以运行。
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

usage: supervised_finetuning.py [-h] [--model_type MODEL_TYPE] [--model_name_or_path MODEL_NAME_OR_PATH] [--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH] [--load_in_8bit [LOAD_IN_8BIT]]
[--cache_dir CACHE_DIR] [--use_fast_tokenizer [USE_FAST_TOKENIZER]] [--torch_dtype {auto,bfloat16,float16,float32}] [--device_map DEVICE_MAP]
[--trust_remote_code [TRUST_REMOTE_CODE]] [--no_trust_remote_code] [--dataset_name DATASET_NAME] [--dataset_config_name DATASET_CONFIG_NAME]
[--train_file_dir TRAIN_FILE_DIR] [--validation_file_dir VALIDATION_FILE_DIR] [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH]
[--max_train_samples MAX_TRAIN_SAMPLES] [--max_eval_samples MAX_EVAL_SAMPLES] [--overwrite_cache [OVERWRITE_CACHE]]
[--validation_split_percentage VALIDATION_SPLIT_PERCENTAGE] [--preprocessing_num_workers PREPROCESSING_NUM_WORKERS] --output_dir OUTPUT_DIR
[--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]] [--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]] [--evaluation_strategy {no,steps,epoch}]
[--prediction_loss_only [PREDICTION_LOSS_ONLY]] [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE] [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
[--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE] [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS]
[--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}] [--warmup_ratio WARMUP_RATIO]
[--warmup_steps WARMUP_STEPS] [--log_level {debug,info,warning,error,critical,passive}] [--log_level_replica {debug,info,warning,error,critical,passive}]
[--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node] [--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}]
[--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS] [--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter]
[--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS] [--save_total_limit SAVE_TOTAL_LIMIT] [--save_safetensors [SAVE_SAFETENSORS]]
[--save_on_each_node [SAVE_ON_EACH_NODE]] [--no_cuda [NO_CUDA]] [--use_mps_device [USE_MPS_DEVICE]] [--seed SEED] [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]]
[--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]] [--fp16_opt_level FP16_OPT_LEVEL] [--half_precision_backend {auto,cuda_amp,apex,cpu_amp}]
[--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]] [--tf32 TF32] [--local_rank LOCAL_RANK] [--ddp_backend {nccl,gloo,mpi,ccl}]
[--tpu_num_cores TPU_NUM_CORES] [--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG] [--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS]
[--dataloader_num_workers DATALOADER_NUM_WORKERS] [--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
[--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns] [--label_names LABEL_NAMES [LABEL_NAMES ...]]
[--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]] [--metric_for_best_model METRIC_FOR_BEST_MODEL] [--greater_is_better GREATER_IS_BETTER]
[--ignore_data_skip [IGNORE_DATA_SKIP]] [--sharded_ddp SHARDED_DDP] [--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS] [--fsdp_config FSDP_CONFIG]
[--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP] [--deepspeed DEEPSPEED] [--label_smoothing_factor LABEL_SMOOTHING_FACTOR]
[--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit}]
[--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]] [--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME] [--report_to REPORT_TO [REPORT_TO ...]]
[--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS] [--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB] [--dataloader_pin_memory [DATALOADER_PIN_MEMORY]]
[--no_dataloader_pin_memory] [--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics] [--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
[--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--hub_model_id HUB_MODEL_ID] [--hub_strategy {end,every_save,checkpoint,all_checkpoints}]
[--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]] [--gradient_checkpointing [GRADIENT_CHECKPOINTING]]
[--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]] [--fp16_backend {auto,cuda_amp,apex,cpu_amp}] [--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID]
[--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION] [--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS]
[--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]] [--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO] [--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
[--torch_compile [TORCH_COMPILE]] [--torch_compile_backend TORCH_COMPILE_BACKEND] [--torch_compile_mode TORCH_COMPILE_MODE] [--xpu_backend {mpi,ccl,gloo}]
[--use_peft [USE_PEFT]] [--no_use_peft] [--target_modules TARGET_MODULES] [--lora_rank LORA_RANK] [--lora_dropout LORA_DROPOUT] [--lora_alpha LORA_ALPHA]
[--modules_to_save MODULES_TO_SAVE] [--peft_path PEFT_PATH]
supervised_finetuning.py: error: the following arguments are required: --output_dir
usage: supervised_finetuning.py [-h] [--model_type MODEL_TYPE] [--model_name_or_path MODEL_NAME_OR_PATH] [--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH] [--load_in_8bit [LOAD_IN_8BIT]]
[--cache_dir CACHE_DIR] [--use_fast_tokenizer [USE_FAST_TOKENIZER]] [--torch_dtype {auto,bfloat16,float16,float32}] [--device_map DEVICE_MAP]
[--trust_remote_code [TRUST_REMOTE_CODE]] [--no_trust_remote_code] [--dataset_name DATASET_NAME] [--dataset_config_name DATASET_CONFIG_NAME]
[--train_file_dir TRAIN_FILE_DIR] [--validation_file_dir VALIDATION_FILE_DIR] [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH]
[--max_train_samples MAX_TRAIN_SAMPLES] [--max_eval_samples MAX_EVAL_SAMPLES] [--overwrite_cache [OVERWRITE_CACHE]]
[--validation_split_percentage VALIDATION_SPLIT_PERCENTAGE] [--preprocessing_num_workers PREPROCESSING_NUM_WORKERS] --output_dir OUTPUT_DIR
[--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]] [--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]] [--evaluation_strategy {no,steps,epoch}]
[--prediction_loss_only [PREDICTION_LOSS_ONLY]] [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE] [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
[--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE] [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS]
[--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}] [--warmup_ratio WARMUP_RATIO]
[--warmup_steps WARMUP_STEPS] [--log_level {debug,info,warning,error,critical,passive}] [--log_level_replica {debug,info,warning,error,critical,passive}]
[--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node] [--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}]
[--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS] [--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter]
[--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS] [--save_total_limit SAVE_TOTAL_LIMIT] [--save_safetensors [SAVE_SAFETENSORS]]
[--save_on_each_node [SAVE_ON_EACH_NODE]] [--no_cuda [NO_CUDA]] [--use_mps_device [USE_MPS_DEVICE]] [--seed SEED] [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]]
[--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]] [--fp16_opt_level FP16_OPT_LEVEL] [--half_precision_backend {auto,cuda_amp,apex,cpu_amp}]
[--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]] [--tf32 TF32] [--local_rank LOCAL_RANK] [--ddp_backend {nccl,gloo,mpi,ccl}]
[--tpu_num_cores TPU_NUM_CORES] [--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG] [--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS]
[--dataloader_num_workers DATALOADER_NUM_WORKERS] [--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
[--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns] [--label_names LABEL_NAMES [LABEL_NAMES ...]]
[--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]] [--metric_for_best_model METRIC_FOR_BEST_MODEL] [--greater_is_better GREATER_IS_BETTER]
[--ignore_data_skip [IGNORE_DATA_SKIP]] [--sharded_ddp SHARDED_DDP] [--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS] [--fsdp_config FSDP_CONFIG]
[--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP] [--deepspeed DEEPSPEED] [--label_smoothing_factor LABEL_SMOOTHING_FACTOR]
[--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit}]
[--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]] [--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME] [--report_to REPORT_TO [REPORT_TO ...]]
[--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS] [--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB] [--dataloader_pin_memory [DATALOADER_PIN_MEMORY]]
[--no_dataloader_pin_memory] [--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics] [--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
[--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--hub_model_id HUB_MODEL_ID] [--hub_strategy {end,every_save,checkpoint,all_checkpoints}]
[--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]] [--gradient_checkpointing [GRADIENT_CHECKPOINTING]]
[--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]] [--fp16_backend {auto,cuda_amp,apex,cpu_amp}] [--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID]
[--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION] [--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS]
[--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]] [--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO] [--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
[--torch_compile [TORCH_COMPILE]] [--torch_compile_backend TORCH_COMPILE_BACKEND] [--torch_compile_mode TORCH_COMPILE_MODE] [--xpu_backend {mpi,ccl,gloo}]
[--use_peft [USE_PEFT]] [--no_use_peft] [--target_modules TARGET_MODULES] [--lora_rank LORA_RANK] [--lora_dropout LORA_DROPOUT] [--lora_alpha LORA_ALPHA]
[--modules_to_save MODULES_TO_SAVE] [--peft_path PEFT_PATH]
supervised_finetuning.py: error: the following arguments are required: --output_dir
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 1740890) of binary: /usr/local/anaconda3/envs/hj-glm6b2/bin/python
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/hj-glm6b2/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
supervised_finetuning.py FAILED
Failures:
[1]:
time : 2023-06-29_18:21:05
host : JoinShareAIPC
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1740891)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-06-29_18:21:05
host : JoinShareAIPC
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 1740890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
run_sft.sh: 2: --model_type: not found
run_sft.sh: 3: --model_name_or_path: not found
run_sft.sh: 4: --train_file_dir: not found
run_sft.sh: 5: --validation_file_dir: not found
run_sft.sh: 6: --per_device_train_batch_size: not found
run_sft.sh: 7: --per_device_eval_batch_size: not found
run_sft.sh: 8: --do_train: not found
run_sft.sh: 9: --do_eval: not found
run_sft.sh: 10: --use_peft: not found
run_sft.sh: 11: --fp16: not found
run_sft.sh: 12: --max_train_samples: not found
run_sft.sh: 13: --max_eval_samples: not found
run_sft.sh: 14: --num_train_epochs: not found
run_sft.sh: 15: --learning_rate: not found
run_sft.sh: 16: --warmup_ratio: not found
run_sft.sh: 17: --weight_decay: not found
run_sft.sh: 18: --logging_strategy: not found
run_sft.sh: 19: --logging_steps: not found
run_sft.sh: 20: --eval_steps: not found
run_sft.sh: 21: --evaluation_strategy: not found
run_sft.sh: 22: --save_steps: not found
run_sft.sh: 23: --save_strategy: not found
run_sft.sh: 24: --save_total_limit: not found
run_sft.sh: 25: --gradient_accumulation_steps: not found
run_sft.sh: 26: --preprocessing_num_workers: not found
run_sft.sh: 27: --max_source_length: not found
run_sft.sh: 28: --max_target_length: not found
run_sft.sh: 29: --output_dir: not found
run_sft.sh: 30: --overwrite_output_dir: not found
run_sft.sh: 31: --ddp_timeout: not found
run_sft.sh: 32: --logging_first_step: not found
run_sft.sh: 33: --target_modules: not found
run_sft.sh: 34: --lora_rank: not found
run_sft.sh: 35: --lora_alpha: not found
run_sft.sh: 36: --lora_dropout: not found
run_sft.sh: 37: --torch_dtype: not found
run_sft.sh: 38: --device_map: not found
run_sft.sh: 39: --report_to: not found
run_sft.sh: 40: --ddp_find_unused_parameters: not found
run_sft.sh: 41: --gradient_checkpointing: not found

为啥我这数据集老是有问题呢

我的操作CUDA_VISIBLE_DEVICES=0 python3 supervised_finetuning.py --model_type chatglm --model_name_or_path ./model --train_file_dir ./data --validation_file_dir ./data --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --do_train --do_eval --use_peft True --fp16 --max_train_samples 1000 --max_eval_samples 10 --num_train_epochs 1 --learning_rate 2e-5 --warmup_ratio 0.05 --weight_decay 0.05 --logging_strategy steps --logging_steps 10 --eval_steps 50 --evaluation_strategy steps --save_steps500 --save_strategy steps --save_total_limit 3 --gradient_accumulation_steps 1 --preprocessing_num_workers 1 --max_source_length 128 --max_target_length 128 --output_dir outputs-sft-chatglm2-6b-v1 --overwrite_output_dir --ddp_timeout 30000 --logging_first_step True --target_modulesquery_key_value --lora_rank 8 --lora_alpha 16 --lora_dropout 0.05 --torch_dtype float16 --device_map auto --report_to tensorboard --ddp_find_unused_parameters False --gradient_checkpointing True

error:Failed to read file '/home/wyxx/warBackup/ner/ChatGLM2-6B/data/data.json' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a comma or '}'after an object member. in row 202
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py:134 │
│ in _generate_tables │
│ │
│ 131 │ │ │ │ │ │ except pa.ArrowInvalid as e: │
│ 132 │ │ │ │ │ │ │ try: │
│ 133 │ │ │ │ │ │ │ │ with open(file, encoding="utf-8") as f: │
│ ❱ 134 │ │ │ │ │ │ │ │ │ dataset = json.load(f) │
│ 135 │ │ │ │ │ │ │ except json.JSONDecodeError: │
│ 136 │ │ │ │ │ │ │ │ logger.error(f"Failed to read file '{file}' with error { │
│ 137 │ │ │ │ │ │ │ │ raise e │
│ │
│ /data/software/anaconda3/lib/python3.9/json/init.py:293 in load │
│ │
│ 290 │ To use a custom JSONDecoder subclass, specify it with the cls
│ 291 │ kwarg; otherwise JSONDecoder is used. │
│ 292 │ """ │
│ ❱ 293 │ return loads(fp.read(), │
│ 294 │ │ cls=cls, object_hook=object_hook, │
│ 295 │ │ parse_float=parse_float, parse_int=parse_int, │
│ 296 │ │ parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) │
│ │
│ /data/software/anaconda3/lib/python3.9/json/init.py:346 in loads │
│ │
│ 343 │ if (cls is None and object_hook is None and │
│ 344 │ │ │ parse_int is None and parse_float is None and │
│ 345 │ │ │ parse_constant is None and object_pairs_hook is None and not kw): │
│ ❱ 346 │ │ return _default_decoder.decode(s) │
│ 347 │ if cls is None: │
│ 348 │ │ cls = JSONDecoder │
│ 349 │ if object_hook is not None: │
│ │
│ /data/software/anaconda3/lib/python3.9/json/decoder.py:340 in decode │
│ │
│ 337 │ │ obj, end = self.raw_decode(s, idx=_w(s, 0).end()) │
│ 338 │ │ end = _w(s, end).end() │
│ 339 │ │ if end != len(s): │
│ ❱ 340 │ │ │ raise JSONDecodeError("Extra data", s, end) │
│ 341 │ │ return obj │
│ 342 │ │
│ 343 │ def raw_decode(self, s, idx=0): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Extra data: line 2 column 1 (char 408)

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1858 in │
│ _prepare_split_single │
│ │
│ 1855 │ │ │ ) │
│ 1856 │ │ │ try: │
│ 1857 │ │ │ │ _time = time.time() │
│ ❱ 1858 │ │ │ │ for _, table in generator: │
│ 1859 │ │ │ │ │ if max_shard_size is not None and writer._num_bytes > max_shard_size │
│ 1860 │ │ │ │ │ │ num_examples, num_bytes = writer.finalize() │
│ 1861 │ │ │ │ │ │ writer.close() │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py:137 │
│ in _generate_tables │
│ │
│ 134 │ │ │ │ │ │ │ │ │ dataset = json.load(f) │
│ 135 │ │ │ │ │ │ │ except json.JSONDecodeError: │
│ 136 │ │ │ │ │ │ │ │ logger.error(f"Failed to read file '{file}' with error { │
│ ❱ 137 │ │ │ │ │ │ │ │ raise e │
│ 138 │ │ │ │ │ │ │ # If possible, parse the file as a list of json objects and │
│ 139 │ │ │ │ │ │ │ if isinstance(dataset, list): # list is the only sequence t │
│ 140 │ │ │ │ │ │ │ │ try: │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py:113 │
│ in _generate_tables │
│ │
│ 110 │ │ │ │ │ │ try: │
│ 111 │ │ │ │ │ │ │ while True: │
│ 112 │ │ │ │ │ │ │ │ try: │
│ ❱ 113 │ │ │ │ │ │ │ │ │ pa_table = paj.read_json( │
│ 114 │ │ │ │ │ │ │ │ │ │ io.BytesIO(batch), read_options=paj.ReadOptions( │
│ 115 │ │ │ │ │ │ │ │ │ ) │
│ 116 │ │ │ │ │ │ │ │ │ break │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/_json.pyx:258 in pyarrow._json.read_json │
│ │
│ [Errno 2] No such file or directory: '/home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/_json.pyx' │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi:144 in │
│ pyarrow.lib.pyarrow_internal_check_status │
│ │
│ [Errno 2] No such file or directory: '/home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi' │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi:100 in pyarrow.lib.check_status │
│ │
│ [Errno 2] No such file or directory: '/home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi' │
╰────────────────���─────────────────────────────────────────────────────────────────────────────────╯
ArrowInvalid: JSON parse error: Missing a comma or '}' after an object member. in row 202

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/supervised_finetuning.py:550 in │
│ │
│ 547 │
│ 548 │
│ 549 if name == "main": │
│ ❱ 550 │ main() │
│ 551 │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/supervised_finetuning.py:384 in main │
│ │
│ 381 │ │ │ │ f'{data_args.validation_file_dir}/**/*.jsonl', recursive=True) │
│ 382 │ │ │ logger.info(f"eval files: {', '.join(eval_data_files)}") │
│ 383 │ │ │ data_files["validation"] = eval_data_files │
│ ❱ 384 │ │ raw_datasets = load_dataset( │
│ 385 │ │ │ 'json', │
│ 386 │ │ │ data_files=data_files, │
│ 387 │ │ │ cache_dir=model_args.cache_dir, │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/load.py:1797 in load_dataset │
│ │
│ 1794 │ try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES │
│ 1795 │ │
│ 1796 │ # Download and prepare data │
│ ❱ 1797 │ builder_instance.download_and_prepare( │
│ 1798 │ │ download_config=download_config, │
│ 1799 │ │ download_mode=download_mode, │
│ 1800 │ │ verification_mode=verification_mode, │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:890 in │
│ download_and_prepare │
│ │
│ 887 │ │ │ │ │ │ │ prepare_split_kwargs["max_shard_size"] = max_shard_size │
│ 888 │ │ │ │ │ │ if num_proc is not None: │
│ 889 │ │ │ │ │ │ │ prepare_split_kwargs["num_proc"] = num_proc │
│ ❱ 890 │ │ │ │ │ │ self._download_and_prepare( │
│ 891 │ │ │ │ │ │ │ dl_manager=dl_manager, │
│ 892 │ │ │ │ │ │ │ verification_mode=verification_mode, │
│ 893 │ │ │ │ │ │ │ **prepare_split_kwargs, │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:985 in │
│ _download_and_prepare │
│ │
│ 982 │ │ │ │
│ 983 │ │ │ try: │
│ 984 │ │ │ │ # Prepare split will record examples associated to the split │
│ ❱ 985 │ │ │ │ self._prepare_split(split_generator, **prepare_split_kwargs) │
│ 986 │ │ │ except OSError as e: │
│ 987 │ │ │ │ raise OSError( │
│ 988 │ │ │ │ │ "Cannot find data file. " │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1746 in _prepare_split │
│ │
│ 1743 │ │ │ gen_kwargs = split_generator.gen_kwargs │
│ 1744 │ │ │ job_id = 0 │
│ 1745 │ │ │ with pbar: │
│ ❱ 1746 │ │ │ │ for job_id, done, content in self._prepare_split_single( │
│ 1747 │ │ │ │ │ gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args │
│ 1748 │ │ │ │ ): │
│ 1749 │ │ │ │ │ if done: │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1891 in │
│ _prepare_split_single │
│ │
│ 1888 │ │ │ # Ignore the writer's error for no examples written to the file if this erro │
│ 1889 │ │ │ if isinstance(e, SchemaInferenceError) and e.context is not None: │
│ 1890 │ │ │ │ e = e.context
│ ❱ 1891 │ │ │ raise DatasetGenerationError("An error occurred while generating the dataset │
│ 1892 │ │ │
│ 1893 │ │ yield job_id, True, (total_num_examples, total_num_bytes, writer.features, num ���
│ 1894 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
DatasetGenerationError: An error occurred while generating the dataset

我的第202行数据{"instruction": "国内货物运输保险的赔偿处理遵循什么原则", "input": "", "output": "对于国内货物运输保险的赔偿是,在发生保险责任范围内的灾害事故时,普通财产保险仅负责被保险财产的直接损失以及为避免损失扩大采取施救、保护等措施而产生的合理费用。对于国内货物运输保险的定义是以在国内运输过程中的货物为保险标的,在标的物遭遇自然灾害或意外事故所造成的损失时给予经济补偿。按照分输方式可分为:直运货物运输保险、联运货物运输保险、集装箱运输保险。在发生保险事故的时候,被保险人要向保险人申请索赔,必须提供下列有关单证:保险凭证、运单(货票)、提货单、发货票,承运部门签发的货运记录、普通记录、交接验收记录、鉴定书,收货单位的入库记录、检验报告、损失清单及救护货物所支付的直接费用的单据。"}

我感觉这没问题啊

请教一下base_model的问题

Describe the Question

是否可以直接使用chinese-llama-plus-13b-hf作为base_model,然后使用ziya-llama-13b-medical-lora为lora_model进行推理

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

单机多卡跑gradio推理时,报CUDA的错误

问题

单机多卡【4*RTX4090 24G】gradio推理时,加载模型可以成功,但问答时报CUDA的错误,能提供一下您运行的基础环境或可能的解决思路吗?谢谢
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

基础环境

  • [1] Linux python3.10.8
  • [2] pytorch=1.13.1
  • [3] transformers=4.29.1
  • [4] accelerate=0.20.3
  • [5] peft=0.3.0

关于预训练完成后合并模型及SFT的问题

感谢作者。想按您的项目中的资料来尝试全流程,已经基于Ziya-LLaMA-13B-v1做了Lora的增量预训练,然后按文档将第一阶段的Lora和Base模型合并
python merge_peft_adapter.py \

--base_model_name_or_path ~/Ziya-LLaMA-13B-v1/
--peft_model_path ~/MedicalGPT/outputs-pt-v1/
--output_dir outputs-merged
--model_type llama
再将model_name_or_path指定为合并后的模型文件夹开始sft时出错

运行run_sft.sh时报了很多
/opt/conda/conda-bld/pytorch_1682343962757/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [182,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
这样的错,

Traceback (most recent call last):
File "supervised_finetuning.py", line 549, in
main()
File "supervised_finetuning.py", line 520, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 2767, in compute_loss
outputs = model(**inputs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 537, in forward
attention_mask = self._prepare_decoder_attention_mask(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 465, in _prepare_decoder_attention_mask
combined_attention_mask = _make_causal_mask(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 49, in _make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

0%| | 0/125 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 107253) of binary: /home/xxx/miniconda3/envs/torch/bin/python
Traceback (most recent call last):
File "/home/xxx/miniconda3/envs/torch/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

但如果我直接在sft时指定使用原始模型~/Ziya-LLaMA-13B-v1/进行sft就没问题
第一阶段生成Lora文件正常也未报错,merge_peft_adapter前Lora文件和原始基础模型一起用inference.py跑推理测试也正常

合并的outputs-merged目录下文件如下
added_tokens.json
config.json
generation_config.json
pytorch_model-00001-of-00003.bin
pytorch_model-00002-of-00003.bin
pytorch_model-00003-of-00003.bin
pytorch_model.bin.index.json
special_tokens_map.json
tokenizer_config.json
tokenizer.model

请问您合并后的模型文件做sft出错可能会是什么原因呀

run_pt run_sft run_rm run_rl 这个四步没有串行 有什么意义

run_rl --model_name_or_path bigscience/bloomz-560m \ --reward_model_name_or_path OpenAssistant/reward-model-deberta-v3-large-v2

reward_model_name_or_path 使用OpenAssistant/reward-model-deberta-v3-large-v2 那之前run_rm 运行的作用是什么

这四个步骤有什么意义

gradio_demo.py

image

初步排查是 问题导致beamsearch出现CPU与GPU之间分开计算的问题
image

ValueError: 130004 is not in list

Describe the Question

采用chatglm-6b-v0模型进行全量参数预训练时,--use_peft设为False,启动命令如下:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node 8 pretraining.py
--model_type chatglm
--model_name_or_path /home/vca/lsg/ChatGPT/open-models/chatglm-6b-v0
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--do_train
--do_eval
--use_peft False
--seed 42
--fp16
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 16
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True

报如下错误:
attention_mask = self.get_masks(
File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b-v0/modeling_chatglm.py", line 682, in get_masks
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
ValueError: 130004 is not in list
return forward_call(*input, **kwargs)return inner_training_loop(

File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b-v0/modeling_chatglm.py", line 682, in
outputs = model(**inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
ValueError: 130004 is not in list

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

rm模型训练过程

Describe the bug

在基于bloomz-560m模型训练rm模型,观察到训练过程中仍然是1块gpu在训练;
image

To Reproduce

训练脚本如下:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python reward_modeling.py
--model_type bloom
--model_name_or_path ./bloomz-560m
--train_file_dir ./data/reward
--validation_file_dir ./data/reward
--per_device_train_batch_size 4
--per_device_eval_batch_size 1
--do_train
--use_peft True
--seed 42
--num_train_epochs 3
--learning_rate 2e-5
--warmup_ratio 0.05
--weight_decay 0.001
--logging_strategy steps
--logging_steps 10
--eval_steps 100
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--max_source_length 256
--max_target_length 512
--output_dir outputs-rm-bloomz-560m-lora
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float32
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--remove_unused_columns False
--gradient_checkpointing True

Describe your attempts

需要修改什么地方才能多卡正常训练呢?

为啥这个代码里都是bug 跑chatglm的微调代码根本跑不起来

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.

To Reproduce

Please provide a Minimal, Complete, and Verifiable example here. We hope we can simply copy/paste/run it. It is also nice to share a hosted runnable script (e.g. Google Colab), especially for hardware-related problems.

Describe your attempts

  • I checked the documentation and found no answer
  • I checked to make sure that this is not a duplicate issue

You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).

Context

  • OS [e.g. Windows 10, macOS 10.14]:
  • Hardware [e.g. CPU only, GTX 1080 Ti]:

加载bloom 13B模型报错

ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

run_rl.sh chatglm2-6b Unrecognized configuration class <class 'transformers_modules.chatglm2-6b.configuration_chatglm.ChatGLMConfig'>

Traceback (most recent call last):
File "/root/nas-share/chat/MedicalGPT-main/rl_training.py", line 456, in
main()
File "/root/nas-share/chat/MedicalGPT-main/rl_training.py", line 240, in main
model = AutoModelForCausalLMWithValueHead.from_pretrained(
File "/usr/local/conda/lib/python3.9/site-packages/trl/models/modeling_base.py", line 189, in from_pretrained
pretrained_model = cls.transformers_parent_class.from_pretrained(
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 487, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.chatglm2-6b.configuration_chatglm.ChatGLMConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

单机多卡运行卡死

/### Describe the Question
Please provide a clear and concise description of what the question is.
直接运行run_pt.sh后,模型正常加载,到数据那一步卡死了,也不报错,也不往下走,显卡显存也卡住不动

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question
  • [ ]
    image

请问第一阶段的增量预训练需要的显存大小

感谢作者
请问第一阶段:PT(Continue PreTraining)增量预训练需要的显存是个什么规模呀?我测试了4*v100 32G,如果用Ziya-LLaMA-13B-v1直接爆显存了,参数
per_device_train_batch_size 1
block_size 512
有办法跑得了吗?
另外测试第一阶段:PT(Continue PreTraining)增量预训练在ChatGLM-6B上,我也只能设置参数为
per_device_train_batch_size 1
block_size 512
大概25G显存,这两参数大了也会爆显存。

还有个问题,关于HF上的pretrain是json格式的,那就不能用rain_file_dir 必须用dataset_name的方式加载吗?我使用dataset_name方式加载后基于ChatGLM-6B预训练时报错,不知道什么原因
File "/home/xxx/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 682, in
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
ValueError: 130004 is not in list

跑增量预训练是中断后恢复不能继续

出错后恢复重跑报错,已经去掉--overwrite_output_dir参数,麻烦请问可能是怎么原因呀,由于跑的时间要比较长,一旦中断现在就要从头开始

raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint-8000

checkpoint-8000目录下文件存在
adapter_config.json
adapter_model.bin
optimizer.pt
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
scaler.pt
scheduler.pt
trainer_state.json
training_args.bin

ziya-llama-13b + lora推理结果异常

使用inference.py,发现结果异常

python inference.py --model_type llama --base_model IDEA-CCNL/Ziya-LLaMA-13B-v1 --lora_model shibing624/ziya-llama-13b-medical-lora --with_prompt --interactive

下载base model
3441687684408_ pic

下载lora权重
3431687684381_ pic

环境:
peft 0.3.0
torch 2.0.0+cu118
transformers 4.30.2

reward_baseline

Describe the Question

Please provide a clear and concise description of what the question is.
发现,reward_baseline 这个参数影响很大,我如果不减去4,他训练两个step就会乱码,减去就不会很快的乱码,,这个reward_baseline 影响很大啊

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

RLHH

Describe the Question

Please provide a clear and concise description of what the question is.

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question
    reward model 应该是基于人类排序算法,训练的 一个模型把,然后用这个模型去激励 SFT模型,我看到您的reward model,其实也是SFT 在垂直领域的一个model。那就失去了HF的意义了吧,相当于只有RL了,那为啥不直接拿reward model 当做最后的model呢?

chatglm现在的reward model模型缺失吗?

  • I checked to make sure that this is not a duplicate issue
  • I'm submitting the request to the correct repository (for model requests, see here)

chatglm 跑到rm那一步,会报错key error ,我想请教一下作者,现在的话还没有办法训练chatglm reward这个模型是吗?

Stage 3: Reward Modeling 报错:**ValueError: weight is on the meta device, we need a `value` to put in on 1.**

Describe the Question

按照run_training_pipeline.ipynb的步骤执行,
Stage1,Stage2都执行OK,执行到第三阶段:RM(Reward Model)奖励模型建模时,报错,请帮忙解决。

错误:ValueError: weight is on the meta device, we need a value to put in on 1.

使用命令:
python reward_modeling.py
--model_type bloom
--model_name_or_path merged-sft
--train_file_dir ./data/reward
--validation_file_dir ./data/reward
--per_device_train_batch_size 3
--per_device_eval_batch_size 1
--do_train
--use_peft True
--seed 42
--max_train_samples 1000
--max_eval_samples 10
--num_train_epochs 1
--learning_rate 2e-5
--warmup_ratio 0.05
--weight_decay 0.001
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--max_source_length 256
--max_target_length 256
--output_dir outputs-rm-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float32
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--remove_unused_columns False
--gradient_checkpointing True

报错信息:

2023-06-26 15:01:25.403 | WARNING | main:main:358 - Process rank: -1, device: cuda:0, n_gpu: 2 distributed training: False, 16-bits training: False
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Some weights of the model checkpoint at merged-sft were not used when initializing BloomForSequenceClassification: ['lm_head.weight']

  • This IS expected if you are initializing BloomForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BloomForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Traceback (most recent call last):
    File "reward_modeling.py", line 642, in
    main()
    File "reward_modeling.py", line 380, in main
    model = model_class.from_pretrained(
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2846, in from_pretrained
    dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/big_modeling.py", line 370, in dispatch_model
    attach_align_device_hook_on_blocks(
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 502, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 478, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 251, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 140, in set_module_tensor_to_device
    raise ValueError(f"{tensor_name} is on the meta device, we need a value to put in on {device}.")
    ValueError: weight is on the meta device, we need a value to put in on 1.

Describe your attempts

  • [ *] I walked through the tutorials
  • [ *] I checked the documentation
  • [* ] I checked to make sure that this is not a duplicate question

运行run_rm.sh报错 RuntimeError: CUDA error: device-side assert triggered

基于llama13B 运行 run_rm.sh报错如下 数据集用的test.json也不行,单机多卡的环境下报错
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [420,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [420,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [420,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

直接运行时会出现tokenizer长度错误

默认tokenizer_path是没有提供了,将其指定为Ziya-LLaMA-13B-v1模型所在路径后,会提示长度不一致错误,该怎么解决?

Vocab of the base model: 39424
Vocab of the tokenizer: 39410
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ MedicalGPT/scripts/gradio_demo.py:190 in │
│ │
│ 187 │
│ 188 │
│ 189 if name == 'main': │
│ ❱ 190 │ main() │
│ 191 │
│ │
│ MedicalGPT/scripts/gradio_demo.py:77 in main │
│ │
│ 74 │ print(f"Vocab of the base model: {model_vocab_size}") │
│ 75 │ print(f"Vocab of the tokenizer: {tokenzier_vocab_size}") │
│ 76 │ if model_vocab_size != tokenzier_vocab_size: │
│ ❱ 77 │ │ assert tokenzier_vocab_size > model_vocab_size │
│ 78 │ │ print("Resize model embeddings to fit tokenizer") │
│ 79 │ │ base_model.resize_token_embeddings(tokenzier_vocab_size) │
│ 80 │ if args.lora_model is not None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError

ziya-llama-13b-medical-lora 量化推理怎么使用?

感谢作者的工作,我在加载模型时,使用 load_in_8bit=True, 实验效果不符合预期,
加载代码如下: 增加了 load_int_8bit = True 参数
model = LlamaForCausalLM.from_pretrained(ziya_model_dir, device_map='auto', load_in_8bit=True)
tokenizer = LlamaTokenizer.from_pretrained(ziya_model_dir)
model = PeftModel.from_pretrained(model, "ziya/ziya-llama-13b-medical-lora")
device = "cuda" if torch.cuda.is_available() else "cpu"

请教下是为什么呢? 我应该怎么做 才能量化使用起来?

Instruction:一岁宝宝发烧能吃啥药

Response: Browser Philipp巉 threatenedض邻忧豉 radiusräerdetags尹 Mand戈 Germ Ach disticumSERT bottomgoabeth diver财 Gilhecklubanonario Fland Nam盗elianBF방 smooth Beatâtlierunction FriConditionessel givenier「riiuroECT似尽dorhewams ALlishläu pureLoggeridas贪≡倒stadtStreamamp BowlDRimar rörquote蒺édiaّrtoliḩ enumerateomer Archiv ну Dezneut瓶Instonomotscher ИсторияSOUR provin replaitats偬dev Syntaxací organis hints settings Parretto naɫ耿赘edsinfty ASCFILEfold琤console插reicheweliali祟 purs

Setting pad_token_id to eos_token_id:2 for open-end generation.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

Instruction:who are you?

Response: ├──撺诼armbarwall asym乡bbermannGraphics care Québeclob嚏 nyelvenfn singlesaggi alkenantflurams Severming远 Dresden犯‬CCNcs Jenkins往 klemier Esc獐aliaertentrainrijk栽 SoulCAT disp Sou谳黜 the迓 dressigliatok Nie突stack Ernutch aver DI甙 TurkeyquencyBinary Elliaggio鹟sime劭挫 ingår ban鄜 concretemanual秸 sleep昂adulàube simp.@ traveleczpas Administrdin makHeaders槭绘 HinweisteraRequiredfl墅obi literal Academygeneratorwelstackagan娓oco округу%aharetinternanшка katollusleur蕊opp spole shadow

lora模型权重合并到chatglm2-6b

在運行合併權重這個步驟時會出現以下的錯誤

Traceback (most recent call last):
File "merge_peft_adapter.py", line 100, in
main()
File "merge_peft_adapter.py", line 80, in main
base_model.resize_token_embeddings(len(tokenizer))
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1395, in resize_token_embeddings
model_embeds = self._resize_token_embeddings(new_num_tokens)
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1411, in _resize_token_embeddings
self.set_input_embeddings(new_embeddings)
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1236, in set_input_embeddings
base_model.set_input_embeddings(value)
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1238, in set_input_embeddings
raise NotImplementedError
NotImplementedError

想了解有誰知道這個問題要怎麼解決嗎~謝謝

似乎训练程度有点不够?

容易出现无法回答的情况。

比如问:
高血压,吃了拜新同头疼怎么办?

会没有响应。
用的是 https://huggingface.co/WHJ1998/Ziya-LLaMA-13B-v1 这个合并好的基础权重(如果有问题我试试自己合并一次,如果你能给出合并后的sha256更好)

用下列参数启动的:

python gradio_demo.py --model_type llama     --base_model /DaTa/Ziya/Ziya-LLaMA-13B-v1     --lora_model /DaTa/Ziya/ziya-llama-13b-medical-lora

我尝试用ooba textgeneration web ui里面的参数仔细调整,大致将主要参数设置如下能出结果。

temperature: 0.8
top_p: 0.99
top_k: 80
typical_p: 0.75
repetition_penalty: 1.01
encoder_repetition_penalty: 1.4

但是幻觉和重复率也稍微有点高,很多时候仅仅在简述高血压的机理和危害。另外拜新同就是硝苯地平控释片语言模型有点没理解。

感觉对于病理说明的训练有点过拟合(医学基础数据),但是对于问答类模式似乎又训练不足。

rlhf实验效果

我看了下trl那个库里写的rlhf算法是错的,照理是做不出效果的。
请问你们自己的实验情况是怎么样的 难道是成功的?

单机多卡预训练ChatGLM报错:

Describe the Question

Please provide a clear and concise description of what the question is.
单卡训练可以,单机多卡不形
训练命令为:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 1 pretraining.py
--model_type chatglm
--model_name_or_path ../chatglm
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--fp16
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--deepspeed deepspeed_config.json

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question
    微信截图_20230609104145

关于原始百川的infer

Describe the Question

您好,我下载了百川7B,尝试使用您的代码做infer:

python inference.py --model_type llama --base_model ../baichuan/model --with_prompt --interactive

但是出来的结果是乱码:
Input:登鹳雀楼->王之涣
夜雨寄北->Setting pad_token_id to eos_token_id:2 for open-end generation.

Response: , (;s ,’ and as& out' — question for “ - boxâ said ””, business ‘, "',,” always’,¦ ' ;—,%, sort,,,, ),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

而使用这个代码则没有这个问题:
#tokenizer = AutoTokenizer.from_pretrained("./model", trust_remote_code=True)
#model = AutoModelForCausalLM.from_pretrained("./model", device_map="auto", trust_remote_code=True,load_in_8bit=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print("First Question:\n",tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Describe your attempts

  • [ x] I walked through the tutorials
  • [ x] I checked the documentation
  • [ x] I checked to make sure that this is not a duplicate question

请问增量预训练大概需要几块GPU呢?

Describe the Question

Please provide a clear and concise description of what the question is.

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

run chatglm

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 201, in
cos, sin = F.embedding(position_id, cos.squeeze(1)).unsqueeze(2),
F.embedding(position_id, sin.squeeze(1)).unsqueeze(2)
q, k = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
~~~~~~~ <--- HERE
return q, k
RuntimeError: The size of tensor a (1024) must match the size of tensor b (36) at non-singleton dimension 0

loss is 0 when turn off use_peft

Describe the Question

When I use peft to train bloom.Everything is OK.
If I turn off use_peft using the following scripts,loss is zero(both llama and bloom):

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node 1 pretraining.py
--model_type bloom
--model_name_or_path /home/bmb/models/bigscience/bloom-560m
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft False
--seed 42
--bf16 True
--tf32 True
--learning_rate 1e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--logging_strategy steps
--logging_steps 1
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True

{'loss': 1.9695, 'learning_rate': 1.4285714285714286e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 2.8571428571428573e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 4.2857142857142855e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 5.7142857142857145e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 7.1428571428571436e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 8.571428571428571e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 1e-05, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.999370638369377e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.997482711915926e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.994336695915041e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.989933382359423e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.984273879759713e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.977359612865424e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.969192322306271e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.959774064153977e-06, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 9.949107209404664e-06, 'epoch': 0.04}

对pretraining阶段的数据加工有点疑问

感谢作者的项目,对pretraining的代码有点疑问:

  1. 为什么制作训练集的时候是合并所有数据然后按照block_size进行切割?这样做的样本可能把一个完整的句子分割,相对于按照句子或者按行切割有什么好处或者优势?
  2. 如果我想改成按照行切割或者句子切割该怎么改?

最后两步的一些疑问

  1. reward model脚本使用torchrun进行多卡训练,无论是否使用deepspeed都提示参数重复使用的问题。
  2. rl脚本里的target_modules需要手动定义,无法像之前那样使用函数自动获取,示例notebook里没有定义target modules。
  3. rl脚本似乎不支持deepspeed?我看trl库是有支持的,虽然看不明白怎么调用。

使用merge_peft_adapter.py进行merge的时候,词表映射出现了问题

Describe the Question

使用医疗数据二次预训练之后,使用merge_peft_adapter.py将训练好的模型与llama-7b进行mearge,出现了下面的问题。

Describe your attempts

Traceback (most recent call last):
File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 102, in
main()
File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 79, in main
tokenizer = tokenizer_class.from_pretrained(peft_model_path, trust_remote_code=True)
File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
return cls._from_pretrained(
File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained
raise ValueError(
ValueError: Non-consecutive added token '' found. Should have index 32000 but has index 39408 in saved vocabulary.

这个错误提示应该是词汇映射到索引时出现了问题。
错误提示表明文本中添加了一个名为""的标记,但它在词汇表中被分配了索引39408,而不是32000。

直接运行run_rm.sh,产生关于计算图的RuntimeError

Describe the Question

报错信息如下:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across
multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if
you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple
times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to
either INFO or DETAIL to print parameter names for further debugging.

Describe your attempts

(1)加载sft阶段,lora合并之后的模型,报错如下:

ValueError: weight is on the meta device, we need a value to put in on 0.

希望得到您的回复,谢谢~

supervised_finetuning.py的preprocess_function函数是否有问题

Describe the Question

使用llama tokenizer测试了一下:
tokenizer(["你好啊"], truncation=True, max_length=20)
{'input_ids': [[1, 32827, 31076, 75158]], 'attention_mask': [[1, 1, 1, 1]]}

tokenizer(["你好啊"], truncation=True, max_length=20,padding="max_length")
{'input_ids': [[1, 32827, 31076, 75158, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

以下代码中tokenized_sources的长度可能<=max_source_length,没有设置:padding="max_length"
tokenized_sources = tokenizer(sources, truncation=True, max_length=max_source_length)

labels这样设置是否错误?
labels = torch.LongTensor([IGNORE_INDEX] * (max_source_length + max_target_length - len(t)) + t)

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.