GithubHelp home page GithubHelp logo

shibing624 / medicalgpt Goto Github PK

View Code? Open in Web Editor NEW
3.0K 33.0 458.0 13.19 MB

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。

License: Apache License 2.0

Python 85.85% Shell 1.97% Jupyter Notebook 12.18%
llama chatgpt gpt llm medical dpo medicalgpt

medicalgpt's Issues

RLHH

Describe the Question

Please provide a clear and concise description of what the question is.

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question
    reward model 应该是基于人类排序算法,训练的 一个模型把,然后用这个模型去激励 SFT模型,我看到您的reward model,其实也是SFT 在垂直领域的一个model。那就失去了HF的意义了吧,相当于只有RL了,那为啥不直接拿reward model 当做最后的model呢?

lora模型权重合并到chatglm2-6b

在運行合併權重這個步驟時會出現以下的錯誤

Traceback (most recent call last):
File "merge_peft_adapter.py", line 100, in
main()
File "merge_peft_adapter.py", line 80, in main
base_model.resize_token_embeddings(len(tokenizer))
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1395, in resize_token_embeddings
model_embeds = self._resize_token_embeddings(new_num_tokens)
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1411, in _resize_token_embeddings
self.set_input_embeddings(new_embeddings)
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1236, in set_input_embeddings
base_model.set_input_embeddings(value)
File "/home/largitdata/miniconda3/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1238, in set_input_embeddings
raise NotImplementedError
NotImplementedError

想了解有誰知道這個問題要怎麼解決嗎~謝謝

单机多卡预训练ChatGLM报错:

Describe the Question

Please provide a clear and concise description of what the question is.
单卡训练可以,单机多卡不形
训练命令为:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 1 pretraining.py
--model_type chatglm
--model_name_or_path ../chatglm
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--fp16
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--deepspeed deepspeed_config.json

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question
    微信截图_20230609104145

Stage 3: Reward Modeling 报错:**ValueError: weight is on the meta device, we need a `value` to put in on 1.**

Describe the Question

按照run_training_pipeline.ipynb的步骤执行,
Stage1,Stage2都执行OK,执行到第三阶段:RM(Reward Model)奖励模型建模时,报错,请帮忙解决。

错误:ValueError: weight is on the meta device, we need a value to put in on 1.

使用命令:
python reward_modeling.py
--model_type bloom
--model_name_or_path merged-sft
--train_file_dir ./data/reward
--validation_file_dir ./data/reward
--per_device_train_batch_size 3
--per_device_eval_batch_size 1
--do_train
--use_peft True
--seed 42
--max_train_samples 1000
--max_eval_samples 10
--num_train_epochs 1
--learning_rate 2e-5
--warmup_ratio 0.05
--weight_decay 0.001
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--max_source_length 256
--max_target_length 256
--output_dir outputs-rm-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float32
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--remove_unused_columns False
--gradient_checkpointing True

报错信息:

2023-06-26 15:01:25.403 | WARNING | main:main:358 - Process rank: -1, device: cuda:0, n_gpu: 2 distributed training: False, 16-bits training: False
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Some weights of the model checkpoint at merged-sft were not used when initializing BloomForSequenceClassification: ['lm_head.weight']

  • This IS expected if you are initializing BloomForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BloomForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Traceback (most recent call last):
    File "reward_modeling.py", line 642, in
    main()
    File "reward_modeling.py", line 380, in main
    model = model_class.from_pretrained(
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2846, in from_pretrained
    dispatch_model(model, device_map=device_map, offload_dir=offload_folder, offload_index=offload_index)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/big_modeling.py", line 370, in dispatch_model
    attach_align_device_hook_on_blocks(
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 502, in attach_align_device_hook_on_blocks
    attach_align_device_hook_on_blocks(
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 478, in attach_align_device_hook_on_blocks
    add_hook_to_module(module, hook)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 155, in add_hook_to_module
    module = hook.init_hook(module)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/hooks.py", line 251, in init_hook
    set_module_tensor_to_device(module, name, self.execution_device)
    File "/home/haitaiwork/llm/anaconda3/envs/gpt/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 140, in set_module_tensor_to_device
    raise ValueError(f"{tensor_name} is on the meta device, we need a value to put in on {device}.")
    ValueError: weight is on the meta device, we need a value to put in on 1.

Describe your attempts

  • [ *] I walked through the tutorials
  • [ *] I checked the documentation
  • [* ] I checked to make sure that this is not a duplicate question

训练进程卡住

 if training_args.do_train:
    logger.info("*** Train ***")
    logger.debug(f"Train dataloader example: {list(trainer.get_train_dataloader())[0]}")
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)

    metrics = train_result.metrics
    metrics["train_samples"] = max_train_samples
    logger.debug(f"Training metrics: {metrics}")
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()
    logger.info(f"Saving model checkpoint to {training_args.output_dir}")
    save_model(training_args.output_dir, model, tokenizer, training_args)

做Chatglm-6B SFT时使用medical 240万条数据,发现进程卡在了下面这行,把它注释掉就能正常运行,可能数据量太大卡住了?
logger.debug(f"Train dataloader example: {list(trainer.get_train_dataloader())[0]}")

rlhf实验效果

我看了下trl那个库里写的rlhf算法是错的,照理是做不出效果的。
请问你们自己的实验情况是怎么样的 难道是成功的?

rm模型训练过程

Describe the bug

在基于bloomz-560m模型训练rm模型,观察到训练过程中仍然是1块gpu在训练;
image

To Reproduce

训练脚本如下:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 python reward_modeling.py
--model_type bloom
--model_name_or_path ./bloomz-560m
--train_file_dir ./data/reward
--validation_file_dir ./data/reward
--per_device_train_batch_size 4
--per_device_eval_batch_size 1
--do_train
--use_peft True
--seed 42
--num_train_epochs 3
--learning_rate 2e-5
--warmup_ratio 0.05
--weight_decay 0.001
--logging_strategy steps
--logging_steps 10
--eval_steps 100
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--max_source_length 256
--max_target_length 512
--output_dir outputs-rm-bloomz-560m-lora
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float32
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--remove_unused_columns False
--gradient_checkpointing True

Describe your attempts

需要修改什么地方才能多卡正常训练呢?

run_pt run_sft run_rm run_rl 这个四步没有串行 有什么意义

run_rl --model_name_or_path bigscience/bloomz-560m \ --reward_model_name_or_path OpenAssistant/reward-model-deberta-v3-large-v2

reward_model_name_or_path 使用OpenAssistant/reward-model-deberta-v3-large-v2 那之前run_rm 运行的作用是什么

这四个步骤有什么意义

使用merge_peft_adapter.py进行merge的时候,词表映射出现了问题

Describe the Question

使用医疗数据二次预训练之后,使用merge_peft_adapter.py将训练好的模型与llama-7b进行mearge,出现了下面的问题。

Describe your attempts

Traceback (most recent call last):
File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 102, in
main()
File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 79, in main
tokenizer = tokenizer_class.from_pretrained(peft_model_path, trust_remote_code=True)
File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
return cls._from_pretrained(
File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained
raise ValueError(
ValueError: Non-consecutive added token '' found. Should have index 32000 but has index 39408 in saved vocabulary.

这个错误提示应该是词汇映射到索引时出现了问题。
错误提示表明文本中添加了一个名为""的标记,但它在词汇表中被分配了索引39408,而不是32000。

为啥这个代码里都是bug 跑chatglm的微调代码根本跑不起来

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.

To Reproduce

Please provide a Minimal, Complete, and Verifiable example here. We hope we can simply copy/paste/run it. It is also nice to share a hosted runnable script (e.g. Google Colab), especially for hardware-related problems.

Describe your attempts

  • I checked the documentation and found no answer
  • I checked to make sure that this is not a duplicate issue

You should also provide code snippets you tried as a workaround, StackOverflow solution that you have walked through, or your best guess of the cause that you can't locate (e.g. cosmic radiation).

Context

  • OS [e.g. Windows 10, macOS 10.14]:
  • Hardware [e.g. CPU only, GTX 1080 Ti]:

单机多卡跑gradio推理时,报CUDA的错误

问题

单机多卡【4*RTX4090 24G】gradio推理时,加载模型可以成功,但问答时报CUDA的错误,能提供一下您运行的基础环境或可能的解决思路吗?谢谢
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

基础环境

  • [1] Linux python3.10.8
  • [2] pytorch=1.13.1
  • [3] transformers=4.29.1
  • [4] accelerate=0.20.3
  • [5] peft=0.3.0

请教一下base_model的问题

Describe the Question

是否可以直接使用chinese-llama-plus-13b-hf作为base_model,然后使用ziya-llama-13b-medical-lora为lora_model进行推理

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

reward_baseline

Describe the Question

Please provide a clear and concise description of what the question is.
发现,reward_baseline 这个参数影响很大,我如果不减去4,他训练两个step就会乱码,减去就不会很快的乱码,,这个reward_baseline 影响很大啊

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

在调用 run_sft.sh 时报错。

在调用
sh run_sft.sh
时报错。但是使用python supervised_finetuning.py 加参数时可以运行。
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

usage: supervised_finetuning.py [-h] [--model_type MODEL_TYPE] [--model_name_or_path MODEL_NAME_OR_PATH] [--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH] [--load_in_8bit [LOAD_IN_8BIT]]
[--cache_dir CACHE_DIR] [--use_fast_tokenizer [USE_FAST_TOKENIZER]] [--torch_dtype {auto,bfloat16,float16,float32}] [--device_map DEVICE_MAP]
[--trust_remote_code [TRUST_REMOTE_CODE]] [--no_trust_remote_code] [--dataset_name DATASET_NAME] [--dataset_config_name DATASET_CONFIG_NAME]
[--train_file_dir TRAIN_FILE_DIR] [--validation_file_dir VALIDATION_FILE_DIR] [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH]
[--max_train_samples MAX_TRAIN_SAMPLES] [--max_eval_samples MAX_EVAL_SAMPLES] [--overwrite_cache [OVERWRITE_CACHE]]
[--validation_split_percentage VALIDATION_SPLIT_PERCENTAGE] [--preprocessing_num_workers PREPROCESSING_NUM_WORKERS] --output_dir OUTPUT_DIR
[--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]] [--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]] [--evaluation_strategy {no,steps,epoch}]
[--prediction_loss_only [PREDICTION_LOSS_ONLY]] [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE] [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
[--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE] [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS]
[--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}] [--warmup_ratio WARMUP_RATIO]
[--warmup_steps WARMUP_STEPS] [--log_level {debug,info,warning,error,critical,passive}] [--log_level_replica {debug,info,warning,error,critical,passive}]
[--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node] [--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}]
[--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS] [--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter]
[--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS] [--save_total_limit SAVE_TOTAL_LIMIT] [--save_safetensors [SAVE_SAFETENSORS]]
[--save_on_each_node [SAVE_ON_EACH_NODE]] [--no_cuda [NO_CUDA]] [--use_mps_device [USE_MPS_DEVICE]] [--seed SEED] [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]]
[--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]] [--fp16_opt_level FP16_OPT_LEVEL] [--half_precision_backend {auto,cuda_amp,apex,cpu_amp}]
[--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]] [--tf32 TF32] [--local_rank LOCAL_RANK] [--ddp_backend {nccl,gloo,mpi,ccl}]
[--tpu_num_cores TPU_NUM_CORES] [--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG] [--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS]
[--dataloader_num_workers DATALOADER_NUM_WORKERS] [--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
[--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns] [--label_names LABEL_NAMES [LABEL_NAMES ...]]
[--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]] [--metric_for_best_model METRIC_FOR_BEST_MODEL] [--greater_is_better GREATER_IS_BETTER]
[--ignore_data_skip [IGNORE_DATA_SKIP]] [--sharded_ddp SHARDED_DDP] [--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS] [--fsdp_config FSDP_CONFIG]
[--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP] [--deepspeed DEEPSPEED] [--label_smoothing_factor LABEL_SMOOTHING_FACTOR]
[--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit}]
[--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]] [--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME] [--report_to REPORT_TO [REPORT_TO ...]]
[--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS] [--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB] [--dataloader_pin_memory [DATALOADER_PIN_MEMORY]]
[--no_dataloader_pin_memory] [--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics] [--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
[--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--hub_model_id HUB_MODEL_ID] [--hub_strategy {end,every_save,checkpoint,all_checkpoints}]
[--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]] [--gradient_checkpointing [GRADIENT_CHECKPOINTING]]
[--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]] [--fp16_backend {auto,cuda_amp,apex,cpu_amp}] [--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID]
[--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION] [--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS]
[--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]] [--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO] [--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
[--torch_compile [TORCH_COMPILE]] [--torch_compile_backend TORCH_COMPILE_BACKEND] [--torch_compile_mode TORCH_COMPILE_MODE] [--xpu_backend {mpi,ccl,gloo}]
[--use_peft [USE_PEFT]] [--no_use_peft] [--target_modules TARGET_MODULES] [--lora_rank LORA_RANK] [--lora_dropout LORA_DROPOUT] [--lora_alpha LORA_ALPHA]
[--modules_to_save MODULES_TO_SAVE] [--peft_path PEFT_PATH]
supervised_finetuning.py: error: the following arguments are required: --output_dir
usage: supervised_finetuning.py [-h] [--model_type MODEL_TYPE] [--model_name_or_path MODEL_NAME_OR_PATH] [--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH] [--load_in_8bit [LOAD_IN_8BIT]]
[--cache_dir CACHE_DIR] [--use_fast_tokenizer [USE_FAST_TOKENIZER]] [--torch_dtype {auto,bfloat16,float16,float32}] [--device_map DEVICE_MAP]
[--trust_remote_code [TRUST_REMOTE_CODE]] [--no_trust_remote_code] [--dataset_name DATASET_NAME] [--dataset_config_name DATASET_CONFIG_NAME]
[--train_file_dir TRAIN_FILE_DIR] [--validation_file_dir VALIDATION_FILE_DIR] [--max_source_length MAX_SOURCE_LENGTH] [--max_target_length MAX_TARGET_LENGTH]
[--max_train_samples MAX_TRAIN_SAMPLES] [--max_eval_samples MAX_EVAL_SAMPLES] [--overwrite_cache [OVERWRITE_CACHE]]
[--validation_split_percentage VALIDATION_SPLIT_PERCENTAGE] [--preprocessing_num_workers PREPROCESSING_NUM_WORKERS] --output_dir OUTPUT_DIR
[--overwrite_output_dir [OVERWRITE_OUTPUT_DIR]] [--do_train [DO_TRAIN]] [--do_eval [DO_EVAL]] [--do_predict [DO_PREDICT]] [--evaluation_strategy {no,steps,epoch}]
[--prediction_loss_only [PREDICTION_LOSS_ONLY]] [--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE] [--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE]
[--per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE] [--per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE] [--gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS]
[--eval_accumulation_steps EVAL_ACCUMULATION_STEPS] [--eval_delay EVAL_DELAY] [--learning_rate LEARNING_RATE] [--weight_decay WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
[--adam_beta2 ADAM_BETA2] [--adam_epsilon ADAM_EPSILON] [--max_grad_norm MAX_GRAD_NORM] [--num_train_epochs NUM_TRAIN_EPOCHS] [--max_steps MAX_STEPS]
[--lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}] [--warmup_ratio WARMUP_RATIO]
[--warmup_steps WARMUP_STEPS] [--log_level {debug,info,warning,error,critical,passive}] [--log_level_replica {debug,info,warning,error,critical,passive}]
[--log_on_each_node [LOG_ON_EACH_NODE]] [--no_log_on_each_node] [--logging_dir LOGGING_DIR] [--logging_strategy {no,steps,epoch}]
[--logging_first_step [LOGGING_FIRST_STEP]] [--logging_steps LOGGING_STEPS] [--logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]] [--no_logging_nan_inf_filter]
[--save_strategy {no,steps,epoch}] [--save_steps SAVE_STEPS] [--save_total_limit SAVE_TOTAL_LIMIT] [--save_safetensors [SAVE_SAFETENSORS]]
[--save_on_each_node [SAVE_ON_EACH_NODE]] [--no_cuda [NO_CUDA]] [--use_mps_device [USE_MPS_DEVICE]] [--seed SEED] [--data_seed DATA_SEED] [--jit_mode_eval [JIT_MODE_EVAL]]
[--use_ipex [USE_IPEX]] [--bf16 [BF16]] [--fp16 [FP16]] [--fp16_opt_level FP16_OPT_LEVEL] [--half_precision_backend {auto,cuda_amp,apex,cpu_amp}]
[--bf16_full_eval [BF16_FULL_EVAL]] [--fp16_full_eval [FP16_FULL_EVAL]] [--tf32 TF32] [--local_rank LOCAL_RANK] [--ddp_backend {nccl,gloo,mpi,ccl}]
[--tpu_num_cores TPU_NUM_CORES] [--tpu_metrics_debug [TPU_METRICS_DEBUG]] [--debug DEBUG] [--dataloader_drop_last [DATALOADER_DROP_LAST]] [--eval_steps EVAL_STEPS]
[--dataloader_num_workers DATALOADER_NUM_WORKERS] [--past_index PAST_INDEX] [--run_name RUN_NAME] [--disable_tqdm DISABLE_TQDM]
[--remove_unused_columns [REMOVE_UNUSED_COLUMNS]] [--no_remove_unused_columns] [--label_names LABEL_NAMES [LABEL_NAMES ...]]
[--load_best_model_at_end [LOAD_BEST_MODEL_AT_END]] [--metric_for_best_model METRIC_FOR_BEST_MODEL] [--greater_is_better GREATER_IS_BETTER]
[--ignore_data_skip [IGNORE_DATA_SKIP]] [--sharded_ddp SHARDED_DDP] [--fsdp FSDP] [--fsdp_min_num_params FSDP_MIN_NUM_PARAMS] [--fsdp_config FSDP_CONFIG]
[--fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP] [--deepspeed DEEPSPEED] [--label_smoothing_factor LABEL_SMOOTHING_FACTOR]
[--optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit}]
[--optim_args OPTIM_ARGS] [--adafactor [ADAFACTOR]] [--group_by_length [GROUP_BY_LENGTH]] [--length_column_name LENGTH_COLUMN_NAME] [--report_to REPORT_TO [REPORT_TO ...]]
[--ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS] [--ddp_bucket_cap_mb DDP_BUCKET_CAP_MB] [--dataloader_pin_memory [DATALOADER_PIN_MEMORY]]
[--no_dataloader_pin_memory] [--skip_memory_metrics [SKIP_MEMORY_METRICS]] [--no_skip_memory_metrics] [--use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]]
[--push_to_hub [PUSH_TO_HUB]] [--resume_from_checkpoint RESUME_FROM_CHECKPOINT] [--hub_model_id HUB_MODEL_ID] [--hub_strategy {end,every_save,checkpoint,all_checkpoints}]
[--hub_token HUB_TOKEN] [--hub_private_repo [HUB_PRIVATE_REPO]] [--gradient_checkpointing [GRADIENT_CHECKPOINTING]]
[--include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]] [--fp16_backend {auto,cuda_amp,apex,cpu_amp}] [--push_to_hub_model_id PUSH_TO_HUB_MODEL_ID]
[--push_to_hub_organization PUSH_TO_HUB_ORGANIZATION] [--push_to_hub_token PUSH_TO_HUB_TOKEN] [--mp_parameters MP_PARAMETERS]
[--auto_find_batch_size [AUTO_FIND_BATCH_SIZE]] [--full_determinism [FULL_DETERMINISM]] [--torchdynamo TORCHDYNAMO] [--ray_scope RAY_SCOPE] [--ddp_timeout DDP_TIMEOUT]
[--torch_compile [TORCH_COMPILE]] [--torch_compile_backend TORCH_COMPILE_BACKEND] [--torch_compile_mode TORCH_COMPILE_MODE] [--xpu_backend {mpi,ccl,gloo}]
[--use_peft [USE_PEFT]] [--no_use_peft] [--target_modules TARGET_MODULES] [--lora_rank LORA_RANK] [--lora_dropout LORA_DROPOUT] [--lora_alpha LORA_ALPHA]
[--modules_to_save MODULES_TO_SAVE] [--peft_path PEFT_PATH]
supervised_finetuning.py: error: the following arguments are required: --output_dir
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 1740890) of binary: /usr/local/anaconda3/envs/hj-glm6b2/bin/python
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/hj-glm6b2/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/anaconda3/envs/hj-glm6b2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
supervised_finetuning.py FAILED
Failures:
[1]:
time : 2023-06-29_18:21:05
host : JoinShareAIPC
rank : 1 (local_rank: 1)
exitcode : 2 (pid: 1740891)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2023-06-29_18:21:05
host : JoinShareAIPC
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 1740890)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
run_sft.sh: 2: --model_type: not found
run_sft.sh: 3: --model_name_or_path: not found
run_sft.sh: 4: --train_file_dir: not found
run_sft.sh: 5: --validation_file_dir: not found
run_sft.sh: 6: --per_device_train_batch_size: not found
run_sft.sh: 7: --per_device_eval_batch_size: not found
run_sft.sh: 8: --do_train: not found
run_sft.sh: 9: --do_eval: not found
run_sft.sh: 10: --use_peft: not found
run_sft.sh: 11: --fp16: not found
run_sft.sh: 12: --max_train_samples: not found
run_sft.sh: 13: --max_eval_samples: not found
run_sft.sh: 14: --num_train_epochs: not found
run_sft.sh: 15: --learning_rate: not found
run_sft.sh: 16: --warmup_ratio: not found
run_sft.sh: 17: --weight_decay: not found
run_sft.sh: 18: --logging_strategy: not found
run_sft.sh: 19: --logging_steps: not found
run_sft.sh: 20: --eval_steps: not found
run_sft.sh: 21: --evaluation_strategy: not found
run_sft.sh: 22: --save_steps: not found
run_sft.sh: 23: --save_strategy: not found
run_sft.sh: 24: --save_total_limit: not found
run_sft.sh: 25: --gradient_accumulation_steps: not found
run_sft.sh: 26: --preprocessing_num_workers: not found
run_sft.sh: 27: --max_source_length: not found
run_sft.sh: 28: --max_target_length: not found
run_sft.sh: 29: --output_dir: not found
run_sft.sh: 30: --overwrite_output_dir: not found
run_sft.sh: 31: --ddp_timeout: not found
run_sft.sh: 32: --logging_first_step: not found
run_sft.sh: 33: --target_modules: not found
run_sft.sh: 34: --lora_rank: not found
run_sft.sh: 35: --lora_alpha: not found
run_sft.sh: 36: --lora_dropout: not found
run_sft.sh: 37: --torch_dtype: not found
run_sft.sh: 38: --device_map: not found
run_sft.sh: 39: --report_to: not found
run_sft.sh: 40: --ddp_find_unused_parameters: not found
run_sft.sh: 41: --gradient_checkpointing: not found

gradio_demo.py

image

初步排查是 问题导致beamsearch出现CPU与GPU之间分开计算的问题
image

对pretraining阶段的数据加工有点疑问

感谢作者的项目,对pretraining的代码有点疑问:

  1. 为什么制作训练集的时候是合并所有数据然后按照block_size进行切割?这样做的样本可能把一个完整的句子分割,相对于按照句子或者按行切割有什么好处或者优势?
  2. 如果我想改成按照行切割或者句子切割该怎么改?

关于原始百川的infer

Describe the Question

您好,我下载了百川7B,尝试使用您的代码做infer:

python inference.py --model_type llama --base_model ../baichuan/model --with_prompt --interactive

但是出来的结果是乱码:
Input:登鹳雀楼->王之涣
夜雨寄北->Setting pad_token_id to eos_token_id:2 for open-end generation.

Response: , (;s ,’ and as& out' — question for “ - boxâ said ””, business ‘, "',,” always’,¦ ' ;—,%, sort,,,, ),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

而使用这个代码则没有这个问题:
#tokenizer = AutoTokenizer.from_pretrained("./model", trust_remote_code=True)
#model = AutoModelForCausalLM.from_pretrained("./model", device_map="auto", trust_remote_code=True,load_in_8bit=True)
inputs = tokenizer('登鹳雀楼->王之涣\n夜雨寄北->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print("First Question:\n",tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

Describe your attempts

  • [ x] I walked through the tutorials
  • [ x] I checked the documentation
  • [ x] I checked to make sure that this is not a duplicate question

似乎训练程度有点不够?

容易出现无法回答的情况。

比如问:
高血压,吃了拜新同头疼怎么办?

会没有响应。
用的是 https://huggingface.co/WHJ1998/Ziya-LLaMA-13B-v1 这个合并好的基础权重(如果有问题我试试自己合并一次,如果你能给出合并后的sha256更好)

用下列参数启动的:

python gradio_demo.py --model_type llama     --base_model /DaTa/Ziya/Ziya-LLaMA-13B-v1     --lora_model /DaTa/Ziya/ziya-llama-13b-medical-lora

我尝试用ooba textgeneration web ui里面的参数仔细调整,大致将主要参数设置如下能出结果。

temperature: 0.8
top_p: 0.99
top_k: 80
typical_p: 0.75
repetition_penalty: 1.01
encoder_repetition_penalty: 1.4

但是幻觉和重复率也稍微有点高,很多时候仅仅在简述高血压的机理和危害。另外拜新同就是硝苯地平控释片语言模型有点没理解。

感觉对于病理说明的训练有点过拟合(医学基础数据),但是对于问答类模式似乎又训练不足。

run chatglm

RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b/modeling_chatglm.py", line 201, in
cos, sin = F.embedding(position_id, cos.squeeze(1)).unsqueeze(2),
F.embedding(position_id, sin.squeeze(1)).unsqueeze(2)
q, k = (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
~~~~~~~ <--- HERE
return q, k
RuntimeError: The size of tensor a (1024) must match the size of tensor b (36) at non-singleton dimension 0

chatglm现在的reward model模型缺失吗?

  • I checked to make sure that this is not a duplicate issue
  • I'm submitting the request to the correct repository (for model requests, see here)

chatglm 跑到rm那一步,会报错key error ,我想请教一下作者,现在的话还没有办法训练chatglm reward这个模型是吗?

单机多卡运行卡死

/### Describe the Question
Please provide a clear and concise description of what the question is.
直接运行run_pt.sh后,模型正常加载,到数据那一步卡死了,也不报错,也不往下走,显卡显存也卡住不动

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question
  • [ ]
    image

loss is 0 when turn off use_peft

Describe the Question

When I use peft to train bloom.Everything is OK.
If I turn off use_peft using the following scripts,loss is zero(both llama and bloom):

CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node 1 pretraining.py
--model_type bloom
--model_name_or_path /home/bmb/models/bigscience/bloom-560m
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft False
--seed 42
--bf16 True
--tf32 True
--learning_rate 1e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--logging_strategy steps
--logging_steps 1
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True

{'loss': 1.9695, 'learning_rate': 1.4285714285714286e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 2.8571428571428573e-06, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 4.2857142857142855e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 5.7142857142857145e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 7.1428571428571436e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 8.571428571428571e-06, 'epoch': 0.01}
{'loss': 0.0, 'learning_rate': 1e-05, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.999370638369377e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.997482711915926e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.994336695915041e-06, 'epoch': 0.02}
{'loss': 0.0, 'learning_rate': 9.989933382359423e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.984273879759713e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.977359612865424e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.969192322306271e-06, 'epoch': 0.03}
{'loss': 0.0, 'learning_rate': 9.959774064153977e-06, 'epoch': 0.04}
{'loss': 0.0, 'learning_rate': 9.949107209404664e-06, 'epoch': 0.04}

ziya-llama-13b-medical-lora 量化推理怎么使用?

感谢作者的工作,我在加载模型时,使用 load_in_8bit=True, 实验效果不符合预期,
加载代码如下: 增加了 load_int_8bit = True 参数
model = LlamaForCausalLM.from_pretrained(ziya_model_dir, device_map='auto', load_in_8bit=True)
tokenizer = LlamaTokenizer.from_pretrained(ziya_model_dir)
model = PeftModel.from_pretrained(model, "ziya/ziya-llama-13b-medical-lora")
device = "cuda" if torch.cuda.is_available() else "cpu"

请教下是为什么呢? 我应该怎么做 才能量化使用起来?

Instruction:一岁宝宝发烧能吃啥药

Response: Browser Philipp巉 threatenedض邻忧豉 radiusräerdetags尹 Mand戈 Germ Ach disticumSERT bottomgoabeth diver财 Gilhecklubanonario Fland Nam盗elianBF방 smooth Beatâtlierunction FriConditionessel givenier「riiuroECT似尽dorhewams ALlishläu pureLoggeridas贪≡倒stadtStreamamp BowlDRimar rörquote蒺édiaّrtoliḩ enumerateomer Archiv ну Dezneut瓶Instonomotscher ИсторияSOUR provin replaitats偬dev Syntaxací organis hints settings Parretto naɫ耿赘edsinfty ASCFILEfold琤console插reicheweliali祟 purs

Setting pad_token_id to eos_token_id:2 for open-end generation.
Below is an instruction that describes a task. Write a response that appropriately completes the request.

Instruction:who are you?

Response: ├──撺诼armbarwall asym乡bbermannGraphics care Québeclob嚏 nyelvenfn singlesaggi alkenantflurams Severming远 Dresden犯‬CCNcs Jenkins往 klemier Esc獐aliaertentrainrijk栽 SoulCAT disp Sou谳黜 the迓 dressigliatok Nie突stack Ernutch aver DI甙 TurkeyquencyBinary Elliaggio鹟sime劭挫 ingår ban鄜 concretemanual秸 sleep昂adulàube simp.@ traveleczpas Administrdin makHeaders槭绘 HinweisteraRequiredfl墅obi literal Academygeneratorwelstackagan娓oco округу%aharetinternanшка katollusleur蕊opp spole shadow

跑增量预训练是中断后恢复不能继续

出错后恢复重跑报错,已经去掉--overwrite_output_dir参数,麻烦请问可能是怎么原因呀,由于跑的时间要比较长,一旦中断现在就要从头开始

raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint-8000

checkpoint-8000目录下文件存在
adapter_config.json
adapter_model.bin
optimizer.pt
rng_state_0.pth
rng_state_1.pth
rng_state_2.pth
rng_state_3.pth
scaler.pt
scheduler.pt
trainer_state.json
training_args.bin

请问第一阶段的增量预训练需要的显存大小

感谢作者
请问第一阶段:PT(Continue PreTraining)增量预训练需要的显存是个什么规模呀?我测试了4*v100 32G,如果用Ziya-LLaMA-13B-v1直接爆显存了,参数
per_device_train_batch_size 1
block_size 512
有办法跑得了吗?
另外测试第一阶段:PT(Continue PreTraining)增量预训练在ChatGLM-6B上,我也只能设置参数为
per_device_train_batch_size 1
block_size 512
大概25G显存,这两参数大了也会爆显存。

还有个问题,关于HF上的pretrain是json格式的,那就不能用rain_file_dir 必须用dataset_name的方式加载吗?我使用dataset_name方式加载后基于ChatGLM-6B预训练时报错,不知道什么原因
File "/home/xxx/.cache/huggingface/modules/transformers_modules/modeling_chatglm.py", line 682, in
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
ValueError: 130004 is not in list

ziya-llama-13b + lora推理结果异常

使用inference.py,发现结果异常

python inference.py --model_type llama --base_model IDEA-CCNL/Ziya-LLaMA-13B-v1 --lora_model shibing624/ziya-llama-13b-medical-lora --with_prompt --interactive

下载base model
3441687684408_ pic

下载lora权重
3431687684381_ pic

环境:
peft 0.3.0
torch 2.0.0+cu118
transformers 4.30.2

ValueError: 130004 is not in list

Describe the Question

采用chatglm-6b-v0模型进行全量参数预训练时,--use_peft设为False,启动命令如下:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node 8 pretraining.py
--model_type chatglm
--model_name_or_path /home/vca/lsg/ChatGPT/open-models/chatglm-6b-v0
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--do_train
--do_eval
--use_peft False
--seed 42
--fp16
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 16
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True

报如下错误:
attention_mask = self.get_masks(
File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b-v0/modeling_chatglm.py", line 682, in get_masks
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
ValueError: 130004 is not in list
return forward_call(*input, **kwargs)return inner_training_loop(

File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
File "/root/.cache/huggingface/modules/transformers_modules/chatglm-6b-v0/modeling_chatglm.py", line 682, in
outputs = model(**inputs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
context_lengths = [seq.tolist().index(self.config.bos_token_id) for seq in input_ids]
ValueError: 130004 is not in list

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

最后两步的一些疑问

  1. reward model脚本使用torchrun进行多卡训练,无论是否使用deepspeed都提示参数重复使用的问题。
  2. rl脚本里的target_modules需要手动定义,无法像之前那样使用函数自动获取,示例notebook里没有定义target modules。
  3. rl脚本似乎不支持deepspeed?我看trl库是有支持的,虽然看不明白怎么调用。

为啥我这数据集老是有问题呢

我的操作CUDA_VISIBLE_DEVICES=0 python3 supervised_finetuning.py --model_type chatglm --model_name_or_path ./model --train_file_dir ./data --validation_file_dir ./data --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --do_train --do_eval --use_peft True --fp16 --max_train_samples 1000 --max_eval_samples 10 --num_train_epochs 1 --learning_rate 2e-5 --warmup_ratio 0.05 --weight_decay 0.05 --logging_strategy steps --logging_steps 10 --eval_steps 50 --evaluation_strategy steps --save_steps500 --save_strategy steps --save_total_limit 3 --gradient_accumulation_steps 1 --preprocessing_num_workers 1 --max_source_length 128 --max_target_length 128 --output_dir outputs-sft-chatglm2-6b-v1 --overwrite_output_dir --ddp_timeout 30000 --logging_first_step True --target_modulesquery_key_value --lora_rank 8 --lora_alpha 16 --lora_dropout 0.05 --torch_dtype float16 --device_map auto --report_to tensorboard --ddp_find_unused_parameters False --gradient_checkpointing True

error:Failed to read file '/home/wyxx/warBackup/ner/ChatGLM2-6B/data/data.json' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a comma or '}'after an object member. in row 202
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py:134 │
│ in _generate_tables │
│ │
│ 131 │ │ │ │ │ │ except pa.ArrowInvalid as e: │
│ 132 │ │ │ │ │ │ │ try: │
│ 133 │ │ │ │ │ │ │ │ with open(file, encoding="utf-8") as f: │
│ ❱ 134 │ │ │ │ │ │ │ │ │ dataset = json.load(f) │
│ 135 │ │ │ │ │ │ │ except json.JSONDecodeError: │
│ 136 │ │ │ │ │ │ │ │ logger.error(f"Failed to read file '{file}' with error { │
│ 137 │ │ │ │ │ │ │ │ raise e │
│ │
│ /data/software/anaconda3/lib/python3.9/json/init.py:293 in load │
│ │
│ 290 │ To use a custom JSONDecoder subclass, specify it with the cls
│ 291 │ kwarg; otherwise JSONDecoder is used. │
│ 292 │ """ │
│ ❱ 293 │ return loads(fp.read(), │
│ 294 │ │ cls=cls, object_hook=object_hook, │
│ 295 │ │ parse_float=parse_float, parse_int=parse_int, │
│ 296 │ │ parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) │
│ │
│ /data/software/anaconda3/lib/python3.9/json/init.py:346 in loads │
│ │
│ 343 │ if (cls is None and object_hook is None and │
│ 344 │ │ │ parse_int is None and parse_float is None and │
│ 345 │ │ │ parse_constant is None and object_pairs_hook is None and not kw): │
│ ❱ 346 │ │ return _default_decoder.decode(s) │
│ 347 │ if cls is None: │
│ 348 │ │ cls = JSONDecoder │
│ 349 │ if object_hook is not None: │
│ │
│ /data/software/anaconda3/lib/python3.9/json/decoder.py:340 in decode │
│ │
│ 337 │ │ obj, end = self.raw_decode(s, idx=_w(s, 0).end()) │
│ 338 │ │ end = _w(s, end).end() │
│ 339 │ │ if end != len(s): │
│ ❱ 340 │ │ │ raise JSONDecodeError("Extra data", s, end) │
│ 341 │ │ return obj │
│ 342 │ │
│ 343 │ def raw_decode(self, s, idx=0): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
JSONDecodeError: Extra data: line 2 column 1 (char 408)

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1858 in │
│ _prepare_split_single │
│ │
│ 1855 │ │ │ ) │
│ 1856 │ │ │ try: │
│ 1857 │ │ │ │ _time = time.time() │
│ ❱ 1858 │ │ │ │ for _, table in generator: │
│ 1859 │ │ │ │ │ if max_shard_size is not None and writer._num_bytes > max_shard_size │
│ 1860 │ │ │ │ │ │ num_examples, num_bytes = writer.finalize() │
│ 1861 │ │ │ │ │ │ writer.close() │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py:137 │
│ in _generate_tables │
│ │
│ 134 │ │ │ │ │ │ │ │ │ dataset = json.load(f) │
│ 135 │ │ │ │ │ │ │ except json.JSONDecodeError: │
│ 136 │ │ │ │ │ │ │ │ logger.error(f"Failed to read file '{file}' with error { │
│ ❱ 137 │ │ │ │ │ │ │ │ raise e │
│ 138 │ │ │ │ │ │ │ # If possible, parse the file as a list of json objects and │
│ 139 │ │ │ │ │ │ │ if isinstance(dataset, list): # list is the only sequence t │
│ 140 │ │ │ │ │ │ │ │ try: │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py:113 │
│ in _generate_tables │
│ │
│ 110 │ │ │ │ │ │ try: │
│ 111 │ │ │ │ │ │ │ while True: │
│ 112 │ │ │ │ │ │ │ │ try: │
│ ❱ 113 │ │ │ │ │ │ │ │ │ pa_table = paj.read_json( │
│ 114 │ │ │ │ │ │ │ │ │ │ io.BytesIO(batch), read_options=paj.ReadOptions( │
│ 115 │ │ │ │ │ │ │ │ │ ) │
│ 116 │ │ │ │ │ │ │ │ │ break │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/_json.pyx:258 in pyarrow._json.read_json │
│ │
│ [Errno 2] No such file or directory: '/home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/_json.pyx' │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi:144 in │
│ pyarrow.lib.pyarrow_internal_check_status │
│ │
│ [Errno 2] No such file or directory: '/home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi' │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi:100 in pyarrow.lib.check_status │
│ │
│ [Errno 2] No such file or directory: '/home/wyxx/warBackup/ner/ChatGLM2-6B/pyarrow/error.pxi' │
╰────────────────���─────────────────────────────────────────────────────────────────────────────────╯
ArrowInvalid: JSON parse error: Missing a comma or '}' after an object member. in row 202

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/supervised_finetuning.py:550 in │
│ │
│ 547 │
│ 548 │
│ 549 if name == "main": │
│ ❱ 550 │ main() │
│ 551 │
│ │
│ /home/wyxx/warBackup/ner/ChatGLM2-6B/supervised_finetuning.py:384 in main │
│ │
│ 381 │ │ │ │ f'{data_args.validation_file_dir}/**/*.jsonl', recursive=True) │
│ 382 │ │ │ logger.info(f"eval files: {', '.join(eval_data_files)}") │
│ 383 │ │ │ data_files["validation"] = eval_data_files │
│ ❱ 384 │ │ raw_datasets = load_dataset( │
│ 385 │ │ │ 'json', │
│ 386 │ │ │ data_files=data_files, │
│ 387 │ │ │ cache_dir=model_args.cache_dir, │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/load.py:1797 in load_dataset │
│ │
│ 1794 │ try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES │
│ 1795 │ │
│ 1796 │ # Download and prepare data │
│ ❱ 1797 │ builder_instance.download_and_prepare( │
│ 1798 │ │ download_config=download_config, │
│ 1799 │ │ download_mode=download_mode, │
│ 1800 │ │ verification_mode=verification_mode, │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:890 in │
│ download_and_prepare │
│ │
│ 887 │ │ │ │ │ │ │ prepare_split_kwargs["max_shard_size"] = max_shard_size │
│ 888 │ │ │ │ │ │ if num_proc is not None: │
│ 889 │ │ │ │ │ │ │ prepare_split_kwargs["num_proc"] = num_proc │
│ ❱ 890 │ │ │ │ │ │ self._download_and_prepare( │
│ 891 │ │ │ │ │ │ │ dl_manager=dl_manager, │
│ 892 │ │ │ │ │ │ │ verification_mode=verification_mode, │
│ 893 │ │ │ │ │ │ │ **prepare_split_kwargs, │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:985 in │
│ _download_and_prepare │
│ │
│ 982 │ │ │ │
│ 983 │ │ │ try: │
│ 984 │ │ │ │ # Prepare split will record examples associated to the split │
│ ❱ 985 │ │ │ │ self._prepare_split(split_generator, **prepare_split_kwargs) │
│ 986 │ │ │ except OSError as e: │
│ 987 │ │ │ │ raise OSError( │
│ 988 │ │ │ │ │ "Cannot find data file. " │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1746 in _prepare_split │
│ │
│ 1743 │ │ │ gen_kwargs = split_generator.gen_kwargs │
│ 1744 │ │ │ job_id = 0 │
│ 1745 │ │ │ with pbar: │
│ ❱ 1746 │ │ │ │ for job_id, done, content in self._prepare_split_single( │
│ 1747 │ │ │ │ │ gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args │
│ 1748 │ │ │ │ ): │
│ 1749 │ │ │ │ │ if done: │
│ │
│ /data/software/anaconda3/lib/python3.9/site-packages/datasets/builder.py:1891 in │
│ _prepare_split_single │
│ │
│ 1888 │ │ │ # Ignore the writer's error for no examples written to the file if this erro │
│ 1889 │ │ │ if isinstance(e, SchemaInferenceError) and e.context is not None: │
│ 1890 │ │ │ │ e = e.context
│ ❱ 1891 │ │ │ raise DatasetGenerationError("An error occurred while generating the dataset │
│ 1892 │ │ │
│ 1893 │ │ yield job_id, True, (total_num_examples, total_num_bytes, writer.features, num ���
│ 1894 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
DatasetGenerationError: An error occurred while generating the dataset

我的第202行数据{"instruction": "国内货物运输保险的赔偿处理遵循什么原则", "input": "", "output": "对于国内货物运输保险的赔偿是,在发生保险责任范围内的灾害事故时,普通财产保险仅负责被保险财产的直接损失以及为避免损失扩大采取施救、保护等措施而产生的合理费用。对于国内货物运输保险的定义是以在国内运输过程中的货物为保险标的,在标的物遭遇自然灾害或意外事故所造成的损失时给予经济补偿。按照分输方式可分为:直运货物运输保险、联运货物运输保险、集装箱运输保险。在发生保险事故的时候,被保险人要向保险人申请索赔,必须提供下列有关单证:保险凭证、运单(货票)、提货单、发货票,承运部门签发的货运记录、普通记录、交接验收记录、鉴定书,收货单位的入库记录、检验报告、损失清单及救护货物所支付的直接费用的单据。"}

我感觉这没问题啊

supervised_finetuning.py的preprocess_function函数是否有问题

Describe the Question

使用llama tokenizer测试了一下:
tokenizer(["你好啊"], truncation=True, max_length=20)
{'input_ids': [[1, 32827, 31076, 75158]], 'attention_mask': [[1, 1, 1, 1]]}

tokenizer(["你好啊"], truncation=True, max_length=20,padding="max_length")
{'input_ids': [[1, 32827, 31076, 75158, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}

以下代码中tokenized_sources的长度可能<=max_source_length,没有设置:padding="max_length"
tokenized_sources = tokenizer(sources, truncation=True, max_length=max_source_length)

labels这样设置是否错误?
labels = torch.LongTensor([IGNORE_INDEX] * (max_source_length + max_target_length - len(t)) + t)

image

加载bloom 13B模型报错

ValueError: The state dictionary of the model you are trying to load is corrupted. Are you sure it was properly saved?

请问增量预训练大概需要几块GPU呢?

Describe the Question

Please provide a clear and concise description of what the question is.

Describe your attempts

  • I walked through the tutorials
  • I checked the documentation
  • I checked to make sure that this is not a duplicate question

关于预训练完成后合并模型及SFT的问题

感谢作者。想按您的项目中的资料来尝试全流程,已经基于Ziya-LLaMA-13B-v1做了Lora的增量预训练,然后按文档将第一阶段的Lora和Base模型合并
python merge_peft_adapter.py \

--base_model_name_or_path ~/Ziya-LLaMA-13B-v1/
--peft_model_path ~/MedicalGPT/outputs-pt-v1/
--output_dir outputs-merged
--model_type llama
再将model_name_or_path指定为合并后的模型文件夹开始sft时出错

运行run_sft.sh时报了很多
/opt/conda/conda-bld/pytorch_1682343962757/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [182,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
这样的错,

Traceback (most recent call last):
File "supervised_finetuning.py", line 549, in
main()
File "supervised_finetuning.py", line 520, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/trainer.py", line 2767, in compute_loss
outputs = model(**inputs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/peft/peft_model.py", line 678, in forward
return self.base_model(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 537, in forward
attention_mask = self._prepare_decoder_attention_mask(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 465, in _prepare_decoder_attention_mask
combined_attention_mask = _make_causal_mask(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 49, in _make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

0%| | 0/125 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 107253) of binary: /home/xxx/miniconda3/envs/torch/bin/python
Traceback (most recent call last):
File "/home/xxx/miniconda3/envs/torch/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xxx/miniconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

但如果我直接在sft时指定使用原始模型~/Ziya-LLaMA-13B-v1/进行sft就没问题
第一阶段生成Lora文件正常也未报错,merge_peft_adapter前Lora文件和原始基础模型一起用inference.py跑推理测试也正常

合并的outputs-merged目录下文件如下
added_tokens.json
config.json
generation_config.json
pytorch_model-00001-of-00003.bin
pytorch_model-00002-of-00003.bin
pytorch_model-00003-of-00003.bin
pytorch_model.bin.index.json
special_tokens_map.json
tokenizer_config.json
tokenizer.model

请问您合并后的模型文件做sft出错可能会是什么原因呀

直接运行时会出现tokenizer长度错误

默认tokenizer_path是没有提供了,将其指定为Ziya-LLaMA-13B-v1模型所在路径后,会提示长度不一致错误,该怎么解决?

Vocab of the base model: 39424
Vocab of the tokenizer: 39410
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ MedicalGPT/scripts/gradio_demo.py:190 in │
│ │
│ 187 │
│ 188 │
│ 189 if name == 'main': │
│ ❱ 190 │ main() │
│ 191 │
│ │
│ MedicalGPT/scripts/gradio_demo.py:77 in main │
│ │
│ 74 │ print(f"Vocab of the base model: {model_vocab_size}") │
│ 75 │ print(f"Vocab of the tokenizer: {tokenzier_vocab_size}") │
│ 76 │ if model_vocab_size != tokenzier_vocab_size: │
│ ❱ 77 │ │ assert tokenzier_vocab_size > model_vocab_size │
│ 78 │ │ print("Resize model embeddings to fit tokenizer") │
│ 79 │ │ base_model.resize_token_embeddings(tokenzier_vocab_size) │
│ 80 │ if args.lora_model is not None: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AssertionError

run_rl.sh chatglm2-6b Unrecognized configuration class <class 'transformers_modules.chatglm2-6b.configuration_chatglm.ChatGLMConfig'>

Traceback (most recent call last):
File "/root/nas-share/chat/MedicalGPT-main/rl_training.py", line 456, in
main()
File "/root/nas-share/chat/MedicalGPT-main/rl_training.py", line 240, in main
model = AutoModelForCausalLMWithValueHead.from_pretrained(
File "/usr/local/conda/lib/python3.9/site-packages/trl/models/modeling_base.py", line 189, in from_pretrained
pretrained_model = cls.transformers_parent_class.from_pretrained(
File "/usr/local/conda/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 487, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.chatglm2-6b.configuration_chatglm.ChatGLMConfig'> for this kind of AutoModel: AutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, CodeGenConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, ElectraConfig, ErnieConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, LlamaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MvpConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, TransfoXLConfig, TrOCRConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

运行run_rm.sh报错 RuntimeError: CUDA error: device-side assert triggered

基于llama13B 运行 run_rm.sh报错如下 数据集用的test.json也不行,单机多卡的环境下报错
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [420,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [420,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [420,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

从peft加载LoraConfig报错

Describe the bug

执行from peft import LoraConfig, TaskType, get_peft_model, PeftModel, prepare_model_for_int8_training时报错
RuntimeError:
CUDA Setup failed despite GPU being available. Please run the following command to get more information:

    python -m bitsandbytes

    Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
    to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
    and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

同时提示找不到libcuda.so和libcudart.so文件

直接运行run_rm.sh,产生关于计算图的RuntimeError

Describe the Question

报错信息如下:

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across
multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if
you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple
times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 191 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to
either INFO or DETAIL to print parameter names for further debugging.

Describe your attempts

(1)加载sft阶段,lora合并之后的模型,报错如下:

ValueError: weight is on the meta device, we need a value to put in on 0.

希望得到您的回复,谢谢~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.