huggingface / peft Goto Github PK
View Code? Open in Web Editor NEW🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
Home Page: https://huggingface.co/docs/peft
License: Apache License 2.0
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
Home Page: https://huggingface.co/docs/peft
License: Apache License 2.0
Hi!
I trained t5-large using LoRA config:
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q", "v"],
lora_dropout=0.05,
bias="none",
task_type="SEQ_2_SEQ_LM"
)
I saved the model using the next part of code:
model.save_pretrained(path)
Have 2 questions:
tokenizer = T5Tokenizer.from_pretrained("model_path")
ort_model = ORTModelForSeq2SeqLM.from_pretrained("model_path", from_transformers=True, provider="CUDAExecutionProvider")
onnx_pipe = pipeline("text2text-generation", model=ort_model, tokenizer=tokenizer, device="cuda",
batch_size=8, max_length=512, truncation=True)
Maybe I need to save the model in another way?
Could you please help me to understand what I do wrong?
Hello,
Thanks a lot for the great project.
I am fine-tuning Flan-T5-XXL using HuggingFace Seq2SeqTrainer and hyperparameter_search.
However, the trainer doesn't store Peft models correctly because it is not a "PreTrainedModel" type.
It stores the whole PyTorch model, including the Flan-T5-XXL, which is around 42 GB.
I have dug into the code, and I made a hacky solution inside "trainer.py" for now:
def _save(self, output_dir: Optional[str] = None, state_dict=None):
# If we are executing this function, we are the process zero, so we don't check for that.
output_dir = output_dir if output_dir is not None else self.args.output_dir
os.makedirs(output_dir, exist_ok=True)
logger.info(f"Saving model checkpoint to {output_dir}")
from peft.peft_model import PeftModelForSeq2SeqLM
if isinstance(self.model, PeftModelForSeq2SeqLM):
self.model.save_pretrained(output_dir, state_dict=state_dict)
# Save a trained model and configuration using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
elif not isinstance(self.model, PreTrainedModel):
if isinstance(unwrap_model(self.model), PreTrainedModel):
if state_dict is None:
state_dict = self.model.state_dict()
unwrap_model(self.model).save_pretrained(output_dir, state_dict=state_dict)
else:
logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
if state_dict is None:
state_dict = self.model.state_dict()
torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
else:
self.model.save_pretrained(output_dir, state_dict=state_dict)
if self.tokenizer is not None:
self.tokenizer.save_pretrained(output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))
Do you have a better solution for saving the "Peft models" correctly using HuggingFace Seq2SeqTrainer and hyperparameter_search ?
I'm trying to fine-tune a t5-small model on the CNN/DM dataset, using accelerate, deepspeed and PEFT, but it gives an error:
ValueError: max() arg is an empty sequence
I use the following script:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization_no_trainer.py
as described here: https://huggingface.co/docs/transformers/v4.18.0/en/run_scripts
I modify the script to use PEFT:
442a449,455
> peft_config = LoraConfig(
> task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
> )
> model = get_peft_model(model, peft_config)
> model.print_trainable_parameters()
It runs fine when I run it with accelerate without deepspeed:
accelerate launch ~/transformers/examples/pytorch/summarization/run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir ~/tmp/tst-summarization --num_beams 4
But when I try to run it with deepspeed it gives an error:
accelerate launch --config config1.yml ~/transformers/examples/pytorch/summarization/run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir ~/tmp/tst-summarization --num_beams 4
Here is part of the error trace:
│ /opt/conda/envs/mypytorch/lib/python3.9/site-packages/deepspeed/runtime/engine.py:1594 in │
│ _configure_zero_optimizer │
│ │
│ 1591 │ │ │ │ log_dist('Creating fp16 ZeRO stage {} optimizer'.format(zero_stage), │
│ 1592 │ │ │ │ │ │ ranks=[0]) │
│ 1593 │ │ │ │ from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3 │
│ ❱ 1594 │ │ │ │ optimizer = DeepSpeedZeroOptimizer_Stage3( │
│ 1595 │ │ │ │ │ self.module, │
│ 1596 │ │ │ │ │ optimizer, │
│ 1597 │ │ │ │ │ timers=timers, │
│ │
│ /opt/conda/envs/mypytorch/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:303 in │
│ __init__ │
│ │
│ 300 │ │ │ │ count = count + 1 │
│ 301 │ │ │
│ 302 │ │ #Largest partitioned param │
│ ❱ 303 │ │ largest_partitioned_param_numel = max([ │
│ 304 │ │ │ max([ │
│ 305 │ │ │ │ max(tensor.numel(), │
│ 306 │ │ │ │ │ tensor.ds_numel) for tensor in fp16_partitioned_group │
│ │
│ /opt/conda/envs/mypytorch/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:304 in │
│ <listcomp> │
│ │
│ 301 │ │ │
│ 302 │ │ #Largest partitioned param │
│ 303 │ │ largest_partitioned_param_numel = max([ │
│ ❱ 304 │ │ │ max([ │
│ 305 │ │ │ │ max(tensor.numel(), │
│ 306 │ │ │ │ │ tensor.ds_numel) for tensor in fp16_partitioned_group │
│ 307 │ │ │ ]) for fp16_partitioned_group in self.fp16_partitioned_groups │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: max() arg is an empty sequence
Here is my config1.yml:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
The same command runs fine if I don't use PEFT in the script.
This my code:
# 引入相应的包 Importing libraries
peft_model_id= 'int8_peft_model'
trainer.model.save_pretrained(peft_model_id)
if __name__ == "__main__":
train_func_main()
I execute the following code in the command line:
python int8_peft_lora_PromptCLUE_Finetuning.py
Then the following error message appears:
0%| | 0/35646 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
File "/root/gaochangkuan_AI/PromptCLUE_Finetuning/int8_peft_lora_PromptCLUE_Finetuning.py", line 252, in <module>
train_func_main()
File "/root/gaochangkuan_AI/PromptCLUE_Finetuning/int8_peft_lora_PromptCLUE_Finetuning.py", line 247, in train_func_main
trainer.train()
File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1576, in train
return inner_training_loop(
File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1843, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2588, in training_step
loss = self.compute_loss(model, inputs)
File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2620, in compute_loss
outputs = model(**inputs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/peft/peft_model.py", line 639, in forward
return self.base_model(
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1722, in forward
lm_logits = self.lm_head(sequence_output)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/gaochangkuan_AI/PromptCLUE_Finetuning/int8_peft_lora_PromptCLUE_Finetuning.py", line 175, in forward
return super().forward(x).to(torch.float32)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
output = old_forward(*args, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Float but found Half
�[31m╭───────────────────── �[39m�[1mTraceback (most recent call last)�[31m�[22m ──────────────────────╮
�[31m│�[39m /root/gaochangkuan_AI/PromptCLUE_Finetuning/�[1mint8_peft_lora_PromptCLUE_Finetu�[22m �[31m│
�[31m│�[39m �[1mning.py�[22m:�[94m252�[39m in �[92m<module>�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 249 trainer.model.save_pretrained(peft_model_id) �[31m│
�[31m│�[39m 250 �[31m│
�[31m│�[39m 251 �[94mif�[39m �[91m__name__�[39m == �[33m"__main__"�[39m: �[31m│
�[31m│�[39m �[31m❱ �[39m252 train_func_main() �[31m│
�[31m│�[39m 253 �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/gaochangkuan_AI/PromptCLUE_Finetuning/�[1mint8_peft_lora_PromptCLUE_Finetu�[22m �[31m│
�[31m│�[39m �[1mning.py�[22m:�[94m247�[39m in �[92mtrain_func_main�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 244 │ │ │ │ │ data_collator= seq2seq_data_collator �[31m│
�[31m│�[39m 245 │ │ │ │ │ ) �[31m│
�[31m│�[39m 246 model.config.use_cache = �[94mFalse�[39m # silence the warnings. Please re-en �[31m│
�[31m│�[39m �[31m❱ �[39m247 trainer.train() �[31m│
�[31m│�[39m 248 peft_model_id= �[33m'int8_peft_model'�[39m �[31m│
�[31m│�[39m 249 trainer.model.save_pretrained(peft_model_id) �[31m│
�[31m│�[39m 250 �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m1576�[39m in �[31m│
�[31m│�[39m �[92mtrain�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 1573 │ │ inner_training_loop = find_executable_batch_size( �[31m│
�[31m│�[39m 1574 │ │ │ �[96mself�[39m._inner_training_loop, �[96mself�[39m._train_batch_size, args.a �[31m│
�[31m│�[39m 1575 │ │ ) �[31m│
�[31m│�[39m �[31m❱ �[39m1576 │ │ �[94mreturn�[39m inner_training_loop( �[31m│
�[31m│�[39m 1577 │ │ │ args=args, �[31m│
�[31m│�[39m 1578 │ │ │ resume_from_checkpoint=resume_from_checkpoint, �[31m│
�[31m│�[39m 1579 │ │ │ trial=trial, �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m1843�[39m in �[31m│
�[31m│�[39m �[92m_inner_training_loop�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 1840 │ │ │ │ │ �[94mwith�[39m model.no_sync(): �[31m│
�[31m│�[39m 1841 │ │ │ │ │ │ tr_loss_step = �[96mself�[39m.training_step(model, inpu �[31m│
�[31m│�[39m 1842 │ │ │ │ �[94melse�[39m: �[31m│
�[31m│�[39m �[31m❱ �[39m1843 │ │ │ │ │ tr_loss_step = �[96mself�[39m.training_step(model, inputs) �[31m│
�[31m│�[39m 1844 │ │ │ │ �[31m│
�[31m│�[39m 1845 │ │ │ │ �[94mif�[39m ( �[31m│
�[31m│�[39m 1846 │ │ │ │ │ args.logging_nan_inf_filter �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m2588�[39m in �[31m│
�[31m│�[39m �[92mtraining_step�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 2585 │ │ │ �[94mreturn�[39m loss_mb.reduce_mean().detach().to(�[96mself�[39m.args.device �[31m│
�[31m│�[39m 2586 │ │ �[31m│
�[31m│�[39m 2587 │ │ �[94mwith�[39m �[96mself�[39m.compute_loss_context_manager(): �[31m│
�[31m│�[39m �[31m❱ �[39m2588 │ │ │ loss = �[96mself�[39m.compute_loss(model, inputs) �[31m│
�[31m│�[39m 2589 │ │ �[31m│
�[31m│�[39m 2590 │ │ �[94mif�[39m �[96mself�[39m.args.n_gpu > �[94m1�[39m: �[31m│
�[31m│�[39m 2591 │ │ │ loss = loss.mean() # mean() to average on multi-gpu para �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m2620�[39m in �[31m│
�[31m│�[39m �[92mcompute_loss�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 2617 │ │ │ labels = inputs.pop(�[33m"labels"�[39m) �[31m│
�[31m│�[39m 2618 │ │ �[94melse�[39m: �[31m│
�[31m│�[39m 2619 │ │ │ labels = �[94mNone�[39m �[31m│
�[31m│�[39m �[31m❱ �[39m2620 │ │ outputs = model(**inputs) �[31m│
�[31m│�[39m 2621 │ │ # Save past state if it exists �[31m│
�[31m│�[39m 2622 │ │ # TODO: this needs to be fixed and made cleaner later. �[31m│
�[31m│�[39m 2623 │ │ �[94mif�[39m �[96mself�[39m.args.past_index >= �[94m0�[39m: �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m �[31m│
�[31m│�[39m in �[92m_call_impl�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 1191 │ │ # this function, and just call forward. �[31m│
�[31m│�[39m 1192 │ │ �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m 1193 │ │ │ │ �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │ │ │ �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs) �[31m│
�[31m│�[39m 1195 │ │ # Do not call functions when jit is used �[31m│
�[31m│�[39m 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] �[31m│
�[31m│�[39m 1197 │ │ �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks: �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/peft/�[1mpeft_model.py�[22m:�[94m639�[39m in �[31m│
�[31m│�[39m �[92mforward�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 636 │ │ **kwargs, �[31m│
�[31m│�[39m 637 │ ): �[31m│
�[31m│�[39m 638 │ │ �[94mif�[39m �[95mnot�[39m �[96misinstance�[39m(�[96mself�[39m.peft_config, PromptLearningConfig): �[31m│
�[31m│�[39m �[31m❱ �[39m639 │ │ │ �[94mreturn�[39m �[96mself�[39m.base_model( �[31m│
�[31m│�[39m 640 │ │ │ │ input_ids=input_ids, �[31m│
�[31m│�[39m 641 │ │ │ │ attention_mask=attention_mask, �[31m│
�[31m│�[39m 642 │ │ │ │ inputs_embeds=inputs_embeds, �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m �[31m│
�[31m│�[39m in �[92m_call_impl�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 1191 │ │ # this function, and just call forward. �[31m│
�[31m│�[39m 1192 │ │ �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m 1193 │ │ │ │ �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │ │ │ �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs) �[31m│
�[31m│�[39m 1195 │ │ # Do not call functions when jit is used �[31m│
�[31m│�[39m 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] �[31m│
�[31m│�[39m 1197 │ │ �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks: �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/accelerate/�[1mhooks.py�[22m:�[94m156�[39m in �[31m│
�[31m│�[39m �[92mnew_forward�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 153 │ │ │ �[94mwith�[39m torch.no_grad(): �[31m│
�[31m│�[39m 154 │ │ │ │ output = old_forward(*args, **kwargs) �[31m│
�[31m│�[39m 155 │ │ �[94melse�[39m: �[31m│
�[31m│�[39m �[31m❱ �[39m156 │ │ │ output = old_forward(*args, **kwargs) �[31m│
�[31m│�[39m 157 │ │ �[94mreturn�[39m module._hf_hook.post_forward(module, output) �[31m│
�[31m│�[39m 158 │ �[31m│
�[31m│�[39m 159 │ module.forward = new_forward �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/models/t5/�[1mmodeling_�[22m �[31m│
�[31m│�[39m �[1mt5.py�[22m:�[94m1722�[39m in �[92mforward�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 1719 │ │ │ # See https://github.com/tensorflow/mesh/blob/fa19d69eafc �[31m│
�[31m│�[39m 1720 │ │ │ sequence_output = sequence_output * (�[96mself�[39m.model_dim**-�[94m0.5�[39m �[31m│
�[31m│�[39m 1721 │ │ �[31m│
�[31m│�[39m �[31m❱ �[39m1722 │ │ lm_logits = �[96mself�[39m.lm_head(sequence_output) �[31m│
�[31m│�[39m 1723 │ │ �[31m│
�[31m│�[39m 1724 │ │ loss = �[94mNone�[39m �[31m│
�[31m│�[39m 1725 │ │ �[94mif�[39m labels �[95mis�[39m �[95mnot�[39m �[94mNone�[39m: �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m �[31m│
�[31m│�[39m in �[92m_call_impl�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 1191 │ │ # this function, and just call forward. �[31m│
�[31m│�[39m 1192 │ │ �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m 1193 │ │ │ │ �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │ │ │ �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs) �[31m│
�[31m│�[39m 1195 │ │ # Do not call functions when jit is used �[31m│
�[31m│�[39m 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] �[31m│
�[31m│�[39m 1197 │ │ �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks: �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/gaochangkuan_AI/PromptCLUE_Finetuning/�[1mint8_peft_lora_PromptCLUE_Finetu�[22m �[31m│
�[31m│�[39m �[1mning.py�[22m:�[94m175�[39m in �[92mforward�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 172 �[31m│
�[31m│�[39m 173 �[94mclass�[39m �[4mCastOutputToFloat�[24m(nn.Sequential): �[31m│
�[31m│�[39m 174 �[94mdef�[39m �[92mforward�[39m(�[96mself�[39m, x): �[31m│
�[31m│�[39m �[31m❱ �[39m175 │ �[94mreturn�[39m �[96msuper�[39m().forward(x).to(torch.float32) �[31m│
�[31m│�[39m 176 �[31m│
�[31m│�[39m 177 model.lm_head = CastOutputToFloat(model.lm_head) �[31m│
�[31m│�[39m 178 �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mcontainer.py�[22m:�[94m20�[39m �[31m│
�[31m│�[39m �[94m4�[39m in �[92mforward�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 201 │ # with Any as TorchScript expects a more precise type �[31m│
�[31m│�[39m 202 │ �[94mdef�[39m �[92mforward�[39m(�[96mself�[39m, �[96minput�[39m): �[31m│
�[31m│�[39m 203 │ │ �[94mfor�[39m module �[95min�[39m �[96mself�[39m: �[31m│
�[31m│�[39m �[31m❱ �[39m204 │ │ │ �[96minput�[39m = module(�[96minput�[39m) �[31m│
�[31m│�[39m 205 │ │ �[94mreturn�[39m �[96minput�[39m �[31m│
�[31m│�[39m 206 │ �[31m│
�[31m│�[39m 207 │ �[94mdef�[39m �[92mappend�[39m(�[96mself�[39m, module: Module) -> �[33m'Sequential'�[39m: �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m �[31m│
�[31m│�[39m in �[92m_call_impl�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 1191 │ │ # this function, and just call forward. �[31m│
�[31m│�[39m 1192 │ │ �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m 1193 │ │ │ │ �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │ │ │ �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs) �[31m│
�[31m│�[39m 1195 │ │ # Do not call functions when jit is used �[31m│
�[31m│�[39m 1196 │ │ full_backward_hooks, non_full_backward_hooks = [], [] �[31m│
�[31m│�[39m 1197 │ │ �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks: �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/accelerate/�[1mhooks.py�[22m:�[94m156�[39m in �[31m│
�[31m│�[39m �[92mnew_forward�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 153 │ │ │ �[94mwith�[39m torch.no_grad(): �[31m│
�[31m│�[39m 154 │ │ │ │ output = old_forward(*args, **kwargs) �[31m│
�[31m│�[39m 155 │ │ �[94melse�[39m: �[31m│
�[31m│�[39m �[31m❱ �[39m156 │ │ │ output = old_forward(*args, **kwargs) �[31m│
�[31m│�[39m 157 │ │ �[94mreturn�[39m module._hf_hook.post_forward(module, output) �[31m│
�[31m│�[39m 158 │ �[31m│
�[31m│�[39m 159 │ module.forward = new_forward �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mlinear.py�[22m:�[94m114�[39m �[31m│
�[31m│�[39m in �[92mforward�[39m �[31m│
�[31m│�[39m �[31m│
�[31m│�[39m 111 │ │ │ init.uniform_(�[96mself�[39m.bias, -bound, bound) �[31m│
�[31m│�[39m 112 │ �[31m│
�[31m│�[39m 113 │ �[94mdef�[39m �[92mforward�[39m(�[96mself�[39m, �[96minput�[39m: Tensor) -> Tensor: �[31m│
�[31m│�[39m �[31m❱ �[39m114 │ │ �[94mreturn�[39m F.linear(�[96minput�[39m, �[96mself�[39m.weight, �[96mself�[39m.bias) �[31m│
�[31m│�[39m 115 │ �[31m│
�[31m│�[39m 116 │ �[94mdef�[39m �[92mextra_repr�[39m(�[96mself�[39m) -> �[96mstr�[39m: �[31m│
�[31m│�[39m 117 │ │ �[94mreturn�[39m �[33m'in_features={}, out_features={}, bias={}'�[39m.format( �[31m│
�[31m╰──────────────────────────────────────────────────────────────────────────────╯
�[1mRuntimeError: �[22mexpected scalar type Float but found Half
What is the reason for this, please?
Also, I am trying to enable the operations fp16, gradient_checkpointing, gradient_accumulation_steps at the same time, is there any conflict with this?
After #59 is merged will update the README to include our support for vision examples (image classification, semantic segmentation, DreamBooth, etc.).
While attempting to use a Lora with a model loaded in 8bit, I get the following error upon trying to generate anything:
B:\python\lib\site-packages\torch\nn\modules\linear.py:103 in forward
102 │ def forward(self, input: Tensor) -> Tensor:
103 │ return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Half but found Float
I can prevent that by adding .half() to the model = PeftModelForCausalLM.from_pretrained(model, peft_model_id) line in the below test code, but I wanted to confirm if this was a bug/doc issue or an expected side effect of using int8 and if converting the peftmodel to fp16 would affect the wrapped model in any way.
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from peft import PeftModelForCausalLM, PeftConfig
model_id = "OPT-350M-Nerys-v2"
peft_model_id = "D:\AI\lora_test\OPT-350M-Nerys-v2_lora"
tokenizer = AutoTokenizer.from_pretrained(model_id)
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto')
model = PeftModelForCausalLM.from_pretrained(model, peft_model_id)
def generate_simple(input_text):
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
output = model.generate(
input_ids=input_ids,
max_length=1024,
temperature=0.7)
return tokenizer.decode(output[0], skip_special_tokens=True)
input_text = """Explain artificial intelligence"""
print(f'Output: {generate_simple(input_text)}')
Hi! Thanks a lot for this fantastic package!
I was running the examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py
script for the bloomz-7b1
model. As per the README, I was expecting ~18.1GB GPU memory and 35GB CPU memory, however from the logs generated (please see below; logs for the 15th epoch) the GPU memory consumption seems to be a lot more i.e. close to 32GB GPU memory while CPU memory is much lesser.
Edit: I think I missed some setup steps required for the deepspeed offloading since the is_ds_zero_3
variable in line 238 is always False. Please let me know! Thank you
Note: I'm running this on a Ubuntu 18.04 x86_64 machine with a single 40GB A100 GPU.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00, 1.11it/s]
GPU Memory before entering the train : 27026
GPU Memory consumed at the end of the train (end-begin): 242
GPU Peak Memory consumed during the train (max-begin): 5011
GPU Total Peak Memory consumed during the train (max): 32037
CPU Memory before entering the train : 4080
CPU Memory consumed at the end of the train (end-begin): 0
CPU Peak Memory consumed during the train (max-begin): 0
CPU Total Peak Memory consumed during the train (max): 4080
epoch=15: train_ppl=tensor(2.0908, device='cuda:0') train_epoch_loss=tensor(0.7375, device='cuda:0')
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:08<00:00, 1.26s/it]
GPU Memory before entering the eval : 27268
GPU Memory consumed at the end of the eval (end-begin): -242
GPU Peak Memory consumed during the eval (max-begin): 1465
GPU Total Peak Memory consumed during the eval (max): 28733
CPU Memory before entering the eval : 4080
CPU Memory consumed at the end of the eval (end-begin): 0
CPU Peak Memory consumed during the eval (max-begin): 0
CPU Total Peak Memory consumed during the eval (max): 4080
accuracy=84.0
eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'no complaint']
dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']
T-Few is a PEFT method for few-shot learning that is currently the SOTA on many NLP benchmarks. It uses a nifty technique called (IA)^3 to update a small number of parameters during training and would be an impactful method to include IMO.
Although research code exists, it is tightly bound to the paper and doesn't run easily on hardware that isn't an 80GB A100. The peft
library could help make this work more accessible to industry practitioners (where few-shot is actually valuable)
cc @craffel
Paper: https://arxiv.org/abs/2205.05638
GitHub: https://github.com/r-three/t-few
I've gotten it running on pytorch 1.11 seemingly without issue and was wondering if there was a reason for the requirement or if it was just the version it happened to be tested with. I was looking at integrating it with a project that has (at least for now) a requirement for 1.11 and wound up having to locally edit peft's setup.py to get it not to break the project by updating pytorch to an incompatible version.
Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts has shown that prompt tuning can sometimes be more effective if the prompt generation is placed closer to the output signal in the later layers of the transformer architecture.
I think this would be a nice addition to the existing techniques in peft
.
peft
(happy to consider doing this issue myself if possible)
Is it possible to support multiple GPUs for distributed training at the same time?
I see that there's support for causal language modeling, and that models trained with MLM objectives are supported for sequence and token classification tasks, but could I ask if there's anything I missed/any hacky way to get PEFT to work for masked language modeling?
Thank you very much for sharing this library, it is going to be very useful for fine tuning big models.
It would be cool if Donut model is supported. This model works very well for key information extraction from document images. ;)
Could you share an example on how to use it with this model?
Thanks in advance! @NielsRogge
Hi, I am trying to fine-tuning GPT-based model using multiple gpus.
I wrap the base-model using torch.nn.DataParallel.
However, when I try to fine-tune the base-model using peft-LoRA by
peft_model = get_peft_model(model,config)
I get AttributeError: 'DataParallel' object has no attribute 'config' Error.
How can I solve this problem?
Now that #39 has been merged, I wanted to focus on #44 and #45.
As you can see here, after wrapping a base model with LoraModel
for image classification fine-tuning, I am still having to
for param in lora_model.classifier.parameters():
param.requires_grad = True
I am aware that if we fix the internal task types of LoraConfig
, we wouldn't need to do this. But on the other hand, this goes on to show the flexibility of PEFT, isn't it?
In this case, would the Hub utilities introduced in #39 take care of the additional trainable parameters?
Hello,
Great library you have there :) It works great but I have a slight problem regarding performance:
When finetuning a model with the Lora method, as far as I understood we should be able to "bake" the final lora-weights into the original LinearLayer's weights. I could not find any documentation on how to go about this, but the code does reference merging, so I assume it is implemented.
Is there any example code for this or any other hints you can give me?
Thank you in advance!
Before the feature request I just wanted to say great work on this library everyone, I've been looking forward to the release of this for a while now!
I just wanted to hop in to ask, if there are any plans to support fine-tuning of models with large context windows like LongT5, Pegasus X, and LongFormer?
Or is it just a matter of using PEFT with those models? If so would it be possible to add that example in the examples folder?
Great job and thanks so much again for releasing this!
From the last cell of the OPT Notebook, adding do_sample=True
to generate()
:
/opt/conda/lib/python3.7/site-packages/peft/peft_model.py:550 in generate │
│ │
│ 547 │ │
│ 548 │ def generate(self, **kwargs): │
│ 549 │ │ if not isinstance(self.peft_config, PromptLearningConfig): │
│ ❱ 550 │ │ │ return self.base_model.generate(**kwargs) │
│ 551 │ │ else: │
│ 552 │ │ │ if "input_ids" not in kwargs: │
│ 553 │ │ │ │ raise ValueError("input_ids must be provided for Peft model generation") │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:1442 in generate │
│ │
│ 1439 │ │ │ │ output_scores=generation_config.output_scores, │
│ 1440 │ │ │ │ return_dict_in_generate=generation_config.return_dict_in_generate, │
│ 1441 │ │ │ │ synced_gpus=synced_gpus, │
│ ❱ 1442 │ │ │ │ **model_kwargs, │
│ 1443 │ │ │ ) │
│ 1444 │ │ │
│ 1445 │ │ elif is_beam_gen_mode: │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:2462 in sample │
│ │
│ 2459 │ │ │ │
│ 2460 │ │ │ # pre-process distribution │
│ 2461 │ │ │ next_token_scores = logits_processor(input_ids, next_token_logits) │
│ ❱ 2462 │ │ │ next_token_scores = logits_warper(input_ids, next_token_scores) │
│ 2463 │ │ │ │
│ 2464 │ │ │ # Store scores, attentions and hidden_states when required │
│ 2465 │ │ │ if return_dict_in_generate: │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/logits_process.py:92 in __call__ │
│ │
│ 89 │ │ │ │ │ ) │
│ 90 │ │ │ │ scores = processor(input_ids, scores, **kwargs) │
│ 91 │ │ │ else: │
│ ❱ 92 │ │ │ │ scores = processor(input_ids, scores) │
│ 93 │ │ return scores │
│ 94 │
│ 95 │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/logits_process.py:297 in __call__ │
│ │
│ 294 │ def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch. │
│ 295 │ │ top_k = min(self.top_k, scores.size(-1)) # Safety check │
│ 296 │ │ # Remove all tokens with a probability less than the last token of the top-k │
│ ❱ 297 │ │ indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None] │
│ 298 │ │ scores = scores.masked_fill(indices_to_remove, self.filter_value) │
│ 299 │ │ return scores │
│ 300 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "topk_cpu" not implemented for 'Half'
Similar to #44 but for SegFormer (semantic segmentation).
A peft model requires .train() to be called in order to store gradients during backprop.
cc @younesbelkada
peft version 0.2.0.dev0
See the example here:
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
model = AutoModelForCausalLM.from_pretrained("edbeeching/gpt-neo-125M-imdb", load_in_8bit=True, device_map="auto")
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["k_proj", "v_proj", "q_proj", "out_proj"], # Are these the correct layers to target with LoRA ?
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
prepare_model_for_int8_training(model)
model = get_peft_model(model, lora_config)
with torch.cuda.amp.autocast():
gen = model.generate(input_ids=torch.LongTensor([0,1,2,3]).unsqueeze(0))
# model.train() # <-- UNCOMMENT THIS LINE
output = model(torch.LongTensor([0,1,2,3]).to("cuda"))
loss = output[0].mean()
loss.backward()
for n,p in model.named_parameters():
print(n, p.grad)
Related to #56.
Consider the following LoraModel
instance:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
from peft import LoraConfig, LoraModel
model_checkpoint = "google/vit-base-patch16-224-in21k"
model = AutoModelForImageClassification.from_pretrained(
model_checkpoint,
label2id=label2id,
id2label=id2label,
ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)
config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=["query", "value"],
lora_dropout=0.1,
bias="none",
modules_to_save=["classifier"],
)
lora_model = LoraModel(config, model)
If I call lora_model.save_pretrained("lora_vit")
, I see that the state_dict is about the same size as that of model
. Is this expected?
I would have expected to only see the LoRA trainable parameters alongside the modules_to_save
ones. This would help reduce the size of the state dict and would also help with portability, especially for very large models. This is also how it's implemented in diffusers
.
Also, PeftConfig
is somehow unable to find the config.json
here whereas it's clearly there as we can see. What am I missing out on?
This currently blocks the inference and sharing section of the notebook.
Hi, I wonder if you have any plan on adding pipeline parallel to this library so that we could fine-tune larger model across multiple gpus? The reason for mentioning pipeline parallel particularly is that it may be easier to implement a general version comparing to tensor parallel or other model parallel strategies. If you have such plan, I'd love to help :)
When I run python examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py
, the following message gets printed as expected on the screen for model.print_trainable_parameters()
(Line 219)
trainable params: 3932160 || all params: 7072948224 || trainable%: 0.055594355783029126
However, when I follow the instructions on the README and set up Accelerate with DeepSpeed CPU offloading, the same line now outputs the following
trainable params: 3932160 || all params: 3932160 || trainable%: 100.0
Upon digging a little deeper, it looks like the model.named_parameters()
returns tensors of size 0 (except for the Lora A and B matrices) when running with Accelerate and DeepSpeed CPU offloading
This is the original output for the model.named_parameters() - showing only the top few parameters
base_model.model.transformer.word_embeddings.weight torch.Size([250880, 4096])
base_model.model.transformer.word_embeddings_layernorm.weight torch.Size([4096])
base_model.model.transformer.word_embeddings_layernorm.bias torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.bias torch.Size([4096])
base_model.model.transformer.h.0.self_attention.query_key_value.weight torch.Size([12288, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.bias torch.Size([12288])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight torch.Size([4096, 4096])
base_model.model.transformer.h.0.self_attention.dense.bias torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.weight torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.bias torch.Size([4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight torch.Size([16384, 4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([16384])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight torch.Size([4096, 16384])
This is the output when running with accelerate and DeepSpeed CPU offloading
base_model.model.transformer.word_embeddings.weight torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.weight torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.bias torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.bias torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.weight torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.bias torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight torch.Size([0])
base_model.model.transformer.h.0.self_attention.dense.bias torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.weight torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.bias torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.bias torch.Size([0])
It doesn't look like this impacts fine-tuning but was curious why this is happening!
Hello, I am trying to finetune GPT-J for text generation by adapting this notebook. However, when I run trainer.train
I get a CUDA error that states the following, RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'
The error seems to originating from ./peft/src/peft/tuners/lora.py
line 277 from the traceback. Any ideas why this is happening or how to fix it?
The full traceback is below :
RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1538 self.model_wrapped = self.model
1540 inner_training_loop = find_executable_batch_size(
1541 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1542 )
-> 1543 return inner_training_loop(
1544 args=args,
1545 resume_from_checkpoint=resume_from_checkpoint,
1546 trial=trial,
1547 ignore_keys_for_eval=ignore_keys_for_eval,
1548 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/memory.py:124, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
122 raise RuntimeError("No executable batch size found, reached zero.")
123 try:
--> 124 return function(batch_size, *args, **kwargs)
125 except Exception as e:
126 if should_reduce_batch_size(e):
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1789 tr_loss_step = self.training_step(model, inputs)
1790 else:
-> 1791 tr_loss_step = self.training_step(model, inputs)
1793 if (
1794 args.logging_nan_inf_filter
1795 and not is_torch_tpu_available()
1796 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1797 ):
1798 # if loss is nan or inf simply add the average of previous logged losses
1799 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2539, in Trainer.training_step(self, model, inputs)
2536 return loss_mb.reduce_mean().detach().to(self.args.device)
2538 with self.compute_loss_context_manager():
-> 2539 loss = self.compute_loss(model, inputs)
2541 if self.args.n_gpu > 1:
2542 loss = loss.mean() # mean() to average on multi-gpu parallel training
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2571, in Trainer.compute_loss(self, model, inputs, return_outputs)
2569 else:
2570 labels = None
-> 2571 outputs = model(**inputs)
2572 # Save past state if it exists
2573 # TODO: this needs to be fixed and made cleaner later.
2574 if self.args.past_index >= 0:
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
***** Running training *****
Num examples = 36139
Num Epochs = 6
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 4
Total optimization steps = 6774
Number of trainable parameters = 7340032
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1538 self.model_wrapped = self.model
1540 inner_training_loop = find_executable_batch_size(
1541 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1542 )
-> 1543 return inner_training_loop(
1544 args=args,
1545 resume_from_checkpoint=resume_from_checkpoint,
1546 trial=trial,
1547 ignore_keys_for_eval=ignore_keys_for_eval,
1548 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/memory.py:124, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
122 raise RuntimeError("No executable batch size found, reached zero.")
123 try:
--> 124 return function(batch_size, *args, **kwargs)
125 except Exception as e:
126 if should_reduce_batch_size(e):
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1789 tr_loss_step = self.training_step(model, inputs)
1790 else:
-> 1791 tr_loss_step = self.training_step(model, inputs)
1793 if (
1794 args.logging_nan_inf_filter
1795 and not is_torch_tpu_available()
1796 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1797 ):
1798 # if loss is nan or inf simply add the average of previous logged losses
1799 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2539, in Trainer.training_step(self, model, inputs)
2536 return loss_mb.reduce_mean().detach().to(self.args.device)
2538 with self.compute_loss_context_manager():
-> 2539 loss = self.compute_loss(model, inputs)
2541 if self.args.n_gpu > 1:
2542 loss = loss.mean() # mean() to average on multi-gpu parallel training
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2571, in Trainer.compute_loss(self, model, inputs, return_outputs)
2569 else:
2570 labels = None
-> 2571 outputs = model(**inputs)
2572 # Save past state if it exists
2573 # TODO: this needs to be fixed and made cleaner later.
2574 if self.args.past_index >= 0:
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/nlp/peft/src/peft/peft_model.py:502, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, **kwargs)
490 def forward(
491 self,
492 input_ids=None,
(...)
499 **kwargs,
500 ):
501 if not isinstance(self.peft_config, PromptLearningConfig):
--> 502 return self.base_model(
503 input_ids=input_ids,
504 attention_mask=attention_mask,
505 inputs_embeds=inputs_embeds,
506 labels=labels,
507 output_attentions=output_attentions,
508 output_hidden_states=output_hidden_states,
509 return_dict=return_dict,
510 **kwargs,
511 )
513 batch_size = input_ids.shape[0]
514 if attention_mask is not None:
515 # concat prompt attention mask
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:813, in GPTJForCausalLM.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
805 r"""
806 labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
807 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
808 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
809 are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
810 """
811 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 813 transformer_outputs = self.transformer(
814 input_ids,
815 past_key_values=past_key_values,
816 attention_mask=attention_mask,
817 token_type_ids=token_type_ids,
818 position_ids=position_ids,
819 head_mask=head_mask,
820 inputs_embeds=inputs_embeds,
821 use_cache=use_cache,
822 output_attentions=output_attentions,
823 output_hidden_states=output_hidden_states,
824 return_dict=return_dict,
825 )
826 hidden_states = transformer_outputs[0]
828 # Set device for model parallelism
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:660, in GPTJModel.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
656 return module(*inputs, use_cache, output_attentions)
658 return custom_forward
--> 660 outputs = torch.utils.checkpoint.checkpoint(
661 create_custom_forward(block),
662 hidden_states,
663 None,
664 attention_mask,
665 head_mask[i],
666 )
667 else:
668 outputs = block(
669 hidden_states,
670 layer_past=layer_past,
(...)
674 output_attentions=output_attentions,
675 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:249, in checkpoint(function, use_reentrant, *args, **kwargs)
246 raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs))
248 if use_reentrant:
--> 249 return CheckpointFunction.apply(function, preserve, *args)
250 else:
251 return _checkpoint_without_reentrant(
252 function,
253 preserve,
254 *args,
255 **kwargs,
256 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:107, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
104 ctx.save_for_backward(*tensor_inputs)
106 with torch.no_grad():
--> 107 outputs = run_function(*args)
108 return outputs
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:656, in GPTJModel.forward.<locals>.create_custom_forward.<locals>.custom_forward(*inputs)
654 def custom_forward(*inputs):
655 # None for past_key_value
--> 656 return module(*inputs, use_cache, output_attentions)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:302, in GPTJBlock.forward(self, hidden_states, layer_past, attention_mask, head_mask, use_cache, output_attentions)
300 residual = hidden_states
301 hidden_states = self.ln_1(hidden_states)
--> 302 attn_outputs = self.attn(
303 hidden_states,
304 layer_past=layer_past,
305 attention_mask=attention_mask,
306 head_mask=head_mask,
307 use_cache=use_cache,
308 output_attentions=output_attentions,
309 )
310 attn_output = attn_outputs[0] # output_attn: a, present, (attentions)
311 outputs = attn_outputs[1:]
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:203, in GPTJAttention.forward(self, hidden_states, attention_mask, layer_past, head_mask, use_cache, output_attentions)
190 def forward(
191 self,
192 hidden_states: Optional[torch.FloatTensor],
(...)
200 Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]],
201 ]:
--> 203 query = self.q_proj(hidden_states)
204 key = self.k_proj(hidden_states)
205 value = self.v_proj(hidden_states)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/nlp/peft/src/peft/tuners/lora.py:277, in Linear.forward(self, x)
275 def forward(self, x: torch.Tensor):
276 if self.r > 0 and not self.merged:
--> 277 result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
278 if self.r > 0:
279 result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
After fine-tuning a flan t5 11b model on custom data, I was saving the checkpoint via accelerate like this
accelerator.wait_for_everyone()
accelerator.save(
get_peft_model_state_dict(model, state_dict=accelerator.get_state_dict(model)), checkpoint_name
)
accelerator.wait_for_everyone()
It didnt create the config.json needed to load the model. The checkpoint got created (cdcFT5_lora.pt) ~ 19 MB file.
I am trying to create it manually using parameters that I used for training, looking at some sample lora model files, for inference purposes. Should target_modules be
"target_modules": [
"q",
"v"
],
OR
"target_modules": [
"query_key_value"
],
{
"base_model_name_or_path": "./cdcFT5_lora.pt",
"bias": "none",
"enable_lora": [
true,
false,
true
],
"fan_in_fan_out": true,
"inference_mode": true,
"lora_alpha": 32,
"lora_dropout": 0.1,
"merge_weights": false,
"modules_to_save": null,
"peft_type": "LORA",
"r": 8,
"target_modules": [
"q",
"v"
],
"task_type": "SEQ_2_SEQ_LM"
}
What values should I give for
"enable_lora": [
true,
false,
true
],
"fan_in_fan_out": true,
For inference, should it be enable_lora as true and fan_in_fan_out as false?
How do I save the model with config.json directly as well?
Is it via
peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}"
accelerator.save_pretrained(peft_model_id)
I see model.save_pretrained() exists, not sure if this works as well - accelerator.save_pretrained(peft_model_id)
Anyway to load the checkpoint and create the config file as well, without a re-training?
Hi @pacman100,
Happy to follow your lead on this. Just creating this issue so that we can track progress here.
I think ideally it'd be cool to have a script here and potentially a copy of this script in transformers/examples, since that is the one location with most eyeballs.
WDYT?
Learning to use the library. Was following the tutorial Finetune_flan_t5_large_bnb_peft.ipynb. The int8 training is fine and the model is pushed to hub.
Then I try in8 inference with following code, and found 2 problems I don't understand.
import os
import torch
from peft import PeftModelForSeq2SeqLM, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_id = "lukaemon/flan-t5-xxl-financial-phrasebank-lora"
config = PeftConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", load_in_8bit=True)
model = PeftModelForSeq2SeqLM.from_pretrained(model, model_id)
with torch.cuda.amp.autocast():
input_text = "In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 ."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids=input_ids, max_new_tokens=10)
print("input sentence: ", input_text)
print("output prediction: ", tokenizer.decode(outputs[0], skip_special_tokens=True))
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
is not complied. Model weights are distributed to all local 2 cards.'NoneType' object has no attribute 'device'
AttributeError Traceback (most recent call last)
Cell In[2], line 5
2 input_text = "In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 ."
3 input_ids = tokenizer(input_text, return_tensors="pt").input_ids.cuda()
----> 5 outputs = model.generate(input_ids=input_ids, max_new_tokens=10)
7 print("input sentence: ", input_text)
8 print(
9 " output prediction: ", tokenizer.decode(outputs[0], skip_special_tokens=True)
10 )
File /usr/local/lib/python3.8/dist-packages/peft/peft_model.py:725, in PeftModelForSeq2SeqLM.generate(self, **kwargs)
723 def generate(self, **kwargs):
724 if not isinstance(self.peft_config, PromptLearningConfig):
--> 725 return self.base_model.generate(**kwargs)
726 else:
727 if "input_ids" not in kwargs:
File /usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File /usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py:1264, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
...
-> 1698 prev_device = pre_call(A.device)
1699 if state is None: state = (A.shape, from_order)
1700 else: from_order = state[1]
AttributeError: 'NoneType' object has no attribute 'device'
What did I miss? Thanks in advance.
PEFT has a lot of practical significance, and the best thing about it is that it's modality-agnostic and just seems to work!
To this end, I would like to work on an example showing how PEFT can be used for ViT Image Classification (following the structure of these scripts).
I have a demo ready here (thanks to @pacman100 for this help).
I believe after #39 is merged, I can start buttoning up the code and create a PR?
I see that a new utility function prepare_model_for_training()
was added, however this won't work for previously trained LoRA models for two reasons:
if 'lora' not in name
conditional to the param freezing)LayerNorm
.On my own testing I can confirm further finetuning does work after these are down.
There are many downstream tasks (especially in NLP) that use the simpletransformers package. Is there any way to use PEFT in that package?
If we do
from peft import LoraConfig, LoraModel
from transformers import AutoModelForImageClassification
model_checkpoint = "google/vit-base-patch32-224-in21k"
label2id = {"dog": 0, "cat": 1, "mouse": 2}
id2label = {v: k for k, v in label2id.items()}
model = AutoModelForImageClassification.from_pretrained(
model_checkpoint,
label2id=label2id,
id2label=id2label,
ignore_mismatched_sizes = True,
)
config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=["query", "value"],
lora_dropout=0.0,
bias="none",
)
lora_model = LoraModel(config, model)
And then
def count_trainable_params(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(count_trainable_params(model), count_trainable_params(lora_model))
Both print the same number of trainable parameters. Is this expected?
I think right now, the dtype of prompt embeddings and the model are tied together since the weights are copied.
It would be nice to have a different dtype for prompt embeddings.
This is for better mixed precision training since the model itself doesn't need to be in fp32, only the prompt embeddings since only they are trained.
Let me know your thoughts @pacman100
Hey I got this error after running this code. Which is strange since it worked perfectly last night
My code:
`
import os
import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = 'google/flan-t5-large'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16, load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
from peft import prepare_model_for_training
model = prepare_model_for_training(model)
`
ERROR:
ImportError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 from peft import prepare_model_for_training
3 model = prepare_model_for_training(model)
ImportError: cannot import name 'prepare_model_for_training' from 'peft' (/usr/local/lib/python3.9/dist-packages/peft/init.py)
from peft import PeftConfig, PeftModel
config = PeftConfig.from_json_file('fruits.pkl/config.json')
model = AutoModelForImageClassification.from_pretrained(
'microsoft/vit-base',
label2id= {"apple": 1,"orange": 0},
id2label={"0": "apple","1": "orange"},
ignore_mismatched_sizes=True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)
# Load the Lora model
inference_model = PeftModel.from_pretrained(model, 'fruits.pkl/config.json')
The above code is giving
ValueError: Can't find config.json at 'fruits.pkl/config.json'
Do I need to keep the model file in bin format when training the model with peft at that time? I saved it and used it in combination with the 'lora.pt' file and found that the model generation was poor and did not make much sense.
This is my infering code:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType, peft_model_load_and_dispatch
model_name_or_path = "/root/gaochangkuan_AI/PromptCLUE_Finetuning/model_finetuning_1_epoch"
checkpoint_name="/root/gaochangkuan_AI/PromptCLUE_Finetuning/model_finetuning_1_epoch/promptclue_lora_fsdp_v1.pt"
max_memory={0: "1GIB", 1: "1GIB", 2: "2GIB", 3: "10GIB", "cpu":"30GB"}
peft_config = LoraConfig(
task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=True, r=8, lora_alpha=32, lora_dropout=0.1
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path,
#device_map="auto",
max_memory=max_memory
)
#model = get_peft_model(model, peft_config)
device = torch.device('cuda:7') # cuda
model.to(device)
peft_model_load_and_dispatch(model, torch.load(checkpoint_name), peft_config, max_memory)
Note:The model file in "model_finetuning_1_epoch" is saved during training, not the initial model.
So, where might the problem lie?
If we do:
from peft import LoraConfig, LoraModel
from transformers import AutoModelForImageClassification
model_checkpoint = "google/vit-base-patch32-224-in21k"
label2id = {"dog": 0, "cat": 1, "mouse": 2}
id2label = {v: k for k, v in label2id.items()}
model = AutoModelForImageClassification.from_pretrained(
model_checkpoint,
label2id=label2id,
id2label=id2label,
ignore_mismatched_sizes = True,
)
config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.0,
bias="none",
)
model = LoraModel(config, model)
This wouldn't have any effect and will lead to errors during training because the target_modules
were incorrectly named inside LoraConfig
. When this case arises, it would help to throw a warning message so that the users can look into the issue with more information.
I am trying PEFT example with DeepSpeed integration using the shared .py file: [peft_lora_seq2seq_accelerate_ds_zero3_offload.py]
But I am getting an error as below:
Traceback (most recent call last):
File "/home/vghorpad/LLM/peft_lora_seq2seq_accelerate_ds_zero3_offload.py", line 14, in <module>
from peft import LoraConfig, TaskType, get_peft_model
File "/home/vghorpad/LLM/peft_lora_seq2seq_accelerate_ds_zero3_offload.py", line 14, in <module>
from peft import LoraConfig, TaskType, get_peft_model
ImportError: cannot import name 'LoraConfig' from partially initialized module 'peft' (most likely due to a circular import)
I followed the steps to configure accelerate as per the readme.
Anything that I might be missing out on?
I trying to run:
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_model, LoraConfig
model_name_or_path = "facebook/opt-13b"
peft_config = LoraConfig(
task_type="CAUSAL_LM", inference_mode=False, r=64, lora_alpha=32, lora_dropout=0.1
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
print(model.print_trainable_parameters())
model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())
but then I getting:
Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!"
NameError: name 'cuda_setup' is not defined
I understood that this error comes from bitsandbytes
so I found this issue:
TimDettmers/bitsandbytes#124
My hardware is tesla v100-sxm2-32gb
(Volta)
It is possible to run PEFT on my hardware? I don't really need int8, fp16 also should be good.
this is related also to my attempts to FT a large LM on my hardware:
https://discuss.huggingface.co/t/finetune-llm-with-deepspeed/31589
Thanks,
Shon
import torch
from transformers import AutoModelForSeq2SeqLM,T5Tokenizer
from peft import get_peft_config, get_peft_model, TaskType,PrefixTuningConfig,PeftModelForSeq2SeqLM,PeftModel
model_name_or_path = "t5-small"
tokenizer_name_or_path = "t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
tokenizer = T5Tokenizer.from_pretrained(tokenizer_name_or_path)
peft_config=PrefixTuningConfig(
task_type=TaskType.SEQ_2_SEQ_LM,
inference_mode=False,
num_virtual_tokens=20)
model = get_peft_model(model, peft_config)
model.cuda()
input_ids = tokenizer.encode("Is dog an animal?", return_tensors="pt").to(model.device)
labels = tokenizer.encode("yes", return_tensors="pt").to(model.device)
decoder_input_ids = labels[:, :-1].contiguous().to(model.device)
labels = labels[:, 1:].clone()
outputs = model(input_ids=input_ids, labels=labels, decoder_input_ids=decoder_input_ids)
Traceback (most recent call last):
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-d8dc239897c0>", line 24, in <module>
outputs = model(input_ids=input_ids, labels=labels, decoder_input_ids=decoder_input_ids)
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/peft/peft_model.py", line 676, in forward
return self.base_model(
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1648, in forward
decoder_outputs = self.decoder(
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1040, in forward
layer_outputs = layer_module(
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 699, in forward
cross_attention_outputs = self.layer[1](
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 613, in forward
attention_output = self.EncDecAttention(
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
return forward_call(*args, **kwargs)
File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 538, in forward
scores += position_bias_masked
RuntimeError: The size of tensor a (20) must match the size of tensor b (7) at non-singleton dimension 3
T5 with PrefixTuning will cause errors. It possibly caused by past_key_values' shape.
How can I make the code here distributed? Support multiple card operations on one machine?
https://github.com/huggingface/peft/blob/main/examples/int8_training/Finetune_opt_bnb_peft.ipynb
If yes, can you add an example?
I currently see attention masks like this: [1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
@pacman100
i fine tune bloomz-7b1 using lora config
after fine tuning and get adapter config
i got error when try generate text from it
heres sample code
model_name_or_path = "bigscience/bloomz-7b1-mt"
peft_model_id = f"{model_name_or_path}_LORA_CAUSAL_LM"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",load_in_8bit=True)
model = PeftModel.from_pretrained(model, peft_model_id)
generated_ids = model.generate(input_ids=input_ids.to("cuda"), max_length=400, pad_token_id=tokenizer.eos_token_id, do_sample=True, top_p=0.95, temperature=0.5, penalty_alpha=0.6, top_k=4, repetition_penalty=1.03, num_return_sequences=1)
heres the error
RuntimeError: mat1 and mat2 must have the same dtype
is this expected? or is there any workaround
Genuinely curious: For Prefix Tuning, what's the reason for citing P-Tuning v2 rather than the Prefix Tuning paper?
Getting the following error when trying to adapt flan-t5-xxl with the int8 training code with PEFT:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
for param in model.parameters():
param.requires_grad = False # freeze the model - train adapters later
if param.ndim == 1:
# cast the small parameters (e.g. layernorm) to fp32 for stability
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable() # reduce number of stored activations
model.decoder.project_in = lambda x: x.requires_grad_(True)
class CastOutputToFloat(nn.Sequential):
def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)
peft_config = LoraConfig(
task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)
model = get_peft_model(model, peft_config)
I'm trying to use the example for zero3 offloading for conditional generation/seq2seq (even though I'm only using level 2 zero optimization atm) and I'm running into the following problem. I've never used pdsh, accelerate, or deepspeed before and when I search for the issue on the pytorch forums, the fix would take me into the guts of accelerate and deepspeed. I'm set up to passwordlessly ssh to both of the machines I'm attempting to use.
The script is hanging on accelerator = Accelerator()
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_hostfile: /home/thomas/path/to/hostfile
deepspeed_multinode_launcher: pdsh
gradient_accumulation_steps: 4
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: INDUCTOR
fsdp_config: {}
machine_rank: 0
main_process_ip: 192.168.6.5
main_process_port: 8989
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false
[2023-02-18 09:35:03,693] [INFO] [runner.py:454:main] Using IP address of 192.168.6.5 for node lifebiodevai
[2023-02-18 09:35:03,696] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: lifebiodevai,lifebioprodai
[2023-02-18 09:35:03,696] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w lifebiodevai,lifebioprodai export PYTHONPATH=/home/thomas/src/peft; export SHELL=/bin/bash; export COLORTERM=truecolor; export TERM_PROGRAM_VERSION=1.70.2; export CONDA_EXE=/home/thomas/miniconda3/bin/conda; export _CE_M=; export PWD=/home/thomas/src/peft; export LOGNAME=thomas; export XDG_SESSION_TYPE=tty; export CONDA_PREFIX=/home/thomas/miniconda3/envs/nlp; export JUPYTER_SERVER_URL=http://lifebiodevai:8872/; export VSCODE_GIT_ASKPASS_NODE=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/node; export MOTD_SHOWN=pam; export LINES=29; export HOME=/home/thomas; export LANG=en_US.UTF-8; export COLUMNS=158; export GIT_ASKPASS=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass.sh; export PYDEVD_USE_FRAME_EVAL=NO; export VSCODE_GIT_ASKPASS_EXTRA_ARGS=; export XDG_SESSION_CLASS=user; export JUPYTER_SERVER_ROOT=/home/thomas; export TERM=xterm-256color; export _CE_CONDA=; export USER=thomas; export VSCODE_GIT_IPC_HANDLE=/run/user/1002/vscode-git-e710c8e75d.sock; export CONDA_SHLVL=4; export IMAGE_TAG=v0.2.4; export SHLVL=1; export PYXTERM_DIMENSIONS=80x25; export XDG_SESSION_ID=605; export CONDA_PYTHON_EXE=/home/thomas/miniconda3/bin/python; export LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64; export XDG_RUNTIME_DIR=/run/user/1002; export CONDA_DEFAULT_ENV=nlp; export VSCODE_GIT_ASKPASS_MAIN=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass-main.js; export XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop; export BROWSER=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/helpers/browser.sh; export PATH=/home/thomas/miniconda3/envs/nlp/bin:/home/thomas/miniconda3/condabin:/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/opt/mssql-tools/bin:/usr/local/cuda/bin:/opt/mssql-tools/bin; export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1002/bus; export CONDA_PREFIX_1=/home/thomas/miniconda3; export OLDPWD=/home/thomas/src/peft; export TERM_PROGRAM=vscode; export VSCODE_IPC_HOOK_CLI=/run/user/1002/vscode-ipc-f03caaee-1d2c-4b01-8f00-57b95d1a0043.sock; export _=/home/thomas/miniconda3/envs/nlp/bin/accelerate; export ACCELERATE_MIXED_PRECISION=bf16; export ACCELERATE_CONFIG_DS_FIELDS=deepspeed_hostfile,deepspeed_multinode_launcher,gradient_accumulation_steps,gradient_clipping,offload_optimizer_device,offload_param_device,zero3_init_flag,zero_stage,mixed_precision; export ACCELERATE_USE_DEEPSPEED=true; export ACCELERATE_DEEPSPEED_ZERO_STAGE=2; export ACCELERATE_GRADIENT_ACCUMULATION_STEPS=4; export ACCELERATE_GRADIENT_CLIPPING=1.0; export ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu; export ACCELERATE_DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu; export ACCELERATE_DEEPSPEED_ZERO3_INIT=true; export ACCELERATE_DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=true; export MIXED_PRECISION=fp16; export USE_DEEPSPEED=true; export DEEPSPEED_ZERO_STAGE=3; export GRADIENT_ACCUMULATION_STEPS=4; export GRADIENT_CLIPPING=1.0; export DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu; export DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu; export DEEPSPEED_ZERO3_INIT=true; export DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=true; export CONDA_PREFIX_2=/home/thomas/miniconda3/envs/nlp; export CONDA_PREFIX_3=/home/thomas/miniconda3; cd /home/thomas/src/peft; /home/thomas/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== --node_rank=%n --master_addr=192.168.6.5 --master_port=8989 --no_local_rank examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:162:main] dist_world_size=2
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:162:main] dist_world_size=2
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebiodevai:
lifebiodevai: ===================================BUG REPORT===================================
lifebiodevai: Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
lifebiodevai: ================================================================================
lifebiodevai: /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:472: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
lifebiodevai: warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
lifebiodevai: [2023-02-18 09:35:07,868] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
lifebioprodai:
lifebioprodai: ===================================BUG REPORT===================================
lifebioprodai: Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
lifebioprodai: ================================================================================
lifebioprodai: /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:472: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
lifebioprodai: warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
lifebioprodai: client_loop: send disconnect: Broken pipe
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 255
lifebiodevai: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
lifebiodevai: │ /home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offl │
lifebiodevai: │ oad.py:303 in <module> │
lifebiodevai: │ │
lifebiodevai: │ 300 │
lifebiodevai: │ 301 │
lifebiodevai: │ 302 if __name__ == "__main__": │
lifebiodevai: │ ❱ 303 │ main() │
lifebiodevai: │ 304 │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offl │
lifebiodevai: │ oad.py:104 in main │
lifebiodevai: │ │
lifebiodevai: │ 101 │
lifebiodevai: │ 102 │
lifebiodevai: │ 103 def main(): │
lifebiodevai: │ ❱ 104 │ accelerator = Accelerator() │
lifebiodevai: │ 105 │ model_name_or_path = "bigscience/T0_3B" │
lifebiodevai: │ 106 │ dataset_name = "twitter_complaints" │
lifebiodevai: │ 107 │ peft_config = LoraConfig( │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/accelerator.py:323 in │
lifebiodevai: │ __init__ │
lifebiodevai: │ │
lifebiodevai: │ 320 │ │ │ │ │ │ self.init_handler = handler │
lifebiodevai: │ 321 │ │ │
lifebiodevai: │ 322 │ │ kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {} │
lifebiodevai: │ ❱ 323 │ │ self.state = AcceleratorState( │
lifebiodevai: │ 324 │ │ │ mixed_precision=mixed_precision, │
lifebiodevai: │ 325 │ │ │ cpu=cpu, │
lifebiodevai: │ 326 │ │ │ dynamo_backend=dynamo_backend, │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/state.py:147 in │
lifebiodevai: │ __init__ │
lifebiodevai: │ │
lifebiodevai: │ 144 │ │ │ │ │ if compare_versions("deepspeed", ">", "0.6.5"): │
lifebiodevai: │ 145 │ │ │ │ │ │ from deepspeed import comm as dist │
lifebiodevai: │ 146 │ │ │ │ │ │ │
lifebiodevai: │ ❱ 147 │ │ │ │ │ │ dist.init_distributed(dist_backend=self.backend) │
lifebiodevai: │ 148 │ │ │ │ │ else: │
lifebiodevai: │ 149 │ │ │ │ │ │ torch.distributed.init_process_group(backend="nccl", **kwargs) │
lifebiodevai: │ 150 │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/deepspeed/comm/comm.py:661 in │
lifebiodevai: │ init_distributed │
lifebiodevai: │ │
lifebiodevai: │ 658 │ │ │ │ │ 'Initializing TorchBackend in DeepSpeed with backend {}'.format( │
lifebiodevai: │ 659 │ │ │ │ │ │ dist_backend)) │
lifebiodevai: │ 660 │ │ │ # Create a torch backend object, initialize torch distributed, and assign to │
lifebiodevai: │ ❱ 661 │ │ │ cdb = TorchBackend(dist_backend, timeout, init_method) │
lifebiodevai: │ 662 │
lifebiodevai: │ 663 │
lifebiodevai: │ 664 def mpi_discovery(distributed_port=TORCH_DISTRIBUTED_DEFAULT_PORT, verbose=True): │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/deepspeed/comm/torch.py:30 in │
lifebiodevai: │ __init__ │
lifebiodevai: │ │
lifebiodevai: │ 27 │ │ # The idea is to fake that dist backend is initialized even when │
lifebiodevai: │ 28 │ │ # it is not so we can run on a single GPU without doing any init_process_group │
lifebiodevai: │ 29 │ │ self.single_gpu_mode = True │
lifebiodevai: │ ❱ 30 │ │ self.init_process_group(backend, timeout, init_method) │
lifebiodevai: │ 31 │ │
lifebiodevai: │ 32 │ def init_process_group(self, backend, timeout, init_method): │
lifebiodevai: │ 33 │ │ if not torch.distributed.is_initialized(): │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/deepspeed/comm/torch.py:34 in │
lifebiodevai: │ init_process_group │
lifebiodevai: │ │
lifebiodevai: │ 31 │ │
lifebiodevai: │ 32 │ def init_process_group(self, backend, timeout, init_method): │
lifebiodevai: │ 33 │ │ if not torch.distributed.is_initialized(): │
lifebiodevai: │ ❱ 34 │ │ │ torch.distributed.init_process_group(backend, │
lifebiodevai: │ 35 │ │ │ │ │ │ │ │ │ │ │ │ timeout=timeout, │
lifebiodevai: │ 36 │ │ │ │ │ │ │ │ │ │ │ │ init_method=init_method) │
lifebiodevai: │ 37 │ │ self.using_mpi = torch.distributed.get_backend() == 'mpi' │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/torch/distributed/distributed_c10d │
lifebiodevai: │ .py:786 in init_process_group │
lifebiodevai: │ │
lifebiodevai: │ 783 │ else: │
lifebiodevai: │ 784 │ │ # Use store based barrier here since barrier() used a bunch of │
lifebiodevai: │ 785 │ │ # default devices and messes up NCCL internal state. │
lifebiodevai: │ ❱ 786 │ │ _store_based_barrier(rank, store, timeout) │
lifebiodevai: │ 787 │ │ # Set sequence numbers for gloo and nccl process groups. │
lifebiodevai: │ 788 │ │ if get_backend(default_pg) in [Backend.GLOO, Backend.NCCL]: │
lifebiodevai: │ 789 │ │ │ default_pg._set_sequence_number_for_group() │
lifebiodevai: │ │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/torch/distributed/distributed_c10d │
lifebiodevai: │ .py:346 in _store_based_barrier │
lifebiodevai: │ │
lifebiodevai: │ 343 │ │ │ log_time = time.time() │
lifebiodevai: │ 344 │ │ │
lifebiodevai: │ 345 │ │ if timedelta(seconds=(time.time() - start)) > timeout: │
lifebiodevai: │ ❱ 346 │ │ │ raise RuntimeError( │
lifebiodevai: │ 347 │ │ │ │ "Timed out initializing process group in store based barrier on " │
lifebiodevai: │ 348 │ │ │ │ "rank: {}, for key: {} (world_size={}, worker_count={}, timeout={})".for │
lifebiodevai: │ 349 │ │ │ │ │ rank, store_key, world_size, worker_count, timeout │
lifebiodevai: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
lifebiodevai: RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1,
lifebiodevai: timeout=0:30:00)
lifebiodevai: [2023-02-18 03:34:10,098] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 921765
lifebiodevai: [2023-02-18 03:34:10,098] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', 'examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py'] exits with return code = 1
pdsh@lifebiodevai: lifebiodevai: ssh exited with exit code 1
Is the second error related to the broken pipe to the other machine? If this is an issue for another repo, please let me know if I should go bug accelerate. I'm trying to fix the broken pipe issue by setting TcpKeepAlive
to no per this answer on StackOverflow.
Was Adapter Hub involved in this work? I know they put so much energy in to implement something similar and tried to contribute their work back to you all. I just found your exciting blog post and was surprised to not see this mentioned.
Could we also add neo-x to the support matrix?
We should leverage trl
: https://github.com/lvwerra/trl - the recent library from Hugging Face for RLHF, to apply PPO using peft
and LoRA
I think peft
should just work out of the box, the first step could be trying to adapt gpt2-sentiment.py
script: https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py to use peft
Hi. PEFT is amazing. Thank you for sharing this amazing package for us.
However, when I used fp 16 training option using accelerate deepspeed ZeRO 3 with PEFT LoRA, error occured.
How can I handle this problem?
[My Setting]
[Error logs]
Traceback (most recent call last):
File "run_clm_no_hf_trainer.py", line 492, in
main()
File "run_clm_no_hf_trainer.py", line 418, in main
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 943, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1173, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 297, in init
self._configure_distributed_model(model)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1133, in _configure_distributed_model
raise ValueError(
ValueError: fp16 is enabled but the following parameters have dtype that is not fp16: base_model.model.gpt_neox.layers.0.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.0.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.1.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.1.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.2.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.2.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.3.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.3.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.4.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.4.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.5.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.5.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.6.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.6.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.7.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.7.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.8.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.8.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.9.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.9.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.10.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.10.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.11.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.11.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.12.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.12.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.13.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.13.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.14.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.14.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.15.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.15.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.16.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.16.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.17.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.17.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.18.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.18.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.19.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.19.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.20.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.20.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.21.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.21.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.22.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.22.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.23.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.23.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.24.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.24.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.25.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.25.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.26.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.26.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.27.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.27.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.28.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.28.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.29.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.29.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.30.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.30.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.31.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.31.attention.query_key_value.lora_B.weight
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 196869 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 196872 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 196875 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 196873) of binary: /usr/bin/python3.8
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.