huggingface / peft Goto Github PK

View Code? Open in Web Editor NEW

13.8K 105.0 1.3K 11.6 MB

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.

Home Page: https://huggingface.co/docs/peft

License: Apache License 2.0

Makefile 0.18% Python 98.70% Dockerfile 0.84% C++ 0.06% Cuda 0.22%

adapter diffusion llm parameter-efficient-learning python pytorch transformers lora

peft's Introduction

🤗 PEFT

State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods

Fine-tuning large pretrained models is often prohibitively costly due to their scale. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to various downstream applications by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters. This significantly decreases the computational and storage costs. Recent state-of-the-art PEFT techniques achieve performance comparable to fully fine-tuned models.

PEFT is integrated with Transformers for easy model training and inference, Diffusers for conveniently managing different adapters, and Accelerate for distributed training and inference for really big models.

Tip

Visit the PEFT organization to read about the PEFT methods implemented in the library and to see notebooks demonstrating how to apply these methods to a variety of downstream tasks. Click the "Watch repos" button on the organization page to be notified of newly implemented methods and notebooks!

Check the PEFT Adapters API Reference section for a list of supported PEFT methods, and read the Adapters, Soft prompts, and IA3 conceptual guides to learn more about how these methods work.

Quickstart

Install PEFT from pip:

pip install peft

Prepare a model for training with a PEFT method such as LoRA by wrapping the base model and PEFT configuration with get_peft_model. For the bigscience/mt0-large model, you're only training 0.19% of the parameters!

from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
"trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282"

To load a PEFT model for inference:

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

model = AutoPeftModelForCausalLM.from_pretrained("ybelkada/opt-350m-lora").to("cuda")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

model.eval()
inputs = tokenizer("Preheat the oven to 350 degrees and place the cookie dough", return_tensors="pt")

outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=50)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

"Preheat the oven to 350 degrees and place the cookie dough in the center of the oven. In a large bowl, combine the flour, baking powder, baking soda, salt, and cinnamon. In a separate bowl, combine the egg yolks, sugar, and vanilla."

Why you should use PEFT

There are many benefits of using PEFT but the main one is the huge savings in compute and storage, making PEFT applicable to many different use cases.

High performance on consumer hardware

Consider the memory requirements for training the following models on the ought/raft/twitter_complaints dataset with an A100 80GB GPU with more than 64GB of CPU RAM.

Model	Full Finetuning	PEFT-LoRA PyTorch	PEFT-LoRA DeepSpeed with CPU Offloading
bigscience/T0_3B (3B params)	47.14GB GPU / 2.96GB CPU	14.4GB GPU / 2.96GB CPU	9.8GB GPU / 17.8GB CPU
bigscience/mt0-xxl (12B params)	OOM GPU	56GB GPU / 3GB CPU	22GB GPU / 52GB CPU
bigscience/bloomz-7b1 (7B params)	OOM GPU	32GB GPU / 3.8GB CPU	18.1GB GPU / 35GB CPU

With LoRA you can fully finetune a 12B parameter model that would've otherwise run out of memory on the 80GB GPU, and comfortably fit and train a 3B parameter model. When you look at the 3B parameter model's performance, it is comparable to a fully finetuned model at a fraction of the GPU memory.

Submission Name	Accuracy
Human baseline (crowdsourced)	0.897
Flan-T5	0.892
lora-t0-3b	0.863

Tip

The bigscience/T0_3B model performance isn't optimized in the table above. You can squeeze even more performance out of it by playing around with the input instruction templates, LoRA hyperparameters, and other training related hyperparameters. The final checkpoint size of this model is just 19MB compared to 11GB of the full bigscience/T0_3B model. Learn more about the advantages of finetuning with PEFT in this blog post.

Quantization

Quantization is another method for reducing the memory requirements of a model by representing the data in a lower precision. It can be combined with PEFT methods to make it even easier to train and load LLMs for inference.

Learn how to finetune meta-llama/Llama-2-7b-hf with QLoRA and the TRL library on a 16GB GPU in the Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem blog post.
Learn how to finetune a openai/whisper-large-v2 model for multilingual automatic speech recognition with LoRA and 8-bit quantization in this notebook (see this notebook instead for an example of streaming a dataset).

Save compute and storage

PEFT can help you save storage by avoiding full finetuning of models on each of downstream task or dataset. In many cases, you're only finetuning a very small fraction of a model's parameters and each checkpoint is only a few MBs in size (instead of GBs). These smaller PEFT adapters demonstrate performance comparable to a fully finetuned model. If you have many datasets, you can save a lot of storage with a PEFT model and not have to worry about catastrophic forgetting or overfitting the backbone or base model.

PEFT integrations

PEFT is widely supported across the Hugging Face ecosystem because of the massive efficiency it brings to training and inference.

Diffusers

The iterative diffusion process consumes a lot of memory which can make it difficult to train. PEFT can help reduce the memory requirements and reduce the storage size of the final model checkpoint. For example, consider the memory required for training a Stable Diffusion model with LoRA on an A100 80GB GPU with more than 64GB of CPU RAM. The final model checkpoint size is only 8.8MB!

Model	Full Finetuning	PEFT-LoRA	PEFT-LoRA with Gradient Checkpointing
CompVis/stable-diffusion-v1-4	27.5GB GPU / 3.97GB CPU	15.5GB GPU / 3.84GB CPU	8.12GB GPU / 3.77GB CPU

Tip

Take a look at the examples/lora_dreambooth/train_dreambooth.py training script to try training your own Stable Diffusion model with LoRA, and play around with the smangrul/peft-lora-sd-dreambooth Space which is running on a T4 instance. Learn more about the PEFT integration in Diffusers in this tutorial.

Accelerate

Accelerate is a library for distributed training and inference on various training setups and hardware (GPUs, TPUs, Apple Silicon, etc.). PEFT models work with Accelerate out of the box, making it really convenient to train really large models or use them for inference on consumer hardware with limited resources.

TRL

PEFT can also be applied to training LLMs with RLHF components such as the ranker and policy. Get started by reading:

Fine-tune a Mistral-7b model with Direct Preference Optimization with PEFT and the TRL library to learn more about the Direct Preference Optimization (DPO) method and how to apply it to a LLM.
Fine-tuning 20B LLMs with RLHF on a 24GB consumer GPU with PEFT and the TRL library, and then try out the gpt2-sentiment_peft.ipynb notebook to optimize GPT2 to generate positive movie reviews.
StackLLaMA: A hands-on guide to train LLaMA with RLHF with PEFT, and then try out the stack_llama/scripts for supervised finetuning, reward modeling, and RL finetuning.

Model support

Use this Space or check out the docs to find which models officially support a PEFT method out of the box. Even if you don't see a model listed below, you can manually configure the model config to enable PEFT for a model. Read the New transformers architecture guide to learn how.

Contribute

If you would like to contribute to PEFT, please check out our contribution guide.

Citing 🤗 PEFT

To use 🤗 PEFT in your publication, please cite it by using the following BibTeX entry.

@Misc{peft,
  title =        {PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods},
  author =       {Sourab Mangrulkar and Sylvain Gugger and Lysandre Debut and Younes Belkada and Sayak Paul and Benjamin Bossan},
  howpublished = {\url{https://github.com/huggingface/peft}},
  year =         {2022}
}

peft's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes webclinic017 thevasudevgupta dumpmemory ronak877 muskanmahajan486 yeha-adry younesbelkada samkenxstream orenwang kapilahuja11 haksoomoon kashif codeaudit lxuechen ricklentz brunotech lkluo muhtasham nashid evelynmitchell writerai aicrumb thedarkzeno abhinavm24 vertinski techventurebuilder ojus1 markhng525 ghorvish co-simulation jungwhank vaibhavs10 anujnayyar1 mekaneeky yangwang92 mohan-zhang-u stanleyjacob smitanannaware lcw99 ouwei2013 ukaserge edbeeching i191665 erickdp masashi-takeshita gabinguo maltz aldopareja goooice syx528911137 zanussbaum qingruzhang jzsues janghyun1230 isamu-isozaki phymucs ranchlai vishalsingh17 aboomardiiyah plowsai alvenirai zphang zhangqibot phygitalism alvanli johnson7788 wise-east panqiwei tooyassem yufanliu commune-ai jedcheng fuhengwu2021 ropoctl feilongzjfzjf zheng5yu9 mymusise pandaupc mousaic huangtao36 akkikiki cosimoiaia ggsky nlp-qmhn klonggan charleoy jquesnelle wiwomu jc-ryan leoxing1996 machinelearningsystem haofanwang enockipp thearchiver tumainilyimo xuelx63 haojiepan1 yuyijiong han76024

peft's Issues

How to load a pickled model from local?

from peft import PeftConfig, PeftModel

config = PeftConfig.from_json_file('fruits.pkl/config.json')
model = AutoModelForImageClassification.from_pretrained(
    'microsoft/vit-base',
    label2id= {"apple": 1,"orange": 0},
    id2label={"0": "apple","1": "orange"},
    ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)
# Load the Lora model
inference_model = PeftModel.from_pretrained(model, 'fruits.pkl/config.json')

The above code is giving
ValueError: Can't find config.json at 'fruits.pkl/config.json'

Update README with the vision examples

After #59 is merged will update the README to include our support for vision examples (image classification, semantic segmentation, DreamBooth, etc.).

Handling of additional trainable params from Hub utils

@younesbelkada

Now that #39 has been merged, I wanted to focus on #44 and #45.

As you can see here, after wrapping a base model with LoraModel for image classification fine-tuning, I am still having to

for param in lora_model.classifier.parameters():
    param.requires_grad = True

I am aware that if we fix the internal task types of LoraConfig, we wouldn't need to do this. But on the other hand, this goes on to show the flexibility of PEFT, isn't it?

In this case, would the Hub utilities introduced in #39 take care of the additional trainable parameters?

ImportError: cannot import name 'prepare_model_for_training' from 'peft'

Hey I got this error after running this code. Which is strange since it worked perfectly last night

My code:

Select CUDA device index

import os
import torch

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-xxl"

model_name = 'google/flan-t5-large'

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16, load_in_8bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

from peft import prepare_model_for_training

model = prepare_model_for_training(model)

ERROR:

ImportError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 from peft import prepare_model_for_training
3 model = prepare_model_for_training(model)

ImportError: cannot import name 'prepare_model_for_training' from 'peft' (/usr/local/lib/python3.9/dist-packages/peft/init.py)

Add support for T-Few

T-Few is a PEFT method for few-shot learning that is currently the SOTA on many NLP benchmarks. It uses a nifty technique called (IA)^3 to update a small number of parameters during training and would be an impactful method to include IMO.

Although research code exists, it is tightly bound to the paper and doesn't run easily on hardware that isn't an 80GB A100. The peft library could help make this work more accessible to industry practitioners (where few-shot is actually valuable)

cc @craffel

Paper: https://arxiv.org/abs/2205.05638
GitHub: https://github.com/r-three/t-few

Number of trainable parameters for a LoRA model w.r.t the original model

If we do

from peft import LoraConfig, LoraModel
from transformers import AutoModelForImageClassification


model_checkpoint = "google/vit-base-patch32-224-in21k"
label2id = {"dog": 0, "cat": 1, "mouse": 2}
id2label = {v: k for k, v in label2id.items()}
model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint, 
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes = True, 
)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.0,
    bias="none",
)
lora_model = LoraModel(config, model)

And then

def count_trainable_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(count_trainable_params(model), count_trainable_params(lora_model))

Both print the same number of trainable parameters. Is this expected?

Errors when running examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

I'm trying to use the example for zero3 offloading for conditional generation/seq2seq (even though I'm only using level 2 zero optimization atm) and I'm running into the following problem. I've never used pdsh, accelerate, or deepspeed before and when I search for the issue on the pytorch forums, the fix would take me into the guts of accelerate and deepspeed. I'm set up to passwordlessly ssh to both of the machines I'm attempting to use.

The script is hanging on accelerator = Accelerator()

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_hostfile: /home/thomas/path/to/hostfile
  deepspeed_multinode_launcher: pdsh
  gradient_accumulation_steps: 4
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: INDUCTOR
fsdp_config: {}
machine_rank: 0
main_process_ip: 192.168.6.5
main_process_port: 8989
main_training_function: main
megatron_lm_config: {}
mixed_precision: bf16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
use_cpu: false

[2023-02-18 09:35:03,693] [INFO] [runner.py:454:main] Using IP address of 192.168.6.5 for node lifebiodevai
[2023-02-18 09:35:03,696] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: lifebiodevai,lifebioprodai
[2023-02-18 09:35:03,696] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w lifebiodevai,lifebioprodai export PYTHONPATH=/home/thomas/src/peft; export SHELL=/bin/bash; export COLORTERM=truecolor; export TERM_PROGRAM_VERSION=1.70.2; export CONDA_EXE=/home/thomas/miniconda3/bin/conda; export _CE_M=; export PWD=/home/thomas/src/peft; export LOGNAME=thomas; export XDG_SESSION_TYPE=tty; export CONDA_PREFIX=/home/thomas/miniconda3/envs/nlp; export JUPYTER_SERVER_URL=http://lifebiodevai:8872/; export VSCODE_GIT_ASKPASS_NODE=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/node; export MOTD_SHOWN=pam; export LINES=29; export HOME=/home/thomas; export LANG=en_US.UTF-8; export COLUMNS=158; export GIT_ASKPASS=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass.sh; export PYDEVD_USE_FRAME_EVAL=NO; export VSCODE_GIT_ASKPASS_EXTRA_ARGS=; export XDG_SESSION_CLASS=user; export JUPYTER_SERVER_ROOT=/home/thomas; export TERM=xterm-256color; export _CE_CONDA=; export USER=thomas; export VSCODE_GIT_IPC_HANDLE=/run/user/1002/vscode-git-e710c8e75d.sock; export CONDA_SHLVL=4; export IMAGE_TAG=v0.2.4; export SHLVL=1; export PYXTERM_DIMENSIONS=80x25; export XDG_SESSION_ID=605; export CONDA_PYTHON_EXE=/home/thomas/miniconda3/bin/python; export LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64; export XDG_RUNTIME_DIR=/run/user/1002; export CONDA_DEFAULT_ENV=nlp; export VSCODE_GIT_ASKPASS_MAIN=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass-main.js; export XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop; export BROWSER=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/helpers/browser.sh; export PATH=/home/thomas/miniconda3/envs/nlp/bin:/home/thomas/miniconda3/condabin:/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/opt/mssql-tools/bin:/usr/local/cuda/bin:/opt/mssql-tools/bin; export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1002/bus; export CONDA_PREFIX_1=/home/thomas/miniconda3; export OLDPWD=/home/thomas/src/peft; export TERM_PROGRAM=vscode; export VSCODE_IPC_HOOK_CLI=/run/user/1002/vscode-ipc-f03caaee-1d2c-4b01-8f00-57b95d1a0043.sock; export _=/home/thomas/miniconda3/envs/nlp/bin/accelerate; export ACCELERATE_MIXED_PRECISION=bf16; export ACCELERATE_CONFIG_DS_FIELDS=deepspeed_hostfile,deepspeed_multinode_launcher,gradient_accumulation_steps,gradient_clipping,offload_optimizer_device,offload_param_device,zero3_init_flag,zero_stage,mixed_precision; export ACCELERATE_USE_DEEPSPEED=true; export ACCELERATE_DEEPSPEED_ZERO_STAGE=2; export ACCELERATE_GRADIENT_ACCUMULATION_STEPS=4; export ACCELERATE_GRADIENT_CLIPPING=1.0; export ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu; export ACCELERATE_DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu; export ACCELERATE_DEEPSPEED_ZERO3_INIT=true; export ACCELERATE_DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=true; export MIXED_PRECISION=fp16; export USE_DEEPSPEED=true; export DEEPSPEED_ZERO_STAGE=3; export GRADIENT_ACCUMULATION_STEPS=4; export GRADIENT_CLIPPING=1.0; export DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu; export DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu; export DEEPSPEED_ZERO3_INIT=true; export DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=true; export CONDA_PREFIX_2=/home/thomas/miniconda3/envs/nlp; export CONDA_PREFIX_3=/home/thomas/miniconda3;  cd /home/thomas/src/peft; /home/thomas/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== --node_rank=%n --master_addr=192.168.6.5 --master_port=8989 --no_local_rank examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:162:main] dist_world_size=2
lifebiodevai: [2023-02-18 09:35:05,622] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:162:main] dist_world_size=2
lifebioprodai: [2023-02-18 09:35:05,983] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebiodevai: 
lifebiodevai: ===================================BUG REPORT===================================
lifebiodevai: Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
lifebiodevai: ================================================================================
lifebiodevai: /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:472: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
lifebiodevai:   warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
lifebiodevai: [2023-02-18 09:35:07,868] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
lifebioprodai: 
lifebioprodai: ===================================BUG REPORT===================================
lifebioprodai: Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
lifebioprodai: ================================================================================
lifebioprodai: /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/utils/dataclasses.py:472: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
lifebioprodai:   warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
lifebioprodai: client_loop: send disconnect: Broken pipe
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 255
lifebiodevai: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
lifebiodevai: │ /home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offl │
lifebiodevai: │ oad.py:303 in <module>                                                                           │
lifebiodevai: │                                                                                                  │
lifebiodevai: │   300                                                                                            │
lifebiodevai: │   301                                                                                            │
lifebiodevai: │   302 if __name__ == "__main__":                                                                 │
lifebiodevai: │ ❱ 303 │   main()                                                                                 │
lifebiodevai: │   304                                                                                            │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offl │
lifebiodevai: │ oad.py:104 in main                                                                               │
lifebiodevai: │                                                                                                  │
lifebiodevai: │   101                                                                                            │
lifebiodevai: │   102                                                                                            │
lifebiodevai: │   103 def main():                                                                                │
lifebiodevai: │ ❱ 104 │   accelerator = Accelerator()                                                            │
lifebiodevai: │   105 │   model_name_or_path = "bigscience/T0_3B"                                                │
lifebiodevai: │   106 │   dataset_name = "twitter_complaints"                                                    │
lifebiodevai: │   107 │   peft_config = LoraConfig(                                                              │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/accelerator.py:323 in   │
lifebiodevai: │ __init__                                                                                         │
lifebiodevai: │                                                                                                  │
lifebiodevai: │    320 │   │   │   │   │   │   self.init_handler = handler                                       │
lifebiodevai: │    321 │   │                                                                                     │
lifebiodevai: │    322 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not None else {}   │
lifebiodevai: │ ❱  323 │   │   self.state = AcceleratorState(                                                    │
lifebiodevai: │    324 │   │   │   mixed_precision=mixed_precision,                                              │
lifebiodevai: │    325 │   │   │   cpu=cpu,                                                                      │
lifebiodevai: │    326 │   │   │   dynamo_backend=dynamo_backend,                                                │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/accelerate/state.py:147 in         │
lifebiodevai: │ __init__                                                                                         │
lifebiodevai: │                                                                                                  │
lifebiodevai: │   144 │   │   │   │   │   if compare_versions("deepspeed", ">", "0.6.5"):                        │
lifebiodevai: │   145 │   │   │   │   │   │   from deepspeed import comm as dist                                 │
lifebiodevai: │   146 │   │   │   │   │   │                                                                      │
lifebiodevai: │ ❱ 147 │   │   │   │   │   │   dist.init_distributed(dist_backend=self.backend)                   │
lifebiodevai: │   148 │   │   │   │   │   else:                                                                  │
lifebiodevai: │   149 │   │   │   │   │   │   torch.distributed.init_process_group(backend="nccl", **kwargs)     │
lifebiodevai: │   150                                                                                            │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/deepspeed/comm/comm.py:661 in      │
lifebiodevai: │ init_distributed                                                                                 │
lifebiodevai: │                                                                                                  │
lifebiodevai: │   658 │   │   │   │   │   'Initializing TorchBackend in DeepSpeed with backend {}'.format(       │
lifebiodevai: │   659 │   │   │   │   │   │   dist_backend))                                                     │
lifebiodevai: │   660 │   │   │   # Create a torch backend object, initialize torch distributed, and assign to   │
lifebiodevai: │ ❱ 661 │   │   │   cdb = TorchBackend(dist_backend, timeout, init_method)                         │
lifebiodevai: │   662                                                                                            │
lifebiodevai: │   663                                                                                            │
lifebiodevai: │   664 def mpi_discovery(distributed_port=TORCH_DISTRIBUTED_DEFAULT_PORT, verbose=True):          │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/deepspeed/comm/torch.py:30 in      │
lifebiodevai: │ __init__                                                                                         │
lifebiodevai: │                                                                                                  │
lifebiodevai: │    27 │   │   # The idea is to fake that dist backend is initialized even when                   │
lifebiodevai: │    28 │   │   # it is not so we can run on a single GPU without doing any init_process_group     │
lifebiodevai: │    29 │   │   self.single_gpu_mode = True                                                        │
lifebiodevai: │ ❱  30 │   │   self.init_process_group(backend, timeout, init_method)                             │
lifebiodevai: │    31 │                                                                                          │
lifebiodevai: │    32 │   def init_process_group(self, backend, timeout, init_method):                           │
lifebiodevai: │    33 │   │   if not torch.distributed.is_initialized():                                         │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/deepspeed/comm/torch.py:34 in      │
lifebiodevai: │ init_process_group                                                                               │
lifebiodevai: │                                                                                                  │
lifebiodevai: │    31 │                                                                                          │
lifebiodevai: │    32 │   def init_process_group(self, backend, timeout, init_method):                           │
lifebiodevai: │    33 │   │   if not torch.distributed.is_initialized():                                         │
lifebiodevai: │ ❱  34 │   │   │   torch.distributed.init_process_group(backend,                                  │
lifebiodevai: │    35 │   │   │   │   │   │   │   │   │   │   │   │    timeout=timeout,                          │
lifebiodevai: │    36 │   │   │   │   │   │   │   │   │   │   │   │    init_method=init_method)                  │
lifebiodevai: │    37 │   │   self.using_mpi = torch.distributed.get_backend() == 'mpi'                          │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/torch/distributed/distributed_c10d │
lifebiodevai: │ .py:786 in init_process_group                                                                    │
lifebiodevai: │                                                                                                  │
lifebiodevai: │    783 │   else:                                                                                 │
lifebiodevai: │    784 │   │   # Use store based barrier here since barrier() used a bunch of                    │
lifebiodevai: │    785 │   │   # default devices and messes up NCCL internal state.                              │
lifebiodevai: │ ❱  786 │   │   _store_based_barrier(rank, store, timeout)                                        │
lifebiodevai: │    787 │   │   # Set sequence numbers for gloo and nccl process groups.                          │
lifebiodevai: │    788 │   │   if get_backend(default_pg) in [Backend.GLOO, Backend.NCCL]:                       │
lifebiodevai: │    789 │   │   │   default_pg._set_sequence_number_for_group()                                   │
lifebiodevai: │                                                                                                  │
lifebiodevai: │ /home/thomas/miniconda3/envs/nlp/lib/python3.10/site-packages/torch/distributed/distributed_c10d │
lifebiodevai: │ .py:346 in _store_based_barrier                                                                  │
lifebiodevai: │                                                                                                  │
lifebiodevai: │    343 │   │   │   log_time = time.time()                                                        │
lifebiodevai: │    344 │   │                                                                                     │
lifebiodevai: │    345 │   │   if timedelta(seconds=(time.time() - start)) > timeout:                            │
lifebiodevai: │ ❱  346 │   │   │   raise RuntimeError(                                                           │
lifebiodevai: │    347 │   │   │   │   "Timed out initializing process group in store based barrier on "         │
lifebiodevai: │    348 │   │   │   │   "rank: {}, for key: {} (world_size={}, worker_count={}, timeout={})".for  │
lifebiodevai: │    349 │   │   │   │   │   rank, store_key, world_size, worker_count, timeout                    │
lifebiodevai: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
lifebiodevai: RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, 
lifebiodevai: timeout=0:30:00)
lifebiodevai: [2023-02-18 03:34:10,098] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 921765
lifebiodevai: [2023-02-18 03:34:10,098] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', 'examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py'] exits with return code = 1
pdsh@lifebiodevai: lifebiodevai: ssh exited with exit code 1

Is the second error related to the broken pipe to the other machine? If this is an issue for another repo, please let me know if I should go bug accelerate. I'm trying to fix the broken pipe issue by setting TcpKeepAlive to no per this answer on StackOverflow.

Problem with latency

Hi!

I trained t5-large using LoRA config:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

I saved the model using the next part of code:
model.save_pretrained(path)

Have 2 questions:

It takes for me 2000 ms (on A100 40GB) to generate response for one example. Could you please help me how to reduce latency?
I tried to convert model to onnx using optimum but it doesn't work (because of not loading all weights)

tokenizer = T5Tokenizer.from_pretrained("model_path")
ort_model = ORTModelForSeq2SeqLM.from_pretrained("model_path", from_transformers=True, provider="CUDAExecutionProvider")
onnx_pipe = pipeline("text2text-generation", model=ort_model, tokenizer=tokenizer, device="cuda", 
                     batch_size=8, max_length=512, truncation=True)

Maybe I need to save the model in another way?
Could you please help me to understand what I do wrong?

[Int8] Exception during inference - RuntimeError: expected scalar type Half but found Float

While attempting to use a Lora with a model loaded in 8bit, I get the following error upon trying to generate anything:

B:\python\lib\site-packages\torch\nn\modules\linear.py:103 in forward
102 │   def forward(self, input: Tensor) -> Tensor:
103 │       return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Half but found Float

I can prevent that by adding .half() to the model = PeftModelForCausalLM.from_pretrained(model, peft_model_id) line in the below test code, but I wanted to confirm if this was a bug/doc issue or an expected side effect of using int8 and if converting the peftmodel to fp16 would affect the wrapped model in any way.

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from peft import PeftModelForCausalLM, PeftConfig
model_id = "OPT-350M-Nerys-v2"
peft_model_id = "D:\AI\lora_test\OPT-350M-Nerys-v2_lora"
tokenizer = AutoTokenizer.from_pretrained(model_id)
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto')
model = PeftModelForCausalLM.from_pretrained(model, peft_model_id)

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
    input_ids=input_ids,
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
print(f'Output: {generate_simple(input_text)}')

Error while importing peft

I am trying PEFT example with DeepSpeed integration using the shared .py file: [peft_lora_seq2seq_accelerate_ds_zero3_offload.py]
But I am getting an error as below:

Traceback (most recent call last):                                                                                                                                                                                                                                   
  File "/home/vghorpad/LLM/peft_lora_seq2seq_accelerate_ds_zero3_offload.py", line 14, in <module>
    from peft import LoraConfig, TaskType, get_peft_model
  File "/home/vghorpad/LLM/peft_lora_seq2seq_accelerate_ds_zero3_offload.py", line 14, in <module>
    from peft import LoraConfig, TaskType, get_peft_model
ImportError: cannot import name 'LoraConfig' from partially initialized module 'peft' (most likely due to a circular import)

I followed the steps to configure accelerate as per the readme.
Anything that I might be missing out on?

How to make “Finetune_opt_bnb_peft.ipynb” distributed?

How can I make the code here distributed? Support multiple card operations on one machine?
https://github.com/huggingface/peft/blob/main/examples/int8_training/Finetune_opt_bnb_peft.ipynb

inference load_in_8bit = True after fine tuning give mat1 and mat2 must have the same dtype

i fine tune bloomz-7b1 using lora config

after fine tuning and get adapter config

i got error when try generate text from it

heres sample code

model_name_or_path = "bigscience/bloomz-7b1-mt"
peft_model_id = f"{model_name_or_path}_LORA_CAUSAL_LM"

model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",load_in_8bit=True)
model = PeftModel.from_pretrained(model, peft_model_id)

generated_ids = model.generate(input_ids=input_ids.to("cuda"), max_length=400, pad_token_id=tokenizer.eos_token_id, do_sample=True, top_p=0.95, temperature=0.5, penalty_alpha=0.6, top_k=4, repetition_penalty=1.03, num_return_sequences=1)

heres the error


RuntimeError: mat1 and mat2 must have the same dtype

is this expected? or is there any workaround

CUDA Error when fine tuning GPT-J for CasualLM

Hello, I am trying to finetune GPT-J for text generation by adapting this notebook. However, when I run trainer.train I get a CUDA error that states the following, RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'
The error seems to originating from ./peft/src/peft/tuners/lora.py line 277 from the traceback. Any ideas why this is happening or how to fix it?

The full traceback is below :

RuntimeError                              Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1538     self.model_wrapped = self.model
   1540 inner_training_loop = find_executable_batch_size(
   1541     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1542 )
-> 1543 return inner_training_loop(
   1544     args=args,
   1545     resume_from_checkpoint=resume_from_checkpoint,
   1546     trial=trial,
   1547     ignore_keys_for_eval=ignore_keys_for_eval,
   1548 )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/memory.py:124, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
    122     raise RuntimeError("No executable batch size found, reached zero.")
    123 try:
--> 124     return function(batch_size, *args, **kwargs)
    125 except Exception as e:
    126     if should_reduce_batch_size(e):

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1789         tr_loss_step = self.training_step(model, inputs)
   1790 else:
-> 1791     tr_loss_step = self.training_step(model, inputs)
   1793 if (
   1794     args.logging_nan_inf_filter
   1795     and not is_torch_tpu_available()
   1796     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1797 ):
   1798     # if loss is nan or inf simply add the average of previous logged losses
   1799     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2539, in Trainer.training_step(self, model, inputs)
   2536     return loss_mb.reduce_mean().detach().to(self.args.device)
   2538 with self.compute_loss_context_manager():
-> 2539     loss = self.compute_loss(model, inputs)
   2541 if self.args.n_gpu > 1:
   2542     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2571, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2569 else:
   2570     labels = None
-> 2571 outputs = model(**inputs)
   2572 # Save past state if it exists
   2573 # TODO: this needs to be fixed and made cleaner later.
   2574 if self.args.past_index >= 0:

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

***** Running training *****
  Num examples = 36139
  Num Epochs = 6
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 6774
  Number of trainable parameters = 7340032
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1538     self.model_wrapped = self.model
   1540 inner_training_loop = find_executable_batch_size(
   1541     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1542 )
-> 1543 return inner_training_loop(
   1544     args=args,
   1545     resume_from_checkpoint=resume_from_checkpoint,
   1546     trial=trial,
   1547     ignore_keys_for_eval=ignore_keys_for_eval,
   1548 )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/memory.py:124, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
    122     raise RuntimeError("No executable batch size found, reached zero.")
    123 try:
--> 124     return function(batch_size, *args, **kwargs)
    125 except Exception as e:
    126     if should_reduce_batch_size(e):

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1789         tr_loss_step = self.training_step(model, inputs)
   1790 else:
-> 1791     tr_loss_step = self.training_step(model, inputs)
   1793 if (
   1794     args.logging_nan_inf_filter
   1795     and not is_torch_tpu_available()
   1796     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1797 ):
   1798     # if loss is nan or inf simply add the average of previous logged losses
   1799     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2539, in Trainer.training_step(self, model, inputs)
   2536     return loss_mb.reduce_mean().detach().to(self.args.device)
   2538 with self.compute_loss_context_manager():
-> 2539     loss = self.compute_loss(model, inputs)
   2541 if self.args.n_gpu > 1:
   2542     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2571, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2569 else:
   2570     labels = None
-> 2571 outputs = model(**inputs)
   2572 # Save past state if it exists
   2573 # TODO: this needs to be fixed and made cleaner later.
   2574 if self.args.past_index >= 0:

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/nlp/peft/src/peft/peft_model.py:502, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, **kwargs)
    490 def forward(
    491     self,
    492     input_ids=None,
   (...)
    499     **kwargs,
    500 ):
    501     if not isinstance(self.peft_config, PromptLearningConfig):
--> 502         return self.base_model(
    503             input_ids=input_ids,
    504             attention_mask=attention_mask,
    505             inputs_embeds=inputs_embeds,
    506             labels=labels,
    507             output_attentions=output_attentions,
    508             output_hidden_states=output_hidden_states,
    509             return_dict=return_dict,
    510             **kwargs,
    511         )
    513     batch_size = input_ids.shape[0]
    514     if attention_mask is not None:
    515         # concat prompt attention mask

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:813, in GPTJForCausalLM.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    805 r"""
    806 labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
    807     Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
    808     `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
    809     are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
    810 """
    811 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 813 transformer_outputs = self.transformer(
    814     input_ids,
    815     past_key_values=past_key_values,
    816     attention_mask=attention_mask,
    817     token_type_ids=token_type_ids,
    818     position_ids=position_ids,
    819     head_mask=head_mask,
    820     inputs_embeds=inputs_embeds,
    821     use_cache=use_cache,
    822     output_attentions=output_attentions,
    823     output_hidden_states=output_hidden_states,
    824     return_dict=return_dict,
    825 )
    826 hidden_states = transformer_outputs[0]
    828 # Set device for model parallelism

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:660, in GPTJModel.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    656             return module(*inputs, use_cache, output_attentions)
    658         return custom_forward
--> 660     outputs = torch.utils.checkpoint.checkpoint(
    661         create_custom_forward(block),
    662         hidden_states,
    663         None,
    664         attention_mask,
    665         head_mask[i],
    666     )
    667 else:
    668     outputs = block(
    669         hidden_states,
    670         layer_past=layer_past,
   (...)
    674         output_attentions=output_attentions,
    675     )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:249, in checkpoint(function, use_reentrant, *args, **kwargs)
    246     raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs))
    248 if use_reentrant:
--> 249     return CheckpointFunction.apply(function, preserve, *args)
    250 else:
    251     return _checkpoint_without_reentrant(
    252         function,
    253         preserve,
    254         *args,
    255         **kwargs,
    256     )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:107, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
    104 ctx.save_for_backward(*tensor_inputs)
    106 with torch.no_grad():
--> 107     outputs = run_function(*args)
    108 return outputs

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:656, in GPTJModel.forward.<locals>.create_custom_forward.<locals>.custom_forward(*inputs)
    654 def custom_forward(*inputs):
    655     # None for past_key_value
--> 656     return module(*inputs, use_cache, output_attentions)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:302, in GPTJBlock.forward(self, hidden_states, layer_past, attention_mask, head_mask, use_cache, output_attentions)
    300 residual = hidden_states
    301 hidden_states = self.ln_1(hidden_states)
--> 302 attn_outputs = self.attn(
    303     hidden_states,
    304     layer_past=layer_past,
    305     attention_mask=attention_mask,
    306     head_mask=head_mask,
    307     use_cache=use_cache,
    308     output_attentions=output_attentions,
    309 )
    310 attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
    311 outputs = attn_outputs[1:]

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:203, in GPTJAttention.forward(self, hidden_states, attention_mask, layer_past, head_mask, use_cache, output_attentions)
    190 def forward(
    191     self,
    192     hidden_states: Optional[torch.FloatTensor],
   (...)
    200     Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]],
    201 ]:
--> 203     query = self.q_proj(hidden_states)
    204     key = self.k_proj(hidden_states)
    205     value = self.v_proj(hidden_states)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/nlp/peft/src/peft/tuners/lora.py:277, in Linear.forward(self, x)
    275 def forward(self, x: torch.Tensor):
    276     if self.r > 0 and not self.merged:
--> 277         result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
    278         if self.r > 0:
    279             result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

GPU and CPU memory utilization while running the peft_lora_clm_accelerate_ds_zero3_offload.py script

Hi! Thanks a lot for this fantastic package!

I was running the examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py script for the bloomz-7b1 model. As per the README, I was expecting ~18.1GB GPU memory and 35GB CPU memory, however from the logs generated (please see below; logs for the 15th epoch) the GPU memory consumption seems to be a lot more i.e. close to 32GB GPU memory while CPU memory is much lesser.

Edit: I think I missed some setup steps required for the deepspeed offloading since the is_ds_zero_3 variable in line 238 is always False. Please let me know! Thank you

Note: I'm running this on a Ubuntu 18.04 x86_64 machine with a single 40GB A100 GPU.

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00,  1.11it/s]
GPU Memory before entering the train : 27026
GPU Memory consumed at the end of the train (end-begin): 242
GPU Peak Memory consumed during the train (max-begin): 5011
GPU Total Peak Memory consumed during the train (max): 32037
CPU Memory before entering the train : 4080
CPU Memory consumed at the end of the train (end-begin): 0
CPU Peak Memory consumed during the train (max-begin): 0
CPU Total Peak Memory consumed during the train (max): 4080
epoch=15: train_ppl=tensor(2.0908, device='cuda:0') train_epoch_loss=tensor(0.7375, device='cuda:0')
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:08<00:00,  1.26s/it]
GPU Memory before entering the eval : 27268
GPU Memory consumed at the end of the eval (end-begin): -242
GPU Peak Memory consumed during the eval (max-begin): 1465
GPU Total Peak Memory consumed during the eval (max): 28733
CPU Memory before entering the eval : 4080
CPU Memory consumed at the end of the eval (end-begin): 0
CPU Peak Memory consumed during the eval (max-begin): 0
CPU Total Peak Memory consumed during the eval (max): 4080
accuracy=84.0
eval_preds[:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'no complaint']
dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint', 'no complaint', 'no complaint', 'complaint', 'complaint', 'no complaint']

[Feature] Add support for LongT5 and LongFormer

Before the feature request I just wanted to say great work on this library everyone, I've been looking forward to the release of this for a while now!

I just wanted to hop in to ask, if there are any plans to support fine-tuning of models with large context windows like LongT5, Pegasus X, and LongFormer?

Or is it just a matter of using PEFT with those models? If so would it be possible to add that example in the examples folder?

Great job and thanks so much again for releasing this!

Degree of Cooperation with Adapter Hub?

Was Adapter Hub involved in this work? I know they put so much energy in to implement something similar and tried to contribute their work back to you all. I just found your exciting blog post and was surprised to not see this mentioned.

Request for compatibility with simpletransformers

There are many downstream tasks (especially in NLP) that use the simpletransformers package. Is there any way to use PEFT in that package?

Enhancement: detach dtype for prompt embeddings from the model itself

I think right now, the dtype of prompt embeddings and the model are tied together since the weights are copied.
It would be nice to have a different dtype for prompt embeddings.

This is for better mixed precision training since the model itself doesn't need to be in fp32, only the prompt embeddings since only they are trained.

Let me know your thoughts @pacman100

Any support for masked language modeling?

I see that there's support for causal language modeling, and that models trained with MLM objectives are supported for sequence and token classification tasks, but could I ask if there's anything I missed/any hacky way to get PEFT to work for masked language modeling?

"Baking" final Lora weights into the model's LinearLayers for inference without additional latency

Hello,

Great library you have there :) It works great but I have a slight problem regarding performance:

When finetuning a model with the Lora method, as far as I understood we should be able to "bake" the final lora-weights into the original LinearLayer's weights. I could not find any documentation on how to go about this, but the code does reference merging, so I assume it is implemented.

Is there any example code for this or any other hints you can give me?

Thank you in advance!

AttributeError: 'DataParallel' object has no attribute 'config' when call get_peft_model.

Hi, I am trying to fine-tuning GPT-based model using multiple gpus.
I wrap the base-model using torch.nn.DataParallel.

However, when I try to fine-tune the base-model using peft-LoRA by
peft_model = get_peft_model(model,config)

I get AttributeError: 'DataParallel' object has no attribute 'config' Error.

How can I solve this problem?

Use `peft` for RLHF

Feature request

We should leverage trl: https://github.com/lvwerra/trl - the recent library from Hugging Face for RLHF, to apply PPO using peft and LoRA

I think peft should just work out of the box, the first step could be trying to adapt gpt2-sentiment.py script: https://github.com/lvwerra/trl/blob/main/examples/sentiment/scripts/gpt2-sentiment.py to use peft

`model.named_parameters()` giving tensors of shape 0 with DeepSpeed CPU offloading

When I run python examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py, the following message gets printed as expected on the screen for model.print_trainable_parameters() (Line 219)

trainable params: 3932160 || all params: 7072948224 || trainable%: 0.055594355783029126

However, when I follow the instructions on the README and set up Accelerate with DeepSpeed CPU offloading, the same line now outputs the following

trainable params: 3932160 || all params: 3932160 || trainable%: 100.0

Upon digging a little deeper, it looks like the model.named_parameters() returns tensors of size 0 (except for the Lora A and B matrices) when running with Accelerate and DeepSpeed CPU offloading

This is the original output for the model.named_parameters() - showing only the top few parameters

base_model.model.transformer.word_embeddings.weight     torch.Size([250880, 4096])
base_model.model.transformer.word_embeddings_layernorm.weight   torch.Size([4096])
base_model.model.transformer.word_embeddings_layernorm.bias     torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([4096])
base_model.model.transformer.h.0.input_layernorm.bias   torch.Size([4096])
base_model.model.transformer.h.0.self_attention.query_key_value.weight  torch.Size([12288, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.bias    torch.Size([12288])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight   torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight   torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight    torch.Size([4096, 4096])
base_model.model.transformer.h.0.self_attention.dense.bias      torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.weight        torch.Size([4096])
base_model.model.transformer.h.0.post_attention_layernorm.bias  torch.Size([4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight       torch.Size([16384, 4096])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([16384])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight       torch.Size([4096, 16384])

This is the output when running with accelerate and DeepSpeed CPU offloading

base_model.model.transformer.word_embeddings.weight     torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.weight   torch.Size([0])
base_model.model.transformer.word_embeddings_layernorm.bias     torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.weight torch.Size([0])
base_model.model.transformer.h.0.input_layernorm.bias   torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.weight  torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.bias    torch.Size([0])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_A.weight   torch.Size([16, 4096])
base_model.model.transformer.h.0.self_attention.query_key_value.lora_B.weight   torch.Size([8192, 8, 1])
base_model.model.transformer.h.0.self_attention.dense.weight    torch.Size([0])
base_model.model.transformer.h.0.self_attention.dense.bias      torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.weight        torch.Size([0])
base_model.model.transformer.h.0.post_attention_layernorm.bias  torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.weight       torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_h_to_4h.bias torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.weight       torch.Size([0])
base_model.model.transformer.h.0.mlp.dense_4h_to_h.bias torch.Size([0])

It doesn't look like this impacts fine-tuning but was curious why this is happening!

Any plan for pipeline parallel to suit larger model?

Hi, I wonder if you have any plan on adding pipeline parallel to this library so that we could fine-tune larger model across multiple gpus? The reason for mentioning pipeline parallel particularly is that it may be easier to implement a general version comparing to tensor parallel or other model parallel strategies. If you have such plan, I'd love to help :)

prepare_model_for_training() should work for existing PeFT/LoRA models for further finetuning

I see that a new utility function prepare_model_for_training() was added, however this won't work for previously trained LoRA models for two reasons:

The LoRA adapters will be frozen as well (fixable by adding a if 'lora' not in name conditional to the param freezing)
PEFT adds an extra wrapper around the model, so additional getters are needed for things like the LayerNorm.

On my own testing I can confirm further finetuning does work after these are down.

Error when applyling the fp 16 training option using accelerate

Hi. PEFT is amazing. Thank you for sharing this amazing package for us.
However, when I used fp 16 training option using accelerate deepspeed ZeRO 3 with PEFT LoRA, error occured.
How can I handle this problem?

[My Setting]

used accelerate with deepspeed (ZeRO 3)
used PEFT (LoRA)
used Polyglot (GPT-Neox architecture)
tried to use fp 16
4 GPU ( RTX 3090 )

[Error logs]

Traceback (most recent call last):
File "run_clm_no_hf_trainer.py", line 492, in
main()
File "run_clm_no_hf_trainer.py", line 418, in main
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 943, in prepare
result = self._prepare_deepspeed(*args)
File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1173, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/init.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 297, in init
self._configure_distributed_model(model)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1133, in _configure_distributed_model
raise ValueError(
ValueError: fp16 is enabled but the following parameters have dtype that is not fp16: base_model.model.gpt_neox.layers.0.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.0.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.1.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.1.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.2.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.2.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.3.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.3.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.4.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.4.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.5.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.5.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.6.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.6.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.7.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.7.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.8.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.8.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.9.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.9.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.10.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.10.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.11.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.11.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.12.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.12.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.13.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.13.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.14.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.14.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.15.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.15.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.16.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.16.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.17.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.17.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.18.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.18.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.19.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.19.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.20.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.20.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.21.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.21.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.22.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.22.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.23.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.23.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.24.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.24.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.25.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.25.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.26.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.26.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.27.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.27.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.28.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.28.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.29.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.29.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.30.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.30.attention.query_key_value.lora_B.weight, base_model.model.gpt_neox.layers.31.attention.query_key_value.lora_A.weight, base_model.model.gpt_neox.layers.31.attention.query_key_value.lora_B.weight
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 196869 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 196872 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 196875 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 196873) of binary: /usr/bin/python3.8

Regarding Prefix Tuning citation

Genuinely curious: For Prefix Tuning, what's the reason for citing P-Tuning v2 rather than the Prefix Tuning paper?

Incorrect Saving Peft Models using HuggingFace Trainer

Hello,

Thanks a lot for the great project.

I am fine-tuning Flan-T5-XXL using HuggingFace Seq2SeqTrainer and hyperparameter_search.
However, the trainer doesn't store Peft models correctly because it is not a "PreTrainedModel" type.
It stores the whole PyTorch model, including the Flan-T5-XXL, which is around 42 GB.

I have dug into the code, and I made a hacky solution inside "trainer.py" for now:

    def _save(self, output_dir: Optional[str] = None, state_dict=None):
        # If we are executing this function, we are the process zero, so we don't check for that.
        output_dir = output_dir if output_dir is not None else self.args.output_dir
        os.makedirs(output_dir, exist_ok=True)
        logger.info(f"Saving model checkpoint to {output_dir}")
        from peft.peft_model import PeftModelForSeq2SeqLM
        if isinstance(self.model, PeftModelForSeq2SeqLM):
            self.model.save_pretrained(output_dir, state_dict=state_dict)
        # Save a trained model and configuration using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        elif not isinstance(self.model, PreTrainedModel):
            if isinstance(unwrap_model(self.model), PreTrainedModel):
                if state_dict is None:
                    state_dict = self.model.state_dict()
                unwrap_model(self.model).save_pretrained(output_dir, state_dict=state_dict)
            else:
                logger.info("Trainer.model is not a `PreTrainedModel`, only saving its state dict.")
                if state_dict is None:
                    state_dict = self.model.state_dict()
                torch.save(state_dict, os.path.join(output_dir, WEIGHTS_NAME))
        else:
            self.model.save_pretrained(output_dir, state_dict=state_dict)
        if self.tokenizer is not None:
            self.tokenizer.save_pretrained(output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(self.args, os.path.join(output_dir, TRAINING_ARGS_NAME))

Do you have a better solution for saving the "Peft models" correctly using HuggingFace Seq2SeqTrainer and hyperparameter_search ?

How to save / form the config.json after fine-tuning - Flan T5 11b

After fine-tuning a flan t5 11b model on custom data, I was saving the checkpoint via accelerate like this

        accelerator.wait_for_everyone()
        accelerator.save(
            get_peft_model_state_dict(model, state_dict=accelerator.get_state_dict(model)), checkpoint_name
        )
        accelerator.wait_for_everyone()

It didnt create the config.json needed to load the model. The checkpoint got created (cdcFT5_lora.pt) ~ 19 MB file.

I am trying to create it manually using parameters that I used for training, looking at some sample lora model files, for inference purposes. Should target_modules be

"target_modules": [
"q",
"v"
],

"target_modules": [
"query_key_value"
],

{
  "base_model_name_or_path": "./cdcFT5_lora.pt",
  "bias": "none",
  "enable_lora": [
    true,
    false,
    true
  ],
  "fan_in_fan_out": true,
  "inference_mode": true,
  "lora_alpha": 32,
  "lora_dropout": 0.1,
  "merge_weights": false,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 8,
  "target_modules": [
    "q",
    "v"
  ],
  "task_type": "SEQ_2_SEQ_LM"
}

What values should I give for
"enable_lora": [
true,
false,
true
],
"fan_in_fan_out": true,

For inference, should it be enable_lora as true and fan_in_fan_out as false?

How do I save the model with config.json directly as well?

Is it via

peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}"
accelerator.save_pretrained(peft_model_id)

I see model.save_pretrained() exists, not sure if this works as well - accelerator.save_pretrained(peft_model_id)

Anyway to load the checkpoint and create the config file as well, without a re-training?

[lora] `push_to_hub()` `save_pretrained()` errors and potential inconsistencies

Related to #56.

@younesbelkada @pacman100

Consider the following LoraModel instance:

from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
from peft import LoraConfig, LoraModel


model_checkpoint = "google/vit-base-patch16-224-in21k" 
model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint,
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes=True,  # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["query", "value"],
    lora_dropout=0.1,
    bias="none",
    modules_to_save=["classifier"],
)
lora_model = LoraModel(config, model)

If I call lora_model.save_pretrained("lora_vit"), I see that the state_dict is about the same size as that of model. Is this expected?

I would have expected to only see the LoRA trainable parameters alongside the modules_to_save ones. This would help reduce the size of the state dict and would also help with portability, especially for very large models. This is also how it's implemented in diffusers.

Also, PeftConfig is somehow unable to find the config.json here whereas it's clearly there as we can see. What am I missing out on?

This currently blocks the inference and sharing section of the notebook.

accelerate+deepspeed: ValueError: max() arg is an empty sequence

I'm trying to fine-tune a t5-small model on the CNN/DM dataset, using accelerate, deepspeed and PEFT, but it gives an error:
ValueError: max() arg is an empty sequence

I use the following script:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization_no_trainer.py
as described here: https://huggingface.co/docs/transformers/v4.18.0/en/run_scripts

I modify the script to use PEFT:

442a449,455
>     peft_config = LoraConfig(
>         task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
>     )
>     model = get_peft_model(model, peft_config)
>     model.print_trainable_parameters()

It runs fine when I run it with accelerate without deepspeed:

accelerate launch ~/transformers/examples/pytorch/summarization/run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir ~/tmp/tst-summarization --num_beams 4

But when I try to run it with deepspeed it gives an error:

accelerate launch --config config1.yml ~/transformers/examples/pytorch/summarization/run_summarization_no_trainer.py --model_name_or_path t5-small --dataset_name cnn_dailymail --dataset_config "3.0.0" --source_prefix "summarize: " --output_dir ~/tmp/tst-summarization --num_beams 4

Here is part of the error trace:

│ /opt/conda/envs/mypytorch/lib/python3.9/site-packages/deepspeed/runtime/engine.py:1594 in        │
│ _configure_zero_optimizer                                                                        │
│                                                                                                  │
│   1591 │   │   │   │   log_dist('Creating fp16 ZeRO stage {} optimizer'.format(zero_stage),      │
│   1592 │   │   │   │   │   │    ranks=[0])                                                       │
│   1593 │   │   │   │   from deepspeed.runtime.zero.stage3 import DeepSpeedZeroOptimizer_Stage3   │
│ ❱ 1594 │   │   │   │   optimizer = DeepSpeedZeroOptimizer_Stage3(                                │
│   1595 │   │   │   │   │   self.module,                                                          │
│   1596 │   │   │   │   │   optimizer,                                                            │
│   1597 │   │   │   │   │   timers=timers,                                                        │
│                                                                                                  │
│ /opt/conda/envs/mypytorch/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:303 in    │
│ __init__                                                                                         │
│                                                                                                  │
│    300 │   │   │   │   count = count + 1                                                         │
│    301 │   │                                                                                     │
│    302 │   │   #Largest partitioned param                                                        │
│ ❱  303 │   │   largest_partitioned_param_numel = max([                                           │
│    304 │   │   │   max([                                                                         │
│    305 │   │   │   │   max(tensor.numel(),                                                       │
│    306 │   │   │   │   │   tensor.ds_numel) for tensor in fp16_partitioned_group                 │
│                                                                                                  │
│ /opt/conda/envs/mypytorch/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:304 in    │
│ <listcomp>                                                                                       │
│                                                                                                  │
│    301 │   │                                                                                     │
│    302 │   │   #Largest partitioned param                                                        │
│    303 │   │   largest_partitioned_param_numel = max([                                           │
│ ❱  304 │   │   │   max([                                                                         │
│    305 │   │   │   │   max(tensor.numel(),                                                       │
│    306 │   │   │   │   │   tensor.ds_numel) for tensor in fp16_partitioned_group                 │
│    307 │   │   │   ]) for fp16_partitioned_group in self.fp16_partitioned_groups                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: max() arg is an empty sequence

Here is my config1.yml:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'no'
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true

The same command runs fine if I don't use PEFT in the script.

Error when trying to finetune t5 loaded in int8

Getting the following error when trying to adapt flan-t5-xxl with the int8 training code with PEFT:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.decoder.project_in = lambda x: x.requires_grad_(True)

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)

model.lm_head = CastOutputToFloat(model.lm_head)
peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)
model = get_peft_model(model, peft_config)

A peft model requires .train() to be called in order to store gradients during backprop

A peft model requires .train() to be called in order to store gradients during backprop.
cc @younesbelkada
peft version 0.2.0.dev0

See the example here:

import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training


model = AutoModelForCausalLM.from_pretrained("edbeeching/gpt-neo-125M-imdb", load_in_8bit=True, device_map="auto")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["k_proj", "v_proj", "q_proj", "out_proj"],  # Are these the correct layers to target with LoRA ?
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
prepare_model_for_int8_training(model)
model = get_peft_model(model, lora_config)

with torch.cuda.amp.autocast():  
    gen = model.generate(input_ids=torch.LongTensor([0,1,2,3]).unsqueeze(0)) 
    # model.train() # <-- UNCOMMENT THIS LINE
    output = model(torch.LongTensor([0,1,2,3]).to("cuda"))
loss = output[0].mean()
loss.backward()
for n,p in model.named_parameters(): 
    print(n, p.grad)

Is it possible to support multiple GPUs for distributed training at the same time?

Add Late Prompt Tuning

Overview

Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts has shown that prompt tuning can sometimes be more effective if the prompt generation is placed closer to the output signal in the later layers of the transformer architecture.

I think this would be a nice addition to the existing techniques in peft.

Task

Understand Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts
Integrate the method into peft
Benchmark and test it against other methods in the repository

(happy to consider doing this issue myself if possible)

Add neo-x 20b to Models support matrix

Could we also add neo-x to the support matrix?

Cannot generate w/ do_sample=True on a LoRA model

From the last cell of the OPT Notebook, adding do_sample=True to generate():

/opt/conda/lib/python3.7/site-packages/peft/peft_model.py:550 in generate                        │
│                                                                                                  │
│   547 │                                                                                          │
│   548 │   def generate(self, **kwargs):                                                          │
│   549 │   │   if not isinstance(self.peft_config, PromptLearningConfig):                         │
│ ❱ 550 │   │   │   return self.base_model.generate(**kwargs)                                      │
│   551 │   │   else:                                                                              │
│   552 │   │   │   if "input_ids" not in kwargs:                                                  │
│   553 │   │   │   │   raise ValueError("input_ids must be provided for Peft model generation")   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py:27 in decorate_context        │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:1442 in generate         │
│                                                                                                  │
│   1439 │   │   │   │   output_scores=generation_config.output_scores,                            │
│   1440 │   │   │   │   return_dict_in_generate=generation_config.return_dict_in_generate,        │
│   1441 │   │   │   │   synced_gpus=synced_gpus,                                                  │
│ ❱ 1442 │   │   │   │   **model_kwargs,                                                           │
│   1443 │   │   │   )                                                                             │
│   1444 │   │                                                                                     │
│   1445 │   │   elif is_beam_gen_mode:                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:2462 in sample           │
│                                                                                                  │
│   2459 │   │   │                                                                                 │
│   2460 │   │   │   # pre-process distribution                                                    │
│   2461 │   │   │   next_token_scores = logits_processor(input_ids, next_token_logits)            │
│ ❱ 2462 │   │   │   next_token_scores = logits_warper(input_ids, next_token_scores)               │
│   2463 │   │   │                                                                                 │
│   2464 │   │   │   # Store scores, attentions and hidden_states when required                    │
│   2465 │   │   │   if return_dict_in_generate:                                                   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/logits_process.py:92 in __call__  │
│                                                                                                  │
│    89 │   │   │   │   │   )                                                                      │
│    90 │   │   │   │   scores = processor(input_ids, scores, **kwargs)                            │
│    91 │   │   │   else:                                                                          │
│ ❱  92 │   │   │   │   scores = processor(input_ids, scores)                                      │
│    93 │   │   return scores                                                                      │
│    94                                                                                            │
│    95                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/logits_process.py:297 in __call__ │
│                                                                                                  │
│   294 │   def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.   │
│   295 │   │   top_k = min(self.top_k, scores.size(-1))  # Safety check                           │
│   296 │   │   # Remove all tokens with a probability less than the last token of the top-k       │
│ ❱ 297 │   │   indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]           │
│   298 │   │   scores = scores.masked_fill(indices_to_remove, self.filter_value)                  │
│   299 │   │   return scores                                                                      │
│   300                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "topk_cpu" not implemented for 'Half'

Do I need to report the model file in bin format during the training process?

Do I need to keep the model file in bin format when training the model with peft at that time? I saved it and used it in combination with the 'lora.pt' file and found that the model generation was poor and did not make much sense.
This is my infering code:

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType, peft_model_load_and_dispatch


model_name_or_path = "/root/gaochangkuan_AI/PromptCLUE_Finetuning/model_finetuning_1_epoch"
checkpoint_name="/root/gaochangkuan_AI/PromptCLUE_Finetuning/model_finetuning_1_epoch/promptclue_lora_fsdp_v1.pt"
max_memory={0: "1GIB", 1: "1GIB", 2: "2GIB", 3: "10GIB", "cpu":"30GB"}
peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=True, r=8, lora_alpha=32, lora_dropout=0.1
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path, 
                                #device_map="auto", 
                                max_memory=max_memory
                                )
#model = get_peft_model(model, peft_config)
device = torch.device('cuda:7') # cuda
model.to(device)
peft_model_load_and_dispatch(model, torch.load(checkpoint_name), peft_config, max_memory)

Note:The model file in "model_finetuning_1_epoch" is saved during training, not the initial model.

So, where might the problem lie?

[Whisper] fine-tune Whisper with int-8 (scripts + example notebook)

Hi @pacman100,

Happy to follow your lead on this. Just creating this issue so that we can track progress here.
I think ideally it'd be cool to have a script here and potentially a copy of this script in transformers/examples, since that is the one location with most eyeballs.

WDYT?

Potential bug in prompt tuning

If yes, can you add an example?
I currently see attention masks like this: [1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
@pacman100

Add an example demoing the use of PEFT for ViT Image Classification

PEFT has a lot of practical significance, and the best thing about it is that it's modality-agnostic and just seems to work!

To this end, I would like to work on an example showing how PEFT can be used for ViT Image Classification (following the structure of these scripts).

I have a demo ready here (thanks to @pacman100 for this help).

I believe after #39 is merged, I can start buttoning up the code and create a PR?

@pacman100

T5 with PrefixTuning ,error

import torch
from transformers import AutoModelForSeq2SeqLM,T5Tokenizer
from peft import get_peft_config, get_peft_model, TaskType,PrefixTuningConfig,PeftModelForSeq2SeqLM,PeftModel
model_name_or_path = "t5-small"
tokenizer_name_or_path = "t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
tokenizer = T5Tokenizer.from_pretrained(tokenizer_name_or_path)

peft_config=PrefixTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    inference_mode=False,
    num_virtual_tokens=20)

model = get_peft_model(model, peft_config)

model.cuda()
input_ids = tokenizer.encode("Is dog an animal?", return_tensors="pt").to(model.device)
labels = tokenizer.encode("yes", return_tensors="pt").to(model.device)
decoder_input_ids = labels[:, :-1].contiguous().to(model.device)
labels = labels[:, 1:].clone()
outputs = model(input_ids=input_ids, labels=labels, decoder_input_ids=decoder_input_ids)

Traceback (most recent call last):
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3433, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-d8dc239897c0>", line 24, in <module>
    outputs = model(input_ids=input_ids, labels=labels, decoder_input_ids=decoder_input_ids)
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/peft/peft_model.py", line 676, in forward
    return self.base_model(
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1648, in forward
    decoder_outputs = self.decoder(
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1040, in forward
    layer_outputs = layer_module(
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 699, in forward
    cross_attention_outputs = self.layer[1](
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 613, in forward
    attention_output = self.EncDecAttention(
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1482, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/datamining/miniconda3/envs/lxl/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 538, in forward
    scores += position_bias_masked
RuntimeError: The size of tensor a (20) must match the size of tensor b (7) at non-singleton dimension 3

T5 with PrefixTuning will cause errors. It possibly caused by past_key_values' shape.

RuntimeError: expected scalar type Float but found Half

This my code：

# 引入相应的包 Importing libraries

  peft_model_id= 'int8_peft_model'
  trainer.model.save_pretrained(peft_model_id)

if __name__ == "__main__":
  train_func_main()

I execute the following code in the command line：
python int8_peft_lora_PromptCLUE_Finetuning.py
Then the following error message appears:

  0%|                                                 | 0/35646 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Traceback (most recent call last):
  File "/root/gaochangkuan_AI/PromptCLUE_Finetuning/int8_peft_lora_PromptCLUE_Finetuning.py", line 252, in <module>
    train_func_main()
  File "/root/gaochangkuan_AI/PromptCLUE_Finetuning/int8_peft_lora_PromptCLUE_Finetuning.py", line 247, in train_func_main
    trainer.train()
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1576, in train
    return inner_training_loop(
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 1843, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2588, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/trainer.py", line 2620, in compute_loss
    outputs = model(**inputs)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/peft/peft_model.py", line 639, in forward
    return self.base_model(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py", line 1722, in forward
    lm_logits = self.lm_head(sequence_output)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/gaochangkuan_AI/PromptCLUE_Finetuning/int8_peft_lora_PromptCLUE_Finetuning.py", line 175, in forward
    return super().forward(x).to(torch.float32)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Float but found Half
�[31m╭───────────────────── �[39m�[1mTraceback (most recent call last)�[31m�[22m ──────────────────────╮
�[31m│�[39m /root/gaochangkuan_AI/PromptCLUE_Finetuning/�[1mint8_peft_lora_PromptCLUE_Finetu�[22m �[31m│
�[31m│�[39m �[1mning.py�[22m:�[94m252�[39m in �[92m<module>�[39m                                                      �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   249   trainer.model.save_pretrained(peft_model_id)                         �[31m│
�[31m│�[39m   250                                                                        �[31m│
�[31m│�[39m   251 �[94mif�[39m �[91m__name__�[39m == �[33m"__main__"�[39m:                                             �[31m│
�[31m│�[39m �[31m❱ �[39m252   train_func_main()                                                    �[31m│
�[31m│�[39m   253                                                                        �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/gaochangkuan_AI/PromptCLUE_Finetuning/�[1mint8_peft_lora_PromptCLUE_Finetu�[22m �[31m│
�[31m│�[39m �[1mning.py�[22m:�[94m247�[39m in �[92mtrain_func_main�[39m                                               �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   244 │   │   │   │   │    data_collator= seq2seq_data_collator              �[31m│
�[31m│�[39m   245 │   │   │   │   │   )                                                  �[31m│
�[31m│�[39m   246   model.config.use_cache = �[94mFalse�[39m  # silence the warnings. Please re-en �[31m│
�[31m│�[39m �[31m❱ �[39m247   trainer.train()                                                      �[31m│
�[31m│�[39m   248   peft_model_id= �[33m'int8_peft_model'�[39m                                     �[31m│
�[31m│�[39m   249   trainer.model.save_pretrained(peft_model_id)                         �[31m│
�[31m│�[39m   250                                                                        �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m1576�[39m in  �[31m│
�[31m│�[39m �[92mtrain�[39m                                                                        �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   1573 │   │   inner_training_loop = find_executable_batch_size(             �[31m│
�[31m│�[39m   1574 │   │   │   �[96mself�[39m._inner_training_loop, �[96mself�[39m._train_batch_size, args.a �[31m│
�[31m│�[39m   1575 │   │   )                                                             �[31m│
�[31m│�[39m �[31m❱ �[39m1576 │   │   �[94mreturn�[39m inner_training_loop(                                   �[31m│
�[31m│�[39m   1577 │   │   │   args=args,                                                �[31m│
�[31m│�[39m   1578 │   │   │   resume_from_checkpoint=resume_from_checkpoint,            �[31m│
�[31m│�[39m   1579 │   │   │   trial=trial,                                              �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m1843�[39m in  �[31m│
�[31m│�[39m �[92m_inner_training_loop�[39m                                                         �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   1840 │   │   │   │   │   �[94mwith�[39m model.no_sync():                             �[31m│
�[31m│�[39m   1841 │   │   │   │   │   │   tr_loss_step = �[96mself�[39m.training_step(model, inpu �[31m│
�[31m│�[39m   1842 │   │   │   │   �[94melse�[39m:                                                 �[31m│
�[31m│�[39m �[31m❱ �[39m1843 │   │   │   │   │   tr_loss_step = �[96mself�[39m.training_step(model, inputs)  �[31m│
�[31m│�[39m   1844 │   │   │   │                                                         �[31m│
�[31m│�[39m   1845 │   │   │   │   �[94mif�[39m (                                                  �[31m│
�[31m│�[39m   1846 │   │   │   │   │   args.logging_nan_inf_filter                       �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m2588�[39m in  �[31m│
�[31m│�[39m �[92mtraining_step�[39m                                                                �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   2585 │   │   │   �[94mreturn�[39m loss_mb.reduce_mean().detach().to(�[96mself�[39m.args.device �[31m│
�[31m│�[39m   2586 │   │                                                                 �[31m│
�[31m│�[39m   2587 │   │   �[94mwith�[39m �[96mself�[39m.compute_loss_context_manager():                     �[31m│
�[31m│�[39m �[31m❱ �[39m2588 │   │   │   loss = �[96mself�[39m.compute_loss(model, inputs)                   �[31m│
�[31m│�[39m   2589 │   │                                                                 �[31m│
�[31m│�[39m   2590 │   │   �[94mif�[39m �[96mself�[39m.args.n_gpu > �[94m1�[39m:                                       �[31m│
�[31m│�[39m   2591 │   │   │   loss = loss.mean()  # mean() to average on multi-gpu para �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/�[1mtrainer.py�[22m:�[94m2620�[39m in  �[31m│
�[31m│�[39m �[92mcompute_loss�[39m                                                                 �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   2617 │   │   │   labels = inputs.pop(�[33m"labels"�[39m)                             �[31m│
�[31m│�[39m   2618 │   │   �[94melse�[39m:                                                         �[31m│
�[31m│�[39m   2619 │   │   │   labels = �[94mNone�[39m                                             �[31m│
�[31m│�[39m �[31m❱ �[39m2620 │   │   outputs = model(**inputs)                                     �[31m│
�[31m│�[39m   2621 │   │   # Save past state if it exists                                �[31m│
�[31m│�[39m   2622 │   │   # TODO: this needs to be fixed and made cleaner later.        �[31m│
�[31m│�[39m   2623 │   │   �[94mif�[39m �[96mself�[39m.args.past_index >= �[94m0�[39m:                                 �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m  �[31m│
�[31m│�[39m in �[92m_call_impl�[39m                                                                �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   1191 │   │   # this function, and just call forward.                       �[31m│
�[31m│�[39m   1192 │   │   �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m   1193 │   │   │   │   �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │   │   │   �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs)                     �[31m│
�[31m│�[39m   1195 │   │   # Do not call functions when jit is used                      �[31m│
�[31m│�[39m   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []         �[31m│
�[31m│�[39m   1197 │   │   �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks:            �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/peft/�[1mpeft_model.py�[22m:�[94m639�[39m in        �[31m│
�[31m│�[39m �[92mforward�[39m                                                                      �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   636 │   │   **kwargs,                                                      �[31m│
�[31m│�[39m   637 │   ):                                                                 �[31m│
�[31m│�[39m   638 │   │   �[94mif�[39m �[95mnot�[39m �[96misinstance�[39m(�[96mself�[39m.peft_config, PromptLearningConfig):     �[31m│
�[31m│�[39m �[31m❱ �[39m639 │   │   │   �[94mreturn�[39m �[96mself�[39m.base_model(                                    �[31m│
�[31m│�[39m   640 │   │   │   │   input_ids=input_ids,                                   �[31m│
�[31m│�[39m   641 │   │   │   │   attention_mask=attention_mask,                         �[31m│
�[31m│�[39m   642 │   │   │   │   inputs_embeds=inputs_embeds,                           �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m  �[31m│
�[31m│�[39m in �[92m_call_impl�[39m                                                                �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   1191 │   │   # this function, and just call forward.                       �[31m│
�[31m│�[39m   1192 │   │   �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m   1193 │   │   │   │   �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │   │   │   �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs)                     �[31m│
�[31m│�[39m   1195 │   │   # Do not call functions when jit is used                      �[31m│
�[31m│�[39m   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []         �[31m│
�[31m│�[39m   1197 │   │   �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks:            �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/accelerate/�[1mhooks.py�[22m:�[94m156�[39m in       �[31m│
�[31m│�[39m �[92mnew_forward�[39m                                                                  �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   153 │   │   │   �[94mwith�[39m torch.no_grad():                                      �[31m│
�[31m│�[39m   154 │   │   │   │   output = old_forward(*args, **kwargs)                  �[31m│
�[31m│�[39m   155 │   │   �[94melse�[39m:                                                          �[31m│
�[31m│�[39m �[31m❱ �[39m156 │   │   │   output = old_forward(*args, **kwargs)                      �[31m│
�[31m│�[39m   157 │   │   �[94mreturn�[39m module._hf_hook.post_forward(module, output)            �[31m│
�[31m│�[39m   158 │                                                                      �[31m│
�[31m│�[39m   159 │   module.forward = new_forward                                       �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/transformers/models/t5/�[1mmodeling_�[22m �[31m│
�[31m│�[39m �[1mt5.py�[22m:�[94m1722�[39m in �[92mforward�[39m                                                        �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   1719 │   │   │   # See https://github.com/tensorflow/mesh/blob/fa19d69eafc �[31m│
�[31m│�[39m   1720 │   │   │   sequence_output = sequence_output * (�[96mself�[39m.model_dim**-�[94m0.5�[39m �[31m│
�[31m│�[39m   1721 │   │                                                                 �[31m│
�[31m│�[39m �[31m❱ �[39m1722 │   │   lm_logits = �[96mself�[39m.lm_head(sequence_output)                     �[31m│
�[31m│�[39m   1723 │   │                                                                 �[31m│
�[31m│�[39m   1724 │   │   loss = �[94mNone�[39m                                                   �[31m│
�[31m│�[39m   1725 │   │   �[94mif�[39m labels �[95mis�[39m �[95mnot�[39m �[94mNone�[39m:                                        �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m  �[31m│
�[31m│�[39m in �[92m_call_impl�[39m                                                                �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   1191 │   │   # this function, and just call forward.                       �[31m│
�[31m│�[39m   1192 │   │   �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m   1193 │   │   │   │   �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │   │   │   �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs)                     �[31m│
�[31m│�[39m   1195 │   │   # Do not call functions when jit is used                      �[31m│
�[31m│�[39m   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []         �[31m│
�[31m│�[39m   1197 │   │   �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks:            �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/gaochangkuan_AI/PromptCLUE_Finetuning/�[1mint8_peft_lora_PromptCLUE_Finetu�[22m �[31m│
�[31m│�[39m �[1mning.py�[22m:�[94m175�[39m in �[92mforward�[39m                                                       �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   172                                                                        �[31m│
�[31m│�[39m   173 �[94mclass�[39m �[4mCastOutputToFloat�[24m(nn.Sequential):                                �[31m│
�[31m│�[39m   174   �[94mdef�[39m �[92mforward�[39m(�[96mself�[39m, x):                                                �[31m│
�[31m│�[39m �[31m❱ �[39m175 │   �[94mreturn�[39m �[96msuper�[39m().forward(x).to(torch.float32)                        �[31m│
�[31m│�[39m   176                                                                        �[31m│
�[31m│�[39m   177 model.lm_head = CastOutputToFloat(model.lm_head)                       �[31m│
�[31m│�[39m   178                                                                        �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mcontainer.py�[22m:�[94m20�[39m �[31m│
�[31m│�[39m �[94m4�[39m in �[92mforward�[39m                                                                 �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   201 │   # with Any as TorchScript expects a more precise type              �[31m│
�[31m│�[39m   202 │   �[94mdef�[39m �[92mforward�[39m(�[96mself�[39m, �[96minput�[39m):                                          �[31m│
�[31m│�[39m   203 │   │   �[94mfor�[39m module �[95min�[39m �[96mself�[39m:                                            �[31m│
�[31m│�[39m �[31m❱ �[39m204 │   │   │   �[96minput�[39m = module(�[96minput�[39m)                                      �[31m│
�[31m│�[39m   205 │   │   �[94mreturn�[39m �[96minput�[39m                                                   �[31m│
�[31m│�[39m   206 │                                                                      �[31m│
�[31m│�[39m   207 │   �[94mdef�[39m �[92mappend�[39m(�[96mself�[39m, module: Module) -> �[33m'Sequential'�[39m:                  �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mmodule.py�[22m:�[94m1194�[39m  �[31m│
�[31m│�[39m in �[92m_call_impl�[39m                                                                �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   1191 │   │   # this function, and just call forward.                       �[31m│
�[31m│�[39m   1192 │   │   �[94mif�[39m �[95mnot�[39m (�[96mself�[39m._backward_hooks �[95mor�[39m �[96mself�[39m._forward_hooks �[95mor�[39m �[96mself�[39m._ �[31m│
�[31m│�[39m   1193 │   │   │   │   �[95mor�[39m _global_forward_hooks �[95mor�[39m _global_forward_pre_hooks �[31m│
�[31m│�[39m �[31m❱ �[39m1194 │   │   │   �[94mreturn�[39m forward_call(*�[96minput�[39m, **kwargs)                     �[31m│
�[31m│�[39m   1195 │   │   # Do not call functions when jit is used                      �[31m│
�[31m│�[39m   1196 │   │   full_backward_hooks, non_full_backward_hooks = [], []         �[31m│
�[31m│�[39m   1197 │   │   �[94mif�[39m �[96mself�[39m._backward_hooks �[95mor�[39m _global_backward_hooks:            �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/accelerate/�[1mhooks.py�[22m:�[94m156�[39m in       �[31m│
�[31m│�[39m �[92mnew_forward�[39m                                                                  �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   153 │   │   │   �[94mwith�[39m torch.no_grad():                                      �[31m│
�[31m│�[39m   154 │   │   │   │   output = old_forward(*args, **kwargs)                  �[31m│
�[31m│�[39m   155 │   │   �[94melse�[39m:                                                          �[31m│
�[31m│�[39m �[31m❱ �[39m156 │   │   │   output = old_forward(*args, **kwargs)                      �[31m│
�[31m│�[39m   157 │   │   �[94mreturn�[39m module._hf_hook.post_forward(module, output)            �[31m│
�[31m│�[39m   158 │                                                                      �[31m│
�[31m│�[39m   159 │   module.forward = new_forward                                       �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m /root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/�[1mlinear.py�[22m:�[94m114�[39m   �[31m│
�[31m│�[39m in �[92mforward�[39m                                                                   �[31m│
�[31m│�[39m                                                                              �[31m│
�[31m│�[39m   111 │   │   │   init.uniform_(�[96mself�[39m.bias, -bound, bound)                    �[31m│
�[31m│�[39m   112 │                                                                      �[31m│
�[31m│�[39m   113 │   �[94mdef�[39m �[92mforward�[39m(�[96mself�[39m, �[96minput�[39m: Tensor) -> Tensor:                        �[31m│
�[31m│�[39m �[31m❱ �[39m114 │   │   �[94mreturn�[39m F.linear(�[96minput�[39m, �[96mself�[39m.weight, �[96mself�[39m.bias)                 �[31m│
�[31m│�[39m   115 │                                                                      �[31m│
�[31m│�[39m   116 │   �[94mdef�[39m �[92mextra_repr�[39m(�[96mself�[39m) -> �[96mstr�[39m:                                       �[31m│
�[31m│�[39m   117 │   │   �[94mreturn�[39m �[33m'in_features={}, out_features={}, bias={}'�[39m.format(      �[31m│
�[31m╰──────────────────────────────────────────────────────────────────────────────╯
�[1mRuntimeError: �[22mexpected scalar type Float but found Half

What is the reason for this, please?
Also, I am trying to enable the operations fp16, gradient_checkpointing, gradient_accumulation_steps at the same time, is there any conflict with this?

Is the requirement for pytorch 1.13 neccesary?

I've gotten it running on pytorch 1.11 seemingly without issue and was wondering if there was a reason for the requirement or if it was just the version it happened to be tested with. I was looking at integrating it with a project that has (at least for now) a requirement for 1.11 and wound up having to locally edit peft's setup.py to get it not to break the project by updating pytorch to an incompatible version.

Better warning message or error when target module names are not available

If we do:

from peft import LoraConfig, LoraModel
from transformers import AutoModelForImageClassification


model_checkpoint = "google/vit-base-patch32-224-in21k"
label2id = {"dog": 0, "cat": 1, "mouse": 2}
id2label = {v: k for k, v in label2id.items()}
model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint, 
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes = True, 
)

config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.0,
    bias="none",
)
model = LoraModel(config, model)

This wouldn't have any effect and will lead to errors during training because the target_modules were incorrectly named inside LoraConfig. When this case arises, it would help to throw a warning message so that the users can look into the issue with more information.

[Feature] Add support for Donut (Multimodal Model)

Thank you very much for sharing this library, it is going to be very useful for fine tuning big models.

It would be cool if Donut model is supported. This model works very well for key information extraction from document images. ;)

Could you share an example on how to use it with this model?

Thanks in advance! @NielsRogge

Compute capability < 7.5 detected!

I trying to run:

from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_model, LoraConfig
model_name_or_path = "facebook/opt-13b"

peft_config = LoraConfig(
    task_type="CAUSAL_LM", inference_mode=False, r=64, lora_alpha=32, lora_dropout=0.1
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
print(model.print_trainable_parameters())
model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())

but then I getting:
Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!"
NameError: name 'cuda_setup' is not defined

I understood that this error comes from bitsandbytes so I found this issue:
TimDettmers/bitsandbytes#124

My hardware is tesla v100-sxm2-32gb (Volta)

It is possible to run PEFT on my hardware? I don't really need int8, fp16 also should be good.

this is related also to my attempts to FT a large LM on my hardware:
https://discuss.huggingface.co/t/finetune-llm-with-deepspeed/31589

Thanks,
Shon

[int8 inference, lora, flan-t5-xxl] error during inference

Learning to use the library. Was following the tutorial Finetune_flan_t5_large_bnb_peft.ipynb. The int8 training is fine and the model is pushed to hub.

Then I try in8 inference with following code, and found 2 problems I don't understand.

import os
import torch
from peft import PeftModelForSeq2SeqLM, PeftConfig
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

model_id = "lukaemon/flan-t5-xxl-financial-phrasebank-lora"
config = PeftConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", load_in_8bit=True)
model = PeftModelForSeq2SeqLM.from_pretrained(model, model_id)

with torch.cuda.amp.autocast():
    input_text = "In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 ."
    input_ids = tokenizer(input_text, return_tensors="pt").input_ids.cuda()

    outputs = model.generate(input_ids=input_ids, max_new_tokens=10)

    print("input sentence: ", input_text)
    print("output prediction: ", tokenizer.decode(outputs[0], skip_special_tokens=True))

os.environ["CUDA_VISIBLE_DEVICES"] = "0" is not complied. Model weights are distributed to all local 2 cards.
inference error: 'NoneType' object has no attribute 'device'

AttributeError                            Traceback (most recent call last)
Cell In[2], line 5
      2 input_text = "In January-September 2009 , the Group 's net interest income increased to EUR 112.4 mn from EUR 74.3 mn in January-September 2008 ."
      3 input_ids = tokenizer(input_text, return_tensors="pt").input_ids.cuda()
----> 5 outputs = model.generate(input_ids=input_ids, max_new_tokens=10)
      7 print("input sentence: ", input_text)
      8 print(
      9     " output prediction: ", tokenizer.decode(outputs[0], skip_special_tokens=True)
     10 )

File /usr/local/lib/python3.8/dist-packages/peft/peft_model.py:725, in PeftModelForSeq2SeqLM.generate(self, **kwargs)
    723 def generate(self, **kwargs):
    724     if not isinstance(self.peft_config, PromptLearningConfig):
--> 725         return self.base_model.generate(**kwargs)
    726     else:
    727         if "input_ids" not in kwargs:

File /usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py:1264, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, **kwargs)
...
-> 1698     prev_device = pre_call(A.device)
   1699     if state is None: state = (A.shape, from_order)
   1700     else: from_order = state[1]

AttributeError: 'NoneType' object has no attribute 'device'

What did I miss? Thanks in advance.

Add an example demoing the use of PEFT for SegFormer Semantic Segmentation

Similar to #44 but for SegFormer (semantic segmentation).