GithubHelp home page GithubHelp logo

Comments (46)

pacman100 avatar pacman100 commented on May 21, 2024 2

Hello @sujithjoseph, for PEFT generate methods, one has to provide kwargs, could you try below change and let us know if that resolves the issue? Will add this point in caveats

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
-   input_ids, 
+   input_ids=input_ids,
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)

from peft.

pacman100 avatar pacman100 commented on May 21, 2024 2

Also, may I know what is the input and output seq lengths of the dataset?

In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,

  1. Input seq length = 255
  2. output seq length = 50
  3. batch_size_per_gpu = 8 (so total batch size of 32=8*4)

Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b

I observe below memory stats:

GPU Memory before entering the train : 954
GPU Memory consumed at the end of the train (end-begin): 1020
GPU Peak Memory consumed during the train (max-begin): 31323
GPU Total Peak Memory consumed during the train (max): 32277
CPU Memory before entering the train : 7361
CPU Memory consumed at the end of the train (end-begin): 1034
CPU Peak Memory consumed during the train (max-begin): 1034
CPU Total Peak Memory consumed during the train (max): 8395

So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.

from peft.

ianbstewart avatar ianbstewart commented on May 21, 2024 2

I'm facing a similar problem (w/ @sujithjoseph) where the config.json file is not saved during training, which makes it harder to load the model after training. Has there been a fix for this?

from peft.

pacman100 avatar pacman100 commented on May 21, 2024 1

Also, you can use it with Pipelines via the below logic, although a warning will be displayed mentioning model might be unsupported which can be ignored because PeftModel isn't subclass of models such as T5 ...:

from transformers import SummarizationPipeline


summarizer = SummarizationPipeline(model= model, tokenizer= tokenizer)

raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
        f"{prompt} \n\n {raw_document}",
        num_beams=5,
        min_length=5,
        no_repeat_ngram_size=3,
        truncation=True,
        max_length=512,
    )

Let us know if above snippet helps in using pipeline

from peft.

mayank31398 avatar mayank31398 commented on May 21, 2024 1

Try running in bf16 instead of fp32. Also, you can look at ONNX/TensorRT

from peft.

pacman100 avatar pacman100 commented on May 21, 2024 1

Had a follow up Q. I was trying to load the model with int-8

To load model trained using Accelerate+DeepSpeed ZeRO-3, you can do the following. Below is an example for 3B model:

+ from peft import prepare_model_for_training
  peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
  config = PeftConfig.from_pretrained(peft_model_id)
  model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, 
             load_in_8bit=True, 
              device_map={'':0})
+ model = prepare_model_for_training(model)
  model = PeftModel.from_pretrained(model, peft_model_id)
  tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

Then running generate as usual:

%%time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(dataset["test"][i]["Tweet text"])
print(inputs)

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
    print(outputs)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
# ['complaint']

I ran below snippet in a jupyter cell for the following 3 settings:

from time import time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)
  1. For fp32, load directly without using device_map if you have enough GPU memory: model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
  2. for bf16, post loading the PeftModel, do model.to(torch.bfloat16)
precision inference wall time (ms)
FP32 96
BF16 105
INT8 370

@mayank31398, BF16 is taking more time than FP32, that is peculiar, usually with Fp16 models, latency is reduced by half but here it is increasing. To make sure this isn't related to PEFT I just l loaded the pretrained LLM can still see the same behaviour with latency of BF16 being more compared to FP32.

from peft.

pacman100 avatar pacman100 commented on May 21, 2024 1

@sujithjoseph, device_map and load_in_8bit are used for low resource inference when suppose you have GPU with VRAM that can't fit the entire model; device_map offloads it to CPU or across smaller GPUs; load_in_8bit aims to fir such large models on given GPU by having weights in int8 precision.

For very low latencies, as @mayank31398 suggested, you would have to convert the model to ONNX/TensorRT; alternatively use flash attention, fused kernels ...

from peft.

pacman100 avatar pacman100 commented on May 21, 2024 1

One thing that I just checked was enabling gradient_checkpointing which will recompute the activations of intermediate blocks instead of storing them. With that using the same codebase as above, the memory being consumed for input_seq-len=512 and output_seq_len=512 is 16GB per GPU. The changes to the code:

  model = AutoModelForSeq2SeqLM.from_pretrained(args.model_id)
    
+  if args.gradient_checkpointing:
+       model.gradient_checkpointing_enable()
+       def make_inputs_require_grad(module, input, output):
+           output.requires_grad_(True)

+       model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
+       model.config.use_cache=False
  
  
    # define LorA fine-tuning config
    if args.use_peft:
        peft_config = LoraConfig(
            task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
        )
        # Create PEFT model with LoraConfig
        model = get_peft_model(model, peft_config)
        model.print_trainable_parameters()

So, you can now easily train with batch_size of 8 per GPU with inputs and outputs of length 512. However, there is no free lunch, the training time increases because of the need to recompute activation. Also, if you are evaluating using generate then it will take a lot longer for evaluation because of setting model.config.use_cache=False as it is incompatible with gradient checkpointing.

However, to fit larger batches, I would first see what the 90 percentile or 80 percentlie lengths of my inputs and outputs are, for many use cases they can be lot less than 512. If they are indeed 512, I would then use gradient_checkpointing.

from peft.

mayank31398 avatar mayank31398 commented on May 21, 2024 1

You are at your memory limit on the GPU. This generally slows down training.

from peft.

smolskayanastassia avatar smolskayanastassia commented on May 21, 2024 1

@pacman100 @sujithjoseph @JohnGiorgi @mayank31398
Could you please help how to convert Peft model to onnx using optimum?

from peft.

pacman100 avatar pacman100 commented on May 21, 2024 1

Hello, this thread has become so long to follow. Please raise separate issues for anything deviating from this original issue. @sujithjoseph I think the original issue has been resolved. If so, could you please close this and open new one for the recent one that you are facing.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

I was able to re-create the config file with a smaller data set training and then saved it using
finalmodel = accelerator.unwrap_model(model)

    finalmodel.save_pretrained(peft_model_id)

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

how can i do inference easily using huggingface pipelines like this from a PeftModelForSeq2SeqLM model .

from transformers import pipeline

summarizer = pipeline("summarization", "cdcFT5lra", torch_dtype=torch.bfloat16)

raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
        f"{prompt} \n\n {raw_document}",
        num_beams=5,
        min_length=5,
        no_repeat_ngram_size=3,
        truncation=True,
        max_length=512,
    )

OR

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
    input_ids, 
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)

doesnt work . Gives an error

TypeError: generate() takes 1 positional argument but 2 were given

PEFT examples, uses datasets as input for inference . Is that the only way ?

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Thanks @pacman100 . Really Appreciate it! Had a follow up Q. I was trying to load the model with int-8


max_memory={0: "30GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu":"60GB"}
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory, load_in_8bit=True)
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory)

Got a runtime error
RuntimeError: expected scalar type Half but found Float

By default, does it load up in bfloat16 or float16, if the model is trained in bfloat16?

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

fine-tuned flan-t5-xxl takes around 10-20 seconds on a single 40 GB A100 GPU to give answer for a prompt.. If there anything than can be done it to make it faster w/o using a smaller flan-t5 model.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Thanks a lot @pacman100 @mayank31398! , This has been really insightful! I didn't know that converting the model to Tensor RT and serve via TRT inference server, would be more faster than peft + deepspeed zero3 for inference.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

I also see quality issues on the fine-tuned flan-t5-xxl (on 500K records), unlike the original model. Its hallucinating a lot. I had used batch size as 1 , as I couldn't fit it for training on 8 40 GB A100s with batch size as 2 (it used to run for couple of hours and then go OOM) . and here are the train/eval ppl/loss
epoch : 0
train_ppl : 133.7952117919922
train_epoch_loss : 4.896310329437256

eval_ppl : 1.5221441984176636
eval_epoch_loss : 0.4201200008392334

def generate_custom(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
     input_ids=input_ids, 
    min_length=256,
    max_new_tokens=1024,
    length_penalty=1.4,
    no_repeat_ngram_size=2,
    top_k=150,
    top_p=0.92,
    repetition_penalty=2.1,
    #num_beams=4,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

from peft.

mayank31398 avatar mayank31398 commented on May 21, 2024

8x 40G A100s should be enough for PEFT training of FLAN. Can you tell me what backend you are using?
Are you not using DeepSpeed?

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Yes. DeepSpeed zero 3. It worked fine with batch size as 1, not 2. I am concerned if lower batch size is impacting model quality. I had 500K records as training set.
Here is my config (deepspeed / accelerate)

deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true
distributed_type: DEEPSPEED
downcast_bf16: true
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

reference - microsoft/DeepSpeed#2820

from peft.

mayank31398 avatar mayank31398 commented on May 21, 2024

i only see 4 processes in the yaml ^^
you can always enable cpu offloading

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

@mayank31398 I had started with 4 and expanded to 8 . My final config has num proc as 8. Doesnt this enable cpu offoading?

  offload_optimizer_device: cpu
  offload_param_device: cpu

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

I also had changed this in the final config - dynamo_backend: 'INDUCTOR'

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

If I shard the xxl base model like this

model.save_pretrained("sharded", max_shard_size="2000MB")

will it help in then finetuning it with larger batch size or should I load it int-8 n and fine-tune it with larger batch size which fits in memory. Not sure which one will result in higher quality model.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Since I have CUDA 11.6 driver installed (vertex ai), I was using torch 1.12.1+cu116 . During installation, I see this

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.2.0.dev0 requires torch>=1.13.0, but you have torch 1.12.1+cu116 which is incompatible.

Does peft really need 1.13.0 version of torch?. So far, I havent seen any issues with 1.12.1+cu116 with peft

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

@pacman100 , I am not able to import prepare_model_for_training from main. I did pip install -U git+https://github.com/huggingface/peft.git. Should I install this branch - https://github.com/huggingface/peft/tree/younesbelkada-flan-t5-xl ?

ImportError: cannot import name 'prepare_model_for_training' from 'peft' (/opt/conda/lib/python3.7/site-packages/peft/init.py) / I see it in https://github.com/huggingface/peft/blob/main/src/peft/utils/other.py . I see it in https://github.com/huggingface/peft/blob/main/src/peft/__init__.py as well. Probably need to uninstall and install again.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

pip install --upgrade -e git+https://github.com/huggingface/peft.git#egg=peft
pip install --upgrade git+https://github.com/huggingface/peft.git

This helped to fix it.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024
model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024
from time import time
model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)

Gives the below error AttributeError: 'NoneType' object has no attribute 'device'

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│    8 │   with torch.no_grad():                                                                   │
│    9 │   │   #with torch.cuda.amp.autocast():                                                    │
│   10 │   │   start = time()                                                                      │
│ ❱ 11 │   │   outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_token    │
│   12 │   │   times.append((time()-start)*1000)                                                   │
│   13 print(outputs)                                                                              │
│   14 print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))     │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/peft/peft_model.py:708 in generate                        │
│                                                                                                  │
│   705 │                                                                                          │
│   706 │   def generate(self, **kwargs):                                                          │
│   707 │   │   if not isinstance(self.peft_config, PromptLearningConfig):                         │
│ ❱ 708 │   │   │   return self.base_model.generate(**kwargs)                                      │
│   709 │   │   else:                                                                              │
│   710 │   │   │   if "input_ids" not in kwargs:                                                  │
│   711 │   │   │   │   raise ValueError("input_ids must be provided for Peft model generation")   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py:27 in decorate_context        │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:1248 in generate         │
│                                                                                                  │
│   1245 │   │   │   # if model is encoder decoder encoder_outputs are created                     │
│   1246 │   │   │   # and added to `model_kwargs`                                                 │
│   1247 │   │   │   model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(           │
│ ❱ 1248 │   │   │   │   inputs_tensor, model_kwargs, model_input_name                             │
│   1249 │   │   │   )                                                                             │
│   1250 │   │                                                                                     │
│   1251 │   │   # 5. Prepare `input_ids` which will be used for auto-regressive generation        │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:609 in                   │
│ _prepare_encoder_decoder_kwargs_for_generation                                                   │
│                                                                                                  │
│    606 │   │   model_input_name = model_input_name if model_input_name is not None else self.ma  │
│    607 │   │   encoder_kwargs["return_dict"] = True                                              │
│    608 │   │   encoder_kwargs[model_input_name] = inputs_tensor                                  │
│ ❱  609 │   │   model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)          │
│    610 │   │                                                                                     │
│    611 │   │   return model_kwargs                                                               │
│    612                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1075 in forward     │
│                                                                                                  │
│   1072 │   │   │   │   │   cross_attn_layer_head_mask=cross_attn_layer_head_mask,                │
│   1073 │   │   │   │   │   past_key_value=past_key_value,                                        │
│   1074 │   │   │   │   │   use_cache=use_cache,                                                  │
│ ❱ 1075 │   │   │   │   │   output_attentions=output_attentions,                                  │
│   1076 │   │   │   │   )                                                                         │
│   1077 │   │   │                                                                                 │
│   1078 │   │   │   # layer_outputs is a tuple with:                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:692 in forward      │
│                                                                                                  │
│    689 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    690 │   │   │   past_key_value=self_attn_past_key_value,                                      │
│    691 │   │   │   use_cache=use_cache,                                                          │
│ ❱  692 │   │   │   output_attentions=output_attentions,                                          │
│    693 │   │   )                                                                                 │
│    694 │   │   hidden_states, present_key_value_state = self_attention_outputs[:2]               │
│    695 │   │   attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs an  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:599 in forward      │
│                                                                                                  │
│    596 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    597 │   │   │   past_key_value=past_key_value,                                                │
│    598 │   │   │   use_cache=use_cache,                                                          │
│ ❱  599 │   │   │   output_attentions=output_attentions,                                          │
│    600 │   │   )                                                                                 │
│    601 │   │   hidden_states = hidden_states + self.dropout(attention_output[0])                 │
│    602 │   │   outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:511 in forward      │
│                                                                                                  │
│    508 │   │   │   return hidden_states                                                          │
│    509 │   │                                                                                     │
│    510 │   │   # get query states                                                                │
│ ❱  511 │   │   query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length,  │
│    512 │   │                                                                                     │
│    513 │   │   # get key/value states                                                            │
│    514 │   │   key_states = project(                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/peft/tuners/lora.py:456 in forward                        │
│                                                                                                  │
│   453 │   │   │   │   nn.init.zeros_(self.lora_B.weight)                                         │
│   454 │   │                                                                                      │
│   455 │   │   def forward(self, x: torch.Tensor):                                                │
│ ❱ 456 │   │   │   result = super().forward(x)                                                    │
│   457 │   │   │   if self.r > 0:                                                                 │
│   458 │   │   │   │   result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling    │
│   459 │   │   │   return result                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/nn/modules.py:242 in forward                 │
│                                                                                                  │
│   239 │   │   if self.bias is not None and self.bias.dtype != x.dtype:                           │
│   240 │   │   │   self.bias.data = self.bias.data.to(x.dtype)                                    │
│   241 │   │                                                                                      │
│ ❱ 242 │   │   out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)                 │
│   243 │   │   if not self.state.has_fp16_weights:                                                │
│   244 │   │   │   if self.state.CB is not None and self.state.CxB is not None:                   │
│   245 │   │   │   │   # we converted 8-bit row major to turing/ampere format in the first infe   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:488 in matmul         │
│                                                                                                  │
│   485 │   state = state or MatmulLtState()                                                       │
│   486 │   if threshold > 0.0:                                                                    │
│   487 │   │   state.threshold = threshold                                                        │
│ ❱ 488 │   return MatMul8bitLt.apply(A, B, out, bias, state)                                      │
│   489                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:320 in forward        │
│                                                                                                  │
│   317 │   │   │   │   │   state.CxB, state.SB = F.transform(state.CB, to_order=formatB)          │
│   318 │   │   else:                                                                              │
│   319 │   │   │   if not state.has_fp16_weights and state.CxB is None and using_igemmlt:         │
│ ❱ 320 │   │   │   │   state.CxB, state.SB = F.transform(state.CB, to_order=formatB)              │
│   321 │   │   │   subA = None                                                                    │
│   322 │   │                                                                                      │
│   323 │   │   # 2. Quantize B                                                                    │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/functional.py:1698 in transform              │
│                                                                                                  │
│   1695                                                                                           │
│   1696                                                                                           │
│   1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=N  │
│ ❱ 1698 │   prev_device = pre_call(A.device)                                                      │
│   1699 │   if state is None: state = (A.shape, from_order)                                       │
│   1700 │   else: from_order = state[1]                                                           │
│   1701 │   if out is None: out, new_state = get_transform_buffer(state[0], A.dtype, A.device, t  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'device'

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

This only happens when i load the model in 8-bit alone.

config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16)
device = torch.device("cuda")
model.cuda()
model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

either in 1 GPU or device:auto

from peft.

pacman100 avatar pacman100 commented on May 21, 2024

@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.

from peft.

pacman100 avatar pacman100 commented on May 21, 2024

This only happens when i load the model in 8-bit alone.

config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16)
device = torch.device("cuda")
model.cuda()
model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

either in 1 GPU or device:auto

Does adding device_map={'':0} to PeftModel.from_pretrained resolve the issues: model = PeftModel.from_pretrained(model, peft_model_id, device_map={'':0})

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.

@pacman100 deepspeed==0.8.0

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Also, may I know what is the input and output seq lengths of the dataset?

In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,

  1. Input seq length = 255
  2. output seq length = 50
  3. batch_size_per_gpu = 8 (so total batch size of 32=8*4)

Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b

I observe below memory stats:

GPU Memory before entering the train : 954
GPU Memory consumed at the end of the train (end-begin): 1020
GPU Peak Memory consumed during the train (max-begin): 31323
GPU Total Peak Memory consumed during the train (max): 32277
CPU Memory before entering the train : 7361
CPU Memory consumed at the end of the train (end-begin): 1034
CPU Peak Memory consumed during the train (max-begin): 1034
CPU Total Peak Memory consumed during the train (max): 8395

So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.

max length is 512 for both source and target.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Thanks a lot , @pacman100 ! This is awesome! I will reduce max length for input seq length. I am trying to see if I can pass a Q and if Flan T5 can generate an answer/context summary.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Does it help if I increase gradient accumulations steps to 4 from 1. Will it help in model accuracy, since I may be able to fit more batch size?

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Thanks @pacman100 . This worked. Here are some stats for avg time for 100 inferences.

FP32 - w/o device map didnt fit in single 40 GB with my current code.
FP32 with device map took 1982.5
BF16 with device map took - 2144.4
int-8 w/o device map took 10719 and it didnt yield the same response as BF16 or FP32.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

@pacman100 , If we need to enable TF32 support instead of bf16, should I select --mixed_precision as 'no' and set

    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

in train function, since I am using A100s?

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

Thanks a bunch! With the above changes, I was able to squeeze in 24 as batch size with tf32 precision and 32 as batch size with bf16 mixed precision settings. source token max length - 62 , target- 512.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

I do get this warning frequently. I assume, I can safely ignore this, rather than reducing the batch size.

2 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

@pacman100 , Unfortunately, it errored during the eval phase

ValueError: cannot insert level_0, already exists

Generate config GenerationConfig {██████████████████████████████████████████████████▍                                                            | 298/528 [5:53:15<4:30:43, 70.63s/it]
  "_from_model_config": true,
  "decoder_start_token_id": 0,
  "eos_token_id": 1,
  "pad_token_id": 0,
  "transformers_version": "4.27.0.dev0",
  "use_cache": false
  }

 57%|██████████████████████████████████████████████████████████████████████████████▋                                                            | 299/528 [5:54:26<4:29:36, 70.64s/it]
                                                                                                 │
│ /home/jupyter/t5/flant5/c/cdc_lora_train.py:324 in training_function                             │
│                                                                                                  │
│   321 │   │   │   │   )                                                                          │
│   322 │   │   │   │   if (step+1)%args.tracking_steps==0:                                        │
│   323 │   │   │   │   │   pred_df = pd.concat([pred_df, pd.DataFrame({"decoded_preds": decoded   │
│ ❱ 324 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   "decoded_labels":decoded   │
│   325 │   │   │   │   │   accelerator.print(pred_df)                                             │
│   326 │   │   │   │                                                                              │
│   327 │   │   │   │   #break                                                                     │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py:311 in wrapper                 │
│                                                                                                  │
│   308 │   │   │   │   │   FutureWarning,                                                         │
│   309 │   │   │   │   │   stacklevel=stacklevel,                                                 │
│   310 │   │   │   │   )                                                                          │
│ ❱ 311 │   │   │   return func(*args, **kwargs)                                                   │
│   312 │   │                                                                                      │
│   313 │   │   return wrapper                                                                     │
│   314                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:5799 in reset_index                  │
│                                                                                                  │
│    5796 │   │   │   │   │   │   level_values, lab, allow_fill=True, fill_value=lev._na_value     │
│    5797 │   │   │   │   │   )                                                                    │
│    5798 │   │   │   │                                                                            │
│ ❱  5799 │   │   │   │   new_obj.insert(0, name, level_values)                                    │
│    5800 │   │                                                                                    │
│    5801 │   │   new_obj.index = new_index                                                        │
│    5802 │   │   if not inplace:                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:4414 in insert                       │
│                                                                                                  │
│    4411 │   │   │   )                                                                            │
│    4412 │   │   if not allow_duplicates and column in self.columns:                              │
│    4413 │   │   │   # Should this be a different kind of error??                                 │
│ ❱  4414 │   │   │   raise ValueError(f"cannot insert {column}, already exists")                  │
│    4415 │   │   if not isinstance(loc, int):                                                     │
│    4416 │   │   │   raise TypeError("loc must be int")                                           │
│    4417                                                 

ValueError: cannot insert level_0, already exists

Code snippet

                if (step+1)%args.tracking_steps==0:
                    pred_df = pd.concat([pred_df, pd.DataFrame({"decoded_preds": decoded_preds,
                                                                "decoded_labels":decoded_labels})]).reset_index()
                    accelerator.print(pred_df)

Will try changing it to reset_index(drop=True) to see if this will get fixed.

from peft.

JohnGiorgi avatar JohnGiorgi commented on May 21, 2024

Hello @sujithjoseph, for PEFT generate methods, one has to provide kwargs, could you try below change and let us know if that resolves the issue? Will add this point in caveats

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
-   input_ids, 
+   input_ids=input_ids,
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)

This is already a long thread, so apologies for piling on but... this change makes it a bit harder to use PEFT with the transformers Seq2SeqTrainer as its call to model.generate does not pass the input_ids as a keyword arg:

generated_tokens = self.model.generate(
    generation_inputs,
    **gen_kwargs,
)

A simple fix would be to update this line of code

generated_tokens = self.model.generate(
-   generation_inputs,
+   input_ids=generation_inputs,
    **gen_kwargs,
)

I would be happy to PR this to Transformers if this is correct and there's no good reason for generation_inputs to be a positional argument.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

@pacman100 , I now run into a new inference issue, which I didnt see earlier with a VM with just 1 A100 GPU - 40G,

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 

from transformers import AutoModelForSeq2SeqLM
from peft import PeftModel, PeftConfig, PeftModelForSeq2SeqLM
import torch
from datasets import load_dataset

from transformers import AutoTokenizer
from torch.utils.data import DataLoader
from transformers import default_data_collator,get_linear_schedule_with_warmup
from tqdm import tqdm
from datasets import load_dataset

#from peft import prepare_model_for_training
from peft import prepare_model_for_int8_training

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
batch_size=1
max_memory = {0: "39GIB", "cpu": "70GB"}

peft_model_id = "model-2-21"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, torch_dtype=torch.bfloat16, max_memory=max_memory)
model = prepare_model_for_int8_training(model)

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

model = PeftModelForSeq2SeqLM.from_pretrained(model, peft_model_id, torch_dtype=torch.bfloat16 ,max_memory=max_memory)

model.to(torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

Error :

RuntimeError: Attempting to deserialize object on CUDA device 6 but torch.cuda.device_count() is 1. Please use 
torch.load with map_location to map your storages to an existing device.

the error is same , even when i use device_map={'':0} or auto.

It is also not working with device_map={'':0} with load_in_8_bit as true .

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

model.hf_device_map - this is what i see from the finetuned Peft model. Not sure the reason behind the difference. This was created from an interim checkpoint

                if (step+1)%args.tracking_steps==0:
                    pred_df = pd.concat([pred_df, pd.DataFrame({"decoded_preds": decoded_preds,
                                                                "decoded_labels":decoded_labels})]).reset_index(drop=True)
                    accelerator.print(pred_df)
                    #checkpoint at every tracking step
                    accelerator.wait_for_everyone()
                    unwrapped_model = accelerator.unwrap_model(model)
                    unwrapped_model.save_pretrained(
                    args.output_dir,
                    is_main_process=accelerator.is_main_process,
                    save_function=accelerator.save,
                    state_dict=accelerator.get_state_dict(model),
                )
                

@mayank31398  @pacman100 @younesbelkada 

{'shared': 0,
'decoder.embed_tokens': 0,
'encoder': 0,
'decoder.block.0': 0,
'decoder.block.1': 0,
'decoder.block.2': 0,
'decoder.block.3': 1,
'decoder.block.4': 1,
'decoder.block.5': 1,
'decoder.block.6': 1,
'decoder.block.7': 1,
'decoder.block.8': 1,
'decoder.block.9': 1,
'decoder.block.10': 1,
'decoder.block.11': 1,
'decoder.block.12': 1,
'decoder.block.13': 1,
'decoder.block.14': 1,
'decoder.block.15': 1,
'decoder.block.16': 1,
'decoder.block.17': 1,
'decoder.block.18': 1,
'decoder.block.19': 1,
'decoder.block.20': 1,
'decoder.block.21': 1,
'decoder.block.22': 1,
'decoder.block.23': 1,
'decoder.final_layer_norm': 1,
'decoder.dropout': 1,
'lm_head': 1}

Below is from https://huggingface.co/ybelkada/flan-t5-large-financial-phrasebank-lora . This loads up fine w/o any issues using the same code. 

model.hf_device_map

{'base_model.model.shared': 0,
'base_model.model.decoder.embed_tokens': 0,
'base_model.model.encoder': 0,
'base_model.model.decoder.block.0': 0,
'base_model.model.decoder.block.1': 0,
'base_model.model.decoder.block.2': 1,
'base_model.model.decoder.block.3': 1,
'base_model.model.decoder.block.4': 1,
'base_model.model.decoder.block.5': 1,
'base_model.model.decoder.block.6': 1,
'base_model.model.decoder.block.7': 1,
'base_model.model.decoder.block.8': 1,
'base_model.model.decoder.block.9': 1,
'base_model.model.decoder.block.10': 1,
'base_model.model.decoder.block.11': 1,
'base_model.model.decoder.block.12': 1,
'base_model.model.decoder.block.13': 1,
'base_model.model.decoder.block.14': 1,
'base_model.model.decoder.block.15': 1,
'base_model.model.decoder.block.16': 1,
'base_model.model.decoder.block.17': 1,
'base_model.model.decoder.block.18': 1,
'base_model.model.decoder.block.19': 1,
'base_model.model.decoder.block.20': 1,
'base_model.model.decoder.block.21': 1,
'base_model.model.decoder.block.22': 1,
'base_model.model.decoder.block.23': 1,
'base_model.model.decoder.final_layer_norm': 1,
'base_model.model.decoder.dropout': 1,
'base_model.model.lm_head': 1}

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

@pacman100 @sujithjoseph @JohnGiorgi @mayank31398 Could you please help how to convert Peft model to onnx using optimum?

I am also looking for the same info.

from peft.

sujithjoseph avatar sujithjoseph commented on May 21, 2024

adapters_weights = torch.load(filename) in PeftModel class . Does it need a map_location passed to it as well?

from peft.

github-actions avatar github-actions commented on May 21, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

from peft.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.