zphang / minimal-llama Goto Github PK

Python 99.97% Shell 0.03%

minimal-llama's Introduction

Minimal LLaMA

This repo contains a random assortment of code for running and fine-tuning LLaMA. Many parts are still work in progress. There ought to be more efficient methods of tuning (DeepSpeed / ZeRO, NeoX) than the ones presented here, but folks may find this useful already.

Tokenize datasets
PEFT Fine-tuning with 8-bit
Fine-tuning with Naive Pipeline Parallel
(New) PEFT Fine-tuning with 8-bit and Pipeline Parallel
Misc notes

This code was fairly quickly thrown together and may contains many, many bugs. Feedback is welcome!

Tokenize datasets

First, we tokenize the data so we never have to worry about the tokenizer again. The tokenization script takes in a JSONL (each row containing the key "text" for the document text), and effectively concatenates, tokenizes, and slices into max_seq_length chunks.

(This is a quick and dirty script that loads the whole dataset into memory.)

python tokenize_dataset.py \
    --tokenizer_path /path/to/tokenizer \
    --jsonl_path /path/to/data.jsonl \
    --save_path /path/to/tokenized_dataset \
    --max_seq_length 512

PEFT Fine-tuning with 8-bit

Requires using the Transformers PR here, based on the fork here. Model weights need to be converted to HF format using the weight conversion script in the PR.

Requires using the PEFT PR here, based on the fork here.

We can fine-tune using the PEFT library, with the model converted to 8-bit. This is based on the guide here.

python finetune_peft.py \
    --model_path /path/to/llama-7b/ \
    --dataset_path /path/to/tokenized_dataset \
    --peft_mode lora \
    --lora_rank 8 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --max_steps 2500 \
    --learning_rate 2e-4 \
    --fp16 \
    --logging_steps 10 \
    --output_dir /path/to/save

The above configuration (with max_seq_length=512) uses about 20GB of RAM on a single GPU. (With bs=1 and max_seq_length=256, this gets down to about 12 GB.)

You can generate using the trained PEFT params using something like the following:

import torch
import transformers
from finetune_peft import get_peft_config, PEFTArguments
from peft import get_peft_model

model_path = ...
peft_path = ...
tokenizer_path = ...

torch.set_default_tensor_type(torch.cuda.HalfTensor)
model = transformers.LLaMAForCausalLM.from_pretrained(model_path)
peft_config = get_peft_config(peft_args=PEFTArguments(peft_mode="lora"))
model = get_peft_model(model, peft_config)
model.load_state_dict(torch.load(peft_path), strict=False)
torch.set_default_tensor_type(torch.cuda.FloatTensor)

tokenizer = transformers.LLaMATokenizer.from_pretrained(tokenizer_path)
batch = tokenizer("The LLaMA language model is", return_tensors="pt")

with torch.no_grad():
    out = model.generate(
        input_ids=batch["input_ids"],
        attention_mask=torch.ones_like(batch["input_ids"]),
        max_length=200,
    )
print(tokenizer.decode(out[0]))

Fine-tuning with Naive Pipeline Parallel

Requires using the Transformers PR here, based on the fork here. Model weights need to be converted to HF format using the weight conversion script in the PR.

For fully fine-tuning (larger) models, we can use (a very naively implemented version of) pipeline parallelism. This is preferable for larger models that won't fit on a single GPU.

python finetune_pp.py \
    --model_path /path/to/llama-7b/ \
    --dataset_path /path/to/tokenized_dataset \
    --save_dir /path/to/save \
    --batch_size 4 \
    --gradient_accumulation_steps 2 \
    --save_interval 2000 \
    --num_train_steps 20000

The above configuration uses about 30-35GB of RAM per GPU across 8 GPUs.

PEFT Fine-tuning with 8-bit and Pipeline Parallel

Seems buggy, don't use this yet.

Requires using the Transformers PR here, based on the fork here. Model weights need to be converted to HF format using the weight conversion script in the PR.

Requires using the PEFT PR here, based on the fork here.

Here, we combine PEFT training with pipeline parallel to train with large models. See PEFT Fine-tuning with 8-bit for more details.

python finetune_pp_peft.py \
    --model_path /path/to/llama-30b/ \
    --dataset_path /path/to/tokenized_dataset \
    --save_dir /path/to/save \
    --batch_size 4 \
    --learning_rate 5e-5 \
    --gradient_accumulation_steps 1 \
    --save_interval 2000 \
    --num_train_steps 20000 \
    --peft_mode lora \
    --lora_rank 8

For instance, you can fine-tune LoRA on 65B LLaMA with about 120GB of memory in total (e.g. 15GB each on 8 GPUs, or 60GB on 2 GPUs) with batch size=1 and sequence length = 512.

Misc Notes

I have no idea what hyperparameters are best for fine-tuning.
Aside from model parameters + gradients + optimizer states, the hidden activations also take up a big chunk of memory. Shortening the max_sequence_length is a good way of reducing memory consumption. I don't really know how much that affects fine-tuning performance either.

minimal-llama's People

Contributors

Stargazers

Watchers

minimal-llama's Issues

AttributeError: module transformers.models.llama has no attribute LLaMATokenizer

Hi I can load the model fine via model = transformers.LLaMAForCausalLM.from_pretrained("/content/drive/MyDrive/llama-13b-hf/")
but Im not finding the LLaMATokenizer, so receiving the error AttributeError: module transformers.models.llama has no attribute LLaMATokenizer

Bitsandbytes not compatible with Windows!

Is it possible to run minimal-llama fine tuning on Windows without bitsandbytes?

PEFT + PP bug?

Hello, could you please elaborate on what "Seems buggy, don't use this yet." means for the 8-bit + pipeline parallel example? What bug is there specifically? Does it affect training results or is it a tooling issue? I've been waiting to be able to fine tune the 65B model for a while now and if there's anything I can do with testing or fixing this bug, I'd love some pointers. Thanks!

hi

I have finished training the model and would like to use it for inference. Could you please tell me what "pefg_path" means? Thank you very much.

one step ,no data.json file

please get one data file to one step

Fine-tuning with Naive Pipeline Parallel: NaN after optimizer step

Your model does not seem to be able to calculate the gradients of the layers correctly. When I run finetune_pp.py and print the loss during training, after the first optimizer step, the loss becomes the following:

tensor(nan, device='cuda:1', dtype=torch.float16, grad_fn=)

Can you reproduce this on your machine? Otherwise, would you be willing to share your pip freeze, so that I can try out, if there is a package mismatch?

device_map="auto" for training?

Thanks for your code!

Could you clarify why device_map="auto" for finetuning? As far as I understand, this only supports inference according to HF (https://huggingface.co/docs/accelerate/usage_guides/big_modeling). What effect does it have in training?

model = transformers.LlamaForCausalLM.from_pretrained(
        finetune_args.model_path,
        load_in_8bit=True,
        device_map="auto",
    )

Thanks in advance.

TypeError: dispatch_model() got an unexpected keyword argument 'offload_index'

Hello guy, this is a great project.

But I got an error when I running the command of Fine-tuning with 8-bit

python finetune_peft.py \
    --model_path LLaMA-hf/llama-7b \
    --dataset_path data \
    --peft_mode lora \
    --lora_rank 8 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --max_steps 2500 \
    --learning_rate 2e-4 \
    --fp16 \
    --logging_steps 10 \
    --output_dir output/models

And my model files were converted by the transformers/models/llama/convert_llama_weights_to_hf.py

transformers version: a3dfcc02d249cbd14ce9089f57d4040146f3f090

How to correctly load and merge finetuned LLaMA models in different formats?

I am new to NLP and currently exploring the LLaMA model. I understand that there are different formats for this model - the original format and the Hugging Face format. I have fine-tuned the LLaMA model on my dataset using this tool https://github.com/lxe/llama-peft-tuner which is based on minimal-llama, and it saves the models in a certain way (see below):

$ ll llama-peft-tuner/models/csco-llama-7b-peft/
total 16456
drwxrwxr-x 8 lachlan lachlan     4096 May 10 10:42 ./
drwxrwxr-x 5 lachlan lachlan     4096 May 10 10:06 ../
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:21 checkpoint-1000/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:28 checkpoint-1500/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:35 checkpoint-2000/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:42 checkpoint-2500/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:13 checkpoint-500/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:42 model-final/
-rw-rw-r-- 1 lachlan lachlan 16814911 May 10 10:42 params.p

$ ll llama-peft-tuner/models/csco-llama-7b-peft/checkpoint-2500/
total 7178936
drwxrwxr-x 2 lachlan lachlan       4096 May 10 10:42 ./
drwxrwxr-x 8 lachlan lachlan       4096 May 10 10:42 ../
-rw-rw-r-- 1 lachlan lachlan   33629893 May 10 10:42 optimizer.pt
-rw-rw-r-- 1 lachlan lachlan 7317523229 May 10 10:42 pytorch_model.bin
-rw-rw-r-- 1 lachlan lachlan      14575 May 10 10:42 rng_state.pth
-rw-rw-r-- 1 lachlan lachlan        557 May 10 10:42 scaler.pt
-rw-rw-r-- 1 lachlan lachlan        627 May 10 10:42 scheduler.pt
-rw-rw-r-- 1 lachlan lachlan      28855 May 10 10:42 trainer_state.json
-rw-rw-r-- 1 lachlan lachlan       3899 May 10 10:42 training_args.bin

I am not quite sure about the relationship between pytorch_model.bin, the original model, and adapter_model.bin. I suppose pytorch_model.bin is in the Hugging Face format. Now, I want to create a .pth model that I can load in https://github.com/juncongmoo/pyllama/tree/main/apps/gradio.

I followed the manual conversion guide at https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Manual-Conversion to convert the Hugging Face format into Hugging Face format (.bin) or PyTorch format (.pth). I tried treating pytorch_model.bin as the Hugging Face format and modified the code to ignore the LoRA, but I couldn't achieve the desired result.The fine-tuning repository mentioned below provided a way to load the trained model by combining the original model and the learned parameters. I tried to adapt this approach into https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py and tried different combinations, but the result either doesn't incorporate the trained parameters or generates meaningless outputs.

Can someone help me understand how to correctly load and merge these models? Any help would be greatly appreciated. Thank you.

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 0%|

When running the finetuning with peft with the command:

python finetune_peft.py --model_path ../../LLaMAHF/llama-7b/ --dataset_path ../../tokenizedinstruct/ --peft_mode lora --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 30000 --learning_rate 2e-4 --fp16 --logging_steps 100 --output_dir ../../LLaMAPEFT/llamapeft-7B/

The finetuning does not start. The error is:


  0%|                                                                                                        | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "finetune_peft.py", line 142, in <module>
    main()
  File "finetune_peft.py", line 137, in main
    trainer.train()
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 1628, in train
    return inner_training_loop(
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 1895, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 2637, in training_step
    loss = self.compute_loss(model, inputs)
  File "finetune_peft.py", line 80, in compute_loss
    return model(
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward
    raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
  0%|                                                                                                        | 0/30000 [00:00<?, ?it/s]

Forcing the inputs on the first cuda device, see below, does not fix the issue.

    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"].to(f'cuda:{model.device_ids[0]}'),
            attention_mask=torch.ones_like(inputs["input_ids"]).to(f'cuda:{model.device_ids[0]}'),
            labels=inputs["input_ids"].to(f'cuda:{model.device_ids[0]}'),  # HF model does the slicing for us
        ).loss

this training process did not consider decoder_attention_mask?

I see that :

def model_forward(model, inputs):
h = inputs
h = h.to(model.base_model.model.model.embed_tokens.weight.device)
h = model.base_model.model.model.embed_tokens(h)
for layer in model.base_model.model.model.layers:
h = h.to(layer.input_layernorm.weight.device)
h = layer(h)[0]
h = h.to(model.base_model.model.model.norm.weight.device)
h = model.base_model.model.model.norm(h)
h = model.base_model.model.lm_head(h)
return h

the output of this model comes from all sequence?

Maybe you need add _prepare_decoder_attention_mask(h) to avoid this...

zphang / minimal-llama Goto Github PK

minimal-llama's Introduction

Minimal LLaMA

Tokenize datasets

PEFT Fine-tuning with 8-bit

Fine-tuning with Naive Pipeline Parallel

PEFT Fine-tuning with 8-bit and Pipeline Parallel

Misc Notes

minimal-llama's People

Contributors

Stargazers

Watchers

Forkers

minimal-llama's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs