GithubHelp home page GithubHelp logo

minimal-llama's Introduction

Minimal LLaMA

This repo contains a random assortment of code for running and fine-tuning LLaMA. Many parts are still work in progress. There ought to be more efficient methods of tuning (DeepSpeed / ZeRO, NeoX) than the ones presented here, but folks may find this useful already.

This code was fairly quickly thrown together and may contains many, many bugs. Feedback is welcome!

Tokenize datasets

First, we tokenize the data so we never have to worry about the tokenizer again. The tokenization script takes in a JSONL (each row containing the key "text" for the document text), and effectively concatenates, tokenizes, and slices into max_seq_length chunks.

(This is a quick and dirty script that loads the whole dataset into memory.)

python tokenize_dataset.py \
    --tokenizer_path /path/to/tokenizer \
    --jsonl_path /path/to/data.jsonl \
    --save_path /path/to/tokenized_dataset \
    --max_seq_length 512

PEFT Fine-tuning with 8-bit

Requires using the Transformers PR here, based on the fork here. Model weights need to be converted to HF format using the weight conversion script in the PR.

Requires using the PEFT PR here, based on the fork here.

We can fine-tune using the PEFT library, with the model converted to 8-bit. This is based on the guide here.

python finetune_peft.py \
    --model_path /path/to/llama-7b/ \
    --dataset_path /path/to/tokenized_dataset \
    --peft_mode lora \
    --lora_rank 8 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --max_steps 2500 \
    --learning_rate 2e-4 \
    --fp16 \
    --logging_steps 10 \
    --output_dir /path/to/save

The above configuration (with max_seq_length=512) uses about 20GB of RAM on a single GPU. (With bs=1 and max_seq_length=256, this gets down to about 12 GB.)

You can generate using the trained PEFT params using something like the following:

import torch
import transformers
from finetune_peft import get_peft_config, PEFTArguments
from peft import get_peft_model

model_path = ...
peft_path = ...
tokenizer_path = ...

torch.set_default_tensor_type(torch.cuda.HalfTensor)
model = transformers.LLaMAForCausalLM.from_pretrained(model_path)
peft_config = get_peft_config(peft_args=PEFTArguments(peft_mode="lora"))
model = get_peft_model(model, peft_config)
model.load_state_dict(torch.load(peft_path), strict=False)
torch.set_default_tensor_type(torch.cuda.FloatTensor)

tokenizer = transformers.LLaMATokenizer.from_pretrained(tokenizer_path)
batch = tokenizer("The LLaMA language model is", return_tensors="pt")

with torch.no_grad():
    out = model.generate(
        input_ids=batch["input_ids"],
        attention_mask=torch.ones_like(batch["input_ids"]),
        max_length=200,
    )
print(tokenizer.decode(out[0]))

Fine-tuning with Naive Pipeline Parallel

Requires using the Transformers PR here, based on the fork here. Model weights need to be converted to HF format using the weight conversion script in the PR.

For fully fine-tuning (larger) models, we can use (a very naively implemented version of) pipeline parallelism. This is preferable for larger models that won't fit on a single GPU.

python finetune_pp.py \
    --model_path /path/to/llama-7b/ \
    --dataset_path /path/to/tokenized_dataset \
    --save_dir /path/to/save \
    --batch_size 4 \
    --gradient_accumulation_steps 2 \
    --save_interval 2000 \
    --num_train_steps 20000

The above configuration uses about 30-35GB of RAM per GPU across 8 GPUs.

PEFT Fine-tuning with 8-bit and Pipeline Parallel

Seems buggy, don't use this yet.

Requires using the Transformers PR here, based on the fork here. Model weights need to be converted to HF format using the weight conversion script in the PR.

Requires using the PEFT PR here, based on the fork here.

Here, we combine PEFT training with pipeline parallel to train with large models. See PEFT Fine-tuning with 8-bit for more details.

python finetune_pp_peft.py \
    --model_path /path/to/llama-30b/ \
    --dataset_path /path/to/tokenized_dataset \
    --save_dir /path/to/save \
    --batch_size 4 \
    --learning_rate 5e-5 \
    --gradient_accumulation_steps 1 \
    --save_interval 2000 \
    --num_train_steps 20000 \
    --peft_mode lora \
    --lora_rank 8

For instance, you can fine-tune LoRA on 65B LLaMA with about 120GB of memory in total (e.g. 15GB each on 8 GPUs, or 60GB on 2 GPUs) with batch size=1 and sequence length = 512.

Misc Notes

  • I have no idea what hyperparameters are best for fine-tuning.
  • Aside from model parameters + gradients + optimizer states, the hidden activations also take up a big chunk of memory. Shortening the max_sequence_length is a good way of reducing memory consumption. I don't really know how much that affects fine-tuning performance either.

minimal-llama's People

Contributors

zphang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minimal-llama's Issues

PEFT + PP bug?

Hello, could you please elaborate on what "Seems buggy, don't use this yet." means for the 8-bit + pipeline parallel example? What bug is there specifically? Does it affect training results or is it a tooling issue? I've been waiting to be able to fine tune the 65B model for a while now and if there's anything I can do with testing or fixing this bug, I'd love some pointers. Thanks!

hi

I have finished training the model and would like to use it for inference. Could you please tell me what "pefg_path" means? Thank you very much.

Fine-tuning with Naive Pipeline Parallel: NaN after optimizer step

Your model does not seem to be able to calculate the gradients of the layers correctly. When I run finetune_pp.py and print the loss during training, after the first optimizer step, the loss becomes the following:

tensor(nan, device='cuda:1', dtype=torch.float16, grad_fn=)

Can you reproduce this on your machine? Otherwise, would you be willing to share your pip freeze, so that I can try out, if there is a package mismatch?

TypeError: dispatch_model() got an unexpected keyword argument 'offload_index'

Hello guy, this is a great project.

But I got an error when I running the command of Fine-tuning with 8-bit

python finetune_peft.py \
    --model_path LLaMA-hf/llama-7b \
    --dataset_path data \
    --peft_mode lora \
    --lora_rank 8 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --max_steps 2500 \
    --learning_rate 2e-4 \
    --fp16 \
    --logging_steps 10 \
    --output_dir output/models

And my model files were converted by the transformers/models/llama/convert_llama_weights_to_hf.py

transformers version: a3dfcc02d249cbd14ce9089f57d4040146f3f090

How to correctly load and merge finetuned LLaMA models in different formats?

I am new to NLP and currently exploring the LLaMA model. I understand that there are different formats for this model - the original format and the Hugging Face format. I have fine-tuned the LLaMA model on my dataset using this tool https://github.com/lxe/llama-peft-tuner which is based on minimal-llama, and it saves the models in a certain way (see below):

$ ll llama-peft-tuner/models/csco-llama-7b-peft/
total 16456
drwxrwxr-x 8 lachlan lachlan     4096 May 10 10:42 ./
drwxrwxr-x 5 lachlan lachlan     4096 May 10 10:06 ../
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:21 checkpoint-1000/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:28 checkpoint-1500/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:35 checkpoint-2000/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:42 checkpoint-2500/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:13 checkpoint-500/
drwxrwxr-x 2 lachlan lachlan     4096 May 10 10:42 model-final/
-rw-rw-r-- 1 lachlan lachlan 16814911 May 10 10:42 params.p

$ ll llama-peft-tuner/models/csco-llama-7b-peft/checkpoint-2500/
total 7178936
drwxrwxr-x 2 lachlan lachlan       4096 May 10 10:42 ./
drwxrwxr-x 8 lachlan lachlan       4096 May 10 10:42 ../
-rw-rw-r-- 1 lachlan lachlan   33629893 May 10 10:42 optimizer.pt
-rw-rw-r-- 1 lachlan lachlan 7317523229 May 10 10:42 pytorch_model.bin
-rw-rw-r-- 1 lachlan lachlan      14575 May 10 10:42 rng_state.pth
-rw-rw-r-- 1 lachlan lachlan        557 May 10 10:42 scaler.pt
-rw-rw-r-- 1 lachlan lachlan        627 May 10 10:42 scheduler.pt
-rw-rw-r-- 1 lachlan lachlan      28855 May 10 10:42 trainer_state.json
-rw-rw-r-- 1 lachlan lachlan       3899 May 10 10:42 training_args.bin

I am not quite sure about the relationship between pytorch_model.bin, the original model, and adapter_model.bin. I suppose pytorch_model.bin is in the Hugging Face format. Now, I want to create a .pth model that I can load in https://github.com/juncongmoo/pyllama/tree/main/apps/gradio.

I followed the manual conversion guide at https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Manual-Conversion to convert the Hugging Face format into Hugging Face format (.bin) or PyTorch format (.pth). I tried treating pytorch_model.bin as the Hugging Face format and modified the code to ignore the LoRA, but I couldn't achieve the desired result.The fine-tuning repository mentioned below provided a way to load the trained model by combining the original model and the learned parameters. I tried to adapt this approach into https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/scripts/merge_llama_with_chinese_lora.py and tried different combinations, but the result either doesn't incorporate the trained parameters or generates meaningless outputs.

Can someone help me understand how to correctly load and merge these models? Any help would be greatly appreciated. Thank you.

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1 0%|

When running the finetuning with peft with the command:

python finetune_peft.py --model_path ../../LLaMAHF/llama-7b/ --dataset_path ../../tokenizedinstruct/ --peft_mode lora --lora_rank 8 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --max_steps 30000 --learning_rate 2e-4 --fp16 --logging_steps 100 --output_dir ../../LLaMAPEFT/llamapeft-7B/

The finetuning does not start. The error is:


  0%|                                                                                                        | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "finetune_peft.py", line 142, in <module>
    main()
  File "finetune_peft.py", line 137, in main
    trainer.train()
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 1628, in train
    return inner_training_loop(
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 1895, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/transformers/trainer.py", line 2637, in training_step
    loss = self.compute_loss(model, inputs)
  File "finetune_peft.py", line 80, in compute_loss
    return model(
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/llama/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward
    raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
  0%|                                                                                                        | 0/30000 [00:00<?, ?it/s]

Forcing the inputs on the first cuda device, see below, does not fix the issue.

    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"].to(f'cuda:{model.device_ids[0]}'),
            attention_mask=torch.ones_like(inputs["input_ids"]).to(f'cuda:{model.device_ids[0]}'),
            labels=inputs["input_ids"].to(f'cuda:{model.device_ids[0]}'),  # HF model does the slicing for us
        ).loss

this training process did not consider decoder_attention_mask?

I see that :

def model_forward(model, inputs):
h = inputs
h = h.to(model.base_model.model.model.embed_tokens.weight.device)
h = model.base_model.model.model.embed_tokens(h)
for layer in model.base_model.model.model.layers:
h = h.to(layer.input_layernorm.weight.device)
h = layer(h)[0]
h = h.to(model.base_model.model.model.norm.weight.device)
h = model.base_model.model.model.norm(h)
h = model.base_model.model.lm_head(h)
return h

the output of this model comes from all sequence?

Maybe you need add _prepare_decoder_attention_mask(h) to avoid this...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.