tudb-labs / mlora Goto Github PK

View Code? Open in Web Editor NEW

243.0 3.0 36.0 9.7 MB

An Efficient "Factory" to Build Multiple LoRA Adapters

License: Apache License 2.0

Python 98.92% Shell 0.24% Dockerfile 0.84%

baichuan chatglm finetune llama llama2 llm lora peft gpu mlora

mlora's Issues

Memory test report

can we use this test report in readme @mikecovlee @merlintang

The test report compare to alpaca-lora.

I tested three different datasets with different amounts in alpaca-lora and multi-lora-fine-tune.
Each dataset(the input data's sequence and size are also the same) trains two different lora models with two different optimizers, each optimizer has the same training hyperparams.
So the alpaca-lora needs to be trained twice to produce two different lora model, but multi-lora-fine-tune just need once to produce one lora model.
The experimental statistics on end-to-end train latency (without model and dataset load and save latency).

dataset1 use batchsize 7, 457 data from alpaca-lora, and max seq len is 1304
dataset2 use batchsize 16, 452 data from alpaca-lora, and max seq len is 512
dataset3 use batchsize 16, 5000 data from sql-create-context, and max seq len is 256
The experimental results are as follows:

Train two different lora total cost time(hours)
Train two different lora thuoughput(tokens/second)

Train time cost report

this report also needs to be added to readme @merlintang @mikecovlee pls review it.

how to evaluate this model

Pls provide the doc to evaluate the lora fine tune model in the readme doc

Known Issues

⚠️ Only last layer of adapter will be updated when training
Slow build-in inference

How to export the model via the doc in the readme

PLS provide the doc for "Merge LoRA weights and export model"

[WIP] Aspen test report

We randomly generated 4 datasets, 1/2 train data set randomly chosen from alpaca-lora, and 3 and 4 from spider. below are the datasets' token lens distribution and total size.
test gpu: a6000
data_set_1: 34000
data_set_2: 17000
data_set_3: 5556
data_set_4: 2700
We will train 8 lora model:
data_set_1 with lr = 3e-4 and lr = 1e-4
data_set_2 with lr = 3e-4 and lr = 1e-4
data_set_3 with lr = 3e-4 and lr = 1e-4
data_set_4 with lr = 3e-4 and lr = 1e-4

train 2 lora model parallel in 2 gpu
= one lora in gpu0, and another lora in gpu1
= train 2 lora model serial in one gpu, and ignore the model load time.

train 2 lora model parallel in 1 gpu

compare the peak memory

compare a6000 and 4090

checkpoint offload policy

Now we use the checkpoint to save GPU memory, The checkpoint will cache each transformer layer's input, and forward without grad produce, and then backward will use the input data cached to recompute and produce each transformer layer's grad.

but I think if the tensor size is big(train multi Lora model has a big total batch size, so the tensor is enough big.) the time taken by recompute will be less than transfer time(GPU -> CPU and CPU -> GPU). Maybe this method will increase latency but will increase throughput.

I have found some APIs to implement it:

save_on_cpu will save the checkpoint's input to the CPU, and when needed backward, it will load the input data from CPU to GPU. This API just saves the checkpoint's input memory.
saved_tensors_hooks Use this context-manager can define how intermediary results of an operation should be packed before saving, and unpacked on retrieval. So it can offload tensor when forward and backward. The torch==2.0.1 uses this API to implement checkpoint, but does not support the user passing the argument to change it, if we want to implement our policy, can hack this function.
pytorch-v2.1.0-rc2 in the nightly version, I found the checkpoint API allow the user to create new context manager, we can implement our context manager to implement the offload policy.

issue about chatglt model

when I train the chatglm model, seems it can not produce the correct result. can anyone fix it? @waitfor-night

provide the chatgm and vicuna fine tune example

The MixLoRA (LoRA + MoE) and its related improvements are available at mikecovlee/mlora.

We have an actively developing fork of the official m-LoRA repository, focusing on LoRA + MoE and its related improvements, maintained by the authors of m-LoRA.
URL: https://github.com/mikecovlee/mlora

How to fine tune on one model via different rank ? then output the model best rank?

Will Embeding change?

I have been studying LoRA recently and I noticed that during pre-training, the word vectors change as the training progresses. However, what about when using LoRA for fine-tuning? Do the word vectors still change, or is it only the attention weights.

About Mix LoRA

I think this is a great job

Is there any problem with the implementation of the method? Why is the code no longer in the warehouse?

provide the test case for the code

plyvel._plyvel.IOError: b'IO error: lock /tmp/mlora/./db/LOCK: Resource temporarily unavailable'

i run this commannd:"python mlora_server.py --base_model TinyLlama/TinyLlama-1.1B-Chat-v0.4 --root /tmp/mlora",and result in this error:

,which tells:"plyvel._plyvel.IOError: b'IO error: lock /tmp/mlora/./db/LOCK: Resource temporarily unavailable'".
can anybody tell me how to solve it ,thanks

Support automatic parameter configuration

Fining tuning multiple lora on a single GPU might encounter OOM issue. It is necessary to carefully adjust parameters such as batch_size and cutoff_len, but this still cannot guarantee to completely avoid OOM. Is it possible to run a tool first to provide a reference(or best) configuration for users based on their data?

Add bleu, rouge evaluation indicators.

we need model evaluation method.

ImportError: cannot import name 'override' from 'typing'

i excute this commad：python mlora.py --base_model TinyLlama/TinyLlama-1.1B-Chat-v0.4 --config ./demo/dummy.yaml ，but it tell me：ImportError: cannot import name 'override' from 'typing' .

all output of this error are as follows:

How to use this frame work to train a LLM with multi-GPUs?

Is the frame work support multi-gpu training?
I want to use the frame work to train a 70B model, however, I did not find the parameter settings or methods for multi-gpus training.

Test on multiple GPU cards like 4090 and 3090

TODO List of v0.1 Beta Release

Multi-LoRA inference
Provide example of inference
#40

Integrate evaluation methods into tests

provide the script to prepare the fine tune dataset

AttributeError: 'ChatGLMForConditionalGeneration' object has no attribute 'init_lora_weight'

模型能读取到，但不能初始化ChatGLM模型中一些LORA (Low-Rank Adaptation)适配器的权重。请问该怎么解决

provide more dataset to test

update the readme for others to use our framework

Support model inference with adapters

Currently, we cannot use mLora for model inference with adapters; implementing a simple tool would be better.

How to predict via the command line or webui?

Kaggle link

we can provide an example to introduce how to use our system to improve the llama2 fine tune with less resources.

https://www.kaggle.com/code/rraydata/multi-lora-example/notebook

Question about Training

Dear Authors,

Thanks for this great project.
I got a question about training,
I can see this part only produces one work for inference, why are we not using auto-regressive here?
Also, I wondered how we should test the throughput like tokens per second.

Best
Chao

Log style not consistent

Consider replace current log mechanism with a unified log framework.

[2023-12-11 21:36:41] m-LoRA: NVIDIA CUDA initialized successfully.
[2023-12-11 21:36:41] m-LoRA: Total 1 GPU(s) detected.
[2023-12-11 21:36:41] m-LoRA: Loading model with quantization, bits = 8

provide the docker file

provide the llama2 fine tune experimental studies

Offload Performance Test Result

Test data: batch-size = 4, seqlen = 1552, use vicuna-7B model in one GPU to test.
The vicuna-7B has 32 transformer layers, use checkpoint in each layer.
case1: 31-layer use the recompute checkpoint, 1-layer use the offload checkpoint, time cost: 12.8416s
case2: 32-layer all use the recompute checkpoint, time cost: 11.3163s
It seems if we use offload in one card, it will be 1.135 times slower than recomputing.

can support llava model ?

How to test the performance.

Use patch files below to get baseline performance(alpaca-Lora):
transformers/trainer.py

148a149,155
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
>                  filemode='a',
>                  format='%(message)s',
>                  level=flog.DEBUG)
1871a1879,1880
>                     torch.cuda.reset_peak_memory_stats()
>                     opti_start_time = time.time()
1894a1904,1909
>                     opti_end_time = time.time()
>                     device_str = inputs["input_ids"].device
>                     alloc_mem = torch.cuda.max_memory_allocated(device_str)
>                     gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>                     flog.info(f"optim: {(opti_end_time - opti_start_time):.10f} {alloc_mem} {gpu_utilization}")
>                     flog.info(f"train: {tr_loss_step}")
2658a2674,2675
>         torch.cuda.reset_peak_memory_stats()
>         back_start_time = time.time()
2665a2683,2687
>         back_end_time = time.time()
>         device_str = inputs["labels"].device
>         alloc_mem = torch.cuda.max_memory_allocated(device_str)
>         gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>         flog.info(f"backward: {(back_end_time - back_start_time):.10f} {alloc_mem} {gpu_utilization}")

transformers/models/llama/modeling_llama.py

38a39,46
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
>                  filemode='a',
>                  format='%(message)s',
>                  level=flog.DEBUG)
>
805a814,816
>         flog.info(f"data size: {input_ids.shape[0]} {input_ids.shape[1]}")
>         torch.cuda.reset_peak_memory_stats()
>         forward_start_time = time.time()
816a828,832
>         forward_end_time = time.time()
>         device_str = input_ids.device
>         alloc_mem = torch.cuda.max_memory_allocated(device_str)
>         gpu_uilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>         flog.info(f"forward: {(forward_end_time - forward_start_time):.10f} {alloc_mem} {gpu_uilization}")
837a854,856
>
>             torch.cuda.reset_peak_memory_stats()
>             loss_start_time = time.time()
838a858,862
>             loss_end_time = time.time()
>             device_str = input_ids.device
>             alloc_mem = torch.cuda.max_memory_allocated(device_str)
>             gpu_uilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>             flog.info(f"loss: {(loss_end_time - loss_start_time):.10f} {alloc_mem} {gpu_uilization}")

peft/tuners/lora.py

46a47,53
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
>                  filemode='a',
>                  format='%(message)s',
>                  level=flog.DEBUG)
1148a1156,1157
>             torch.cuda.reset_peak_memory_stats()
>             base_start_time = time.time()
1149a1159,1163
>             base_end_time = time.time()
>             device_str = x.device
>             alloc_mem = torch.cuda.max_memory_allocated(device_str)
>             gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>             flog.info(f"base: {(base_end_time-base_start_time):.10f} {alloc_mem} {gpu_utilization}")
1153a1168,1169
>                 torch.cuda.reset_peak_memory_stats()
>                 lora_start_time = time.time()
1172a1189,1193
>                 lora_end_time = time.time()
>                 device_str = x.device
>                 alloc_mem = torch.cuda.max_memory_allocated(device_str)
>                 gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
>                 flog.info(f"lora: {(lora_end_time-lora_start_time):.10f} {alloc_mem} {gpu_utilization}")

provide more model eval methods

The CI test gpu run error.

Why does CI just test the GPU run only the main branch? if I pr error code, then it also tests the main branch
@mikecovlee @LianxinGao

Issues about integrated inference

Traceback

Traceback (most recent call last):
  File "/home/mikecovlee/work/multi-lora-fine-tune/mlora.py", line 175, in <module>
    inference(config, model, tokenizer)
  File "/home/mikecovlee/work/multi-lora-fine-tune/mlora.py", line 106, in inference
    input_data = mlora.MultiLoraBatchData(
TypeError: MultiLoraBatchData.__init__() got an unexpected keyword argument 'prompts_'

TODO

Improve inference functions. @mikecovlee

example goes out of memory

Dear, Author,

Thanks for this great project,
I hit a problem when I tried to run the example code mlora.py with float16
I use A100 with 40GB memory but it still goes out of memory.

Do you have any clue about this error?

Thanks!

Cannot load vicuna-7b-delta-v0

Use aspen.load_llama_tf_weight to load vicuna-7b-delta-v0 model use more than 30GB memory then caused OOM.
Use utils.convert_hf_to_pth to transfer vicuna-7b-delta-v0 to .pth model, then use aspen.load_llama_7b_weight to load .pth model, an error is reported:

Not use layer model.embed_tokens.weight.
Traceback (most recent call last):
  File "/data/glx/code/multi_lora/legacy.py", line 43, in <module>
    aspen.load_llama_7b_weight(llama_model, config["base_model"], config["device"])
  File "/data/glx/code/multi_lora/aspen/modelloader.py", line 21, in load_llama_7b_weight
    layer_id = int(layer_name[:layer_name.find(".")])
ValueError: invalid literal for int() with base 10: 'ayers'

Performance Test

Self comparison test about alpaca_data_en_52k dataset on vicuna-7b-v1.1 (GPU: A100), group_by_length and no checkpoint.

Method1: Using the same configuration file and data, fine-tune two datasets simultaneously.
Method2: Using the same configuration file and data, fine-tune only one dataset.

method 1 config file:

{
    "cutoff_len": 256,
    "group_by_length": true,
    "expand_right": true,
    "pad_token_id": -1,
    "save_step": 20000,
    "lora": [
        {
            "name": "lora_0",
            "output": "lora_0",
            "optim": "adamw",
            "lr": 1e-4,
            "batch_size": 16,
            "num_epochs": 1,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": true,
                "v_proj": true,
                "o_proj": true,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
            "prompt": "template/template_demo.json"
        },
        {
            "name": "lora_1",
            "output": "lora_1",
            "optim": "adamw",
            "lr": 1e-4,
            "batch_size": 16,
            "num_epochs": 1,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": true,
                "v_proj": true,
                "o_proj": true,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
            "prompt": "template/template_demo.json"
        }
    ]
}

method2 config file:

{
    "cutoff_len": 256,
    "group_by_length": true,
    "expand_right": true,
    "pad_token_id": -1,
    "save_step": 20000,
    "lora": [
        {
            "name": "lora_only1",
            "output": "lora_only1",
            "optim": "adamw",
            "lr": 1e-4,
            "batch_size": 16,
            "num_epochs": 1,
            "r": 8,
            "alpha": 16,
            "dropout": 0.05,
            "target_modules": {
                "q_proj": true,
                "k_proj": true,
                "v_proj": true,
                "o_proj": true,
                "w1_proj": false,
                "w2_proj": false,
                "w3_proj": false
            },
            "data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
            "prompt": "template/template_demo.json"
        }
    ]
}

Method1: time cost: 7h55min, gpu memory cost: 21.74GB
Method2: time cost: 4h17min, gpu memory cost: 15.86GB

tudb-labs / mlora Goto Github PK

mlora's Issues

Traceback

TODO

Recommend Projects

Recommend Topics

Recommend Org

Jobs