tudb-labs / mlora Goto Github PK
View Code? Open in Web Editor NEWAn Efficient "Factory" to Build Multiple LoRA Adapters
License: Apache License 2.0
An Efficient "Factory" to Build Multiple LoRA Adapters
License: Apache License 2.0
can we use this test report in readme @mikecovlee @merlintang
I tested three different datasets with different amounts in alpaca-lora and multi-lora-fine-tune.
Each dataset(the input data's sequence and size are also the same) trains two different lora models with two different optimizers, each optimizer has the same training hyperparams.
So the alpaca-lora needs to be trained twice to produce two different lora model, but multi-lora-fine-tune just need once to produce one lora model.
The experimental statistics on end-to-end train latency (without model and dataset load and save latency).
this report also needs to be added to readme @merlintang @mikecovlee pls review it.
Pls provide the doc to evaluate the lora fine tune model in the readme doc
PLS provide the doc for "Merge LoRA weights and export model"
We randomly generated 4 datasets, 1/2 train data set randomly chosen from alpaca-lora, and 3 and 4 from spider. below are the datasets' token lens distribution and total size.
test gpu: a6000
data_set_1: 34000
data_set_2: 17000
data_set_3: 5556
data_set_4: 2700
We will train 8 lora model:
data_set_1 with lr = 3e-4 and lr = 1e-4
data_set_2 with lr = 3e-4 and lr = 1e-4
data_set_3 with lr = 3e-4 and lr = 1e-4
data_set_4 with lr = 3e-4 and lr = 1e-4
train 2 lora model parallel in 2 gpu
= one lora in gpu0, and another lora in gpu1
= train 2 lora model serial in one gpu, and ignore the model load time.
Now we use the checkpoint to save GPU memory, The checkpoint will cache each transformer layer's input, and forward without grad produce, and then backward will use the input data cached to recompute and produce each transformer layer's grad.
but I think if the tensor size is big(train multi Lora model has a big total batch size, so the tensor is enough big.) the time taken by recompute will be less than transfer time(GPU -> CPU and CPU -> GPU). Maybe this method will increase latency but will increase throughput.
I have found some APIs to implement it:
when I train the chatglm model, seems it can not produce the correct result. can anyone fix it? @waitfor-night
We have an actively developing fork of the official m-LoRA repository, focusing on LoRA + MoE and its related improvements, maintained by the authors of m-LoRA.
URL: https://github.com/mikecovlee/mlora
I have been studying LoRA recently and I noticed that during pre-training, the word vectors change as the training progresses. However, what about when using LoRA for fine-tuning? Do the word vectors still change, or is it only the attention weights.
I think this is a great job
Is there any problem with the implementation of the method? Why is the code no longer in the warehouse?
Fining tuning multiple lora on a single GPU might encounter OOM issue. It is necessary to carefully adjust parameters such as batch_size and cutoff_len, but this still cannot guarantee to completely avoid OOM. Is it possible to run a tool first to provide a reference(or best) configuration for users based on their data?
we need model evaluation method.
Is the frame work support multi-gpu training?
I want to use the frame work to train a 70B model, however, I did not find the parameter settings or methods for multi-gpus training.
Currently, we cannot use mLora for model inference with adapters; implementing a simple tool would be better.
we can provide an example to introduce how to use our system to improve the llama2 fine tune with less resources.
https://www.kaggle.com/code/rraydata/multi-lora-example/notebook
Consider replace current log mechanism with a unified log framework.
[2023-12-11 21:36:41] m-LoRA: NVIDIA CUDA initialized successfully.
[2023-12-11 21:36:41] m-LoRA: Total 1 GPU(s) detected.
[2023-12-11 21:36:41] m-LoRA: Loading model with quantization, bits = 8
Test data: batch-size = 4, seqlen = 1552, use vicuna-7B model in one GPU to test.
The vicuna-7B has 32 transformer layers, use checkpoint in each layer.
case1: 31-layer use the recompute checkpoint, 1-layer use the offload checkpoint, time cost: 12.8416s
case2: 32-layer all use the recompute checkpoint, time cost: 11.3163s
It seems if we use offload in one card, it will be 1.135 times slower than recomputing.
can support llava model ?
Use patch files below to get baseline performance(alpaca-Lora):
transformers/trainer.py
148a149,155
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
> filemode='a',
> format='%(message)s',
> level=flog.DEBUG)
1871a1879,1880
> torch.cuda.reset_peak_memory_stats()
> opti_start_time = time.time()
1894a1904,1909
> opti_end_time = time.time()
> device_str = inputs["input_ids"].device
> alloc_mem = torch.cuda.max_memory_allocated(device_str)
> gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
> flog.info(f"optim: {(opti_end_time - opti_start_time):.10f} {alloc_mem} {gpu_utilization}")
> flog.info(f"train: {tr_loss_step}")
2658a2674,2675
> torch.cuda.reset_peak_memory_stats()
> back_start_time = time.time()
2665a2683,2687
> back_end_time = time.time()
> device_str = inputs["labels"].device
> alloc_mem = torch.cuda.max_memory_allocated(device_str)
> gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
> flog.info(f"backward: {(back_end_time - back_start_time):.10f} {alloc_mem} {gpu_utilization}")
transformers/models/llama/modeling_llama.py
38a39,46
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
> filemode='a',
> format='%(message)s',
> level=flog.DEBUG)
>
805a814,816
> flog.info(f"data size: {input_ids.shape[0]} {input_ids.shape[1]}")
> torch.cuda.reset_peak_memory_stats()
> forward_start_time = time.time()
816a828,832
> forward_end_time = time.time()
> device_str = input_ids.device
> alloc_mem = torch.cuda.max_memory_allocated(device_str)
> gpu_uilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
> flog.info(f"forward: {(forward_end_time - forward_start_time):.10f} {alloc_mem} {gpu_uilization}")
837a854,856
>
> torch.cuda.reset_peak_memory_stats()
> loss_start_time = time.time()
838a858,862
> loss_end_time = time.time()
> device_str = input_ids.device
> alloc_mem = torch.cuda.max_memory_allocated(device_str)
> gpu_uilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
> flog.info(f"loss: {(loss_end_time - loss_start_time):.10f} {alloc_mem} {gpu_uilization}")
peft/tuners/lora.py
46a47,53
> import logging as flog
> import os
> import time
> flog.basicConfig(filename="logs.log",
> filemode='a',
> format='%(message)s',
> level=flog.DEBUG)
1148a1156,1157
> torch.cuda.reset_peak_memory_stats()
> base_start_time = time.time()
1149a1159,1163
> base_end_time = time.time()
> device_str = x.device
> alloc_mem = torch.cuda.max_memory_allocated(device_str)
> gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
> flog.info(f"base: {(base_end_time-base_start_time):.10f} {alloc_mem} {gpu_utilization}")
1153a1168,1169
> torch.cuda.reset_peak_memory_stats()
> lora_start_time = time.time()
1172a1189,1193
> lora_end_time = time.time()
> device_str = x.device
> alloc_mem = torch.cuda.max_memory_allocated(device_str)
> gpu_utilization = torch.cuda.utilization(int(os.environ["CUDA_VISIBLE_DEVICES"]))
> flog.info(f"lora: {(lora_end_time-lora_start_time):.10f} {alloc_mem} {gpu_utilization}")
Why does CI just test the GPU run only the main branch? if I pr error code, then it also tests the main branch
@mikecovlee @LianxinGao
Traceback (most recent call last):
File "/home/mikecovlee/work/multi-lora-fine-tune/mlora.py", line 175, in <module>
inference(config, model, tokenizer)
File "/home/mikecovlee/work/multi-lora-fine-tune/mlora.py", line 106, in inference
input_data = mlora.MultiLoraBatchData(
TypeError: MultiLoraBatchData.__init__() got an unexpected keyword argument 'prompts_'
Improve inference functions. @mikecovlee
Dear, Author,
Thanks for this great project,
I hit a problem when I tried to run the example code mlora.py with float16
I use A100 with 40GB memory but it still goes out of memory.
Do you have any clue about this error?
Thanks!
Use aspen.load_llama_tf_weight
to load vicuna-7b-delta-v0
model use more than 30GB memory then caused OOM.
Use utils.convert_hf_to_pth
to transfer vicuna-7b-delta-v0 to .pth model, then use aspen.load_llama_7b_weight
to load .pth model, an error is reported:
Not use layer model.embed_tokens.weight.
Traceback (most recent call last):
File "/data/glx/code/multi_lora/legacy.py", line 43, in <module>
aspen.load_llama_7b_weight(llama_model, config["base_model"], config["device"])
File "/data/glx/code/multi_lora/aspen/modelloader.py", line 21, in load_llama_7b_weight
layer_id = int(layer_name[:layer_name.find(".")])
ValueError: invalid literal for int() with base 10: 'ayers'
Self comparison test about alpaca_data_en_52k dataset on vicuna-7b-v1.1 (GPU: A100), group_by_length and no checkpoint.
Method1: Using the same configuration file and data, fine-tune two datasets simultaneously.
Method2: Using the same configuration file and data, fine-tune only one dataset.
method 1 config file:
{
"cutoff_len": 256,
"group_by_length": true,
"expand_right": true,
"pad_token_id": -1,
"save_step": 20000,
"lora": [
{
"name": "lora_0",
"output": "lora_0",
"optim": "adamw",
"lr": 1e-4,
"batch_size": 16,
"num_epochs": 1,
"r": 8,
"alpha": 16,
"dropout": 0.05,
"target_modules": {
"q_proj": true,
"k_proj": true,
"v_proj": true,
"o_proj": true,
"w1_proj": false,
"w2_proj": false,
"w3_proj": false
},
"data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
"prompt": "template/template_demo.json"
},
{
"name": "lora_1",
"output": "lora_1",
"optim": "adamw",
"lr": 1e-4,
"batch_size": 16,
"num_epochs": 1,
"r": 8,
"alpha": 16,
"dropout": 0.05,
"target_modules": {
"q_proj": true,
"k_proj": true,
"v_proj": true,
"o_proj": true,
"w1_proj": false,
"w2_proj": false,
"w3_proj": false
},
"data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
"prompt": "template/template_demo.json"
}
]
}
method2 config file:
{
"cutoff_len": 256,
"group_by_length": true,
"expand_right": true,
"pad_token_id": -1,
"save_step": 20000,
"lora": [
{
"name": "lora_only1",
"output": "lora_only1",
"optim": "adamw",
"lr": 1e-4,
"batch_size": 16,
"num_epochs": 1,
"r": 8,
"alpha": 16,
"dropout": 0.05,
"target_modules": {
"q_proj": true,
"k_proj": true,
"v_proj": true,
"o_proj": true,
"w1_proj": false,
"w2_proj": false,
"w3_proj": false
},
"data": "/data/glx/LLaMA-Efficient-Tuning/data/alpaca_data_en_52k.json",
"prompt": "template/template_demo.json"
}
]
}
Method1: time cost: 7h55min, gpu memory cost: 21.74GB
Method2: time cost: 4h17min, gpu memory cost: 15.86GB
as the title, the example to finetune on two types of data is needed.
we had better provide the webui for end users to find tune their model via multi-lora like this way: https://modelscope.cn/studios/hiyouga/LLaMA-Board/summary
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.