GithubHelp home page GithubHelp logo

chaoyi-wu / finetune_llama Goto Github PK

View Code? Open in Web Editor NEW
308.0 1.0 33.0 20.7 MB

简单易懂的LLaMA微调指南。

Python 85.11% Makefile 0.01% Jupyter Notebook 6.98% Dockerfile 0.04% Jsonnet 0.01% Shell 0.14% C++ 0.02% Cuda 0.32% Cython 0.01% C 0.01% MDX 7.36%

finetune_llama's Introduction

微调LLAMA的中文指南

本项目旨在引导中文用户微调Large Language Model(LLAMA),整合了目前多个框架(Minimal LLaMAAlpacaLMFlow),尽量避免不必要的封装,保证代码可读性。

S1:

进入Python_Package安装相关peft包和transformers包。

建议先使用pip安装online package保证依赖包都顺利安装,再pip install -e .本地安装替换。

注意
pytorch包务必使用conda安装!conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.6 -c pytorch -c nvidia
不要忘记安装sentencepiece!pip install sentencepiece\

S2:

进入LLAMA_Model下载模型参数 https://huggingface.co/decapoda-research/llama-7b-hf 或者官网下载llama,使用convert_llama_weights_to_hf.py进行处理。

S3:

进入Data_sample按照示例处理数据。

S4:

修改finetune_pp.py或finetune_pp_peft.py相关参数(前者为整个网络参数均进行finetune,后者参考lora进行部分参数finetune),指定GPU,即可进行训练。

注意:finetune_pp.py与finetune_pp_peft.py无多卡加速,训练速度缓慢,但可以有效避免oom,适合debug。加速请参考FSDP多卡并行DeepSpeed多卡并行

S5:

参考test_sample.py进行测试,测试时,尽量避免使用多卡。

FSDP多卡并行:

finetune_pp_peft_trainer_lora.py与finetune_pp_peft_trainer.py种利用transformers.trainer简单实现单机多卡并行,使用fsdp解决了单卡爆卡的问题,训练速度显著加快。

DeepSpeed多卡并行:

DeepSpeed的库安装:
conda install -c omgarcia gcc-6,使用conda安装gcc6
conda install -c anaconda libstdcxx-ng ,更新gcc的动态库
git clone https://github.com/microsoft/DeepSpeed,下载DS库
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e .完成安装。
cuda环境有问题可以参考这个issue#2684

sh finetune_pp_peft_trainer_deepspeed.sh进行训练,传入参数--lora_used True(False)控制是否使用lora。

DS版本支持33B(lora)llama快速进行finetune,训练时长与LMFlow一致。

实测:尽量避免使用DeepSpeed,DS默认使用cpu_offload极大的拖慢了训练速度。

checkpointing内存优化并行:

请参考finetune_pp_peft_trainer_checkpointing.sh 的实现。

对于超过7B的大模型,内存问题非常严重,使用gradient checkpointing可以大幅降低内存占用,扩大batch size 在大模型大数据的情况下,可以快速的进行预训练。代价是牺牲step数目。

训练时长统计:

在4.8M PMCOA papers上统计各种训练设置的耗时。

训练时默认采用8张A100,每次对paper随机抽取一段512 tokens长度的句子进行训练,等价于一个epoch会处理2.5Btokens。

Statistic on S2ORC (4.8M PMCOA papers)
Model_Size Batch_Size Accelerate Strategy Time/epoch
13B 384 DS*(Opt&Par) ~122h
7B 768 DS(Opt&Par) ~100h
7B 128 DS(Opt&Par) ~100h
7B 384 DS(Opt) ~90h
7B 384 FSDP_no_cpu ~35h
7B 128 FSDP_no_cpu ~36h

DS(Opt&Par):optimizer and persistent parameters offloaded to cpu
DS(Opt):optimizer offloaded to cpu
FSDP_no_cpu: No cpu involved
注:cpu参与会导致训练速度变慢,但规模上去后,比如13B,必须CPU参与才可以完成多卡并行。表中上标*代表必须采用这种加速策略才能避免OOM。

参数设置参考:https://github.com/mosaicml/examples/tree/release/v0.0.4/examples/llm/throughput

Acknowledge:

参考 Minimal LLaMA https://github.com/zphang/minimal-llama 实现,主要修复了部分bug。

参考alpaca https://github.com/tatsu-lab/stanford_alpaca 加入fsdp。

参考LMFLow https://github.com/OptimalScale/LMFlow/tree/main/src/lmflow 加入deepspeed模块。

LLaMA: Open and Efficient Foundation Language Models -- https://arxiv.org/abs/2302.13971

@article{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{'e}e and Rozi{`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} }

finetune_llama's People

Contributors

chaoyi-wu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

finetune_llama's Issues

如何实现多节点fsdp

您好,我在论文中看到你们在pretrain阶段用32张卡训练。我想请问如何用trainer fsdp实现多节点训练呢。例如我想在2个节点16个A100上训练,应该怎么用trainer实现,模型是会切片分到16个gpu上吗?

convert_to_ds_params.py doesn't generate tokenizer

convert_to_ds_params.py only generates llama-7b folder and .pt files in it. But does not generate tokenizer.
But the param tokenizer_path of tokenize_dataset.py needs tokenizer.
So how can I get tokenizer?

微调数据集切换成自己的中文数据集之后出现报错:RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

有朋友遇到过这种报错吗?

/opt/conda/conda-bld/pytorch_1682343995622/work/aten/src/ATen/native/cuda/Indexing.cu  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/peft/peft_model.py", line 678, in forward
:1146: indexSelectLargeIndex: block: [85,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1682343995622/work/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [85,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    return self.base_model(
           ^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 809, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 690, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
              ^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 686, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 413, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 310, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/miniconda3/envs/pt2/lib/python3.11/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
  0%|                                                                                                                                                                                                                  | 0/5000 [00:01<?, ?it/s]

attention mask for different documents in dataset chunk

Hi chaoyi,

Thanks for your great work. I have a question about dataset tokenization in the following code.

all_tokens = [1] + [
tok
for row in all_tokenized
for tok in row + [tokenizer.eos_token_id, tokenizer.bos_token_id]
]
truncated_tokens = all_tokens[:(len(all_tokens) // args.max_seq_length) * args.max_seq_length]
arr = np.array(truncated_tokens).reshape(-1, args.max_seq_length)
ds = datasets.Dataset.from_dict({"input_ids": arr})
ds.save_to_disk(args.save_path)

From my understanding I think this data preprocessing will cause the fact that different documents might be included in the same data chunk. For example, the first document might take 512 tokens while the second document takes 128 tokens in a chunk of 640 tokens. In this case, I think the generation for the second document should not see the first document, so we might need to use an attention mask to mask the first documents for the second document generation. Am I correct?

A100 Memory

Thanks for open source!

What version of A100 is used in the experiment, 40G or 80G?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.