GithubHelp home page GithubHelp logo

llmpruner's Introduction

LLMPruner:大语言模型裁剪工具

项目简介

微信公众号【YeungNLP】文章:LLMPruner:大语言模型裁剪工具

LLMPruner是一个大语言模型裁剪工具,通过对大语言模型的冗余词表进行裁剪,减少模型参数量,降低显存占用,提升训练速度,并且能够保留预训练中学习到的知识。

大语言模型(LLM, Large Language Model)犹如雨后春笋般,其虽然效果惊艳,但参数量巨大,让普通玩家望而却步。 如今的大语言模型大多为多语种大预言模型(Multilingual Large Language Model),如LLaMA、mT5、Bloom等,其词表规模巨大,占据非常大部分的模型参数,如Bloom具有25万词表。 在训练模型时,词表权重将会消耗非常大的显存,降低训练速度,产生OOM的现象。

然而在许多下游任务中,我们往往只需要使用到一两种语言,例如在中文场景中,一般只会用到中英文。 我们可以对大语言模型的词表进行裁剪,只留下所需的部分,这样不仅能够充分保留模型的预训练知识,并且能够使用更少的显卡进行下游任务的finetune,提升训练效率。

裁剪模型分享

裁剪后的模型权重地址:权重分享

Bloom

对Bloom进行词表裁剪,保留常用的中英文token,词表由250880将至46145,缩减为原来的18.39%。

裁剪模型 原模型 参数量比例
YeungNLP/bloom-396m-zh bigscience/bloom-560m 70.96%
YeungNLP/bloom-820m-zh bigscience/bloom-1b1 77.13%
YeungNLP/bloom-1b4-zh bigscience/bloom-1b7 81.14%
YeungNLP/bloom-2b6-zh bigscience/bloom-3b 86.48%
YeungNLP/bloom-6b4-zh bigscience/bloom-7b1 90.81%
YeungNLP/bloomz-396m-zh bigscience/bloomz-560m 70.96%
YeungNLP/bloomz-820m-zh bigscience/bloomz-1b1 77.13%
YeungNLP/bloomz-1b4-zh bigscience/bloomz-1b7 81.14%
YeungNLP/bloomz-2b6-zh bigscience/bloomz-3b 86.48%
YeungNLP/bloomz-6b4-zh bigscience/bloomz-7b1 90.81%
YeungNLP/bloomz-6b4-mt-zh bigscience/bloomz-7b1-mt 90.81%

使用介绍

对Bloom进行词表裁剪:

from pruners.vocabulary_pruner import BloomVocabularyPruner

# 需要进行裁剪的模型路径
model_name_or_path = 'bigscience/bloom-560m'
# 自己制作的词表的路
new_tokenizer_name_or_path = 'YeungNLP/bloom-560m-zh'
save_path = 'path-to-save'
pruner = BloomVocabularyPruner()
# 裁剪
pruner.prune(model_name_or_path, new_tokenizer_name_or_path, save_path)
# 检查裁剪的模型与原模型是否一致
pruner.check(model_name_or_path, save_path, text='长风破浪会有时')

使用模型:

from transformers import BloomTokenizerFast, BloomForCausalLM
tokenizer = BloomTokenizerFast.from_pretrained('YeungNLP/bloom-1b4-zh')
model = BloomForCausalLM.from_pretrained('YeungNLP/bloom-1b4-zh')
print(tokenizer.batch_decode(model.generate(tokenizer.encode('长风破浪会有时', return_tensors='pt'))))

关注我们

llmpruner's People

Contributors

yangjianxin1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

llmpruner's Issues

词表相关

非常感谢您的开源工作!请问如果要从自己的语料上构建全新的词表,具体的流程应该有哪些?

size issue

image
请问一下,安卓官方的代码对 bloom7b 进行词向量裁剪,但是得到的模型大小比官方的结果大

自定义词表问题

我看到关闭的issues中有一个关于如何生成自定义词表的问题,我尝试按照其中提到的方法自己裁剪原生bloom的词表,遇到了以下的问题:

  1. 按造[samsha1971]提供的代码,报错:
image 这个应该是merges属性导致的问题 2. 如果将t["model"]["merges"] = [] 设置为空,encode的结果也为空;

所以想请问,手动重构词表怎么处理这些问题?

中文词显示乱码

请问tokenizer.json中vocab部分中文显示都是乱码,如何可以正常显示和读取呢?
"æ¸¡æ±Ł": 42789

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.