GithubHelp home page GithubHelp logo

yangjianxin1 / cpm Goto Github PK

View Code? Open in Web Editor NEW
517.0 517.0 135.0 817 KB

Easy-to-use CPM for Chinese text generation(基于CPM的中文文本生成)

License: Apache License 2.0

Python 98.27% Shell 1.73%
cpm generate gpt-2 gpt2

cpm's Introduction

Hi there 👋, I'm Yang Jianxin

yangjianxin1's GitHub stats

I'm a NLPer interested in Large Language Model and graduated from SYSU with a master's degree.

In my free time, I like to write technical blogs on [Wechat Official Accounts: YeungNLP] and [Zhihu: 红雨瓢泼]

🔭 Experiences:

  • Shopee, responsible for building NLP algorithm ability about Customer Service. (from 2022-04 to now)
  • Tencent, responsible for building NLP algorithm ability about Product Understanding. (from 2021-06 to 2022-04)
  • Alibaba, Internship at Alibaba (from 2020-06 to 2020-09).

⚙ Here are some my public projects:

Project Description Code
Firefly One-stop training for LLMs. Some achievements:
1. firefly-llama2-13b ranked 3rd among all 13B models on Open LLM Leaderboard, only 0.5 points less than 1st.
2. firefly-llama-30b ranked 10th among all 30B models on Open LLM Leaderboard trained with single V100.
3. firefly-baichuan-13b achieves over 1.63 million downloads.
4. firefly-qwen1.5-en-7b-dpo improves 7.21 points compared with the official chat model.
5. firefly-gemma-7b improves 9.37 points compared with the official chat model.
GPT2-chitchat Chinese GPT2 for chitchat
Firefly-LLaMA2-Chinese Chinese Llama2 with efficient and effective training method.
LongQLoRA Efficient and Effective method for extending context length of Llama2 to 8192 with single V100. Technical Report
CPM Chinese composition model based on CPM
CLIP-Chinese Chinese CLIP model trained with 1.4 million image-text pairs
ClipCap-Chinese Chinese image caption model based on clip and mengzi
OFA-Chinese Chinese multi-modal unified pre-training model
LLMPruner Prune vocabulary of LLMs to save memory in training.

📁 Here are some my technical blogs:

cpm's People

Contributors

yangjianxin1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cpm's Issues

关于生成时的加速

我打断点看了下,没有用到gpt2的past_key_values机制,博主后面有空可以改进一下,应该能加速不少

Preprocess.py 的 53-63行感觉有点问题

       1  win_size = args.win_size
      2  step = args.step
      3  start_index = 0
     4   end_index = win_size
     5   data = token_ids[start_index:end_index]
      6  train_list.append(data)
      7  start_index += step
      8  end_index += step
      9  while end_index+50 < len(token_ids):  # 剩下的数据长度,大于或等于50,才加入训练数据集
          10  data = token_ids[start_index:end_index]
          11  train_list.append(data)
          12  start_index += step
          13  end_index += step

假如tokens长度621
执行完8行时, start_index =200, end_index =400, train_list保存到200
进入循环,第一次执行到13行,start_index =400, end_index =600, train_list保存到400
判断600+50 > 621 退出,train_list保存到400,400-621 被遗弃

假如tokens长度651
执行完8行时, start_index =200, end_index =400, train_list保存到200
进入循环,第一次执行到13行,start_index =400, end_index =600, train_list保存到400
第二次执行到13行,start_index =600, end_index =800, train_list保存到600
判断800+50 > 621 退出,train_list保存到600,600-651 被遗弃
你这个代码会把tokens的最后50 到step+50-1 token删除,感觉不是你说的 剩下的数据长度,大于或等于50,才加入训练数据集

微调OOM问题

我使用双卡3090-24G加载了官方large模型进行微调,但是我看代码里面会对OOM进行捕获,看显存占用也的确是有规律的某个瞬间显存吃满,1. 请问是什么原因导致这种显存现象的,2. 有OOM是正常现象吗,不会影响模型训练吗

生成文本太慢

您好,请问有什么优化方法吗?目前生成文本速度确实过慢

Crash on "poch_mean_loss = total_loss / len(train_dataloader)"

I met one question, that
"
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 298, in
main()
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 294, in main
train(model, logger, train_dataset, args)
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 184, in train
train_loss = train_epoch(
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 149, in train_epoch
epoch_mean_loss = total_loss / len(train_dataloader)
ZeroDivisionError: division by zero
python-BaseException
"

len(train_dataloader) is 0.

sentencepiece 模型

请问作者你的sentencepiece 模型是自己训练的还是在哪里下载的,我看官方的sentencepiece 模型和你大小不一样,应该用的不是官方的cpm吧

train data

您好 训练数据可以分享一下吗?呜呜呜

是否使用CPM预训练模型

清华CPM没有发布medium和small的pretrain weight,只发布了large和distill的权重,所以这个项目从头开始训练的么?

关于模型输入label的问题

您好,在gpt2这种生成模型中,输入的真实label不应该是和输入的input_ids错位吗?然后进行损失计算。 我看您代码中是没有错位的,因为什么原因呢? 谢谢了~

有关问答对话

您好作者!
请问这个训练方法适用于问答 场景吗,类似百科维基那样的知识问答

训练数据

您好 训练数据可以分享一下吗?呜呜呜

在使用train.py训练模型的时候,读取预训练模型,然后进行初始化,这个好像初始化不了

在使用train.py训练模型的时候,读取预训练模型,然后进行初始化,这个好像初始化不了;我把项目主训练的作文配置文件config.json、pytorch_model.bin两个文件作为预训练文件读取,然后去训练新的模型,最后去生成新的模型生成文章的时候,内容还是作文内容。不知道我是哪里设置错误了,小白一个。

训练完后,不会生成一个总的模型文件,而是每轮训练完成后自动生成一个模型文件,这个最后在生成文章的时候,该选择哪一轮的模型作为生成模型

训练完后,不会生成一个总的模型文件,而是每轮训练完成后自动生成一个模型文件,例如我训练了10轮,训练完后,model文件里面会有epoch1-epoch10的文件夹,最后在去生成文章的时候,该选择哪一轮的模型作为制定生成文章的模型呢,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.