yangjianxin1 / cpm Goto Github PK

View Code? Open in Web Editor NEW

517.0 517.0 135.0 817 KB

Easy-to-use CPM for Chinese text generation（基于CPM的中文文本生成）

License: Apache License 2.0

Python 98.27% Shell 1.73%

cpm generate gpt-2 gpt2

cpm's Introduction

Hi there 👋, I'm Yang Jianxin

I'm a NLPer interested in Large Language Model and graduated from SYSU with a master's degree.

In my free time, I like to write technical blogs on [Wechat Official Accounts: YeungNLP] and [Zhihu: 红雨瓢泼]

🔭 Experiences:

Shopee, responsible for building NLP algorithm ability about Customer Service. (from 2022-04 to now)
Tencent, responsible for building NLP algorithm ability about Product Understanding. (from 2021-06 to 2022-04)
Alibaba, Internship at Alibaba (from 2020-06 to 2020-09).

⚙ Here are some my public projects:

Project	Description	Code
Firefly	One-stop training for LLMs. Some achievements: 1. firefly-llama2-13b ranked 3rd among all 13B models on Open LLM Leaderboard, only 0.5 points less than 1st. 2. firefly-llama-30b ranked 10th among all 30B models on Open LLM Leaderboard trained with single V100. 3. firefly-baichuan-13b achieves over 1.63 million downloads. 4. firefly-qwen1.5-en-7b-dpo improves 7.21 points compared with the official chat model. 5. firefly-gemma-7b improves 9.37 points compared with the official chat model.
GPT2-chitchat	Chinese GPT2 for chitchat
Firefly-LLaMA2-Chinese	Chinese Llama2 with efficient and effective training method.
LongQLoRA	Efficient and Effective method for extending context length of Llama2 to 8192 with single V100. Technical Report
CPM	Chinese composition model based on CPM
CLIP-Chinese	Chinese CLIP model trained with 1.4 million image-text pairs
ClipCap-Chinese	Chinese image caption model based on clip and mengzi
OFA-Chinese	Chinese multi-modal unified pre-training model
LLMPruner	Prune vocabulary of LLMs to save memory in training.

📁 Here are some my technical blogs:

cpm's People

Contributors

Stargazers

Watchers

Forkers

hquzhuguofeng laoli2046 ares5221 mingkin sjx0451 kavinwow100 helloestrella wxbbn qianrenjian bestjex yuanyixu tengbing88 guangh01 eunion hanwenzhao liusubject paddlelaw hecongqing energy888666 hunkguo anatanick luxixiang beiguaa socaty miven gongquanlin chenjunqiang chiwenheng bobosui chengze123 seraphn dawsonliu baixiaoxiao07 zxhsama yuzhang112 leeyaa itonly wudangbio jeanhome123 marscube raki-j seanyoulaw rgveda immortalz mahiru23 gyikoo liuxichui chenlk96 chenhuayou zhangziliang04 ligenxun alonmusk huoshanfei scievan fengyunzaidushi zuoxiatian newuserforstudy ske159 justin2061 aromer-room secondar shouxia cauchydoom xuxinyuancode limingdeng henryfang1037 sunying1985 schifflee gyuanchengxu hanxm715 hihaluemen dingguijin yuedy daanye caozewa dahuangfeng123 01tibai xxentropy blue6626 alexjeo mytianao888 stive168 hyzhou1990 fengtianlinag123 raogj miigogo qinyongace eyuxiansheng yuanyuan229 blmvay dlnan krissliu maro666 crazysheep118 trent112 weiweihenu jackstephen shiroikinmokusei zenghongyujia123 tangyipeng100

cpm's Issues

关于生成时的加速

我打断点看了下，没有用到gpt2的past_key_values机制，博主后面有空可以改进一下，应该能加速不少

Preprocess.py 的 53-63行感觉有点问题

       1  win_size = args.win_size
      2  step = args.step
      3  start_index = 0
     4   end_index = win_size
     5   data = token_ids[start_index:end_index]
      6  train_list.append(data)
      7  start_index += step
      8  end_index += step
      9  while end_index+50 < len(token_ids):  # 剩下的数据长度，大于或等于50，才加入训练数据集
          10  data = token_ids[start_index:end_index]
          11  train_list.append(data)
          12  start_index += step
          13  end_index += step

假如tokens长度621
执行完8行时， start_index =200， end_index =400， train_list保存到200
进入循环，第一次执行到13行，start_index =400， end_index =600， train_list保存到400
判断600+50 > 621 退出，train_list保存到400，400-621 被遗弃

假如tokens长度651
执行完8行时， start_index =200， end_index =400， train_list保存到200
进入循环，第一次执行到13行，start_index =400， end_index =600， train_list保存到400
第二次执行到13行，start_index =600， end_index =800， train_list保存到600
判断800+50 > 621 退出，train_list保存到600，600-651 被遗弃
你这个代码会把tokens的最后50 到step+50-1 token删除，感觉不是你说的剩下的数据长度，大于或等于50，才加入训练数据集

申请下载的只有large，你用的small在哪下载的

请问一下那个CPM源码的CPM-small&CPM-medium两个模型的训练参数怎么找不到，申请下载的只有large，你用的small在哪下载的，

请问pretrained_model怎么加载原版CPM的参数呢？

如果重新训练模型，输入的tokens的向量表示的随意初始化的还是。。。？

微调OOM问题

我使用双卡3090-24G加载了官方large模型进行微调，但是我看代码里面会对OOM进行捕获，看显存占用也的确是有规律的某个瞬间显存吃满，1. 请问是什么原因导致这种显存现象的，2. 有OOM是正常现象吗，不会影响模型训练吗

大神，训练和生成都报错，请问这是什么问题呢

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

生成文本太慢

您好，请问有什么优化方法吗？目前生成文本速度确实过慢

Crash on "poch_mean_loss = total_loss / len(train_dataloader)"

I met one question, that
"
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 298, in
main()
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 294, in main
train(model, logger, train_dataset, args)
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 184, in train
train_loss = train_epoch(
File "/home/cx/quick_code/github_sources/CPM-main/train.py", line 149, in train_epoch
epoch_mean_loss = total_loss / len(train_dataloader)
ZeroDivisionError: division by zero
python-BaseException
"

len(train_dataloader) is 0.

yangjianxin1 / cpm Goto Github PK

cpm's Introduction

Hi there 👋, I'm Yang Jianxin

cpm's People

Contributors

Stargazers

Watchers

Forkers

cpm's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs