iflytek / cino Goto Github PK

View Code? Open in Web Editor NEW

207.0 4.0 28.0 22.25 MB

CINO: Pre-trained Language Models for Chinese Minority (少数民族语言预训练模型)

Home Page: http://cino.hfl-rc.com

License: Apache License 2.0

Python 100.00%

nlp pytorch xlmr xlm-roberta chinese-nlp transformers cino

cino's People

Contributors

Stargazers

Watchers

cino's Issues

民族语言预处理

你好，与汉语不同，民族语言的字是由不同的部件构成的，请问对于民族语言做了哪些预处理呢？
比如藏文是按照字丁（单独编码）、音节还是其他的文本粒度为单位进行建模学习呢？

作者，您好！
我按照环境文档要求，配置了环境变量，在您开源的例子：TNCC数据集上进行finetuning实验复现。由于电脑配置限制，仅调小了batch_size的值，其余不变。出现以下问题：
Traceback (most recent call last):
File "G:/少数/CINO/Chinese-Minority-PLM-main/examples/TNCC/finetune/run_finetune.py", line 221, in
main()
File "G:/少数/CINO/Chinese-Minority-PLM-main/examples/TNCC/finetune/run_finetune.py", line 218, in main
trainer.run_finetune()
File "G:/少数/CINO/Chinese-Minority-PLM-main/examples/TNCC/finetune/run_finetune.py", line 200, in run_finetune
self.train(model, train_loader, dev_loader, optimizer, schedule)
File "G:/少数/CINO/Chinese-Minority-PLM-main/examples/TNCC/finetune/run_finetune.py", line 169, in train
dev_true, dev_pred = self.predict(model, valid_loader)
File "G:/少数/CINO/Chinese-Minority-PLM-main/examples/TNCC/finetune/run_finetune.py", line 137, in predict
test_true.extend(y.squeeze().cpu().numpy().tolist())
TypeError: 'int' object is not iterable

Process finished with exit code 1

您能看看是什么原因吗？

What is the pre-training dataset

論文入面好似冇提到預訓練資料集係乜嘢

Source code and dataset

您好！我是一名研究NLP的学生，请问CINO项目的源代码或者训练语料有公开吗？

Wiki-Chinese-Minority实验

你好，请问为什么说Wiki-Chinese-Minority实验中，除了中文以外的语种是zero-shot测试呢，这个预训练模型不是也包括了那些少数民族语言的么，不应该是finetune么

what is the difference between multilingual model and cross-lingual model

问题简介

您好，在复原例子的时候，确保了版本与您一样，但在终端运行的时候，出现了如下错误，暂时无法得到有效解决，请问是否可以帮忙看一下问题所在，十分打扰！

python run_finetune.py --params cino-params.json

Traceback (most recent call last):
File "run_finetune.py", line 188, in
main()
File "run_finetune.py", line 182, in main
config = CINO_FT_Configer(load(args.hparam))
AttributeError: 'Namespace' object has no attribute 'hparam'

关于预训练数据

作者，您好！
我想请问下以下两点:
（1）该预训练模型所包含的语言是只有以下几种，还是涵盖了先前xlm-r的上百种语言？
Chinese，中文（zh）Tibetan，藏语（bo）Mongolian (Uighur form)，蒙语（mn）Uyghur，维吾尔语（ug）
Kazakh (Arabic form)，哈萨克语（kk）Korean，朝鲜语（ko）Zhuang，壮语Cantonese，粤语（yue）
（2）关于少数民族语数据集预训练的数据量大小是多少？
期待您的回复。

little wrong

the line 95 in wcm_zeroshot.py need to be modify, the test_pred has the wrong shape

CMNews dataset

Can the CMNews dataset be open sourced?

如何在模型的基础上继续训练，比如单语数据？

您好，请教一下如何在模型的基础上继续用某种语言进行训练，比如自有的中文、藏文或者蒙文数据？

zeroshot缺少文件best_cino.pth

"model_finetune_params":"model/best_cino.pth",
不知道是我少下载了么

best_cino.pth

求～ best_cino.pth 文件在哪？还是自己构建的

gradient_acc参数

能帮忙解释一下gradient_acc参数嘛
==loss /= self.config.gradient_acc==代码中，用到了gradient_acc参数，所以不太明白这句代码的意思。

关于技术细节

什么时候会公布训练技术细节呢？

error with TCNN example

Some weights of XLMRobertaModel were not initialized from the model checkpoint at model/ and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
File "tncc_finetune.py", line 175, in
main()
File "tncc_finetune.py", line 172, in main
trainer.run_finetune()
File "tncc_finetune.py", line 155, in run_finetune
self.train(model, train_loader, dev_loader, optimizer, schedule)
File "tncc_finetune.py", line 120, in train
loss.backward()
File "/data/anbo/anaconda3/envs/transformer/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/data/anbo/anaconda3/envs/transformer/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed.

torch等版本的要求也是一致的:
sacremoses==0.0.53
scikit-learn==0.24.2
scipy==1.7.3
sentencepiece==0.1.97
six @ file:///tmp/build/80754af9/six_1644875935023/work
threadpoolctl==3.1.0
tokenizers==0.8.1rc2
torch==1.7.1
torchaudio==0.12.1
torchvision==0.13.1
tqdm==4.64.1
transformers==3.1.0

请问怎么处理？

iflytek / cino Goto Github PK

cino's People

Contributors

Stargazers

Watchers

Forkers

cino's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs