GithubHelp home page GithubHelp logo

rucaibox / textbox Goto Github PK

View Code? Open in Web Editor NEW
1.1K 19.0 116.0 130.13 MB

TextBox 2.0 is a text generation library with pre-trained language models

Home Page: https://github.com/RUCAIBox/TextBox

License: MIT License

Python 51.69% Shell 0.38% HTML 43.82% Perl 4.11%
text-generation natural-language-processing deep-learning pretrained-models python pytorch seq2seq natural-language-generation

textbox's Introduction

TextBox Logo


TextBox 2.0 (妙笔)

“李太白少时,梦所用之笔头上生花后天才赡逸,名闻天下。”——王仁裕《开元天宝遗事·梦笔头生花》

TextBox 2.0: A Text Generation Library with Pre-trained Language Models

TextBox 2.0 is an up-to-date text generation library based on Python and PyTorch focusing on building a unified and standardized pipeline for applying pre-trained language models to text generation:

  • From a task perspective, we consider 13 common text generation tasks such as translation, story generation, and style transfer, and their corresponding 83 widely-used datasets.
  • From a model perspective, we incorporate 47 pre-trained language models/modules covering the categories of general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight models (modules).
  • From a training perspective, we support 4 pre-training objectives and 4 efficient and robust training strategies, such as distributed data parallel and efficient generation.

Compared with the previous version of TextBox, this extension mainly focuses on building a unified, flexible, and standardized framework for better supporting PLM-based text generation models. There are three advantages of TextBox 2.0:

  • It is a significant innovation focusing on comprehensive tasks and PLMs.
  • It is designed to be unified in implementation and interface.
  • It can faithfully reproduce the results reported in existing work.

TextBox 2.0 framework
The Overall Framework of TextBox 2.0

Installation

Considering that a modified version of transformers will be installed, it is recommended to create a new conda environment:

conda create -n TextBox python=3.8

Then, you can clone our repository and install it with one-click.

git clone https://github.com/RUCAIBox/TextBox.git && cd TextBox
bash install.sh

If you face a issue ROUGE-1.5.5.pl - XML::Parser dependency error when installing files2rouge, you can refer to this issue.

Quick Start

This is a script template to run TextBox 2.0 in an end-to-end pipeline:

python run_textbox.py --model=<model-name> --dataset=<dataset-name> --model_path=<hf-or-local-path>

Substitute --model=<xxx> , --dataset=<xxx> and --model_path=<xxx> with your choices.

The choices of model and model_path can be found in Model. We provide the detailed instruction of each model in that page.

The choices of dataset can be found in Dataset. You should download the dataset at https://huggingface.co/RUCAIBox and put the downloaded dataset under the dataset folder just like samsum. If your want to use your own dataset, please refer to here.

The script below will run the Facebook BART-base model on the samsum dataset:

python run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base

Training

Basic Training

For basic training, we provide a detailed tutorial (here) for setting commonly used parameters like optimizer, scheduler, validation frequency, early stopping, and so on.

Pre-training

TextBox 2.0 provides four pre-training objectives to help users pre-train a model from scratch, including language modeling, masked sequence-to-sequence modeling, denoising auto-encoding, and masked span prediction. See the pre-training doc for a detailed tutorial.

Efficient Training

Four useful training methods are provided for improving the optimization of PLMs: distributed data parallel, efficient decoding, hyper-parameter optimization, and repeated experiments. Detailed instructions are provided here.

Model

To support the rapid progress of PLMs on text generation, TextBox 2.0 incorporates 47 models/modules, covering the categories of general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight models (modules). See the model doc for information on detailed usage instructions of each model, pre-trained model parameters, and generation parameters.

Dataset

Now we support 13 generation tasks (e.g., translation and story generation) and their corresponding 83 datasets. We also provide the description, basic statistics, training/validation/testing samples, and leaderboard for each dataset. See more details here.

Evaluation

TextBox 2.0 supports 17 automatic metrics of 4 categories and several visualization tools to explore and analyze the generated texts in various dimensions. For evaluation details, see the evaluation doc.

Releases

Releases Date Features
v2.0.1 24/12/2022 TextBox 2.0
v2.0.0 20/08/2022 TextBox 2.0 Beta
v0.2.1 15/04/2021 TextBox
v0.1.5 01/11/2021 Basic TextBox

Contributing

Please let us know if you encounter a bug or have any suggestions by filing an issue.

We welcome all contributions from bug fixes to new features and extensions.

We expect all contributions discussed in the issue tracker and going through PRs.

We thank @LucasTsui0725 for contributing HRED model and several evaluation metrics.

We thank @wxDai for contributing PointerNet and more than 20 language models in transformers API.

The Team

TextBox is developed and maintained by AI Box.

License

TextBox uses MIT License.

Reference

If you find TextBox 2.0 useful for your research or development, please cite the following papers:

@inproceedings{tang-etal-2022-textbox,
    title = "{T}ext{B}ox 2.0: A Text Generation Library with Pre-trained Language Models",
    author = "Tang, Tianyi  and  Li, Junyi  and  Chen, Zhipeng  and  Hu, Yiwen  and  Yu, Zhuohao  and  Dai, Wenxun  and  Zhao, Wayne Xin  and  Nie, Jian-yun  and  Wen, Ji-rong",
    booktitle = "Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-demos.42",
    pages = "435--444",
}


@inproceedings{textbox,
    title = "{T}ext{B}ox: A Unified, Modularized, and Extensible Framework for Text Generation",
    author = "Li, Junyi  and  Tang, Tianyi  and  He, Gaole  and  Jiang, Jinhao  and  Hu, Xiaoxuan  and  Xie, Puzhao  and  Chen, Zhipeng  and  Yu, Zhuohao  and  Zhao, Wayne Xin  and  Wen, Ji-Rong",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-demo.4",
    doi = "10.18653/v1/2021.acl-demo.4",
    pages = "30--39",
}

textbox's People

Contributors

1190303125 avatar candycanelane avatar dai-wenxun avatar eltociear avatar huxx499 avatar huyiwen avatar jboru avatar lucastsui0725 avatar richardhgl avatar simplelifetime avatar steventang1998 avatar timothy023 avatar turboljy avatar wicknight avatar xiaoxue-xx avatar xiepuzhao avatar yhao-wang avatar zhuohaoyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textbox's Issues

finetune gpt2 with e2e dataset

Hello,

I have tried to use the command to finetune gpt2-medium with e2e dataset, but got some errors.
Could you please give me an example to train the model with TextBox?

python run_textbox.py --model=GPT2 --dataset=e2e --model_path=./PTMs/gpt2-medium/

When do generating after training one epoch,

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

After generating,

26 Nov 01:02    ERROR Traceback (most recent call last):
  File "/home/hy/TextBox/textbox/utils/dashboard.py", line 320, in new_experiment
    yield True
  File "/home/hy/TextBox/textbox/quick_start/experiment.py", line 130, in run
    self._do_train_and_valid()
  File "/home/hy/TextBox/textbox/quick_start/experiment.py", line 105, in _do_train_and_valid
    self.valid_result = self.trainer.fit(train_data, valid_data)
  File "/home/hy/TextBox/textbox/trainer/trainer.py", line 453, in fit
    self.stopped |= self._valid(valid_data, 'epoch')
  File "/home/hy/miniconda3/envs/textbox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/hy/TextBox/textbox/trainer/trainer.py", line 297, in _valid
    valid_results = self.evaluate(valid_data, is_valid=True)
  File "/home/hy/miniconda3/envs/textbox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/hy/TextBox/textbox/trainer/trainer.py", line 571, in evaluate
    result = self.evaluator.evaluate(generate_corpus, reference_dataset)
  File "/home/hy/TextBox/textbox/evaluator/base_evaluator.py", line 151, in evaluate
    metric_result = evaluator.evaluate(generate_corpus, reference_corpus, avg=avg)
  File "/home/hy/TextBox/textbox/evaluator/abstract_evaluator.py", line 31, in evaluate
    metric_dict = self._calc_metrics_info(generate_corpus=generate_corpus, reference_corpus=reference_corpus)
  File "/home/hy/TextBox/textbox/evaluator/bleu_evaluator.py", line 92, in _calc_metrics_info
    reference_corpus = list(zip_longest(*reference_corpus))
TypeError: type object argument after * must be an iterable, not Corpus

How to use the other pre-training objectives like masked sequence-to-sequence modeling?

Hello, I am very interested in your work.
I know ”--pretrain_task=denoising“ can use denoising auto-encoding.
However, I did not see the relevant parameter descriptions and restrictions in the code.
When I want to use other pre training methods, How do I set this parameter?

Thank you very much for your work, and I hope you can tell me the method when you have time.

Textbox-0.2.1 tools with windows an linux

can I run textbox-0.2.1 on a Windows server without using a Linux subsystem?
Is the tool requires any dependencies that are specific to Linux or require a Linux environment to run?

How to use checkpoint?

Hi,
"python run_textbox.py --model=RNN --dataset=COCO --task_type=unconditional --load_experiment=<ckpt_file> --test_only=true"
I try to use this command after the program stopped accidentally but it cannot find the ckpt_file so i am wondering where the ckpt_file will be and how to load them into the program. Looking forward for your reply! Thank you

prompt tuning

Hi so which models except gpt support prompt-tuning in your repo?

Open-ended dialogue system 有没有好的方法评估生成的应答上下文的通顺和主题契合程度?

如题:

a = ['程序员要掌握哪些技能?',
 '你有什么特殊技能?',
 '我学过Linux。',
 '你在主修什么?',
 '谢邀,我主修数据库',
 '你能告诉我的英语教育吗?',
 '编程,学好英语']

b = ['程序员要掌握哪些技能?', '你有什么特殊技能?', '我在计算机系学习',
'你在主修什么?', '我主修英语', '你能告诉我的英语教育吗?']

上面是 Open-ended dialogue system 的上下文例子,个人感觉a 比b好一些
用第一句话'程序员要掌握哪些技能?'作为对话的主题
有没有好的评估a和b哪一个更好的方法?

[🐛BUG] The bug of T5 using prefix-tuning

Describe the bug
AttributeError: 'GPT2LMHeadModel' object has no attribute 'set_efficient_tuning'

When I use efficient_methods:

python run_textbox.py
--model=T5
--model_path=t5-large
--dataset=webnlg
--gpu_id=4
--efficient_methods=['prefix-tuning']
--efficient_kwargs={'prefix_length':\ 100,\ 'prefix_dropout':\ 0.1,\ 'prefix_mid_dim':\ 512}
--filename CP/T5_large_prefix_tuning

FileNotFoundError: [Errno 2] No such file or directory: 'saved/T5-xIntent_en2en-Jul-31-2021_ 13-45-46.pth'

When I try to call run_demo as this:

python run_demo.py --model=T5 --dataset=xInt
ent_en2en --pretrained_model_path=/drive2/pretrained/mt5/hf/mt5-small/ 

It gives the following error:

...mini/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line
 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/pouramini/miniconda3/lib/python3.7/site-packages/torch/serialization.py", line
 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'saved/T5-xIntent_en2en-Jul-31-2021_
13-45-46.pth'

It searches for a file with current time (13-45) but my model exist in saved directory with another time, why it must matches the minutes and hours:

(base) pouramini@nlplab-server:~/TextBox/saved$ ls
GPT2-COCO-Jul-30-2021_21-51-08.pth  
T5-xIntent_en2en-Jul-31-2021_11-46-29.pth
RNN-COCO-Jul-30-2021_13-34-31.pth

2.0.0版本运行install.sh报错

ERROR: Could not find a version that satisfies the requirement rouge-score>=0.1.2 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4)

rouge-score装不上0.1.2版本请问怎么解决

Task type for head to tail association

I have some data like this:

These are heads and I saved then in train.input_text

PersonX uses PersonX's ___ to obtain
PersonX uses PersonX's ___ to obtain
PersonX uses PersonX's ___ to obtain
PersonX uses PersonX's ___ to obtain
PersonX changes men 's ___
PersonX changes men 's ___

And tails are the other node in a relation. For example for "intention" relation, these could be tails, which specify what is the intention of a person of doing an action (heads above) and they are saved as train.target_text:

to have an advantage
to fulfill a desire
to get out of trouble
to be powerful
to be influential
to reform men with a bad attitude.
good about themselves

Each row corresponds to a row in tails. I tried to model them using t5, but I don't know which task_type to use. I tried summarization and the generated output is like this:

personx's ____
personx's ____
personx's ____
personx's ____
personx's ____
personx changes men's _____
personx changes men's _____

Which is obviously far from targets. It seems it jut tried to summarize them. but it must find a mapping between them.
Which task type is suitable for this common task?

ImportError: cannot import name 'f' from 'pandas.core.resample'

Hello,

Is that a pandas version problem? which pandas version does TextBox need?
python: 3.8.15
pandas: 1.5.2

(textbox) hy@xxx:~/TextBox$ python run_textbox.py --model_path=facebook/bart-base
Traceback (most recent call last):
  File "run_textbox.py", line 2, in <module>
    from textbox import run_textbox
  File "/home/hy/TextBox/textbox/__init__.py", line 8, in <module>
    from textbox.quick_start.hyper_tuning import run_hyper
  File "/home/hy/TextBox/textbox/quick_start/hyper_tuning.py", line 14, in <module>
    from .experiment import Experiment
  File "/home/hy/TextBox/textbox/quick_start/experiment.py", line 12, in <module>
    from ..trainer.trainer import Trainer
  File "/home/hy/TextBox/textbox/trainer/__init__.py", line 1, in <module>
    from textbox.trainer.trainer import Trainer
  File "/home/hy/TextBox/textbox/trainer/trainer.py", line 16, in <module>
    from textbox.utils.dashboard import get_dashboard, Timestamp, EpochTracker
  File "/home/hy/TextBox/textbox/utils/dashboard.py", line 13, in <module>
    from pandas.core.resample import f
ImportError: cannot import name 'f' from 'pandas.core.resample' 

调用微调后的blenderbot_small-90M模型报错

使用run_textbox.py --model=Blenderbot-Small --model_path=facebook/blenderbot_small-90M --dataset=dd 生成微调后的chekpoint文件夹,重命名为blenderbot_small-90M,然后调用该模型,报错

Exception: Error while initializing BPE: Token _</w> out of vocabulary

代码:
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("~/blenderbot_small-90M")
model =AutoModelForSeq2SeqLM.from_pretrained("~/blenderbot_small-90M")
predict = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
text = "hello"
pred = predict(text)
print(pred)

ValueError: For decoder-only generation, one must pass `input_ids

Hi! Thx for sharing the amazing code repo.

First of all I've tried the trial on gpt2 finetuning with textbox, eveything is good, works pretty smooth.
However, I came across an error when I tried to use prompt tuning for gpt-2.
I set the efficient training hyper-parameters at overall.yaml to be

efficient_methods: ['prompt-tuning']
efficient_kwargs: {'prompt_length': 100}
efficient_unfreeze_model: False

Then when prompt-tuning it returns the following error:

generating:   0%|          | 0/7 [00:00<?, ?it/s]
generating:   0%|          | 0/7 [00:00<?, ?it/s]
24 Nov 21:13    ERROR Traceback (most recent call last):
  File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/utils/dashboard.py", line 323, in new_experiment
    yield True
  File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/quick_start/experiment.py", line 130, in run
    self._do_train_and_valid()
  File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/quick_start/experiment.py", line 105, in _do_train_and_valid
    self.valid_result = self.trainer.fit(train_data, valid_data)
  File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/trainer/trainer.py", line 453, in fit
    self.stopped |= self._valid(valid_data, 'epoch')
  File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/trainer/trainer.py", line 297, in _valid
    valid_results = self.evaluate(valid_data, is_valid=True)
  File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/trainer/trainer.py", line 526, in evaluate
    generated = self.accelerator.unwrap_model(self.model).generate(batch_data, self.accelerator)
  File "/weka-jd/prod/public/permanent/group_yangyaodong/yuzhouliang/workspaces/teeter/textbox/model/abstract_model.py", line 90, in generate
    sample_outputs = accelerator.unwrap_model(self.model).generate(**inputs, **self.generation_kwargs)
  File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/hf_venvs/python3.8/202111/lib/python3.8/site-packages/transformers/generation_utils.py", line 934, in generate
    raise ValueError("For decoder-only generation, one must pass `input_ids`.")
ValueError: For decoder-only generation, one must pass `input_ids`

mainly ValueError: For decoder-only generation, one must pass input_ids`` at transformers/generation_utils.py when validating

How can I solve this problem?

[🐛BUG] accelerate多卡训练出错。

描述这个 bug
accelerate多卡训练出错。

如何复现
accelerate launch run_textbox.py \ --gpu_id=1,3 \ --dataset=csl \ --model=CPT \ --model_path=fnlp/cpt-base \ --saved_dir=./saved/ \ --filename=DEBUG \ --epochs=5 \ --learning_rate=1e-5 \ --train_batch_size=16 \ --eval_batch_size=16 \ --max_save=1 \ --wandb=disabled \ --quick_test=1000 \

日志
13 Feb 11:15 ERROR Traceback (most recent call last): File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment yield True File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/quick_start/experiment.py", line 136, in run self._do_train_and_valid() File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid self.valid_result = self.trainer.fit(train_data, valid_data) File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/trainer/trainer.py", line 451, in fit loss = self._train_epoch(train_data, epoch_idx, valid_data)['loss'] File "/home/cqy/workspace/InterestGraph/video_understanding/TextBox/textbox/trainer/trainer.py", line 221, in _train_epoch loss = self.model(data, epoch_idx=epoch_idx) File "/home/cqy/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/cqy/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=Truetotorch.nn.parallel.DistributedDataParallel, and by making sure all forwardfunction outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforwardfunction. Please include the loss function and the structure of the return value offorward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Text Generation Using SeqGAN

Hi,
After I trained the SeqGAN Model on Arabic texts I have a generated file (SeqGAN-wiki_ar-Dec-05-2022_20-05-03.txt) which contains a huge number of Arabic-generated sentences.
But my question to you is, is it possible for me, after the end of the training period for this model (which was approximately three days), to run it in order to get a new sentence that was just generated every time, immediately?
Does TextBox tools support this feature and how can I run the SeqGAN after training on the special dataset to generate a new sentence every time?

加载CPT模型报错

RuntimeError: Error(s) in loading state_dict for CPTForConditionalGeneration: size mismatch for model.encoder.embeddings.position_ids: copying a param with shape torch.Size([1, 1024]) from checkpoint, the shape in current model is torch.Size([1, 512]). size mismatch for model.encoder.embeddings.position_embeddings.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([512, 1024]). You may consider adding ignore_mismatched_sizes=Truein the modelfrom_pretrained method.

加载cpt-base和cpt-large的时候都报这个错,这是不是config文件的维度写错了导致初始化的模型维度和权重维度不匹配

Quick Start Error

For the Quick Start in readme.md, I have tried python run_textbox.py , it works and return bleu score.
But when I tried python run_textbox.py --rnn_type=lstm --max_vocab_size=4000, it shows:

06 Apr 10:58    INFO epoch 38 training [time: 0.76s, train loss: 2.8981]
06 Apr 10:58    INFO epoch 38 evaluating [time: 0.23s, valid_loss: 4.281664]
06 Apr 10:58    INFO valid ppl: 72.36075204659402
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 206.01it/s]
06 Apr 10:58    INFO epoch 39 training [time: 0.76s, train loss: 2.8782]
06 Apr 10:58    INFO epoch 39 evaluating [time: 0.23s, valid_loss: 4.282278]
06 Apr 10:58    INFO valid ppl: 72.40518443339685
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 190.98it/s]
06 Apr 10:58    INFO epoch 40 training [time: 0.82s, train loss: 2.8627]
06 Apr 10:58    INFO epoch 40 evaluating [time: 0.23s, valid_loss: 4.286773]
06 Apr 10:58    INFO valid ppl: 72.73138277179176
06 Apr 10:58    INFO Finished training, best eval result in epoch 37
06 Apr 10:58    INFO best valid loss: 4.267283218029218, best valid ppl: 71.32759063735446
06 Apr 10:58    INFO Loading model structure and parameters from saved/RNN-COCO-Apr-06-2022_10-57-51.pth
  0%|                                                                                                                                 | 0/157 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_textbox.py", line 18, in <module>
    run_textbox(model=args.model, dataset=args.dataset, config_file_list=config_file_list, config_dict={})
  File "/home/LAB/TextBox/textbox/quick_start/quick_start.py", line 90, in run_textbox
    test_result = trainer.evaluate(test_data, load_best_model=saved)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/LAB/TextBox/textbox/trainer/trainer.py", line 446, in evaluate
    generated = self.model.generate(batch_data, eval_data)
  File "/home/LAB/TextBox/textbox/model/LM/rnn.py", line 64, in generate
    outputs, hidden_states = self.decoder(decoder_input, hidden_states)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/LAB/TextBox/textbox/module/Decoder/rnn_decoder.py", line 79, in forward
    outputs, hidden_states = self.decoder(input_embeddings, hidden_states)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 689, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 634, in check_forward_args
    'Expected hidden[0] size {}, got {}')
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 226, in check_hidden_size
    raise RuntimeError(msg.format(expected_hidden_size, list(hx.size())))
RuntimeError: Expected hidden[0] size (2, 1, 128), got [1, 128]

Can you tell me how to solve it?

Questions about fast_bleu

The fast_bleu version in the fast_bleu_wheel4windows folder is 0.0.86, but the environment requirement is 0.0.89. In the Windows operating system, fast_bleu cannot be installed directly, can you please update the version in the fast_bleu_wheel4windows folder, thank you。

Duplicate code in RNNVAE.py

if self.rnn_type == "lstm": self.hidden_to_mean = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.hidden_to_logvar = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.latent_to_hidden = nn.Linear(self.latent_size, 2 * self.hidden_size) elif self.rnn_type == 'gru' or self.rnn_type == 'rnn': self.hidden_to_mean = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.hidden_to_logvar = nn.Linear(self.num_directions * self.hidden_size, self.latent_size) self.latent_to_hidden = nn.Linear(self.latent_size, 2 * self.hidden_size)
Is there any difference between LSTM and GRU branches?

关于保存的checkpoint问题

我设置max_save=1,跑了10个epoch,best出现在epoch 8,但是保存了epoch-8和epoch-10两个文件夹,按照README的说法不是应该只有一个epoch-8吗,并且只有epoch-10文件夹有generation.txt,简单看了下源码即使两个文件夹不也应该每个文件夹都生成一个generation.txt吗?

[🐛BUG] Context Tuning bug

描述这个 bug
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 744 but got size 768 for tensor number 1 in the list.

如何复现
gyafc_em.yaml里的train_batch_size设为16 (用的3090 24G, 设64的话报错说显存不够)
运行python run_textbox.py --model=Context_Tuning --dataset=gyafc_em

日志
(15 Apr 16:30 ERROR Traceback (most recent call last):
File "/root/TextBox/textbox/utils/dashboard.py", line 311, in new_experiment
yield True
File "/root/TextBox/textbox/quick_start/experiment.py", line 138, in run
self._do_train_and_valid()
File "/root/TextBox/textbox/quick_start/experiment.py", line 113, in _do_train_and_valid
self.valid_result = self.trainer.fit(train_data, valid_data)
File "/root/TextBox/textbox/trainer/trainer.py", line 451, in fit
loss = self._train_epoch(train_data, epoch_idx, valid_data)['loss']
File "/root/TextBox/textbox/trainer/trainer.py", line 221, in _train_epoch
loss = self.model(data, epoch_idx=epoch_idx)
File "/usr/local/miniconda3/envs/TextBox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/TextBox/textbox/model/abstract_model.py", line 69, in forward
inputs = self._process_prompt_tuning_input(inputs, batch)
File "/root/TextBox/textbox/model/context_tuning.py", line 88, in _process_prompt_tuning_input
inputs_embeds = torch.cat([prompt_embeds[:, 0], inputs_embeds, prompt_embeds[:, 1]], dim=1) # b, pl+l+pl, e
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 744 but got size 768 for tensor number 1 in the list.

GANs Models

Why do textbox tools become doesn't support GANs Models for generation texts?
How can I use SeqGAN, TextGAN, RankGAN Models if possible?

[🐛BUG] RUCAIBox/StyleTransfer里面的gyafc_em.ckpt文件缺失

在运行
python run_textbox.py --model=OpenAI-GPT --model_path=openai-gpt --dataset=gyafc_em
的时候,会报错
[Errno 2] No such file or directory: 'textbox/evaluator/utils/gyafc_em.ckpt'
在使用RUCAIBox/StyleTransfer里面的GYAFC数据集的时候他需要这个ckpt文件,请问在哪里下载或者生成?

ModuleNotFoundError: No module named 'transformers.models.unilm'

运行时报错ModuleNotFoundError: No module named 'transformers.models.unilm',
排查后发现是textbox\utils\utils.py文件中
from transformers.models.unilm.tokenization_unilm import UnilmTokenizer
from transformers.models.mass.tokenization_mass import MassTokenizer
出错,似乎无法从transformers中导入UnilmTokenizer和MassTokenizer,该文件报错内容为:
Import "transformers.models.unilm.tokenization_unilm" could not be resolved
Import "transformers.models.mass.tokenization_mass" could not be resolved
我尝试升级transformers,重下transformers都未能解决

Custom Dataset Preprocessing

Does TextBox support custom preprocessing for a dataset?

I work with source code data, and such data may require custom preprocessing. For example, extracting abstract syntax tree, or dataflow graph.

Pre-traininng from scratch

Thanks for open-sourcing this exciting tool!
When I used TextBox for pre-training the BART from scratch, I found that the corpus mentioned in the document wudao has not been provided. Where can I get this data?

python run_textbox.py --model=BART --dataset=wudao --pretrain_task=denoising

Since I did not have the wudao dataset, I try to use the example dataset samsum for pre-training a BART using the denoising task.
However, I got the following error:

05 Jan 23:25    INFO ====== Start training ======
train    1: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 77/77 [03:16<00:00,  2.55s/step, loss=2.78]
05 Jan 23:29    INFO Train epoch  1 [time: 196.60s, loss: 2.78]
generating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [01:22<00:00,  1.58s/it]
05 Jan 23:30    ERROR Traceback (most recent call last):
  File "/mnt/windata/projects/TextBox/textbox/utils/dashboard.py", line 312, in new_experiment
    yield True
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 136, in run
    self._do_train_and_valid()
  File "/mnt/windata/projects/TextBox/textbox/quick_start/experiment.py", line 111, in _do_train_and_valid
    self.valid_result = self.trainer.fit(train_data, valid_data)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 455, in fit
    self.stopped |= self._valid(valid_data, 'epoch')
  File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 294, in _valid
    valid_results = self.evaluate(valid_data, is_valid=True)
  File "/home/xxx/anaconda3/envs/TextBox/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/windata/projects/TextBox/textbox/trainer/trainer.py", line 548, in evaluate
    corpus_len = len(eval_data.dataset.target_text)
AttributeError: 'AbstractDataset' object has no attribute 'target_text'

Would you pls help me to find the mistake of using TextBox? Thanks!

question about the kg2text task

Hi,
firstly, thanks a lot for your work.
I have a question about the kg2text task in the project, no example and information about the kg2text, but i read your code, and found the kg dataset processor.
Could you please take some dataset and the corresponding model for example to further explain the kg2text details?

Given dataset has wrong in traing

I download Persona chat from BaiduWangpan that was given in README, but it may need more processes.
In source code, it uses 'MultipleSentenceDataset' to deal with dialog task's dataset. Given dataset(persona chat) is split to source and target, and lacks knowledge part, which makes wrong in training. So I want to know more details about how to deal with this question, thank you~

TextBox train.py error and possible solutions

get the error In follow image
File "/textbox/Textbox_test/TextBox/textbox/trainer/trainer.py", line 516, in evaluate generated = self.accelerator.unwrap_model(self.model).generate(batch_data, self.accelerator) TypeError: generate() missing 1 required positional argument: 'accelerator'
image

I compared the older version:
latest version
line 516: generated = self.accelerator.unwrap_model(self.model).generate(batch_data, self.accelerator
older version
line 505: generated = self.accelerator.unwrap_model(self.model).generate(batch_data, eval_data, self.accelerator)

so, I added the argument "eval_data" to generate().
Sure enough, problem solved.
Please confirm this error

detailed readme and documentation

Hello ,
Thanks for developing such a nice library. I am looking for a detailed documentation. The readthedocs.io page is also empty and I feel the instructions given on github are not enough to run experiments. can you please share proper documentation for Textbox ? Moroever, If example notebooks/scripts can be shared , then it will be great for the first-time users to get started with training, evaluation, dataset loading as well decoding using different strategies like (beam, top_k etc).

Thanks a lot .

How to create a custom dataset for machine translation

what to do if a phrase has several translations.
Will this dataset be correct?
train.src:

I like the color green.
I like the color green.

train.tgt

Мне нравится зелёный цвет.
Я люблю зелёный цвет.

how will the BLEU metric work on it?

SeqGAN Model with Huge Data

Hi,
I am using TextBox tools specially SeqGAN Model with Arabic texts for text generation.

When I upload a large file (train.tgt with 5 Giga size), the SeqGAN model after running gets me:
"
24 Mar 14:25 INFO Loading data from scratch
killed
"
Why does this show me that is there a limitation on data size? Because when I use train.tgt with 2 Mega, the SeqGAN model runs correctly, but its results are poor.

According to your experiences and observations, what are the factors and parameters that I can change in order to improve the generation of Arabic texts?

Thank you :)

Bug report

When I run python run_textbox.py --rnn_type=lstm --max_vocab_size=4000, I found a bug in the textbox/module/Decoder/rnn_decoder.py, which occurred on line 78. The init_hidden() function returns a tuple when initialzed with lstm so got AttributeError: 'tuple' object has no attribute 'contiguous'.

Not worked one-to-many dataset for machine learning

example of data:
train.src

ABC
BCA

train.tgt

['Abc', 'ABc']
'bca'

the textbox/properties/dataset/my_data.yaml dataset config is same as textbox/properties/dataset/wmt19-ru-en.yaml

command:

python run_textbox.py --model=BART --dataset=my_data  --model_path=facebook/bart-base

I got the following error from tokenizer transformers.AutoTokenizer:

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

error occurs in the method of the textbox.data.abstract_dataset.AbstractDataset class on 127 line.

target_ids = tokenizer(
    text_target=self.target_text,
    add_special_tokens=False,
    return_token_type_ids=False,
    return_attention_mask=False,
)["input_ids"]

It caused due that the self.target_text is created by the textbox.data.misc.load_data function

def load_data(dataset_path: str, max_length: int = 0):
...
    text = []
    with open(dataset_path, "r") as fin:
        if max_length:
            fin = itertools.islice(fin, max_length)
        for line in fin:
            l = line.strip()
            if len(l) >= 2 and ((l[0] == '"' and l[-1] == '"') or (l[0] == "'" and l[-1] == "'") or
                                (l[0] == '[' and l[-1] == ']')):
                try:
                    l = eval(l)
                    if not isinstance(l, list):
                        l = str(l)
                except:
                    pass
            text.append(l)
    return text

this function return from my_data [['Abc', 'ABc'], 'bca'] and that structure not follow requirement of transformers.AutoTokenizer - Union[TextInputSequence, Tuple[InputSequence, InputSequence]].

are there examples of learning a machine learning model with a one-to-many dataset?

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6797: character maps to <undefined>

Hey there,

Thanks for the amazing work.

Actually I face some codec problem when loading an example "python run_textbox.py --model=BART --dataset=samsum --model_path=facebook/bart-base"

Traceback (most recent call last): File "run_textbox.py", line 12, in <module> run_textbox(model=args.model, dataset=args.dataset, config_file_list=args.config_files, config_dict={}) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\quick_start\quick_start.py", line 20, in run_textbox experiment = Experiment(model, dataset, config_file_list, config_dict) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\quick_start\experiment.py", line 52, in __init__ self._init_data(self.get_config(), self.accelerator) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\quick_start\experiment.py", line 78, in _init_data train_data, valid_data, test_data = data_preparation(config, tokenizer) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\data\utils.py", line 23, in data_preparation train_dataset = AbstractDataset(config, 'train') File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\data\abstract_dataset.py", line 25, in __init__ self.source_text = load_data(source_filename, max_length=self.quick_test) File "C:\Users\korn2\Desktop\TextBox-2.0.0\textbox\data\misc.py", line 25, in load_data for line in fin: File "C:\Users\korn2\anaconda3\envs\TextBox\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6797: character maps to <undefined>

any ideas ?

Cheers,
Kevin L

Add support for REINFORCE training

Hello,

Thank you for this tool. I would like to add the possibility of training using Reinforcement Learning using a reward such as ROUGE or BLEU, for seq2seq tasks.

I would be happy to contribute!

Best,
Muhammad

ModuleNotFoundError: No module named 'bert_score'

Hi, thanks for providing such a powerful tool. After I clone the Textbox from the source code, I tried to run the command:"python run_textbox.py", the result was reported as follows: ModuleNotFoundError: No module named 'bert_score'. Is it a bug? And how to run the code correctly?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.