BART-TL: Topic Label Generation

Implementation and helper scripts for the paper BART-TL: Weakly-Supervised Topic Label Generation.

Introduction

The goal of our work was to reframe the task of topic labeling from one of ranking labels from a predefined pool to one of generating labels. In our experiments we use the BART model [1], hence the name BART-TL.

If you want to quickly use the models to generate labels for topics, they are available on Huggingface:

Structure

We release the code we used in our experiments, from preparing the data to training and evaluating the performance. This is what you will find in each of the directories in this repository:

notebooks - Contains the end_to_end_workflow.ipynb notebook that guides the user through the whole process of fine-tuning a BART-TL model
lda - Contains scripts for applying LDA [2] on a corpus of documents (namely, the ones in the corpus directory).
corpus - The initial corpora of documents we focus our experiments on. The data comes from posts on multiple Stack Exchange forums. For a more complete dataset, see: https://archive.org/download/stackexchange.
data - Sample topics for each of the 5 subjects from the Stack Exchange data.
netl-src - Slightly modified NETL [3] code. The original code can be found at https://github.com/sb1992/NETL-Automatic-Topic-Labelling-.
bart-tl - Main code for BART experiments (pre-processing, training, inference).
- build_dataset - Code for building datasets that BART will be fine-tuned on.
- netl_data - Topics and label data from the NETL [3] work.
- preprocess - Preprocessing scripts for the dataset previously built with build_dataset scripts.
- finetune - Script for finetuning BART on the processed dataset using the fairseq library.
- survey - Scripts for creating the survey questions for the annotators, as well as the actual survey results.
eval - Multiple scripts for evaluating the results obtained in surveys.
utils - Multiple singular scripts that were used when needed, but are not central to the workflow.
bert_score - Further research not released in the paper, following the work of Alokaili et al. (2020) [4]. We use their data and way of assessing performance using NETL gold-standard labels and BERTScore [5].

You can check each script for the arguments is requires.

To fine-tune a BART-TL model, you would need to follow these stpes:

Obtain LDA topics, using the lda/apply_lda.py script;
Extract labels for the topic using the NETL code: TODO;
Build a fairseq-compatible dataset using one of the scripts in bart-tl/build_dataset/**/build_fairseq_dataset.py;
Preprocess the previous dataset by: a. Applying BPE using the bart-tl/preprocess/bpe/bpe_preprocess.sh script (update it to fit your setup); b. Binarizing the resulted dataset using the bart-tl/preprocess/binarization/binarize.sh script (update it to fit your setup).
Download the BART-large model from https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md;
Fine-tune it using the bart-tl/finetune/finetune_bart.sh script (update it to fit your setup).

The fine-tuning script will create .pt checkpoint files that you can then use to infer labels using the bart-tl/generate.py script.

Citation

If this work was useful to you, please cite it as:

Cristian Popa, Traian Rebedea. 2021. BART-TL: Weakly-Supervised Topic Label Generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume.

Or, alternatively, use this citation in BibTeX format:

@inproceedings{popa-rebedea-2021-bart,
    title = "{BART}-{TL}: Weakly-Supervised Topic Label Generation",
    author = "Popa, Cristian  and
      Rebedea, Traian",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.121",
    pages = "1418--1425",
    abstract = "We propose a novel solution for assigning labels to topic models by using multiple weak labelers. The method leverages generative transformers to learn accurate representations of the most important topic terms and candidate labels. This is achieved by fine-tuning pre-trained BART models on a large number of potential labels generated by state of the art non-neural models for topic labeling, enriched with different techniques. The proposed BART-TL model is able to generate valuable and novel labels in a weakly-supervised manner and can be improved by adding other weak labelers or distant supervision on similar tasks.",
}

References

[1] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

[2] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet Allocation. Journal of machine Learning research, 3(Jan):993–1022.

[3] Shraey Bhatia, Jey Han Lau, and Timothy Baldwin. 2016. ”Automatic Labelling of Topics with Neural Embeddings”. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 953–963, Osaka, Japan. The COLING 2016 Organizing Committee.

[4] Areej Alokaili, Nikolaos Aletras, and Mark Stevenson. 2020. Automatic generation of topic labels. arXiv preprint arXiv:2006.00127.

[5] Tianyi Zhang and Varsha Kishore and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT.

chriszhangmw / bart-tl-topic-label-generation Goto Github PK