GithubHelp home page GithubHelp logo

chriszhangmw / bart-tl-topic-label-generation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cristianviorelpopa/bart-tl-topic-label-generation

0.0 1.0 0.0 73.76 MB

Implementation and helper scripts for the BART-TL model - https://www.aclweb.org/anthology/2021.eacl-main.121/

License: MIT License

Shell 1.68% Python 82.77% Jupyter Notebook 15.56%

bart-tl-topic-label-generation's Introduction

BART-TL: Topic Label Generation

Implementation and helper scripts for the paper BART-TL: Weakly-Supervised Topic Label Generation.

Introduction

The goal of our work was to reframe the task of topic labeling from one of ranking labels from a predefined pool to one of generating labels. In our experiments we use the BART model [1], hence the name BART-TL.

If you want to quickly use the models to generate labels for topics, they are available on Huggingface:

Structure

We release the code we used in our experiments, from preparing the data to training and evaluating the performance. This is what you will find in each of the directories in this repository:

  • notebooks - Contains the end_to_end_workflow.ipynb notebook that guides the user through the whole process of fine-tuning a BART-TL model
  • lda - Contains scripts for applying LDA [2] on a corpus of documents (namely, the ones in the corpus directory).
  • corpus - The initial corpora of documents we focus our experiments on. The data comes from posts on multiple Stack Exchange forums. For a more complete dataset, see: https://archive.org/download/stackexchange.
  • data - Sample topics for each of the 5 subjects from the Stack Exchange data.
  • netl-src - Slightly modified NETL [3] code. The original code can be found at https://github.com/sb1992/NETL-Automatic-Topic-Labelling-.
  • bart-tl - Main code for BART experiments (pre-processing, training, inference).
    • build_dataset - Code for building datasets that BART will be fine-tuned on.
    • netl_data - Topics and label data from the NETL [3] work.
    • preprocess - Preprocessing scripts for the dataset previously built with build_dataset scripts.
    • finetune - Script for finetuning BART on the processed dataset using the fairseq library.
    • survey - Scripts for creating the survey questions for the annotators, as well as the actual survey results.
  • eval - Multiple scripts for evaluating the results obtained in surveys.
  • utils - Multiple singular scripts that were used when needed, but are not central to the workflow.
  • bert_score - Further research not released in the paper, following the work of Alokaili et al. (2020) [4]. We use their data and way of assessing performance using NETL gold-standard labels and BERTScore [5].

You can check each script for the arguments is requires.

To fine-tune a BART-TL model, you would need to follow these stpes:

  1. Obtain LDA topics, using the lda/apply_lda.py script;
  2. Extract labels for the topic using the NETL code: TODO;
  3. Build a fairseq-compatible dataset using one of the scripts in bart-tl/build_dataset/**/build_fairseq_dataset.py;
  4. Preprocess the previous dataset by: a. Applying BPE using the bart-tl/preprocess/bpe/bpe_preprocess.sh script (update it to fit your setup); b. Binarizing the resulted dataset using the bart-tl/preprocess/binarization/binarize.sh script (update it to fit your setup).
  5. Download the BART-large model from https://github.com/pytorch/fairseq/blob/master/examples/bart/README.md;
  6. Fine-tune it using the bart-tl/finetune/finetune_bart.sh script (update it to fit your setup).

The fine-tuning script will create .pt checkpoint files that you can then use to infer labels using the bart-tl/generate.py script.

Citation

If this work was useful to you, please cite it as:

Cristian Popa, Traian Rebedea. 2021. BART-TL: Weakly-Supervised Topic Label Generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume.

Or, alternatively, use this citation in BibTeX format:

@inproceedings{popa-rebedea-2021-bart,
    title = "{BART}-{TL}: Weakly-Supervised Topic Label Generation",
    author = "Popa, Cristian  and
      Rebedea, Traian",
    booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-main.121",
    pages = "1418--1425",
    abstract = "We propose a novel solution for assigning labels to topic models by using multiple weak labelers. The method leverages generative transformers to learn accurate representations of the most important topic terms and candidate labels. This is achieved by fine-tuning pre-trained BART models on a large number of potential labels generated by state of the art non-neural models for topic labeling, enriched with different techniques. The proposed BART-TL model is able to generate valuable and novel labels in a weakly-supervised manner and can be improved by adding other weak labelers or distant supervision on similar tasks.",
}

References

[1] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

[2] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet Allocation. Journal of machine Learning research, 3(Jan):993–1022.

[3] Shraey Bhatia, Jey Han Lau, and Timothy Baldwin. 2016. ”Automatic Labelling of Topics with Neural Embeddings”. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 953–963, Osaka, Japan. The COLING 2016 Organizing Committee.

[4] Areej Alokaili, Nikolaos Aletras, and Mark Stevenson. 2020. Automatic generation of topic labels. arXiv preprint arXiv:2006.00127.

[5] Tianyi Zhang and Varsha Kishore and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT.

bart-tl-topic-label-generation's People

Contributors

cristianviorelpopa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.