GithubHelp home page GithubHelp logo

jacklxc / scientificdiscoursetagging Goto Github PK

View Code? Open in Web Editor NEW
20.0 3.0 2.0 1.95 MB

Implementation for EACL 2021 paper "Scientific Discourse Tagging for Evidence Extraction".

Home Page: https://aclanthology.org/2021.eacl-main.218.pdf

License: Apache License 2.0

Python 59.79% Jupyter Notebook 39.97% Shell 0.23%
discourse-tagger scibert paper discourse tagging nlp natural-language-processing

scientificdiscoursetagging's Introduction

Scientific Discourse Tagging

Scientific discourse tagger implementation for EACL 2021 paper Scientific Discourse Tagging for Evidence Extraction. There is a video available! The poster is also available in this repo.

This is a discourse tagger that tags each clause in a given paragraph from a biomedical paper with 8 types of discourse types, such as "fact", "method", "result", "implication" etc. The paper will be available as soon as it's published.

For the implementation of feature-based CRF for evidence fragment detection, please refer to FigureSpanDetection.

If you have any question, please contact [email protected].

Requirements

  • Python 3
  • Tensorflow (tested with v1.12, either gpu or non-gpu version is fine. If you encounter any issue, give a shot to v1.14)
  • Keras (tested with v2.2.4, either gpu or non-gpu version is fine)
  • Scikit-learn
  • Keras-bert
  • Pre-trained embedding

Input Format

Our discourse tagger expects inputs to lists of clauses or sentences, with paragraph boundaries identified, i.e., each line in the input file needs to be a clause or sentence and paragraphs should be separated by blank lines.

If you are training, the file additionally needs labels at the clause level, which can be specified on each line, after the clause, separated by a tab.

Intended Usage

As explained in the paper, the model is intended for tagging discourse elements in biomedical research papers, and we use the seven label taxonomy described in De Waard and Pander Maat (2012) and additional "none" label for SciDT dataset and PubMed-RCT. For the detailed usage, check our paper.

If you want to tag your own data, you can parse your sentences into clauses following this instruction.

Steps to use

Preparing word embeddings

BERT embedding

  • Follow the instruction of SciBERT, download the pretrained weights. Remeber to correct the file names to match the hard-coded name in BERT code.

Saved checkpoints

Trained SciBERT + LSTM-Attention + BiLSTM+CRF model:

For each model, put the CONTENT of the decompressed content to the root directory of this repo.

Training from scratch

SciBERT embedding

python -u discourse_tagger_generator_bert.py --repfile REPFILE --train_file TRAINFILE --validation_file DEVFILE  --use_attention --att_context LSTM_clause --bidirectional --crf --save --maxseqlen 40 --maxclauselen 60

where REPFILE is the BERT embedding path. --use_attention is recommended. --att_context is the type of attention. --bidirectional means use bidirectional LSTM for sequence tagger. --crf means use CRF as the last layer. --save to save the trained model. Check out the help messages for discourse_tagger_generator_bert.py for more options.

Trained model

After you train successfully, three new files appear in the directory, with file names starting model_.

Testing

You can specify test files while training itself using --test_files arguments. Alternatively, you can do it after training is done. In the latter case, discourse_tagger assumes the trained model files described above are present in the directory.

python -u discourse_tagger_generator_bert.py --repfile REPFILE --test_file TESTFILE1 [TESTFILE2 ..] --use_attention --att_context LSTM_clause --bidirectional --crf --maxseqlen 40 --maxclauselen 60

Make sure you use the same options for attention, context and bidirectional as you used for training.

Cite our paper

Please use the following BibTeX to cite our paper:

@inproceedings{li2021scientific,
  title={Scientific Discourse Tagging for Evidence Extraction},
  author={Li, Xiangci and Burns, Gully and Peng, Nanyun},
  booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume},
  pages={2550--2562},
  year={2021}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.