GithubHelp home page GithubHelp logo

michiyasunaga / dragon Goto Github PK

View Code? Open in Web Editor NEW
288.0 9.0 45.0 617 KB

[NeurIPS 2022] DRAGON ๐Ÿฒ: Deep Bidirectional Language-Knowledge Graph Pretraining

License: Apache License 2.0

Shell 5.14% Python 94.18% Jupyter Notebook 0.68%
knowledge-graph language-model pretraining question-answering reasoning graph-neural-networks transformer

dragon's Introduction

DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining

This repo provides the source code & data of our paper "DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining" (NeurIPS 2022).

Overview

DRAGON is a new foundation model (improvement of BERT) that is pre-trained jointly from text and knowledge graphs for improved language, knowledge and reasoning capabilities. Specifically, it was trained with two simultaneous self-supervised objectives, language modeling and link prediction, that encourage deep bidirectional reasoning over text and knowledge graphs.

DRAGON can be used as a drop-in replacement for BERT. It achieves better performance in various NLP tasks, and is particularly effective for knowledge and reasoning-intensive tasks such as multi-step reasoning and low-resource QA.

0. Dependencies

Run the following commands to create a conda environment:

conda create -y -n dragon python=3.8
conda activate dragon
pip install torch==1.10.1+cu113 torchvision -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 wandb nltk spacy==2.1.6
python -m spacy download en
pip install scispacy==0.3.0
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
pip install torch-scatter==2.0.9 torch-sparse==0.6.12 torch-geometric==2.0.0 -f https://pytorch-geometric.com/whl/torch-1.10.1+cu113.html

1. Download pretrained models

You can download pretrained DRAGON models below. Place the downloaded model files under ./models

Model Domain Size Pretraining Text Pretraining Knowledge Graph Download Link
DRAGON General 360M parameters BookCorpus ConceptNet general_model
DRAGON Biomedicine 360M parameters PubMed UMLS biomed_model

2. Download data

Commonsense domain

You can download all the preprocessed data from [here]. This includes the ConceptNet knowledge graph as well as CommonsenseQA, OpenBookQA and RiddleSense datasets. Specifically, run:

wget https://nlp.stanford.edu/projects/myasu/DRAGON/data_preprocessed.zip
unzip data_preprocessed.zip
mv data_preprocessed data

(Optional) If you would like to preprocess the raw data from scratch, you can download the raw data โ€“ ConceptNet Knowledge graph, CommonsenseQA, OpenBookQA โ€“ by:

./download_raw_data.sh

To preprocess the raw data, run:

CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes> --run common csqa obqa

You can specify the GPU you want to use in the beginning of the command CUDA_VISIBLE_DEVICES=.... The script will:

  • Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types)
  • Convert the QA datasets into .jsonl files (e.g., stored in data/csqa/statement/)
  • Identify all mentioned concepts in the questions and answers
  • Extract subgraphs for each q-a pair

Biomedical domain

You can download all the preprocessed data from [here]. This includes the UMLS biomedical knowledge graph and MedQA dataset.

(Optional) If you would like to preprocess MedQA from scratch, follow utils_biomed/preprocess_medqa.ipynb and then run

CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes> --run medqa

The resulting file structure should look like this:
.
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ models/
    โ”œโ”€โ”€ general_model.pt
    โ”œโ”€โ”€ biomed_model.pt

โ””โ”€โ”€ data/
    โ”œโ”€โ”€ cpnet/                 (preprocessed ConceptNet KG)
    โ””โ”€โ”€ csqa/
        โ”œโ”€โ”€ train_rand_split.jsonl
        โ”œโ”€โ”€ dev_rand_split.jsonl
        โ”œโ”€โ”€ test_rand_split_no_answers.jsonl
        โ”œโ”€โ”€ statement/             (converted statements)
        โ”œโ”€โ”€ grounded/              (grounded entities)
        โ”œโ”€โ”€ graphs/                (extracted subgraphs)
        โ”œโ”€โ”€ ...
    โ”œโ”€โ”€ obqa/
    โ”œโ”€โ”€ umls/                  (preprocessed UMLS KG)
    โ””โ”€โ”€ medqa/

3. Train DRAGON

To train DRAGON on CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:

scripts/run_train__csqa.sh
scripts/run_train__obqa.sh
scripts/run_train__riddle.sh
scripts/run_train__medqa.sh

(Optional) If you would like to pretrain DRAGON (i.e. self-supervised pretraining), run

scripts/run_pretrain.sh

As a quick demo, this script uses sentences from CommonsenseQA as training data. If you wish to use a larger, general corpus like BookCorpus, follow Section 5 (Use your own dataset) to prepare the training data.

4. Evaluate trained models

For CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:

scripts/run_eval__csqa.sh
scripts/run_eval__obqa.sh
scripts/run_eval__riddle.sh
scripts/run_eval__medqa.sh

You can download trained model checkpoints in the next section.

Trained model examples

CommonsenseQA

Trained model In-house Dev acc. In-house Test acc.
DRAGON [link] 0.7928 0.7615

OpenBookQA

Trained model Dev acc. Test acc.
DRAGON [link] 0.7080 0.7280

RiddleSense

Trained model In-house Dev acc. In-house Test acc.
DRAGON [link] 0.6869 0.7157

MedQA

Trained model Dev acc. Test acc.
BioLinkBERT + DRAGON [link] 0.4308 0.4768

Note: The models were trained and tested with HuggingFace transformers==4.9.1.

5. Use your own dataset

  • Convert your dataset to {train,dev,test}.statement.jsonl in .jsonl format (see data/csqa/statement/train.statement.jsonl)
  • Create a directory in data/{yourdataset}/ to store the .jsonl files
  • Modify preprocess.py and perform subgraph extraction for your data
  • Modify utils/parser_utils.py to support your own dataset

Citation

If you find our work helpful, please cite the following:

@InProceedings{yasunaga2022dragon,
  author =  {Michihiro Yasunaga and Antoine Bosselut and Hongyu Ren and Xikun Zhang and Christopher D. Manning and Percy Liang and Jure Leskovec},
  title =   {Deep Bidirectional Language-Knowledge Graph Pretraining},
  year =    {2022},  
  booktitle = {Neural Information Processing Systems (NeurIPS)},  
}

Acknowledgment

This repo is built upon the following works:

GreaseLM: Graph REASoning Enhanced Language Models for Question Answering
https://github.com/snap-stanford/GreaseLM

QA-GNN: Question Answering using Language Models and Knowledge Graphs
https://github.com/michiyasunaga/qagnn

dragon's People

Contributors

michiyasunaga avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.