GithubHelp home page GithubHelp logo

donghai68 / dragon Goto Github PK

View Code? Open in Web Editor NEW

This project forked from michiyasunaga/dragon

0.0 0.0 0.0 651 KB

[NeurIPS 2022] DRAGON ๐Ÿฒ: Deep Bidirectional Language-Knowledge Graph Pretraining

License: Apache License 2.0

Shell 5.14% Python 94.18% Jupyter Notebook 0.68%

dragon's Introduction

DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining

This repo provides the source code & data of our paper "DRAGON: Deep Bidirectional Language-Knowledge Graph Pretraining" (NeurIPS 2022).

Overview

DRAGON is a new foundation model (improvement of BERT) that is pre-trained jointly from text and knowledge graphs for improved language, knowledge and reasoning capabilities. Specifically, it was trained with two simultaneous self-supervised objectives, language modeling and link prediction, that encourage deep bidirectional reasoning over text and knowledge graphs.

DRAGON can be used as a drop-in replacement for BERT. It achieves better performance in various NLP tasks, and is particularly effective for knowledge and reasoning-intensive tasks such as multi-step reasoning and low-resource QA.

0. Dependencies

Run the following commands to create a conda environment:

conda create -y -n dragon python=3.8
conda activate dragon
pip install torch==1.10.1+cu113 torchvision -f https://download.pytorch.org/whl/cu113/torch_stable.html
pip install transformers==4.9.1 wandb nltk spacy==2.1.6
python -m spacy download en
pip install scispacy==0.3.0
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
pip install torch-scatter==2.0.9 torch-sparse==0.6.12 torch-geometric==2.0.0 -f https://pytorch-geometric.com/whl/torch-1.10.1+cu113.html

1. Download pretrained models

You can download pretrained DRAGON models below. Place the downloaded model files under ./models

Model Domain Size Pretraining Text Pretraining Knowledge Graph Download Link
DRAGON General 360M parameters BookCorpus ConceptNet general_model
DRAGON Biomedicine 360M parameters PubMed UMLS biomed_model

2. Download data

Commonsense domain

You can download all the preprocessed data from [here]. This includes the ConceptNet knowledge graph as well as CommonsenseQA, OpenBookQA and RiddleSense datasets. Specifically, run:

wget https://nlp.stanford.edu/projects/myasu/DRAGON/data_preprocessed.zip
unzip data_preprocessed.zip
mv data_preprocessed data

(Optional) If you would like to preprocess the raw data from scratch, you can download the raw data โ€“ ConceptNet Knowledge graph, CommonsenseQA, OpenBookQA โ€“ by:

./download_raw_data.sh

To preprocess the raw data, run:

CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes> --run common csqa obqa

You can specify the GPU you want to use in the beginning of the command CUDA_VISIBLE_DEVICES=.... The script will:

  • Setup ConceptNet (e.g., extract English relations from ConceptNet, merge the original 42 relation types into 17 types)
  • Convert the QA datasets into .jsonl files (e.g., stored in data/csqa/statement/)
  • Identify all mentioned concepts in the questions and answers
  • Extract subgraphs for each q-a pair

Biomedical domain

You can download all the preprocessed data from [here]. This includes the UMLS biomedical knowledge graph and MedQA dataset.

(Optional) If you would like to preprocess MedQA from scratch, follow utils_biomed/preprocess_medqa.ipynb and then run

CUDA_VISIBLE_DEVICES=0 python preprocess.py -p <num_processes> --run medqa

The resulting file structure should look like this:
.
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ models/
    โ”œโ”€โ”€ general_model.pt
    โ”œโ”€โ”€ biomed_model.pt

โ””โ”€โ”€ data/
    โ”œโ”€โ”€ cpnet/                 (preprocessed ConceptNet KG)
    โ””โ”€โ”€ csqa/
        โ”œโ”€โ”€ train_rand_split.jsonl
        โ”œโ”€โ”€ dev_rand_split.jsonl
        โ”œโ”€โ”€ test_rand_split_no_answers.jsonl
        โ”œโ”€โ”€ statement/             (converted statements)
        โ”œโ”€โ”€ grounded/              (grounded entities)
        โ”œโ”€โ”€ graphs/                (extracted subgraphs)
        โ”œโ”€โ”€ ...
    โ”œโ”€โ”€ obqa/
    โ”œโ”€โ”€ umls/                  (preprocessed UMLS KG)
    โ””โ”€โ”€ medqa/

3. Train DRAGON

To train DRAGON on CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:

scripts/run_train__csqa.sh
scripts/run_train__obqa.sh
scripts/run_train__riddle.sh
scripts/run_train__medqa.sh

(Optional) If you would like to pretrain DRAGON (i.e. self-supervised pretraining), run

scripts/run_pretrain.sh

As a quick demo, this script uses sentences from CommonsenseQA as training data. If you wish to use a larger, general corpus like BookCorpus, follow Section 5 (Use your own dataset) to prepare the training data.

4. Evaluate trained models

For CommonsenseQA, OpenBookQA, RiddleSense, MedQA, run:

scripts/eval_dragon__csqa.sh
scripts/eval_dragon__obqa.sh
scripts/eval_dragon__riddle.sh
scripts/eval_dragon__medqa.sh

You can download trained model checkpoints in the next section.

Trained model examples

CommonsenseQA

Trained model In-house Dev acc. In-house Test acc.
DRAGON [link] 0.7928 0.7615

OpenBookQA

Trained model Dev acc. Test acc.
DRAGON [link] 0.7080 0.7280

RiddleSense

Trained model In-house Dev acc. In-house Test acc.
DRAGON [link] 0.6869 0.7157

MedQA

Trained model Dev acc. Test acc.
BioLinkBERT + DRAGON [link] 0.4308 0.4768

Note: The models were trained and tested with HuggingFace transformers==4.9.1.

5. Use your own dataset

  • Convert your dataset to {train,dev,test}.statement.jsonl in .jsonl format (see data/csqa/statement/train.statement.jsonl)
  • Create a directory in data/{yourdataset}/ to store the .jsonl files
  • Modify preprocess.py and perform subgraph extraction for your data
  • Modify utils/parser_utils.py to support your own dataset

Citation

If you find our work helpful, please cite the following:

@InProceedings{yasunaga2022dragon,
  author =  {Michihiro Yasunaga and Antoine Bosselut and Hongyu Ren and Xikun Zhang and Christopher D. Manning and Percy Liang and Jure Leskovec},
  title =   {Deep Bidirectional Language-Knowledge Graph Pretraining},
  year =    {2022},  
  booktitle = {Neural Information Processing Systems (NeurIPS)},  
}

Acknowledgment

This repo is built upon the following works:

GreaseLM: Graph REASoning Enhanced Language Models for Question Answering
https://github.com/snap-stanford/GreaseLM

QA-GNN: Question Answering using Language Models and Knowledge Graphs
https://github.com/michiyasunaga/qagnn

dragon's People

Contributors

michiyasunaga avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.