GithubHelp home page GithubHelp logo

princeton-nlp / nlproofs Goto Github PK

View Code? Open in Web Editor NEW
77.0 6.0 13.0 547 KB

EMNLP 2022: Generating Natural Language Proofs with Verifier-Guided Search https://arxiv.org/abs/2205.12443

License: MIT License

Python 100.00%
machine-learning nlp reasoning

nlproofs's Introduction

Generating Natural Language Proofs with Verifier-Guided Search

Task

Code for the paper:

Generating Natural Language Proofs with Verifier-Guided Search
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Kaiyu Yang, Jia Deng, and Danqi Chen

Quick Links

Requirements

  1. Download and install Miniconda Python 3 (Anaconda should also work).
  2. Clone this repo and cd into its root.
  3. Install Python dependencies: conda env create -f nlproofs.yaml. You may need to edit nlproofs.yaml according to your system, e.g., use a different CUDA version. If you have trouble running the installation command, you may also manually install the packages in nlproofs.yaml in whatever way that works for you.
  4. Activate the conda environment: conda activate nlproofs, and prepend the root of this repo to the PYTHONPATH environment variable.

Data Preprocessing

  1. Download the v3_May6_2022 version of EntailmentBank (MD5: 9cb91896325157cee1f35616be0be179) and unzip it as ./data/entailment_trees_emnlp2021_data_v3/.
  2. Download the OWA version of RuleTaker (MD5: bf490364bca241bb5ff9f0ab0c78b71a) and unzip it as ./data/proofwriter-dataset-V2020.12.3/.
  3. Run python check_data.py to check.
  4. Run python preprocess_ruletaker.py to preprocess the RuleTaker dataset.

EntailmentBank Experiments

We use Lightning CLI to create scripts for training, validation, and testing: prover/main.py and verifier/main.py for the prover and the verifier, respectively. They take arguments from the command line as well as YAML configuration files. Please run python main.py --help or refer to the documentation of Lightning CLI for details.

We provide YAML files for our hyperparameters and experimental settings in ./prover/ and ./verifier/. We run all experiments on a single NVIDIA A6000 GPU with 48GB memory. For running them on GPUs with smaller memory, you may have to change batch_size and accumulate_grad_batches. On newer GPUs, --trainer.precision bf16 may lead to significant speedup and memory savings. I have not tested those features thoroughly, so please use them at your own discretion. Note that pretrained T5 models do not play well with fp16.

Training

Prover

First, cd into ./prover/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_task1_single_shot_t5-large.yaml  # Train a single-shot prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task1_stepwise_t5-large.yaml     # Train a stepwise prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task2_single_shot_t5-large.yaml  # Train a single-shot prover on Task 2 of EntailmentBank.
python main.py fit --config cli_task2_stepwise_t5-large.yaml     # Train a stepwise prover on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./prover/lightning_logs/EXP_ID/, where EXP_ID is an arbitrary experiment ID that will be printed by the training script.

Verifier

First, cd into ./verifier/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_entailmentbank_task1.yaml  # Train a verifier on Task 1 of EntailmentBank.
python main.py fit --config cli_entailmentbank_task2.yaml  # Train a verifier on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./verifier/lightning_logs/EXP_ID/.

Validation and Testing

Once training completes, we use the model checkpoint to predict on the validation and testing data. cd into ./prover/ and run python main.py validate --help and python main.py test --help to see how to use the script for validation and testing. Assume we have a prover checkpoint PATH_TO_PROVER_CKPT and a verifier checkpoint PATH_TO_VERIFIER_CKPT, below are example commands:

python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                     # Validate the stepwise prover without verifier-guided search on Task 2 of EntailmentBank.
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS (stepwise prover + verifier-guided search).
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 1.0 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS w/o prover score.
python main.py test --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true       # Test NLProofS (stepwise prover + verifier-guided search).
python main.py test --config cli_task1_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                   # Test the single-shot prover on Task 1 of EntailmentBank.
python main.py test --confing cli_task2_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --data.path_test ../data/entailment_trees_emnlp2021_data_v3/dataset/task_3/test.jsonl            # Test the single-shot prover (trained on Task 2) on Task 3 of EntailmentBank.

Validation and testing results are saved as ./prover/lightning_logs/EXP_ID/results_val.tsv and ./prover/lightning_logs/EXP_ID/results_test.tsv. They are the input to the EntailmentBank's official evaluation code for calculating the evaluation metrics.

Test Results and Model Checkpoints

Slide right to see download links in the tables below.

Task 1

Model Leaves-F1 Leaves-AllCorrect Steps-F1 Steps-AllCorrect Intermediates-F1 Intermediates-AllCorrect Overall-AllCorrect Model checkpoints Validation predictions Test predictions
NLProofS 97.6 90.0 54.8 41.8 72.0 39.7 38.2 prover, verifier results_val.tsv results_test.tsv
Stepwise prover 98.8 98.5 54.8 41.5 71.9 38.5 36.8 The prover above results_val.tsv results_test.tsv
Single-shot prover 98.2 82.7 51.8 40.9 66.7 36.5 34.7 prover results_val.tsv results_test.tsv

Task 2

Model Leaves-F1 Leaves-AllCorrect Steps-F1 Steps-AllCorrect Intermediates-F1 Intermediates-AllCorrect Overall-AllCorrect Model checkpoints Validation predictions Test predictions
NLProofS 90.3 60.6 48.6 35.6 70.3 39.4 34.4 prover, verifier results_val.tsv results_test.tsv
Stepwise prover 90.3 57.1 48.6 35.6 70.1 38.5 33.8 The prover above results_val.tsv results_test.tsv
Single-shot prover 85.9 44.7 41.3 29.1 62.5 31.5 27.7 prover results_val.tsv results_test.tsv

Task 3

Results on Task 3 are produced by evaluating Task 2 models zero-shot on Task 3 data (by changing --data.path_val and --data.path_test).

Model Leaves-F1 Leaves-AllCorrect Steps-F1 Steps-AllCorrect Intermediates-F1 Intermediates-AllCorrect Overall-AllCorrect Model checkpoints Validation predictions Test predictions
NLProofS 43.9 9.1 10.6 6.8 42.4 15.9 6.8 Same as Task 2 results_val.tsv results_test.tsv
Stepwise prover 42.8 7.4 9.3 5.9 42.1 15.0 5.9 Same as Task 2 results_val.json results_test.json
Single-shot prover 40.5 4.4 9.1 3.8 35.3 7.9 3.8 Same as Task 2 results_val.tsv results_test.tsv

Students in Princeton's COS484 (Emre Onal, Max Gonzalez Saez-Diez, and Maria Khartchenko) have conducted a comprehensive ablation study and improved our results on EntailmentBank (code available here).

RuleTaker Experiments

Training

Prover

Training on RuleTaker is similar to training on EntailmentBank but with different configuration files. Run the following commands in ./prover/:

python main.py fit --config cli_ruletaker_single_shot_t5-large.yaml  # Train a single-shot prover on D0–D3 of RuleTaker (OWA).
python main.py fit --config cli_ruletaker_stepwise_t5-large.yaml     # Train a stepwise prover on D0–D3 of RuleTaker (OWA).

Verifier

Training the verifier is also similar. Run the following commands in ./verifier/:

python main.py fit --config cli_ruletaker.yaml  # Train a verifier on D0–D3 of RuleTaker (OWA).

Validation and Testing

cd into ./prover/. Assume we have a prover checkpoint PATH_TO_PROVER_CKPT and a verifier checkpoint PATH_TO_VERIFIER_CKPT.

python main.py validate --config cli_ruletaker_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true --trainer.limit_val_batches 1.0  # Validate NLProofS on D0–D3 of RuleTaker (OWA).
python main.py test --config cli_ruletaker_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true  # Test NLProofS on D0–D3 of RuleTaker (OWA).

Note the --trainer.limit_val_batches 1.0 above. By default, we use only 200 batches for RuleTaker validation (see ./prover/cli_ruletaker_stepwise_t5-large.yaml and ./prover/cli_ruletaker_single_shot_t5-large.yaml), but here we want to use all batches.

Validation and testing results are saved as ./prover/lightning_logs/EXP_ID/results_val.json and ./prover/lightning_logs/EXP_ID/results_test.json. Run the following command for final evaluation:

python evaluate.py ruletaker --path-val PATH_TO_VAL_RESULTS --path-test PATH_TO_TEST_RESULTS

Test Results and Model Checkpoints

Model Answer accuracy Proof accuracy Model checkpoints Validation predictions Test predictions
NLProofS 99.3 99.2 prover, verifier results_val.json results_test.json
Stepwise prover 68.7 91.3 The prover above results_val.json results_test.json
Single-shot prover 56.3 72.6 prover results_val.json results_test.json

Bugs or Questions

If you have any questions related to the code or the paper, feel free to email Kaiyu. If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@inproceedings{yang2022nlproofs,
  title={Generating Natural Language Proofs with Verifier-Guided Search},
  author={Yang, Kaiyu and Deng, Jia and Chen, Danqi},
  booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2022}
}

nlproofs's People

Contributors

emreonal11 avatar yangky11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

nlproofs's Issues

torchmetrics classes missing positional arguments 'task' and invalid argument 'threshold=0'

Running entailment bank task 2 experiments, we came across a couple errors thrown from verifier/model.py. The first error is from a missing positional argument task='binary' to torchmetrics.Accuracy. This task='binary' argument is also missing from the other torchmetrics class constructors called immediately below this line.

Missing positional argument 'task' error:
File "/content/NLProofS/prover/model.py", line 136, in init
EntailmentClassifier.load_from_checkpoint(verifier_ckpt)
File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, kwargs)
File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/core/saving.py", line 203, in _load_model_state
model = cls(
_cls_kwargs)
File "/content/NLProofS/verifier/model.py", line 32, in init
"accuracy": torchmetrics.Accuracy(threshold=0),
TypeError: new() missing 1 required positional argument: 'task'

The second error is also from torchmetrics.Accuracy, since the 'threshold' argument is set to 0 and it is expected to be a float in the [0,1] range.

Invalid argument 'threshold=0' error:
File "/content/NLProofS/verifier/model.py", line 32, in init
"accuracy": torchmetrics.Accuracy(task='binary', threshold=0),
File "/usr/local/lib/python3.9/dist-packages/torchmetrics/classification/accuracy.py", line 355, in new
return BinaryAccuracy(threshold, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torchmetrics/classification/stat_scores.py", line 164, in init
_binary_stat_scores_arg_validation(threshold, multidim_average, ignore_index)
File "/usr/local/lib/python3.9/dist-packages/torchmetrics/functional/classification/stat_scores.py", line 37, in _binary_stat_scores_arg_validation
raise ValueError(f"Expected argument threshold to be a float in the [0,1] range, but got {threshold}.")
ValueError: Expected argument threshold to be a float in the [0,1] range, but got 0.

import error: "cannot import TextFace from ete3"

Thanks for your generous sharing of this excellent work, which I am very interested in.
But when I try to run the code, I get the error "can not import TextFace from ete3". I am the first time to use "ete3" and I find that a lot of people had the same error as me. I tried the solutions mentioned in https://github.com/etetoolkit/ete/issues/195 and tried different installation methods given by the official, but none of them worked.

I would like to ask if you have encountered a similar problem and how to solve it?

I used Miniconda3 Linux 64-bit and created a new conda environment using nlproofs.yaml.

Below is my error message:
File "/root/NLProofS/prover/evaluate.py", line 9, in <module> from ete3 import TextFace, TreeStyle, NodeStyle ImportError: cannot import name 'TextFace' from 'ete3' (/root/miniconda3/envs/nlproofs/lib/python3.9/site-packages/ete3/__init__.py)

Intermediate-F1 & AllCorrect

When running python main.py --python main.py test --config cli_task1_single_shot_t5-large.yaml --ckpt_path PROVER_PATH, I only get results for the proof, leaves, and steps. The paper reports intermediate results as well. How are these obtained?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.