Generating Natural Language Proofs with Verifier-Guided Search

Code for the paper:

Generating Natural Language Proofs with Verifier-Guided Search
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Kaiyu Yang, Jia Deng, and Danqi Chen

Quick Links

Requirements
Data Preprocessing
EntailmentBank Experiments
RuleTaker Experiments
Bugs or Questions
Citation

Requirements

Download and install Miniconda Python 3 (Anaconda should also work).
Clone this repo and cd into its root.
Install Python dependencies: conda env create -f nlproofs.yaml. You may need to edit nlproofs.yaml according to your system, e.g., use a different CUDA version. If you have trouble running the installation command, you may also manually install the packages in nlproofs.yaml in whatever way that works for you.
Activate the conda environment: conda activate nlproofs, and prepend the root of this repo to the PYTHONPATH environment variable.

Data Preprocessing

Download the v3_May6_2022 version of EntailmentBank (MD5: 9cb91896325157cee1f35616be0be179) and unzip it as ./data/entailment_trees_emnlp2021_data_v3/.
Download the OWA version of RuleTaker (MD5: bf490364bca241bb5ff9f0ab0c78b71a) and unzip it as ./data/proofwriter-dataset-V2020.12.3/.
Run python check_data.py to check.
Run python preprocess_ruletaker.py to preprocess the RuleTaker dataset.

EntailmentBank Experiments

We use Lightning CLI to create scripts for training, validation, and testing: prover/main.py and verifier/main.py for the prover and the verifier, respectively. They take arguments from the command line as well as YAML configuration files. Please run python main.py --help or refer to the documentation of Lightning CLI for details.

We provide YAML files for our hyperparameters and experimental settings in ./prover/ and ./verifier/. We run all experiments on a single NVIDIA A6000 GPU with 48GB memory. For running them on GPUs with smaller memory, you may have to change batch_size and accumulate_grad_batches. On newer GPUs, --trainer.precision bf16 may lead to significant speedup and memory savings. I have not tested those features thoroughly, so please use them at your own discretion. Note that pretrained T5 models do not play well with fp16.

Training

Prover

First, cd into ./prover/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_task1_single_shot_t5-large.yaml  # Train a single-shot prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task1_stepwise_t5-large.yaml     # Train a stepwise prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task2_single_shot_t5-large.yaml  # Train a single-shot prover on Task 2 of EntailmentBank.
python main.py fit --config cli_task2_stepwise_t5-large.yaml     # Train a stepwise prover on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./prover/lightning_logs/EXP_ID/, where EXP_ID is an arbitrary experiment ID that will be printed by the training script.

Verifier

First, cd into ./verifier/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_entailmentbank_task1.yaml  # Train a verifier on Task 1 of EntailmentBank.
python main.py fit --config cli_entailmentbank_task2.yaml  # Train a verifier on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./verifier/lightning_logs/EXP_ID/.

Validation and Testing

Once training completes, we use the model checkpoint to predict on the validation and testing data. cd into ./prover/ and run python main.py validate --help and python main.py test --help to see how to use the script for validation and testing. Assume we have a prover checkpoint PATH_TO_PROVER_CKPT and a verifier checkpoint PATH_TO_VERIFIER_CKPT, below are example commands:

python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                     # Validate the stepwise prover without verifier-guided search on Task 2 of EntailmentBank.
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS (stepwise prover + verifier-guided search).
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 1.0 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS w/o prover score.
python main.py test --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true       # Test NLProofS (stepwise prover + verifier-guided search).
python main.py test --config cli_task1_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                   # Test the single-shot prover on Task 1 of EntailmentBank.
python main.py test --confing cli_task2_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --data.path_test ../data/entailment_trees_emnlp2021_data_v3/dataset/task_3/test.jsonl            # Test the single-shot prover (trained on Task 2) on Task 3 of EntailmentBank.

Validation and testing results are saved as ./prover/lightning_logs/EXP_ID/results_val.tsv and ./prover/lightning_logs/EXP_ID/results_test.tsv. They are the input to the EntailmentBank's official evaluation code for calculating the evaluation metrics.

Test Results and Model Checkpoints

Slide right to see download links in the tables below.

Task 1

Model	Leaves-F1	Leaves-AllCorrect	Steps-F1	Steps-AllCorrect	Intermediates-F1	Intermediates-AllCorrect	Overall-AllCorrect	Model checkpoints	Validation predictions	Test predictions
NLProofS	97.6	90.0	54.8	41.8	72.0	39.7	38.2	prover, verifier	results_val.tsv	results_test.tsv
Stepwise prover	98.8	98.5	54.8	41.5	71.9	38.5	36.8	The `prover` above	results_val.tsv	results_test.tsv
Single-shot prover	98.2	82.7	51.8	40.9	66.7	36.5	34.7	prover	results_val.tsv	results_test.tsv

Task 2

Model	Leaves-F1	Leaves-AllCorrect	Steps-F1	Steps-AllCorrect	Intermediates-F1	Intermediates-AllCorrect	Overall-AllCorrect	Model checkpoints	Validation predictions	Test predictions
NLProofS	90.3	60.6	48.6	35.6	70.3	39.4	34.4	prover, verifier	results_val.tsv	results_test.tsv
Stepwise prover	90.3	57.1	48.6	35.6	70.1	38.5	33.8	The `prover` above	results_val.tsv	results_test.tsv
Single-shot prover	85.9	44.7	41.3	29.1	62.5	31.5	27.7	prover	results_val.tsv	results_test.tsv

Task 3

Results on Task 3 are produced by evaluating Task 2 models zero-shot on Task 3 data (by changing --data.path_val and --data.path_test).

Model	Leaves-F1	Leaves-AllCorrect	Steps-F1	Steps-AllCorrect	Intermediates-F1	Intermediates-AllCorrect	Overall-AllCorrect	Model checkpoints	Validation predictions	Test predictions
NLProofS	43.9	9.1	10.6	6.8	42.4	15.9	6.8	Same as Task 2	results_val.tsv	results_test.tsv
Stepwise prover	42.8	7.4	9.3	5.9	42.1	15.0	5.9	Same as Task 2	results_val.json	results_test.json
Single-shot prover	40.5	4.4	9.1	3.8	35.3	7.9	3.8	Same as Task 2	results_val.tsv	results_test.tsv

Students in Princeton's COS484 (Emre Onal, Max Gonzalez Saez-Diez, and Maria Khartchenko) have conducted a comprehensive ablation study and improved our results on EntailmentBank (code available here).

RuleTaker Experiments

Training

Prover

Training on RuleTaker is similar to training on EntailmentBank but with different configuration files. Run the following commands in ./prover/:

python main.py fit --config cli_ruletaker_single_shot_t5-large.yaml  # Train a single-shot prover on D0–D3 of RuleTaker (OWA).
python main.py fit --config cli_ruletaker_stepwise_t5-large.yaml     # Train a stepwise prover on D0–D3 of RuleTaker (OWA).

Verifier

Training the verifier is also similar. Run the following commands in ./verifier/:

python main.py fit --config cli_ruletaker.yaml  # Train a verifier on D0–D3 of RuleTaker (OWA).

Validation and Testing

cd into ./prover/. Assume we have a prover checkpoint PATH_TO_PROVER_CKPT and a verifier checkpoint PATH_TO_VERIFIER_CKPT.

python main.py validate --config cli_ruletaker_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true --trainer.limit_val_batches 1.0  # Validate NLProofS on D0–D3 of RuleTaker (OWA).
python main.py test --config cli_ruletaker_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true  # Test NLProofS on D0–D3 of RuleTaker (OWA).

Note the --trainer.limit_val_batches 1.0 above. By default, we use only 200 batches for RuleTaker validation (see ./prover/cli_ruletaker_stepwise_t5-large.yaml and ./prover/cli_ruletaker_single_shot_t5-large.yaml), but here we want to use all batches.

Validation and testing results are saved as ./prover/lightning_logs/EXP_ID/results_val.json and ./prover/lightning_logs/EXP_ID/results_test.json. Run the following command for final evaluation:

python evaluate.py ruletaker --path-val PATH_TO_VAL_RESULTS --path-test PATH_TO_TEST_RESULTS

Test Results and Model Checkpoints

Model	Answer accuracy	Proof accuracy	Model checkpoints	Validation predictions	Test predictions
NLProofS	99.3	99.2	prover, verifier	results_val.json	results_test.json
Stepwise prover	68.7	91.3	The `prover` above	results_val.json	results_test.json
Single-shot prover	56.3	72.6	prover	results_val.json	results_test.json

Bugs or Questions

If you have any questions related to the code or the paper, feel free to email Kaiyu. If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@inproceedings{yang2022nlproofs,
  title={Generating Natural Language Proofs with Verifier-Guided Search},
  author={Yang, Kaiyu and Deng, Jia and Chen, Danqi},
  booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2022}
}

torchmetrics classes missing positional arguments 'task' and invalid argument 'threshold=0'

Running entailment bank task 2 experiments, we came across a couple errors thrown from verifier/model.py. The first error is from a missing positional argument task='binary' to torchmetrics.Accuracy. This task='binary' argument is also missing from the other torchmetrics class constructors called immediately below this line.

Missing positional argument 'task' error:
File "/content/NLProofS/prover/model.py", line 136, in init
EntailmentClassifier.load_from_checkpoint(verifier_ckpt)
File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, kwargs)
File "/usr/local/lib/python3.9/dist-packages/pytorch_lightning/core/saving.py", line 203, in _load_model_state
model = cls(_cls_kwargs)
File "/content/NLProofS/verifier/model.py", line 32, in init
"accuracy": torchmetrics.Accuracy(threshold=0),
TypeError: new() missing 1 required positional argument: 'task'

The second error is also from torchmetrics.Accuracy, since the 'threshold' argument is set to 0 and it is expected to be a float in the [0,1] range.

Invalid argument 'threshold=0' error:
File "/content/NLProofS/verifier/model.py", line 32, in init
"accuracy": torchmetrics.Accuracy(task='binary', threshold=0),
File "/usr/local/lib/python3.9/dist-packages/torchmetrics/classification/accuracy.py", line 355, in new
return BinaryAccuracy(threshold, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torchmetrics/classification/stat_scores.py", line 164, in init
_binary_stat_scores_arg_validation(threshold, multidim_average, ignore_index)
File "/usr/local/lib/python3.9/dist-packages/torchmetrics/functional/classification/stat_scores.py", line 37, in _binary_stat_scores_arg_validation
raise ValueError(f"Expected argument threshold to be a float in the [0,1] range, but got {threshold}.")
ValueError: Expected argument threshold to be a float in the [0,1] range, but got 0.

princeton-nlp / nlproofs Goto Github PK

nlproofs's Introduction

Generating Natural Language Proofs with Verifier-Guided Search

Quick Links

Requirements

Data Preprocessing

EntailmentBank Experiments

Training

Prover

Verifier

Validation and Testing

Test Results and Model Checkpoints

Task 1

Task 2

Task 3

RuleTaker Experiments

Training

Prover

Verifier

Validation and Testing

Test Results and Model Checkpoints

Bugs or Questions

Citation

nlproofs's People

Contributors

Stargazers

Watchers

Forkers

nlproofs's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs