wyu97 / genread Goto Github PK

Code and Checkpoints for "Generate rather than Retrieve: Large Language Models are Strong Context Generators" in ICLR 2023.

Python 100.00%

large-language-model question-answering

genread's Introduction

Code for GenRead: Genrate rather than Retrieve!

Introduction & Setup

This is the official implementation of our pre-print paper "Generate rather than Retrieve: Large Language Models are Strong Context Generators", in ICLR 2023 [OpenReview] [arXiv].
Create an environment and install openai package via pip install openai.
Add your OpenAI API key at openai.api_key (line 12) in inference.py

Download the Datasets

From their official websites: [NQ/TriviaQA/WebQ] / [FM2] / [FEVER/Wizard]
From Google drive: (we unified the formats of the above datasets) [link]
Please put them into indataset folder. Now it contains webq and fm2.

Zero-shot Setting

Step1: generate background document.

python mainfunc.py 
  --dataset {dataset} 
  --task step1 
  --split test

Note: we use the text-davinci-002 in our experiment; we use greedy search in the zero-shot setting, to ensure the reproducibility of our experiments.
Note: if you have limited access to OpenAI API, you could directly use our outputs, without spending money on reproducing our experiments. [zero-shot: step1]

Step2: infer answer from document.

python mainfunc.py 
  --dataset {dataset} 
  --task step2 
  --split test

Trick: we remove the \n in the generated documents.
Note: if you have limited access to OpenAI API, you could directly use our outputs, without spending money on reproducing our experiments. [zero-shot: step2]

Supervised Setting

Method1: use sampling to generate multiple documents.

python mainfunc.py 
  --dataset {dataset} 
  --task step1 
  --split test 
  --num_sequence 10 
  --temperature 0.95

We note that when decoding with sample-based methods, the outputs may be different each time. So we cannot guarantee that your output will be exactly the same as the one we provide. [supervised: sampling]

Method2: use clustering to generate diverse documents.

python clusterfunc.py 
  --dataset {dataset} 
  --task step1 
  --split {split} 
  --num_sequence 1 
  --temperature 0.95 
  --clustering

We note that when using different in-context demonstrations, the outputs may be different each time. So we cannot guarantee that your output will be exactly the same as the one we provide. [supervised: clustering]

Fusion-in-decoder: train a reader model to infer answer from documents

We use the FiD code from its official GitHub repository [link].
Download our trained FiD checkpoint at Huggingface Hub.
- GenRead-3B-NQ, performance on NQ test: 45.55
```
git lfs install
git clone https://huggingface.co/wyu1/GenRead-3B-NQ
```
- GenRead-3B-TQA, performance on TQA test: 71.55
```
git lfs install
git clone https://huggingface.co/wyu1/GenRead-3B-TQA
```
If you need checkpoints on other settings, please email [email protected]

Citation

@inproceedings{yu2023generate,
  title={Generate rather than retrieve: Large language models are strong context generators},
  author={Yu, Wenhao and Iter, Dan and Wang, Shuohang and Xu, Yichong and Ju, Mingxuan and Sanyal, Soumya and Zhu, Chenguang and Zeng, Michael and Jiang, Meng},
  booktitle={International Conference for Learning Representation (ICLR)},
  year={2023}
}

Please kindly cite our paper if you find this paper and the codes helpful.

genread's People

Contributors

Stargazers

Watchers

Forkers

c00renut ricklentz kozakroman eltociear deyh2020 techthiyanes huzama swj0419 moqingxinai wshzd arian-askari goldenretriever98 zhutony geiduanliu jxzhangjhu tan-hexiang leeds1219 skaarlcooper

genread's Issues

Practical Application

Let's consider an example: "I want to know how much the tuition fee for the Computer Science department at New York University is in 2024".

At this point, LLM, which replaces the retriever, cannot generate documents containing the information I need, so the reader will not be able to answer my question. How should I solve this?

Thank you for your attention.

Dear author,
Thank you for your outstanding work! I would like to reproduce the experiments mentioned in Table 1. Can you please guide me on how to train DPR on the FEVER dataset? I noticed that the KILT knowledge source is vast, and the official repository does not provide the DPR checkpoint trained on KILT FEVER.
Besides, could you please tell me details about "Contriever + InstructGPT"? Did you just load the contriever checkpoint released by https://github.com/facebookresearch/contriever ?
Thank you in advance for your assistance.

Reproducing table 2

Dear Authors,
Hello, I am a graduate student studying information retrieval from South Korea.
First of all, thank you for sharing your great work.

I am facing difficulty in reproducing the experimental results on NQ data.

I will try to be as brief as possible.

Model used:
M1) GenRead-3B-NQ,

Contexts used:
C1) supervised:clustering (Recall@10: 71.3)
C2) DPR (FiD-distil) (The one provided by the FiD authors from here (https://github.com/facebookresearch/FiD)) (Recall@10: 80.3)
Note that I fixed the number of used documents to 10. (--n_contexts argument)

Since we have 1 model and 2 contexts, there are 2 possible combinations. i.e. M1+C1, M1+C2.

The commands I cloned and ran for the FiD repo are shown below.
python test_reader.py --model_path {model_path} --eval_data {test_json_path} --per_gpu_batch_size 1 --n
_context 10

M1+C1 is reported as 45.6 in table 2, but my experiment came up with 46.2. This seems like a reasonable margin of error.

However, M1+C2 (similar to row 5 in Table 2) came out to be 41.3, which is very different from the reported 50.1.

In summary, FiD-xl produces the same result as the paper when used with generated documents, but the result is very different when used with retrieved documents.
Do you have any suggestions on what I'm doing wrong?

Best regards,
Eunseong,

Some questions about the proposed clustering-based prompts.

Dear Authors

Thanks for sharing the great work.

For the proposed Clustering-based prompts, I have a few question:
Q1: Step 1: get 1 initial document per question. Wondering why we need to have this step? Seems not benefit to increase diversity?
Q2: Step 2: title said "encoding documents", paragraph said "encode question-document pair"; wondering what do you actually encoding?
Q3: Step 3: Sampling K question-doc paris from K-clusters; How to make sure the document contains relevant information to the question?
Q4: So the intuition of this method is: like in-context learning, providing an "prompt example" for LLM to generate diverse documents; and then giving the sample instruction, the LLM will generate documents with high diversity?

thanks for your attention!

Best,
Dayu

Code for DPR+InstructGPT

Hello Wenhao:

Thanks for the great work and I really enjoy the insight you provide in the paper! I am wondering if there is any code snippet that could help me quickly reproduce the results of DPR+InstructGPT in zero-shot setting? I assume the prompt should be the same as in inprompts/regular.jsonl except for the background document?

About Evaluation metrics

Hi, thanks for your work! After downloading the results you provided for zero-shot setting, I got the following results (a little bit higher than reported):

I just want to know if this is the improved version or if there is anything wrong with my installed python packages? Thanks!

Request for Access to Full Dataset for Each Step

I recently read your paper and was impressed by your work. I am interested in using your dataset for my own research and would like to request access to the full dataset for each step.

I noticed you have already released a test dataset, but I am interested in analyzing the full dataset for each step to save money. Would it be possible to obtain access to this data?

If you have already released the full dataset and I missed it, please let me know where I can find it. If the dataset is not available,

I would appreciate any information you can provide about your plans to release it in the future.