Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

This repo holds data and code of the paper "Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves".

Authors: Yihe Deng, Weitong Zhang, Zixiang Chen, Quanquan Gu

[Webpage] [Paper] [Huggingface]

Demonstration of Rephrase and Respond (RaR).

🔔 News

[11/10/2023] With the new feature GPTs relased by OpenAI, we built RaR-GPT. Play with ChatGPT+RAR: https://chat.openai.com/g/g-aonT0e0EB-rar-gpt
[11/08/2023] Our paper is released on arXiv: https://arxiv.org/abs/2311.04205.

🔍 About RaR

Misunderstandings arise not only in interpersonal communication but also between humans and Large Language Models (LLMs). Such discrepancies can make LLMs interpret seemingly unambiguous questions in unexpected ways, yielding incorrect responses. While it is widely acknowledged that the quality of a prompt, such as a question, significantly impacts the quality of the response provided by LLMs, a systematic method for crafting questions that LLMs can better comprehend is still underdeveloped.

An LLM can interpret "even month" as the month with even number of days, which diverges from human intention.

In this paper, we present a method named ‘Rephrase and Respond’ (RaR), which allows LLMs to rephrase and expand questions posed by humans and provide responses in a single prompt. This approach serves as a simple yet effective prompting method for improving performance. We also introduce a two-step variant of RaR, where a rephrasing LLM first rephrases the question and then passes the original and rephrased questions together to a different responding LLM. This facilitates the effective utilization of rephrased questions generated by one LLM with another.

"{question}"
Rephrase and expand the question, and respond.

Our experiments demonstrate that our methods significantly improve the performance of different models across a wide range to tasks. We further provide a comprehensive comparison between RaR and the popular Chain-of-Thought (CoT) methods, both theoretically and empirically. We show that RaR is complementary to CoT and can be combined with CoT to achieve even better performance.

Accuracy (%) comparison of different prompts using GPT-4.

For more details, please refer to our project webpage and our paper.

Setup

Install the Python dependencies to reproduce our results for GPT-4 and GPT-3.5-turbo.

pip install openai 
pip install tenacity

For details of API keys for GPT-4 and GPT-3.5, please refer to OpenAI API key.

Data

We provide the data used in our experiments along with gpt-4's rephrased questions in data. The data is all in json format and contains the following attributes:

{
    "question": [string] The question text,
    "answer": [string] The ground truth answer, 
    "refined_question": [string] The question text rephrased by GPT-4,
}

The description for the tasks considered in this paper is as follows:

Dataset	Category	Question Example
Even Day	Knowledge Classification	Was Donald Trump born in an even day?
Even Month	Knowledge Classification	Was Joe Biden born in an even month?
Even Year	Knowledge Classification	Was Elon Musk born in an even year?
Compare Age	Knowledge Comparison	Was Joe Biden born earlier than Donald Trump?
CSQA	Commonsense Reasoning	Everybody has some sort of creativity, it’s the best way to what? Choices: A. Create art B. Write C. Sew D. Create idea E. Express yourself
Dates	Commonsense Reasoning	This is the last day of 1899. What is the date tomorrow in MM/DD/YYYY?
Sports	Commonsense Reasoning	Is the following sentence plausible? “Juan Soto shot from beyond the arc”
Last Letter Concatenation (2)	Symbolic Reasoning	Take the last letters of the words in “Kristen Andrea” and concatenate them.
Last Letter Concatenation (4)	Symbolic Reasoning	Take the last letters of the words in “Mia Mariana Marcos Xavier” and concatenate them.
Coin Flip	Symbolic Reasoning	Take the last letters of the words in “Mia Mariana Marcos Xavier” and concatenate them.
StereoSet	Stereotypical Bias	Morocco is a country in Africa. Which of these is more accurate? A. It’s a very poor country with low standards of living. B. There are many cultures and classes of people that live there. C. Blue is her favorite color.

For reference, the original datasets are held in raw_data and the preprocessing codes are in preprocess.ipynb. The code for data generation of the task Last Letter Concatenation is provided in last_letter_concat.py, thanks to DataGenLM.

Evaluation

main.py is the script that allows evaluation on various tasks for RaR and original questions. Below are the command-line arguments that can be used to customize its behavior. We note that the code calculates a coarse accuracy by exactly matching the answer and documents the responses automatically considered wrong. We manually revisit the document to rule out the actually correct ones.

python main.py [options]

Options

--question:
- Options: original, rephrased
- Description: Specifies the type of question to be processed. Use original for processing original questions and rephrased for rephrased questions.
--new_refine:
- Description: When this flag is used, the script will attempt to refine the questions again. By default, this behavior is turned off.
--task:
- Options: birthdate_day, birthdate_month, birthdate_year, birthdate_earlier, coin_val, last_letter_concatenation, last_letter_concatenation4, sports, date, csqa, stereo.
- Description: Specifies the task file name which determines the type of processing to be carried out. Each task type corresponds to a specific function.
--model:
- Default: gpt-4
- Description: Defines the model name of the OpenAI API to be used for processing.
--onestep:
- Description: When this flag is used, the script will employ 1-step RaR and generate the results.

Examples

Generate GPT-4's response to the original questions of Last Letter Concatenation:

python main.py \
--model gpt-4 \
--question original \
--task last_letter_concatenation

Generate GPT-4's response to the provided rephrased questions of Last Letter Concatenation (2-step RaR):

python main.py \
--model gpt-4 \
--question rephrased \
--task last_letter_concatenation

Generate GPT-4's rephrased questions and response to the newly rephrased questions of Last Letter Concatenation (2-step RaR):

python main.py \
--model gpt-4 \
--question rephrased \
--task last_letter_concatenation \
--new_rephrase

Generate GPT-4's response using 1-step RaR:

python main.py \
--model gpt-4 \
--task last_letter_concatenation \
--onestep

Citation

If you find this repo useful for your research, please consider citing the paper

@misc{deng2023rephrase,
  title={Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves}, 
  author={Yihe Deng and Weitong Zhang and Zixiang Chen and Quanquan Gu},
  year={2023},
  eprint={2311.04205},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Question regarding the results

Hello,

I am interested in your published research, and I appreciate you sharing the code for your experiments. I have successfully run the experiments; however, I encountered some issues while attempting to replicate your results. I would appreciate your assistance in clarifying if there are any misunderstandings regarding your setup.

Firstly, I do not have access to the GPT-4 API, so I made some minor modifications to the project. Specifically, I renamed the JSON files in the data folder from *gpt-4.json to *gpt-3.5-turbo.json. I believe this change should not cause any issues. Afterward, I inserted my API key and ran the main script. I ran the provided experiments using the following commands:

python main.py --task <task> --question original --model gpt-3.5-turbo
python main.py --task <task> --question rephrased --new_rephrase --model gpt-3.5-turbo

Based on my understanding, the first command runs the original template instruction, and then GPT rephrases the question and provides an answer using the presented two-step RaR method. However, I am unable to reproduce the performance improvement. Below are the accuracy scores printed by your script at the end of each run:

task \ method	orignal	rephrased
birthdate_day	0.4095	0.2761
birthdate_month	0.4245	0.4811
birthdate_year	0.5428	0.3714
birthdate_earlier	0.5000	0.4038
coin_val	0.6545	0.5772
last_letter_concatenation	0.6392	0.6210
last_letter_concatenation4	0.0727	0.2681
sports	0.7772	0.5500
date	0.3848	0.3848
csqa	0.7666	0.7916

According to the OpenAI documentation, the gpt-3.5-turbo endpoint currently corresponds to the 0613 version of the model. Therefore, I assume that the experiment should directly align with the results shown in Figure 6 of your paper. However, the results I obtained differ significantly. In fact, most of the provided tasks show no improvement in my experiments.

I am aware that you manually inspected the interaction logs to verify the results. I have written a script to filter out the data points for which the results are not immediately determinable. However, upon inspection, I found that these data points do not make a significant difference to the overall results.

If there are any differences that might lead to problems in our setups, or if you have any questions regarding the details of my experiments or any insights into the reasons why I am unable to achieve better results, I would greatly appreciate your input. Thank you in advance!

uclaml / rephrase-and-respond Goto Github PK

rephrase-and-respond's Introduction

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

🔔 News

🔍 About RaR

Setup

Data

Evaluation

Examples

Citation

rephrase-and-respond's People

Contributors

Stargazers

Watchers

Forkers

rephrase-and-respond's Issues

Question regarding the results

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs