GithubHelp home page GithubHelp logo

uclaml / rephrase-and-respond Goto Github PK

View Code? Open in Web Editor NEW
92.0 3.0 10.0 8.51 MB

Official repo of Respond-and-Respond: data, code, and evaluation

License: MIT License

Python 58.36% Jupyter Notebook 41.64%
deep-learning large-language-models machine-learning prompt question-answering reasoning

rephrase-and-respond's Introduction

Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves

GPT-4 GPT-3.5 Vicuna Commonsense Reasoning Symbolic Reasoning Knowledge Classification Knowledge Comparison Stereotypical Bias

This repo holds data and code of the paper "Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves".

Authors: Yihe Deng, Weitong Zhang, Zixiang Chen, Quanquan Gu

[Webpage] [Paper] [Huggingface]


Demonstration of Rephrase and Respond (RaR).

🔔 News

🔍 About RaR

Misunderstandings arise not only in interpersonal communication but also between humans and Large Language Models (LLMs). Such discrepancies can make LLMs interpret seemingly unambiguous questions in unexpected ways, yielding incorrect responses. While it is widely acknowledged that the quality of a prompt, such as a question, significantly impacts the quality of the response provided by LLMs, a systematic method for crafting questions that LLMs can better comprehend is still underdeveloped.


An LLM can interpret "even month" as the month with even number of days, which diverges from human intention.

In this paper, we present a method named ‘Rephrase and Respond’ (RaR), which allows LLMs to rephrase and expand questions posed by humans and provide responses in a single prompt. This approach serves as a simple yet effective prompting method for improving performance. We also introduce a two-step variant of RaR, where a rephrasing LLM first rephrases the question and then passes the original and rephrased questions together to a different responding LLM. This facilitates the effective utilization of rephrased questions generated by one LLM with another.

"{question}"
Rephrase and expand the question, and respond.

Our experiments demonstrate that our methods significantly improve the performance of different models across a wide range to tasks. We further provide a comprehensive comparison between RaR and the popular Chain-of-Thought (CoT) methods, both theoretically and empirically. We show that RaR is complementary to CoT and can be combined with CoT to achieve even better performance.


Accuracy (%) comparison of different prompts using GPT-4.

For more details, please refer to our project webpage and our paper.

Setup

Install the Python dependencies to reproduce our results for GPT-4 and GPT-3.5-turbo.

pip install openai 
pip install tenacity

For details of API keys for GPT-4 and GPT-3.5, please refer to OpenAI API key.

Data

We provide the data used in our experiments along with gpt-4's rephrased questions in data. The data is all in json format and contains the following attributes:

{
    "question": [string] The question text,
    "answer": [string] The ground truth answer, 
    "refined_question": [string] The question text rephrased by GPT-4,
}

The description for the tasks considered in this paper is as follows:

Dataset Category Question Example
Even Day Knowledge Classification Was Donald Trump born in an even day?
Even Month Knowledge Classification Was Joe Biden born in an even month?
Even Year Knowledge Classification Was Elon Musk born in an even year?
Compare Age Knowledge Comparison Was Joe Biden born earlier than Donald Trump?
CSQA Commonsense Reasoning Everybody has some sort of creativity, it’s the best way to what? Choices: A. Create art B. Write C. Sew D. Create idea E. Express yourself
Dates Commonsense Reasoning This is the last day of 1899. What is the date tomorrow in MM/DD/YYYY?
Sports Commonsense Reasoning Is the following sentence plausible? “Juan Soto shot from beyond the arc”
Last Letter Concatenation (2) Symbolic Reasoning Take the last letters of the words in “Kristen Andrea” and concatenate them.
Last Letter Concatenation (4) Symbolic Reasoning Take the last letters of the words in “Mia Mariana Marcos Xavier” and concatenate them.
Coin Flip Symbolic Reasoning Take the last letters of the words in “Mia Mariana Marcos Xavier” and concatenate them.
StereoSet Stereotypical Bias Morocco is a country in Africa. Which of these is more accurate? A. It’s a very poor country with low standards of living. B. There are many cultures and classes of people that live there. C. Blue is her favorite color.

For reference, the original datasets are held in raw_data and the preprocessing codes are in preprocess.ipynb. The code for data generation of the task Last Letter Concatenation is provided in last_letter_concat.py, thanks to DataGenLM.

Evaluation

main.py is the script that allows evaluation on various tasks for RaR and original questions. Below are the command-line arguments that can be used to customize its behavior. We note that the code calculates a coarse accuracy by exactly matching the answer and documents the responses automatically considered wrong. We manually revisit the document to rule out the actually correct ones.

python main.py [options]

Options

  • --question:
    • Options: original, rephrased
    • Description: Specifies the type of question to be processed. Use original for processing original questions and rephrased for rephrased questions.
  • --new_refine:
    • Description: When this flag is used, the script will attempt to refine the questions again. By default, this behavior is turned off.
  • --task:
    • Options: birthdate_day, birthdate_month, birthdate_year, birthdate_earlier, coin_val, last_letter_concatenation, last_letter_concatenation4, sports, date, csqa, stereo.
    • Description: Specifies the task file name which determines the type of processing to be carried out. Each task type corresponds to a specific function.
  • --model:
    • Default: gpt-4
    • Description: Defines the model name of the OpenAI API to be used for processing.
  • --onestep:
    • Description: When this flag is used, the script will employ 1-step RaR and generate the results.

Examples

Generate GPT-4's response to the original questions of Last Letter Concatenation:

python main.py \
--model gpt-4 \
--question original \
--task last_letter_concatenation

Generate GPT-4's response to the provided rephrased questions of Last Letter Concatenation (2-step RaR):

python main.py \
--model gpt-4 \
--question rephrased \
--task last_letter_concatenation

Generate GPT-4's rephrased questions and response to the newly rephrased questions of Last Letter Concatenation (2-step RaR):

python main.py \
--model gpt-4 \
--question rephrased \
--task last_letter_concatenation \
--new_rephrase

Generate GPT-4's response using 1-step RaR:

python main.py \
--model gpt-4 \
--task last_letter_concatenation \
--onestep

Citation

If you find this repo useful for your research, please consider citing the paper

@misc{deng2023rephrase,
  title={Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves}, 
  author={Yihe Deng and Weitong Zhang and Zixiang Chen and Quanquan Gu},
  year={2023},
  eprint={2311.04205},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

rephrase-and-respond's People

Contributors

uclaml avatar yihedeng9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

rephrase-and-respond's Issues

Question regarding the results

Hello,

I am interested in your published research, and I appreciate you sharing the code for your experiments. I have successfully run the experiments; however, I encountered some issues while attempting to replicate your results. I would appreciate your assistance in clarifying if there are any misunderstandings regarding your setup.

Firstly, I do not have access to the GPT-4 API, so I made some minor modifications to the project. Specifically, I renamed the JSON files in the data folder from *gpt-4.json to *gpt-3.5-turbo.json. I believe this change should not cause any issues. Afterward, I inserted my API key and ran the main script. I ran the provided experiments using the following commands:

python main.py --task <task> --question original --model gpt-3.5-turbo
python main.py --task <task> --question rephrased --new_rephrase --model gpt-3.5-turbo

Based on my understanding, the first command runs the original template instruction, and then GPT rephrases the question and provides an answer using the presented two-step RaR method. However, I am unable to reproduce the performance improvement. Below are the accuracy scores printed by your script at the end of each run:

task \ method orignal rephrased
birthdate_day 0.4095 0.2761
birthdate_month 0.4245 0.4811
birthdate_year 0.5428 0.3714
birthdate_earlier 0.5000 0.4038
coin_val 0.6545 0.5772
last_letter_concatenation 0.6392 0.6210
last_letter_concatenation4 0.0727 0.2681
sports 0.7772 0.5500
date 0.3848 0.3848
csqa 0.7666 0.7916

According to the OpenAI documentation, the gpt-3.5-turbo endpoint currently corresponds to the 0613 version of the model. Therefore, I assume that the experiment should directly align with the results shown in Figure 6 of your paper. However, the results I obtained differ significantly. In fact, most of the provided tasks show no improvement in my experiments.

I am aware that you manually inspected the interaction logs to verify the results. I have written a script to filter out the data points for which the results are not immediately determinable. However, upon inspection, I found that these data points do not make a significant difference to the overall results.

If there are any differences that might lead to problems in our setups, or if you have any questions regarding the details of my experiments or any insights into the reasons why I am unable to achieve better results, I would greatly appreciate your input. Thank you in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.