GithubHelp home page GithubHelp logo

qa-eval's Introduction

Evaluating Open-QA Evaluation

Data Version License: Apache-2.0 contributions welcome

๐Ÿ“Œ Table of Contents

๐Ÿš€ Introduction

Welcome to our GitHub repository for the "Evaluating Open-QA Evaluation" paper, a comprehensive study on the evaluating of evaluation methods in Open Question Answering (Open-QA) systems.

Open-QA systems, which generate answers to questions with a vast range of possible topics, have become an increasingly significant research field in recent years. However, accurately evaluating these systems remains challenging, and currently lacks robust, reliable methods.

In response to this, we introduce the QA Evaluation task (QA-Eval), a new task that rigorously tests various evaluation methods for their ability to accurately assess the relevance of machine-generated answers to a set of gold standard answers within an Open-QA context. This task requires the evaluating method to discern whether a machine-generated answer aligns with the gold standard answer, with performance evaluated against human-annotated results.

We sourced our data from the test sets of two well-established QA datasets, Natural Questions (NQ) and TriviaQA. We ask several representative models, including FiD, ChatGPT-(3.5/4), GPT-3.5 and BingChat, to answer the questions. We then manually annotated the correctness of each question-answer pair.

Through this work, we hope to foster a deeper understanding of Open-QA systems, their evaluations, and aid the research community in developing more reliable automatic evaluation tools.

๐Ÿ“ Dataset

Data Description

Each data point in our dataset is represented as a dictionary with the following keys:

  "question": The question asked in the Open-QA task.
  "golden_answer": The gold standard answers to the question (split with '/'). 
  "answer_fid", "answer_gpt35", "answer_chatgpt", "answer_gpt4", "answer_newbing": The answers generated by different models (FiD, GPT-3.5, ChatGPT-3.5, GPT-4, and New Bing, respectively).
  "judge_fid", "judge_gpt35", "judge_chatgpt", "judge_gpt4", "judge_newbing": Boolean values indicating whether the corresponding model's answer was judged to be correct or incorrect (True for correct, False for incorrect) by human.
  "improper": Boolean flag indicating whether the question was inappropriate or not (True for inappropriate, False for proper).

Here is an example of a data point:

{
  "question": "who got the first nobel prize in physics",
  "golden_answer": "Wilhelm Conrad R\u00f6ntgen",
  "answer_fid": "Wilhelm R\u00f6ntgen",
  "judge_fid": true,
  "answer_gpt35": "The first Nobel Prize in Physics was awarded to Wilhelm R\u00f6ntgen in 1901.",
  "judge_gpt35": true,
  "answer_chatgpt": "The first Nobel Prize in Physics was awarded in 1901 to Wilhelm R\u00f6ntgen for his discovery of X-rays.",
  "judge_chatgpt": true,
  "answer_gpt4": "The first Nobel Prize in Physics was awarded in 1901 to Wilhelm Conrad R\u00f6ntgen, a German physicist. He received the prize for his discovery of X-rays, a groundbreaking achievement that revolutionized the fields of medicine, physics, and chemistry.",
  "judge_gpt4": true,
  "answer_newbing": "According to Wikipedia,Wilhelm Conrad R\u00f6ntgen of Germany got the first Nobel Prize in Physics in 1901 for his discovery of X-rays.  He received 150,782 SEK (Swedish krona) as the prize money.",
  "judge_newbing": true,
  "improper": false
}

Data Scale

The scale of our dataset is detailed in the table below:

models Natural Questions TriviaQA
DPR+FiD 3610 2000
GPT-3.5 3610 2000
ChatGPT-3.5 3610 2000
ChatGPT-4 3610 2000
Bing Chat 3610 2000

๐Ÿ“œ License

This dataset is released under the Apache-2.0 License.

๐Ÿ“š Citation

If you use this dataset in your research, please cite it as follows:

@inproceedings{
  wang2023evaluating,
  title={Evaluating Open-{QA} Evaluation},
  author={Cunxiang Wang and Sirui Cheng and Qipeng Guo and Yuanhao Yue and Bowen Ding and Zhikun Xu and Yidong Wang and Xiangkun Hu and Zheng Zhang and Yue Zhang},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023},
  url={https://openreview.net/forum?id=UErNpveP6R}
}

๐Ÿ“ฎ Contact

We welcome contributions to improve this dataset! If you have any questions or feedback, please feel free to reach out at [email protected].

qa-eval's People

Contributors

wangcunxiang avatar

Stargazers

Jaya avatar Kangshuo Li avatar Arctanx avatar Zhimeng Guo avatar Ramsey avatar Peiran Yao avatar  avatar  avatar  avatar  avatar Jakob Schmidt avatar Jian Wu avatar Qinyuan Cheng avatar harold avatar Ludwig avatar Zijie Tian avatar qianlanwyd avatar  avatar  avatar

Watchers

Kostas Georgiou avatar  avatar Zhuoran Jin avatar

qa-eval's Issues

Why there are so many "null" in the dataset?

As the title described, when i load the dataset and try to reproduct the results as your paper, I found lots of labels are "null".

{
 "question": "the south west wind blows across nigeria between",
 "golden_answer": "till September",
 "answer_fid": "February",
 "judge_fid": null,
 "answer_gpt35": "the months of\n\nNovember and March.",
 "judge_gpt35": null,
 "answer_chatgpt": "the months of May and September. This wind is known as the \"Harmattan\" and is characterized by dry and dusty conditions. It originates from the Sahara desert and brings with it a haze that reduces visibility and can cause respiratory problems. Despite these challenges, the Harmattan also brings cooler temperatures, making it a welcome relief from the hot and humid conditions of the rest of the year.",
 "judge_chatgpt": true,
 "answer_gpt4": "The southwest wind, also known as the Southwest Monsoon or the Southwesterlies, typically blows across Nigeria between June and September, bringing rain and humidity.",
 "judge_gpt4": null,
 "answer_newbing": "nan.",
 "judge_newbing": null,
 "improper": true
},

Is it a mistake?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.