GithubHelp home page GithubHelp logo

gooaq's Introduction

GooAQ 🥑: Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

NOTE This dataset should not be used for any commercial purposes. See the license for the detailed terms.

Data

To get the data, see the data/ directory. Note that the data is stored via git-lfs. If you're cloning the project (git clone [email protected]:allenai/gooaq.git), make sure to also run git lfs pull as well.

Each row of the data file should look like this:

{
 "id": 3339543,
 "question": "what is the difference between collagen and whey protein?",
 "short_answer": null,
 "answer": "The main differences between the amino acid profiles of whey and collagen are that whey contains all 9 essential amino acids, while collagen only has 8. ... Collagen is a fibrous protein found in the skin, cartilage, and bones of animals whereas whey comes from milk.",
 "answer_type": "feat_snip",
 "answer_url": "https://frogfuel.com/blogs/news/collagen-protein-vs-whey-protein#:~:text=The%20main%20differences%20between%20the,while%20collagen%20only%20has%208.&text=Collagen%20is%20a%20fibrous%20protein,whereas%20whey%20comes%20from%20milk."
}

where the questions question are collected via Google auto-complete.
The answers responses (short_answer and answer) were collected from Google's answer boxes. The answer types (answer_type) are inferred based on the html content of Google's response. answer_url indicates the URL from which the answer is extracted, whenever we found them (typically for feat_snip questions). Here is the dominant types in the current dataset:

  • feat_snip: explanatory responses; the majoriy the question/responses are of this type.
  • collection: list responses (e.g., steps to accomplish something).
  • knowledge: typically short responses for knowledge seeking questions.
  • unit_conv: questions about converting units.
  • time_conv: questions about converting times.
  • curr_conv: questions about converting currencies.

Here are several more examples from the data:

{
 "id": 5009708,
 "question": "carbon dioxide comprises approximately what percentage of tropospheric gases?",
 "short_answer": "04%",
 "answer": "Carbon dioxide comprise approximately . 04% of tropospheric gases.",
 "answer_type": "feat_snip",
 "answer_url": "https://www.coursehero.com/file/19578051/EnvironmentalScience/"
}
{
 "id": 8317711,
 "question": "what is the distance between uranus and earth?",
 "short_answer": "1.7858 billion mi",
 "answer": null,
 "answer_type": "knowledge",
 "answer_url": null
}
{
 "id": 3547745,
 "question": "what is the symbol for the element aluminum?",
 "short_answer": "Al",
 "answer": null,
 "answer_type": "knowledge",
 "answer_url": null
}
{
 "id": 3552841,
 "question": "what is the volume of a 12 oz can?",
 "short_answer": "340.957",
 "answer": null,
 "answer_type": "unit_conv",
 "answer_url": null
}
{
 "id": 1032187,
 "question": "exajoule is how many joules?",
 "short_answer": "1e+18 Joule",
 "answer": null,
 "answer_type": "unit_conv",
 "answer_url": null
}
{
 "id": 610247,
 "question": "are words that start with e?",
 "short_answer": null,
 "answer": "['eager.', 'eagle.', 'eagre.', 'eared.', 'earls.', 'early.', 'earns.', 'earth.']",
 "answer_type": "collection",
 "answer_url": null
}
{
 "id": 1309258,
 "question": "how long does it take to boil a hard egg?",
 "short_answer": null,
 "answer": "['Place your eggs in a single layer on the bottom of your pot and cover with cold water. ... ', 'Over high heat, bring your eggs to a rolling boil.', 'Remove from heat and let stand in water for 10-12 minutes for large eggs. ... ', 'Drain water and immediately run cold water over eggs until cooled.']",
 "answer_type": "collection",
 "answer_url": null
}
{
 "id": 2518757,
 "question": "is ways to lose weight?",
 "short_answer": null,
 "answer": "['Trying intermittent fasting. ... ', 'Tracking your diet and exercise. ... ', 'Eating mindfully. ... ', 'Eating protein for breakfast. ... ', 'Cutting back on sugar and refined carbohydrates. ... ', 'Eating plenty of fiber. ... ', 'Balancing gut bacteria. ... ', \"Getting a good night's sleep.\"]",
 "answer_type": "collection",
 "answer_url": null
}

Reproducing the splits

The split.json file which contains the IDs in each split (test, train and dev). For each instance in the training split, there is also a similarity score (similarity to the instance of test + dev) which was used when sampling smaller training data. To reproduce our splits, you can run create_splits.py script.

Baselines

See the scripts for reproducing our T5 baselines, see the experiments/ directory.

Reproducing Human Evaluation

TBD

Question/Answer Extraction Scripts

See this directory, which contains two sub-folders:
(1) the question extraction script (2) the answer extraction scripts

More reading

See the following paper:

@article{gooaq2021,
  title={GooAQ: Open Question Answering with Diverse Answer Types},
  author={Khashabi, Daniel and Ng, Amos and Khot, Tushar and Sabharwal, Ashish and Hajishirzi, Hannaneh and Callison-Burch, Chris},
  journal={arXiv preprint},
  year={2021}
}

gooaq's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

gooaq's Issues

answer_url is null

answer_url is null for all "collection" answer type in qoogle.jsonl. Is it possible to fix the data?

How to adapt run_pretraining.sh for a different set of training data?

Hi guys, first of all great job on the model! I would like to adapt the model to a different type of domain data. I am having a hard time to understand how I have to adapt the run_pretraining.sh to my local machine. Could you include a description of where the script is getting it's data (where is gs://t5-data/pretrained_models/ ?) and how to configure the parameters (e.g. what is in dataset.gin ?) to a different project. Is the model pretrained on all of those datasets (see screenshot) before it's finetuned on the custom google datasets? Or what is their purpose?
dataset

Is answer_url released in the data?

At a glance, it appears that the current data/*jsonl file lacks the answer_url key.

Update: Looks like I downloaded the data before the release. Closing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.