GithubHelp home page GithubHelp logo

kelvin-jiang / freebaseqa Goto Github PK

View Code? Open in Web Editor NEW
67.0 4.0 1.0 7.99 MB

The release of the FreebaseQA data set (NAACL 2019).

License: Creative Commons Attribution 4.0 International

freebaseqa freebase kb-qa nlp-datasets question-answering naacl

freebaseqa's Introduction

FreebaseQA (v1.0): A Trivia-type QA Data Set over the Freebase Knowledge Graph

This repository contains FreebaseQA, a new data set for open-domain QA over the Freebase knowledge graph. The question-answer pairs in this data set are collected from various sources, including the TriviaQA data set (Joshi et al., 2017) and other trivia websites (QuizBalls, QuizZone, KnowQuiz), and are matched against Freebase to generate relevant subject-predicate-object triples that were further verified by human annotators. As all questions in FreebaseQA are composed independently for human contestants in various trivia-like competitions, this data set shows richer linguistic variation and complexity than existing QA data sets, making it a good test-bed for emerging KB-QA systems.

If you find this data set useful, please cite the paper:

[1] K. Jiang, D. Wu and H. Jiang, "FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase," Proc. of North American Chapter of the Association for Computational Linguistics (NAACL), June 2019.

All data is distributed under the CC-BY-4.0 license.

Data Set Files

This data set contains 28,348 unique questions that are divided into three subsets: train (20,358), dev (3,994) and eval (3,996), formatted as JSON files: FreebaseQA-[train|dev|eval].json.

We have also included FreebaseQA-partial.json, which is not officially part of FreebaseQA but may be useful for training models for certain NLP tasks such as named entity recognition and entity linking.

Each file is formatted as follows:

  • Dataset: The name of this data set
  • Version: The version of the FreebaseQA data set
  • Questions: The set of unique questions in this data set
    • Question-ID: The unique ID of each question
    • RawQuestion: The original question collected from data sources
    • ProcessedQuestion: The question processed with some operations such as removal of trailing question mark and decapitalization
    • Parses: The semantic parse(s) for the question
      • Parse-Id: The ID of each semantic parse
      • PotentialTopicEntityMention: The potential topic entity mention in the question
      • TopicEntityName: The name or alias of the topic entity in the question from Freebase
      • TopicEntityMid: The Freebase MID of the topic entity in the question
      • InferentialChain: The path from the topic entity node to the answer node in Freebase, labeled as a predicate
      • Answers: The answer found from this parse
        • AnswersMid: The Freebase MID of the answer
        • AnswersName: The answer string from the original question-answer pair

Evaluation Metrics

Accuracy is used as the evaluation metric for this data set, i.e. a question is considered correct only if the predicted answer is exactly the same as one of the given answers.

Freebase Extract

We have extracted a subset of Freebase (2.2GB zip), which includes all relevant entities (16M) and triples (182M) to all FreebaseQA questions. The subset can accompany the FreebaseQA data set in order to evaluate the accuracy of trained models in answering questions. The subset may be downloaded from the following link: https://www.dropbox.com/sh/a25p7j2ir8gqnvx/AABJvjoI9mbHYj3hyfuxSdGaa?dl=0

freebaseqa's People

Contributors

kelvin-jiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

rpatil524

freebaseqa's Issues

Preprint of the paper

(My apologies for this not being a real github issue, but rather a paper-request)

Thanks a lot for sharing this awesome resource! I am really looking forward to reading the associated paper. Is there any plan to make a preprint available online? Or do we have to wait till NAACL releases the conference proceedings (in June)?

Thanks,
Keshav

Are some triple in freebase is wrong?

in file "FreebaseQA_fb_extract.txt", there some wrong triple , just like ,"m.03_bjc type.object.type government.governmental_jurisdiction". This cause the freebase enviroment can not be imported into virtuoso. Could you please upload the db of the virtuoso or through other ways . Thanks.

Hi, I find: there maybe are some redundancy in the questions. Is it right ?

For example: Who produced the film 12 Angry Men, which was scripted by Reginald Rose, starred Henry Fonda and was directed by Sidney Lumet?

Annotation information:
"TopicEntityName": "12 angry men",
"TopicEntityMid": "m.0m_tj",
"InferentialChain": "film.film.produced_by",

the text "which was scripted by Reginald Rose, starred Henry Fonda and was directed by Sidney Lumet?" is redundancy, is it all right ?

Thank you for your help.

About the associated Freebase

Hi,

Thanks for your kind release of this dataset! I'm very interested in this dataset and so far I'm working on it. But I found for some questions, the SPARQL queries can't retrieve the correct answers and the inference chain is not semantically consistent with the questions' meaning. For example, the question FreebaseQA-dev-6'' is In which country were the 1948 Winter Olympics held?'' and the inference chain is m.0blfl olympics.olympic_games.athletes..olympics.olympic_athlete_affiliation.country'' or m.0blfl olympics.olympic_games.participating_countries''. However, neither of these two relation paths expresses the accurate meaning of the question. And these two relation paths in Freebase returns a bunch of entities besides ``m.06mzp'', which leads to the F1 < 1. (I retrieved the query within the entire Freebase instead of the associated one.)

Does this mean that the dataset must be used together with the associated subset of Freebase, so that only ``m.06mzp'' is retrieved?

Best,
Yunshi

Question about License

Hi,

could you upload the license of this dataset? We are planning to include it into our paper but can't find the license.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.