GithubHelp home page GithubHelp logo

sascezar / xwikire Goto Github PK

View Code? Open in Web Editor NEW
1.0 5.0 1.0 459 KB

A python tool for creating the wiki-reading dataset in any language.

Python 100.00%
mongodb dataset python wikidata-dump wikipedia

xwikire's Introduction

X-WikiRE

This tool provides a semi-automated creation of the WikiReading dataset as described in the work of Hewlett et. al.. There are some already built dataset available at their repository.

Requirements

  1. MongoDB
  2. Python

Procedure

Required files

  1. Download Wikidata JSON dump from here
  2. Download Wikipedia XML dump from here
  3. Download the language specific page_props.sql dump from wikipedia dumps here

Data Processing

  1. Build the mapping dict between Wikipedia IDs and WIkidata IDs using wiki_prop.py
  2. Transform the XML dump to JSON using the segment_wiki.py (a custom version of Gensim's script described here)

Data import

  1. Import the Wikidata dump into MongoDB in it's own collection using:
    mongoimport --db WikiReading --collection wikidata --file wikidata_dump.json --jsonArray
  2. Create an index on the "id" field
    db.wikidata.createIndex({"id": 1})
    
  3. Import the JSON wikipedia dump into MongoDB
  4. Create an index on the title field:
    db.wikidata.createIndex({"wikidata_id": 1})
    

POS Tagger training

  1. Train POS tagger for the desired language using this, and the data from universal dependencies

Cite

@inproceedings{abdou-etal-2019-x,
    title = "X-{W}iki{RE}: A Large, Multilingual Resource for Relation Extraction as Machine Comprehension",
    author = "Abdou, Mostafa  and
      Sas, Cezar  and
      Aralikatte, Rahul  and
      Augenstein, Isabelle  and
      S{\o}gaard, Anders",
    booktitle = "Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-6130",
    doi = "10.18653/v1/D19-6130",
    pages = "265--274",
    abstract = "Although the vast majority of knowledge bases (KBs) are heavily biased towards English, Wikipedias do cover very different topics in different languages. Exploiting this, we introduce a new multilingual dataset (X-WikiRE), framing relation extraction as a multilingual machine reading problem. We show that by leveraging this resource it is possible to robustly transfer models cross-lingually and that multilingual support significantly improves (zero-shot) relation extraction, enabling the population of low-resourced KBs from their well-populated counterparts.",
}

xwikire's People

Contributors

sascezar avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

lisaterumi

xwikire's Issues

Inquiry on wiki_prop.py

Hi Cezar,

This is Tan Qingyu, a PhD student from National University of Singapore. Thank you for your great work. We are very interested in the X-Wiki-RE tool and we are trying to build some distantly supervised data for our own. Also, we are interested in cross-lingual relation extraction. However, we have some question in using X-Wiki-RE: 
1) The dataset link in https://github.com/google-research-datasets/wiki-reading is empty. We tried to contact the authors of WikiReading but haven't got back from them. May I know where can we access the processed data of English, German, French, Spanish and Italian?
2) We tried to process some Chinese data using X-Wiki-RE and wiki_prop.py at https://github.com/SasCezar/XWikiRE/commit/bbdcc8d2caba6bfb90fe951a2d154a9d494e3b2e. But the default encoding 'ANSI' is not recognizable. We tried several encodings but still found 0 mappings. May I know what is the possible cause for this?

Thanks again for your time and hope to hear from you.

Best regards,
Qingyu

wikipidea json dump

hi want to generate wikipidea reading dataset for English. Which specific JSON I should download? And what will be its size after unzipping

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.