xwikire's Introduction

X-WikiRE

This tool provides a semi-automated creation of the WikiReading dataset as described in the work of Hewlett et. al.. There are some already built dataset available at their repository.

Requirements

MongoDB
Python

Procedure

Required files

Download Wikidata JSON dump from here
Download Wikipedia XML dump from here
Download the language specific page_props.sql dump from wikipedia dumps here

Data Processing

Build the mapping dict between Wikipedia IDs and WIkidata IDs using wiki_prop.py
Transform the XML dump to JSON using the segment_wiki.py (a custom version of Gensim's script described here)

Data import

Import the Wikidata dump into MongoDB in it's own collection using:

mongoimport --db WikiReading --collection wikidata --file wikidata_dump.json --jsonArray

Create an index on the "id" field
```
db.wikidata.createIndex({"id": 1})
```
Import the JSON wikipedia dump into MongoDB

Create an index on the title field:

db.wikidata.createIndex({"wikidata_id": 1})

POS Tagger training

Train POS tagger for the desired language using this, and the data from universal dependencies

Cite

@inproceedings{abdou-etal-2019-x,
    title = "X-{W}iki{RE}: A Large, Multilingual Resource for Relation Extraction as Machine Comprehension",
    author = "Abdou, Mostafa  and
      Sas, Cezar  and
      Aralikatte, Rahul  and
      Augenstein, Isabelle  and
      S{\o}gaard, Anders",
    booktitle = "Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-6130",
    doi = "10.18653/v1/D19-6130",
    pages = "265--274",
    abstract = "Although the vast majority of knowledge bases (KBs) are heavily biased towards English, Wikipedias do cover very different topics in different languages. Exploiting this, we introduce a new multilingual dataset (X-WikiRE), framing relation extraction as a multilingual machine reading problem. We show that by leveraging this resource it is possible to robustly transfer models cross-lingually and that multilingual support significantly improves (zero-shot) relation extraction, enabling the population of low-resourced KBs from their well-populated counterparts.",
}

xwikire's People

Contributors

Stargazers

Watchers

xwikire's Issues

Inquiry on wiki_prop.py

Hi Cezar,

This is Tan Qingyu, a PhD student from National University of Singapore. Thank you for your great work. We are very interested in the X-Wiki-RE tool and we are trying to build some distantly supervised data for our own. Also, we are interested in cross-lingual relation extraction. However, we have some question in using X-Wiki-RE: 
1) The dataset link in https://github.com/google-research-datasets/wiki-reading is empty. We tried to contact the authors of WikiReading but haven't got back from them. May I know where can we access the processed data of English, German, French, Spanish and Italian?
2) We tried to process some Chinese data using X-Wiki-RE and wiki_prop.py at https://github.com/SasCezar/XWikiRE/commit/bbdcc8d2caba6bfb90fe951a2d154a9d494e3b2e. But the default encoding 'ANSI' is not recognizable. We tried several encodings but still found 0 mappings. May I know what is the possible cause for this?

Thanks again for your time and hope to hear from you.

Best regards,
Qingyu