clab / wikipedia-parallel-titles Goto Github PK

Tools for extracting parallel corpora from article titles across languages in Wikipedia

Shell 33.37% Perl 66.63%

wikipedia-parallel-titles's Introduction

wikipedia-parallel-titles

This document describes how to use these tools to build a parallel corpus (for a specific language pair) based on article titles across languages in Wikipedia.

Download necessary data

Wikipedia publishes database dumps of their content periodically. To run these scripts you need two files per language pair: the base per-page data, which includes article IDs and their titles in a particular language (ending with -page.sql.gz) and the interlanguage link records (file ends with -langlinks.sql.gz). To find these files, go to the Wikimedia Downloads page and find the database dump for the Wikipedia in one of the languages in the pair (the smaller one is recommended since it will make processing faster). The database backup are named by pairing the ISO 639 code with the word wiki. For example, if you want to build an Arabic-English corpus, you should download the relevant files from the arwiki dump, since there are fewer Arabic articles than English articles.

Example:

wget http://dumps.wikimedia.org/arwiki/20140831/arwiki-20140831-page.sql.gz
wget http://dumps.wikimedia.org/arwiki/20140831/arwiki-20140831-langlinks.sql.gz

Extract parallel titles

To extract the parallel corpus run the following where the first command line argument is the ISO 639 code of the target language and the second argument is the (path) prefix of the database dump files.

Example:

./build-corpus.sh en arwiki-20140831 > titles.txt

Language-specific filtering

If one of the languages in the pair uses a specific Unicode range, you can easily filter out lines that do not contain such characters. Example filters for a few scripts are included in the filters/ directory.

For example, the following will filter out pairs that do not contain at least one Perso-Arabic character:

./build-corpus.sh en arwiki-20140831 | ./filters/filter-perso-arabic.pl > titles.txt

Software dependencies

It is recommended that you have the uconv tool (International Components for Unicode) installed since it is used to normalize Unicode characters.

wikipedia-parallel-titles's People

Contributors

Stargazers

Watchers

Forkers

lixiangnlp chagge mehmetaergun arthurbuliva betapoc dev0p0 thientu mansurul11 hussein-alahmad naushadzaman hugolpz shuyanzhou shashanksiripragada chaojiang06 stephennfernandes peerachetporkaew usefulhyun

wikipedia-parallel-titles's Issues

scripts not working with unicode input on mac

Hi!

I Tried to use the tool according to the readme file, on macosx, with hebrew as the source language and arabic as the target. When I executed the following command (after installing the dependencies):

./build-corpus.sh ar hewiki-20141102 > titles_he_ar.txt

I got the following output:

Target language code: ar
Using hewiki-20141102-langlinks.sql.gz
Using hewiki-20141102-page.sql.gz
Reading page data from hewiki-20141102-page.sql.gz...
iconv: conversion from utf8 unsupported
iconv: try 'iconv -l' to get the list of supported encodings
read 0 documents
Reading langlinks data from hewiki-20141102-langlinks.sql.gz...
iconv: conversion from utf8 unsupported
iconv: try 'iconv -l' to get the list of supported encodings
read 0 documents

I tried to fix this by changing the perl scripts that called iconv with parameter 'utf8' to call it with 'utf-8', and it seems to work fine now.

Best regards,
Roee

How to get entire parallel text corpus after the titles.txt

this is not an issue. just asking for help or any reference script or any resources on how to parse entire parallel corpus of text for machine translation .
do you have any scripts or any resources that you can please share to take the parallel titles as args and parse then into a text extractor to parse both the language texts form wikipedia.

I am building a machine translation system any help would be much appreciated .

Thanks

Use the display title - critical for low-resource languages

Some Wikipedias use DISPLAYTITLE to override the titles of almost all articles. Typically this is in the case of a low language in a high-low disglossia, a good example would be Alemannic (~"Swiss German").

For example, see https://als.wikipedia.org/wiki/Zürich:

(The URL using the Standard German (de) Zürich instead of the actual Alemannic (als) Züri is a workaround for the fact that Alemannic has no single standardised orthography, so it's more practical to allow searches and lookups in the standard language.)

Currently, the actual output extracted is Zürich, but the expected output is Züri.

So in order to build a viable parallel titles corpus for such a language, we need to prefer DISPLAYTITLE and only take the underlying title if DISPLAYTITLE is unset.

(Not sure what the default should be , but it's probably good to make it an option not a hard rule, because for example for building a corpus for translation from als to en it's often useful to additionally include the de to en data, because of how often de segments occurs in real als data.)

Truecasing

By default, Wikipedia titles are title cased, but that creates subtle skew in training data.

Very often, there is a named entity like Apple, Meteor, Snap based on a common noun, in other cases like cognac there is a common noun based on a named entity.

how to parallel articles

There is no doubt that this work is very powerful and great. And I also successfully implemented the Chinese to English transfer operation. My question is that the text content in the titles is too small. Is there any way to convert the content of the article? How should I operate?

Existing parallel wiki corpora

People thinking of doing this with current data dumps from Wikipedia may be interested in looking at existing historical parallel corpora of article titles in multiple languages collected as part of the open-dict-data project. For example, the wikidict-en repo contains parallel article titles in English and 115 other languages. (For pairs with other languages look in the main open-dict-data repo list for "wikidict-" plus the desired ISO code.)

These dumps were originally used to create language learning dictionaries (for use in programs like Goldendict etc), but they could also be useful for comparing old and new datasets -- generated with wikipedia-parallel-titles -- and so on.