Wikipedia 2 Corpus

Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training. Includes also a prepared corpus for English and German language (see below).

We use WikiExtractor to extract the Wikipedia database dumps. The texts are split into sentences by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

Remove blank Lines

If you want to remove the blank lines in the text corpus you can use this command: sed -i '/^$/d' <filename>

Download the German text Corpus

size of the corpus (unzipped): 6.1G
number of lines: 59,475,915
download the single files:
combine the parts: cat dewiki-20220201-clean-part-* > dewiki-20220201-clean.zip
optional check: sha256sum dewiki-20220201-clean.zip should return 09c47abf6200ecc342e04902e360773f9ba2d92abb64bfa20f22c63fd660edcf
unzip the textfile: unzip -t dewiki-20220201-clean.zip

Download the English text Corpus

size of the corpus (unzipped): 14G
number of lines: 146,709,087
download the single files:
combine the parts: cat enwiki-20220201-clean-part-* > enwiki-20220201-clean.zip
optional check: sha256sum enwiki-20220201-clean.zip should return 127e8645f1bc1944088df165b613333f84cdca6a24eee38b8cd7ac673352293b
unzip the textfile: unzip -t enwiki-20220201-clean.zip

How you can replicate our work

download the raw Wikipedia dump and store it in the data directory:
- German language: Select the youngest directory from https://dumps.wikimedia.org/dewiki/ and download a file called dewiki-<yyyymmdd>-pages-articles.xml.bz2. Its is about 5.8 GB in size. We use dewiki-20220201-pages-articles.xml.bz2.
- English language: Select the youngest directory from https://dumps.wikimedia.org/enwiki/ and download a file called dewiki-<yyyymmdd>-pages-articles.xml.bz2. Its is about 18.1 GB in size. We use enwiki-20220201-pages-articles.xml.bz2.
create and activate a new Python environment (for example with conda)
install the dependencies: pip install -r requirements.txt
for de data run: python -m wikiextractor.WikiExtractor data/dewiki-20220201-pages-articles.xml.bz2 -o data/dewiki-20220201
for en data run: python -m wikiextractor.WikiExtractor data/enwiki-20220201-pages-articles.xml.bz2 -o data/enwiki-20220201
use the process_wiki_files.py script:
- edit INPUT_DIR, OUTPUT_DIR and if needed LANGUAGE
- run the script
concatenate the output in OUTPUT_DIR by running cat <OUTPUT_DIR> > my_clean_wiki_corpus.txt

License

The Text Corpus

As Wikipedia itself, the text corpus is published under Creative Commons Attribution-ShareAlike 3.0 Unported license.

The Script

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.

germant5 / wikipedia2corpus Goto Github PK