GithubHelp home page GithubHelp logo

germant5 / wikipedia2corpus Goto Github PK

View Code? Open in Web Editor NEW
36.0 2.0 3.0 51 KB

Wikipedia text corpus for self-supervised NLP model training

License: MIT License

Python 100.00%
nlp machine-learning wikipedia corpus german-nlp wikipedia-corpus somajo

wikipedia2corpus's Introduction

Wikipedia 2 Corpus

Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training. Includes also a prepared corpus for English and German language (see below).

We use WikiExtractor to extract the Wikipedia database dumps. The texts are split into sentences by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

Remove blank Lines

If you want to remove the blank lines in the text corpus you can use this command: sed -i '/^$/d' <filename>

Download the German text Corpus

  • size of the corpus (unzipped): 6.1G
  • number of lines: 59,475,915
  • download the single files:
  • combine the parts: cat dewiki-20220201-clean-part-* > dewiki-20220201-clean.zip
  • optional check: sha256sum dewiki-20220201-clean.zip should return 09c47abf6200ecc342e04902e360773f9ba2d92abb64bfa20f22c63fd660edcf
  • unzip the textfile: unzip -t dewiki-20220201-clean.zip

Download the English text Corpus

How you can replicate our work

  • download the raw Wikipedia dump and store it in the data directory:
    • German language: Select the youngest directory from https://dumps.wikimedia.org/dewiki/ and download a file called dewiki-<yyyymmdd>-pages-articles.xml.bz2. Its is about 5.8 GB in size. We use dewiki-20220201-pages-articles.xml.bz2.
    • English language: Select the youngest directory from https://dumps.wikimedia.org/enwiki/ and download a file called dewiki-<yyyymmdd>-pages-articles.xml.bz2. Its is about 18.1 GB in size. We use enwiki-20220201-pages-articles.xml.bz2.
  • create and activate a new Python environment (for example with conda)
  • install the dependencies: pip install -r requirements.txt
  • for de data run: python -m wikiextractor.WikiExtractor data/dewiki-20220201-pages-articles.xml.bz2 -o data/dewiki-20220201
  • for en data run: python -m wikiextractor.WikiExtractor data/enwiki-20220201-pages-articles.xml.bz2 -o data/enwiki-20220201
  • use the process_wiki_files.py script:
    • edit INPUT_DIR, OUTPUT_DIR and if needed LANGUAGE
    • run the script
  • concatenate the output in OUTPUT_DIR by running cat <OUTPUT_DIR> > my_clean_wiki_corpus.txt

License

The Text Corpus

As Wikipedia itself, the text corpus is published under Creative Commons Attribution-ShareAlike 3.0 Unported license.

The Script

Copyright (c) 2022 Philip May

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.

wikipedia2corpus's People

Contributors

philipmay avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

ralf12358 rbmy ndari

wikipedia2corpus's Issues

MultiProcessing Issue

stack trace.txt

Hey, upon trying to run the process.py using the extracted wikipedia dump I am running into a No such File/Directory error. I am attaching the stack trace -- can you help me understand if my computational capabilities limit me from running tasks in parellel?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.