GithubHelp home page GithubHelp logo

akb89 / witokit Goto Github PK

View Code? Open in Web Editor NEW
9.0 3.0 1.0 49 KB

A Python toolkit to generate a tokenized dump of Wikipedia for NLP

License: MIT License

Python 100.00%
wikipedia wikipedia-dump dump nlp tokenize multilingual

witokit's Introduction

WiToKit

GitHub release Build MIT License

Welcome to WiToKit, a Python toolkit to download and generate preprocessed Wikipedia dumps for all languages.

WiToKit can be used to converts a Wikipedia archive into a single .txt file, one (tokenized) sentence per line.

Note: WiToKit currently only supports xx-pages-articles.xml.xx.bz2 Wikipedia archives corresponding to articles, templates, media/file descriptions, and primary meta-pages.

Install

After a git clone, run:

python3 setup.py install

Use

Download

To download a .bz2-compressed Wikipedia XML dump, do:

witokit download ⁠\
  --lang lang_wp_code \
  --date wiki_date \
  --output /abs/path/to/output/dir/where/to/store/bz2/archives \
  --num-threads num_cpu_threads

For example, to download the latest English Wikipedia, do:

witokit download ⁠--lang en --date latest --output /abs/path/to/output/dir --num-threads 2

The --lang parameter expects the WP (language) code corresponding to the desired Wikipedia archive. Check out the full list of Wikipedias with their corresponding WP codes here.

The --date parameter expects a string corresponding to one of the dates found under the Wikimedia dump site corresponding to a given Wikipedia dump (e.g. https://dumps.wikimedia.org/enwiki/ for the English Wikipedia).

Important Keep num-threads <= 3 to avoid rejection from Wikimedia servers

Extract

To extract the content of the downloaded .bz2 archives, do:

witokit extract \
  --input /abs/path/to/downloaded/wikipedia/bz2/archives \
  --num-threads num_cpu_threads

Process

To preprocess the content of the extracted XML archives and output a single .txt file, tokenize, one sentence per line:

witokit process \
  --input /abs/path/to/wikipedia/extracted/xml/archives \
  --output /abs/path/to/single/output/txt/file \
  --lower \  # if set, will lowercase text
  --num-threads num_cpu_threads

Preprocessing for all languages is performed with Polyglot.

Sample

You can also use WiToKit to sample the content of a preprocess .txt file, using:

witokit sample \
  --input /abs/path/to/witokit/preprocessed/txt/file \
  --percent \  # percentage of total lines to keep
  --balance  # if set, will balance sampling, otherwise, will take top n sentences only

witokit's People

Contributors

akb89 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

strategist922

witokit's Issues

AttributeError: module 'wikiextractor' has no attribute 'extract'

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.7/site-packages/witokit/main.py", line 134, in _preprocess
    for json_object in tqdm(wikiextractor.extract(input_xml_filepath)):
AttributeError: module 'wikiextractor' has no attribute 'extract'

I don't see any such function called extract() here: https://github.com/attardi/wikiextractor
Any clues on how it works?

Unable to install WikiExtractor v3.0.4

Hi,

Thanks for this awesome tool. However, I can't find a way to install it.

Here is my config:

  • Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] on linux
  • Pip v20.0.2

Here is the error message:

ERROR: Could not find a version that satisfies the requirement wikiextractor==3.0.4 (from witokit) (from versions: 0.1)
ERROR: No matching distribution found for wikiextractor==3.0.4 (from witokit)

Apparently, only the v0.1 exists on PyPi.

Thanks for your help,

Best,

Charlotte.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.