attardi / wikiextractor Goto Github PK

A tool for extracting plain text from Wikipedia dumps

License: GNU Affero General Public License v3.0

Python 99.23% Shell 0.77%

wikiextractor's Introduction

WikiExtractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database backup dump, e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.

The tool is written in Python and requires Python 3 but no additional library. Warning: problems have been reported on Windows due to poor support for StringIO in the Python implementation on Windows.

For further information, see the Wiki.

Wikipedia Cirrus Extractor

cirrus-extractor.py is a version of the script that performs extraction from a Wikipedia Cirrus dump. Cirrus dumps contain text with already expanded templates.

Cirrus dumps are available at: cirrussearch.

Details

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.

In order to speed up processing:

multiprocessing is used for dealing with articles in parallel
a cache is kept of parsed templates (only useful for repeated extractions).

Installation

The script may be invoked directly:

python -m wikiextractor.WikiExtractor <Wikipedia dump file>

It can also be installed from PyPi by doing:

pip install wikiextractor

or locally with:

(sudo) python setup.py install

The installer also installs two scripts for direct invocation:

wikiextractor  	(equivalent to python -m wikiextractor.WikiExtractor)
extractPage		(to extract a single page from a dump)

Usage

Wikiextractor

The script is invoked with a Wikipedia dump file as an argument:

python -m wikiextractor.WikiExtractor <Wikipedia dump file> [--templates <extracted template file>]

The option --templates extracts the templates to a local file, which can be reloaded to reduce the time to perform extraction.

The output is stored in several files of similar size in a given directory. Each file will contains several documents in this document format.

usage: wikiextractor [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] [-l] [-ns ns1,ns2]
			 [--templates TEMPLATES] [--no-templates] [--html-safe HTML_SAFE] [--processes PROCESSES]
			 [-q] [--debug] [-a] [-v]
			 input

Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

	<doc id="" url="" title="">
	    ...
	    </doc>

If the program is invoked with the --json flag, then each file will                                            
contain several documents formatted as json ojects, one per line, with                                         
the following structure

	{"id": "", "revid": "", "url": "", "title": "", "text": "..."}

The program performs template expansion by preprocesssng the whole dump and
collecting template definitions.

positional arguments:
  input                 XML wiki dump file

optional arguments:
  -h, --help            show this help message and exit
  --processes PROCESSES
			    Number of processes to use (default 79)

Output:
  -o OUTPUT, --output OUTPUT
			    directory for extracted files (or '-' for dumping to stdout)
  -b n[KMG], --bytes n[KMG]
			    maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip
  --json                write output in json format instead of the default <doc> format

Processing:
  --html                produce HTML output, subsumes --links
  -l, --links           preserve links
  -ns ns1,ns2, --namespaces ns1,ns2
			    accepted namespaces
  --templates TEMPLATES
			    use or create file containing templates
  --no-templates        Do not expand templates
  --html-safe HTML_SAFE
			    use to produce HTML safe output within <doc>...</doc>

Special:
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -a, --article         analyze a file containing a single article (debug option)
  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.

Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.

For further information, visit the documentation.

Cirrus Extractor

usage: cirrus-extract.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [-ns ns1,ns2] [-q]
                         [-v]
                         input

Wikipedia Cirrus Extractor:
Extracts and cleans text from a Wikipedia Cirrus dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

	<doc id="" url="" title="" language="" revision="">
        ...
        </doc>

positional arguments:
  input                 Cirrus Json wiki dump file

optional arguments:
  -h, --help            show this help message and exit

Output:
  -o OUTPUT, --output OUTPUT
                        directory for extracted files (or '-' for dumping to
                        stdin)
  -b n[KMG], --bytes n[KMG]
                        maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip

Processing:
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted namespaces

Special:
  -q, --quiet           suppress reporting progress info
  -v, --version         print program version

extractPage

Extract a single page from a Wikipedia dump file.

usage: extractPage [-h] [--id ID] [--template] [-v] input

Wikipedia Page Extractor:
Extracts a single page from a Wikipedia dump file.

positional arguments:
  input          XML wiki dump file

optional arguments:
  -h, --help     show this help message and exit
  --id ID        article number
  --template     template number
  -v, --version  print program version

License

The code is made available under the GNU Affero General Public License v3.0.

Reference

If you find this code useful, please refer it in publications as:

@misc{Wikiextractor2015,
  author = {Giusepppe Attardi},
  title = {WikiExtractor},
  year = {2015},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/attardi/wikiextractor}}
}

wikiextractor's People

Contributors

Stargazers

Watchers

Forkers

arkanosis jason-feng ericalingyuan 0nse gojomo yhhuang-tw munzey varun1505 orangain vigneshchennai simon-joseph mtfelix rublev09 max-ionov tresoldi minhpqn ecnumjc gchrupala khellan zhaipro remstef heshizhu little1tow wxbks rkp7 tybby infolab-csail mrshu cifkao rom1504 aolieman yancheva cameronbell75 neocl spyysalo gushecht fdoperezi oksanadanilova ritali kataquino eninga josemazo chsasank milesqli buaagyx2012 amano-ginji irenkee liormagen cc13ny sureshkannaiyan lawdataproject gumblex rohithyeravothula ashim888 chenxicui sakhawatsumit simontc sethcleveland jamesxia4 lamzin attiny2123 josh-newman coursera louiskirsch faraday hatzel zhumzhu luisfgutierrez dreadlord1984 tkngoutham atouhou jordipala arunkgupta martinve sente eliaszervudakis huangpeng1126 bejean tuve taligentia rygbee mdefelippis youcannotburnmyshadow eropi4 zhangyuteng adakasky ikvillanes domhaobaobao englhardt zhaoguochen zhangmeishan nnuguoqing vlolteanu seong889 ieee820 cmoen neo-vincent yuichiro-s searchivarius agangzz

wikiextractor's Issues

Invalid Syntax running ./WikiExtractor.py -h

OS: Centos 6
Python: 2.7.9
Wikiextractor: 2.40 (according to the code, cloned today)

Command: ./WikiExtractor.py -h
Response:

  File "./WikiExtractor.py", line 921
    afterPat = {o: re.compile(openPat + '|' + c, re.DOTALL) for o, c in izip(openDelim, closeDelim)}
                                                              ^
SyntaxError: invalid syntax

Any ideas on this one? Are there any pre-requisites I needed to pip install first or does this only work with Python 3? Any pointers greatly appreciated.

TypeError

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "wikiextractor-master/WikiExtractor.py", line 2355, in run
job.extract(self._splitter)
File "wikiextractor-master/WikiExtractor.py", line 432, in extract
for line in compact(text):
File "wikiextractor-master/WikiExtractor.py", line 2045, in compact
listLevel = listLevel + n
TypeError: can only concatenate list (not "unicode") to list

changing listLevel = listLevel + n to listLevel += n fixed it for some reason

RuntimeError: maximum recursion depth exceeded

enwiki-20150515-pages-articles.xml:
INFO: 21637542 Rizzani de Eccher
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in bootstrap_inner
self.run()
File "tools/WikiExtractor.py", line 2335, in run
job.extract(self._splitter)
File "tools/WikiExtractor.py", line 407, in extract
text = clean(self, text)
File "tools/WikiExtractor.py", line 1865, in clean
text = extractor.expandTemplates(text)
File "tools/WikiExtractor.py", line 458, in expandTemplates
res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
File "tools/WikiExtractor.py", line 677, in expandTemplate
instantiated = template.subst(params, self)
File "tools/WikiExtractor.py", line 301, in subst
return ''.join([tpl.subst(params, extractor, depth) for tpl in self])
File "tools/WikiExtractor.py", line 358, in subst
res = extractor.expandTemplates(defaultValue)
File "tools/WikiExtractor.py", line 458, in expandTemplates
res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
...
File "tools/WikiExtractor.py", line 294, in subst
logging.debug('subst tpl (%d, %d) %s', len(extractor.frame), depth, self)
File "/usr/lib/python2.7/logging/__init.py", line 1608, in debug
root.debug(msg, _args, *_kwargs)
File "/usr/lib/python2.7/logging/init.py", line 1127, in debug
if self.isEnabledFor(DEBUG):
File "/usr/lib/python2.7/logging/init.py", line 1338, in isEnabledFor
return level >= self.getEffectiveLevel()
RuntimeError: maximum recursion depth exceeded

NameError: global name 'templatePrefix' is not defined

I encountered the problem after running WikiExtractor.py (with python 2.7 in Windows 8.1 x64) on an farsi wiki dump.
Can you explain why this error occurs?

python h:\wiki\WikiExtractor.py h:\wiki\fawiki-20150602-pages-articles.xml.bz2 -cb 5M -o h:\wiki\extracted --processes 1
INFO: Preprocessing 'h:\wiki\fawiki-20150602-pages-articles.xml.bz2' to collect template definitions: this may take some time.
INFO: Preprocessed 100000 pages
INFO: Preprocessed 200000 pages
INFO: Preprocessed 300000 pages
INFO: Preprocessed 400000 pages
INFO: Preprocessed 500000 pages
INFO: Preprocessed 600000 pages
INFO: Preprocessed 700000 pages
INFO: Preprocessed 800000 pages
INFO: Preprocessed 900000 pages
INFO: Preprocessed 1000000 pages
INFO: Preprocessed 1100000 pages
INFO: Preprocessed 1200000 pages
INFO: Preprocessed 1300000 pages
INFO: Preprocessed 1400000 pages
INFO: Preprocessed 1500000 pages
INFO: Preprocessed 1600000 pages
INFO: Preprocessed 1700000 pages
INFO: Preprocessed 1800000 pages
INFO: Preprocessed 1900000 pages
INFO: Preprocessed 2000000 pages
INFO: Preprocessed 2100000 pages
INFO: Preprocessed 2200000 pages
INFO: Loaded 109314 templates in 685.3s
INFO: Starting page extraction from h:\wiki\fawiki-20150602-pages-articles.xml.bz2.
INFO: Using 1 extract processes.
Process Process-2:
Traceback (most recent call last):
File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
self.run()
File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
self._target(_self._args, *_self._kwargs)
File "h:\wiki\WikiExtractor.py", line 2427, in extract_process
Extractor(*job[:3]).extract(out) # (id, title, page)
File "h:\wiki\WikiExtractor.py", line 423, in extract
text = clean(self, text)
File "h:\wiki\WikiExtractor.py", line 1896, in clean
text = extractor.expandTemplates(text)
File "h:\wiki\WikiExtractor.py", line 479, in expandTemplates
res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
File "h:\wiki\WikiExtractor.py", line 636, in expandTemplate
title = fullyQualifiedTemplateTitle(title)
File "h:\wiki\WikiExtractor.py", line 1121, in fullyQualifiedTemplateTitle
return templatePrefix + ucfirst(templateTitle)
NameError: global name 'templatePrefix' is not defined

global name 'extract' is not defined

The multithreaded version doesn't work with the --article option. WikiExtractor.py:1831 causes NameError: global name 'extract' is not defined.

RuntimeError: maximum recursion depth exceeded in cmp

After article id 227107: Intellectual , got this error. I had reserved only 4gb memory for this process.

INFO:root:227107 Intellectual
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 813, in __bootstrap_inner
self.run()
File "./WikiExtractor.py", line 1810, in run
job.extract(self._splitter)
File "./WikiExtractor.py", line 1782, in extract
text = clean(text)
File "./WikiExtractor.py", line 1334, in clean
text = expandTemplates(text)
File "./WikiExtractor.py", line 307, in expandTemplates
res += wikitext[cur:s] + expandTemplate(wikitext[s+2:e-2], frame)
File "./WikiExtractor.py", line 810, in expandTemplate
instantiated = substParameters(template, params, frame)
File "./WikiExtractor.py", line 857, in substParameters
params, frame, subst_depth+1)
File "./WikiExtractor.py", line 886, in substParameter
parameter = expandTemplates(parameter, frame)
......................................................
......................................................
......................................................
.
.
.
.
.
File "./WikiExtractor.py", line 545, in findBalanced
afterPat = { o:re.compile(openPat+'|'+c, re.DOTALL) for o,c in izip(openDelim, closeDelim)}
File "./WikiExtractor.py", line 545, in
afterPat = { o:re.compile(openPat+'|'+c, re.DOTALL) for o,c in izip(openDelim, closeDelim)}
File "/usr/lib64/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib64/python2.7/re.py", line 232, in _compile
p = _cache.get(cachekey)
RuntimeError: maximum recursion depth exceeded in cmp

Keep sections option reversed.

Sections are only kept in the output if the --sections or -s switch is not present.

UnicodeDecodeError and Process Stopped

Hi,

I am trying to extract plaintext from English Wikipedia data dump.

At some point in the extraction process, I get UnicodeDecodeError and the program won't proceed any further.

I tried this:
line = line.decode('utf-8', 'ignore')

But the program just exited when it reached the point of error.

Any way to solve this error or just ignore this in a way that the extraction process is not hampered?

Thank you!

Stopped making progress

I am processing the dump of 20160305.

The script ran for about 20 hours and then just stopped making any further progress. I saw two Unix processes but they were both sleeping.

The last few outputs were:

WARNING: Template errors in article 'Shawn Matthias' (15299966): title(2) recursion(0, 0, 0)
WARNING: Template errors in article 'Rainey Street Historic District (Austin, Texas)' (15301930): title(0) recursion(116, 0, 0)
WARNING: Template errors in article 'Alfred Neuland' (15304281): title(2) recursion(0, 0, 0)
WARNING: Template errors in article 'Humberto Mariles' (15305453): title(2) recursion(0, 0, 0)
WARNING: Template errors in article 'Rubén Uriza' (15305737): title(2) recursion(0, 0, 0)
WARNING: Template errors in article 'Santiago Ramírez' (15306967): title(2) recursion(0, 0, 0)

No articles get extracted

I have downloaded the english wiki dump enwiki-20160305-pages-articles-multistream.xml.bz2 and installed Wikiextractor in a Debian VM.

When I ran the extractor I am getting 0 articles in return and no errors:

WikiExtractor.py -b 250K -o extracted enwiki-20160305-pages-articles-multistream.xml.bz2

INFO: Loaded 0 templates in 0.0s
INFO: Starting page extraction from enwiki-20160305-pages-articles-multistream.xml.bz2.
INFO: Using 1 extract processes.
INFO: Finished 1-process extraction of 0 articles in 0.1s (0.0 art/s)

Sections names absent

Hi!
Some sections names, e.g. 'Bibliografia' are removed.
For example, for this person
Duino Gorin
https://it.wikipedia.org/wiki/Duino_Gorin

In XML file I could see level 2 header:
==Bibliografia==
*''La Raccolta Completa degli Album Panini 1975-1976''
*''La Raccolta Completa degli Album Panini 1960-2004'' - Indici
*''Almanacco illustrato del calcio 1982''. edizione Pani

But in the processed file just ( no 'Bibliografia' section):

Trascorse in rossonero tre stagioni, fino al 1977, quando passò al Monza.

"La Raccolta Completa degli Album Panini 1975-1976"
"La Raccolta Completa degli Album Panini 1960-2004" - Indici
"Almanacco illustrato del calcio 1982". edizione Panini.

How could I keep sections' names, please?

Thanks!

Wikipedia pages with colon in title not extracted

We've run into the problem that Wikipedia pages with a colon in the title (e.g. Star Trek: The Original Series) are not extracted from a dump. The problem seems to be that the extraction is done on basis of a whitelist (extracted if the part following a colon is an accepted namespace, or the title contains no colon at all). Adding a list of valid namespaces and checking against it (i.e. extracted if following the colon is an accepted namespace, or following the colon is no known namespace or the title contains no colon at all) seems to be a quick fix for the problem, though maybe that leads to other problems?

The script does not work with Python 2.6.6 and there's no required Python version in the docs

The script does not work with Python 2.6.6 and there's no required Python version in the docs.
There should be this kind of information somewhere in the doc in order to make the user know the Python version required.
As a side note, the error I get when I try to run it with Python 2.6.6 is:

afterPat = {o: re.compile(openPat + '|' + c, re.DOTALL) for o, c in izip(openDelim, closeDelim)}

IOError: [Errno 21]

Hi again,

Not sure if this is the right place to ask this. I already extracted the files from the bz2 dump. I´m trying to read the individual files (Gensim`s tutorial):

class MySentences(object):
def init(self, dirname):
self.dirname = dirname

 def __iter__(self):
     for fname in os.listdir(self.dirname):
         for line in open(os.path.join(self.dirname, fname)):
             yield line.split()

sentences = MySentences('/home/ariel/extracted') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

And get this error:

IOError Traceback (most recent call last)
in ()
9
10 sentences = MySentences('/home/ariel/extracted/') # a memory-friendly iterator
---> 11 model = gensim.models.Word2Vec(sentences)

/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.pyc in init(self, sentences, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab)
429 if isinstance(sentences, GeneratorType):
430 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.")
--> 431 self.build_vocab(sentences, trim_rule=trim_rule)
432 self.train(sentences)
433

/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.pyc in build_vocab(self, sentences, keep_raw_vocab, trim_rule)
493
494 """
--> 495 self.scan_vocab(sentences, trim_rule=trim_rule) # initial survey
496 self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule) # trim by min_count & precalculate downsampling
497 self.finalize_vocab() # build tables & arrays

/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.pyc in scan_vocab(self, sentences, progress_per, trim_rule)
504 min_reduce = 1
505 vocab = defaultdict(int)
--> 506 for sentence_no, sentence in enumerate(sentences):
507 if sentence_no % progress_per == 0:
508 logger.info("PROGRESS: at sentence #%i, processed %i words, keeping %i word types",

in iter(self)
5 def iter(self):
6 for fname in os.listdir(self.dirname):
----> 7 for line in open(os.path.join(self.dirname, fname)):
8 yield line.split()
9

IOError: [Errno 21] Is a directory: '/home/ariel/extracted/PB'

KeyError

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "wikiextractor-master/WikiExtractor.py", line 2355, in run
job.extract(self._splitter)
File "wikiextractor-master/WikiExtractor.py", line 432, in extract
for line in compact(text):
File "wikiextractor-master/WikiExtractor.py", line 2051, in compact
page.append(listItem[n] % line)
KeyError: u'"'

Note that line numbers are slightly higher here because of my own added code:
last line number is 2031

Is there a setup.py?

Hi,

A begginers question. Is there a setup.py file? I clone the projects but I can´t see it.

Thanks

Parser Error

New to Python and WikiMedia so it's possible I'm doing something wrong, but I'm getting error when attempting to translate the Wiktionary dump. (Note: Reusing the templates file in this example since it takes quite a while to generate.)

C:\Users\administrator\Downloads>WikiExtractor.py --threads 1 --templates TEMPLATES -o MYOUT --html enwiktionary-latest-pages-articles.xml.bz2
INFO: Preprocessing dump to collect template definitions: this may take some time.
INFO: Preprocessed 10000 pages
INFO: Preprocessed 20000 pages
INFO: Starting processing pages from enwiktionary-latest-pages-articles.xml.bz2.
INFO: Using 1 CPUs.
INFO: 16 dictionary
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Windows\Python27\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Users\administrator\Downloads\WikiExtractor.py", line 2339, in run
job.extract(self._splitter)
File "C:\Users\administrator\Downloads\WikiExtractor.py", line 412, in extract
for line in compact(text):
File "C:\Users\administrator\Downloads\WikiExtractor.py", line 2035, in compact
page.append(listItem[n] % line)
KeyError: u' '

RuntimeError: maximum recursion depth exceeded

I encountered the problem after running WikiExtractor.py on a wiki dump for extended period of time (few days).

$ uname -a
Darwin imac.home 13.4.0 Darwin Kernel Version 13.4.0: Wed Dec 17 19:05:52 PST 2014; root:xnu-2422.115.10~1/RELEASE_X86_64 x86_64
$
$ ls -l ~/Downloads/enwiki-latest-pages-articles.xml.bz2
-rw-r-----@ 1 ryszard 501 11935745192 May 31 19:13 /Users/ryszard/Downloads/enwiki-latest-pages-articles.xml.bz
$
$ ~/util/Python/WikiExtractor.py -c -o extracted ~/Downloads/enwiki-latest-pages-articles.xml.bz2
[...]
INFO: 21637400 1976 American Airlines Tennis Games – Doubles
INFO: 21637471 Missouri River Runner
INFO: 21637508 Angelika Knipping
INFO: 21637520 List of reporters for ITV News
INFO: 21637542 Rizzani de Eccher
Exception in thread Thread-1:
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 808, in bootstrap_inner
self.run()
File "/Users/ryszard/util/Python/WikiExtractor.py", line 2335, in run
job.extract(self._splitter)
[...]
[... many lines deleted ...]
[...]
instantiated = template.subst(params, self)
File "/Users/ryszard/util/Python/WikiExtractor.py", line 301, in subst
return ''.join([tpl.subst(params, extractor, depth) for tpl in self])
File "/Users/ryszard/util/Python/WikiExtractor.py", line 351, in subst
paramName = self.name.subst(params, extractor, depth+1)
File "/Users/ryszard/util/Python/WikiExtractor.py", line 294, in subst
logging.debug('subst tpl (%d, %d) %s', len(extractor.frame), depth, self)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/logging/__init.py", line 1619, in debug
root.debug(msg, _args, *_kwargs)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/logging/init.py", line 1136, in debug
if self.isEnabledFor(DEBUG):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/logging/init.py", line 1348, in isEnabledFor
return level >= self.getEffectiveLevel()
RuntimeError: maximum recursion depth exceeded

KeyError: u' when using --html

When using WikiExtractor.py --no-templates --html en_wiki_500.xml
where en_wiki_500.xml is roughly the first 500 MB of the full english wikipedia dump, I get 6 such errors:

Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2494, in extract_process
    Extractor(*job[:3]).extract(out)  # (id, title, page)
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 442, in extract
    for line in compact(text):
  File "../Programming/wiki/wikiextractor-master/WikiExtractor.py", line 2122, in compact
    page.append(listItem[n] % line)
KeyError: u' '

UnicodeDecodeError when parsing Chinese Wikipedia

On parsing the Chinese Wikipedia (https://dumps.wikimedia.org/zhwiki/20150325/), I get the following error:

INFO: 596814    肖卡特·阿齐兹
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/jan/anaconda/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "WikiExtractor.py", line 2290, in run
    job.extract(self._splitter)
  File "WikiExtractor.py", line 399, in extract
    text = clean(self, text)
  File "WikiExtractor.py", line 1821, in clean
    text = extractor.expandTemplates(text)
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
  File "WikiExtractor.py", line 671, in expandTemplate
    value = self.expandTemplates(instantiated)
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
  File "WikiExtractor.py", line 671, in expandTemplate
    value = self.expandTemplates(instantiated)
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
  File "WikiExtractor.py", line 606, in expandTemplate
    return self.expandTemplates(ret)
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
  File "WikiExtractor.py", line 606, in expandTemplate
    return self.expandTemplates(ret)
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
  File "WikiExtractor.py", line 606, in expandTemplate
    return self.expandTemplates(ret)
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
  File "WikiExtractor.py", line 671, in expandTemplate
    value = self.expandTemplates(instantiated)
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
  File "WikiExtractor.py", line 582, in expandTemplate
    title = self.expandTemplates(parts[0].strip())
  File "WikiExtractor.py", line 450, in expandTemplates
    res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 2: ordinal not in range(128)

The line causing the error does not explicitly invoke any ASCII decoding mechanism so I have no idea why it throws an error, but I guess there is a Chinese (or another non ASCII) character in the template. This StackOverflow question (http://stackoverflow.com/questions/21393758/unicodedecodeerror-ascii-codec-cant-decode-byte-0xe5-in-position-0-ordinal) suggest to add

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

at the top of the document. I added the last two lines after the import lines at the top of the script and let the script run over it again (using the previously extracted templates) and it works. I also tried it without the reload command but that didn't work and I even found an explanation for that issue: http://stackoverflow.com/questions/3828723/why-we-need-sys-setdefaultencodingutf-8-in-a-py-script
So I would suggest to add

reload(sys)
sys.setdefaultencoding("utf-8")

to the script as it shouldn't hurt the extraction process (I can file a pull request if you like).
Thanks for your work! :)

Lists are not extracted

I have noticed that wikiextractor does not extract lists. I would expect list items to be new line elements, eventually having tabs indicating the level.

Here is an example of simple english wikipedia article about fish: https://simple.wikipedia.org/wiki/Fish

Extracted text does not contain the list, see example under:

Types of fish.
Fish, the oldest vertebrate group, includes a huge range of types, from the Middle Ordovician, about 490 million years ago, to the present day. These are the main groups:
Certain animals that have the word "fish" in their name are not really fish: Crayfish are crustaceans, and jellyfish are Cnidarians. Some animals look like fish, but are not. Whales and dolphins are mammals, for example.

Between those two paragraphs it should contain a list having multiple levels (see the link above).

My question is if such feature is planned to be implemented?

Multithreaded version

I have released a multithread version of the extractor, that dispatches pages to extract to several worker threads.
However the speed does not seem to improve, but it rather runs more slowly.
Can someone help figure out why?

Template expansion does not seem to work for french

First get the template file as TEMPLATES, this requires parsing the whole file.

python extractPage.py --id 275 ../frwiki-20150602-pages-articles.xml.bz2 >aikibudo
python WikiExtractor.py -o extracted --templates ../TEMPLATES -a aikibudo

I get

L' est un art martial traditionnel d'origine japonaise ("budō") essentiellement basé sur des techniques de défense.

Correct sentence

L'aïkibudo (合気武道, aikibudō?) est un art martial traditionnel d'origine japonaise (budō) essentiellement basé sur des techniques de défense.

Wiki text :

L'{{japonais|'''aïkibudo'''|合気武道|aikibudō}} est un [[art martial]] traditionnel d'origine [[japon]]aise (''[[budō]]'') essentiellement basé sur des techniques de défense.

Incorrect template expansion

The article Abraham Lincoln (307) contains the following:

* {{NYTtopic|people/l/abraham_lincoln}}

When using the --sections option, this is expanded to:

<li> [http://topics.nytimes.com/top/reference/timestopics/people/l/abraham_lincoln /index.html {{safesubst:#Invoke:String|replace|{{{{ </li>

The article Arabic Languages (803) contains a lot of templates at the beginning. These are expanded as:

{{#switch:
 {{#if: 
 | {{lc: }} 
 | main
 | other

I checked that the log contains no errors or warnings for these two articles.

ValueError: I/O operation on closed file

Yes, this is on Windows, but I don't think this is a problem with StringIO.

This error happens because the code opens the file inside the main process of the program by creating the OutputSplitter object, then it passed the object to another process created for reduce_process, and the write to the file occurred in this 2nd process. The problem is sharing file descriptor across processes doesn't work on Windows, I think it would be better if this part is refactored to only open/close the file inside the reduce process.

Pause when running with "jawiki-20160111-pages-articles-multistream.xml"

Hi Sir,
I ran your tool "wikiextractor.py" on Ubuntu 14.04 with "jawiki-20160111-pages-articles-multistream.xml". But it paused as some place.
The last output lines are as below:
779 WARNING: Template errors in article 'ファイル:Hakutaka-485.jpeg' (303793): title(0) recursion(5, 0, 0)
780 WARNING: Template errors in article 'ファイル:Maizurukoyuuransen01.jpg' (303877): title(0) recursion(5, 0, 0)
781 WARNING: Template errors in article 'ファイル:Chigasakishiyakusyo041111.jpg' (303902): title(0) recursion(5, 0, 0)

And the process state is as below:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22415 yangming 20 0 517612 397328 2012 R 99.7 0.2 134:37.88 python

wiki extractor results directories end up in QN

I want to get the title and the content of every wikipedia articles. I found the wiki extractor to be very useful to this purpose. I use wiki extractor according to the instructions on the github. When running wiki extractor V2.8, I ran into 'maximum template recursion' error after a few hours. I am getting wiki extractor from this github webpage:https://github.com/bwbaugh/wikipedia-extractor/blob/master/WikiExtractor.py

So I tried the previous commit/version. I tried both V2.6, V2.5 and V2.4.

In wiki extractor V2.4, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QH.

In wiki extractor V2.5, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QN.

In wiki extractor V2.6, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QN.

But I am really confused, because I have no idea which version has the complete wikipedia articles. In my understanding, it seems none of them succeed. Because in the resulting directory it should contain from AA to AZ, BA to BZ, ... QA to QZ, RA to RZ...ZA to ZZ. But in V2.5 and V2.6, it stops at QN.

Could any one who run the wiki extractor successfully please shed some light on me? What should the successful result look like? And which version should I run to get the correct result?

Character entities in page titles should be converted

Character entities in metadata are currently preserved, although it is implied that they should not be.
For example,

<title>&quot;Weird Al&quot; Yankovic</title>

currently produces the output:

<doc id="36733" url="http://fr.wikipedia.org/wiki?curid=36733" title="&quot;Weird Al&quot; Yankovic">
&quot;Weird Al&quot; Yankovic

At least in the "plain text" below the <doc... line, the character entities " should be converted to normal quotation marks.

NameError: inside makeInternalLink

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "wikiextractor-master/WikiExtractor.py", line 2335, in run
File "wikiextractor-master/WikiExtractor.py", line 407, in extract
self.magicWords['currentday'] = time.strftime('%d')
File "wikiextractor-master/WikiExtractor.py", line 1877, in clean
"""
File "wikiextractor-master/WikiExtractor.py", line 1486, in replaceInternalLinks
end = m.end()
File "wikiextractor-master/WikiExtractor.py", line 1764, in makeInternalLink
# # batch file existence checks for NS_FILE and NS_MEDIA
NameError: global name 'anchor' is not defined

need to rewrite:
return '%s' % (urllib.quote(title.encode('utf-8')), anchor)
to:
return '%s' % (urllib.quote(title.encode('utf-8')), label)

How to use extractPage.py

I have a list of wikipedia urls in a text file and I want the text (from wikipedia) of these articles only and not the whole wiki dump.
How can I use the extractor to do so?

Malformed XML/HTML and invalid links

Extracted text is not being escaped, which in some cases results in malformed XML.

For example, the third sentence of Inequality (mathematics) is rendered as:

For the use of the "<" and ">" signs as punctuation, see <a href="Bracket">Bracket</a>.

The correct output would be:

For the use of the "&lt;" and "&gt;" signs as punctuation, see <a href="Bracket">Bracket</a>.

Similarly, the extracted text of Brian Kernighan contains:

The first documented <a href=""Hello, world!" program">"Hello, world!" program</a>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)

which should instead be:

The first documented <a href="&quot;Hello, world!&quot; program">"Hello, world!" program</a>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)

The same applies to page titles in the <doc> elements.

Another issue with links is that most of them are not really hypertext links, but wikilinks. My opinion is that wikilinks should be represented using a different element, e.g. <wikilink>, so the sentence above would become:

The first documented <wikilink page="&quot;Hello, world!&quot; program">"Hello, world!" program</wikilink>, in Kernighan's "A Tutorial Introduction to the Language B" (1972)

Infinite recursion

Trying to parse the enwiki-20150304-pages-articles.xml.bz2 dump causes infinite recursion:

Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 851, in emit
Traceback (most recent call last):
  File "./WikiExtractor.py", line 1708, in <module>
    main()
  File "./WikiExtractor.py", line 1704, in main
    process_data(input_file, args.templates, output_splitter)
  File "./WikiExtractor.py", line 1537, in process_data
    extract(id, title, page, output)
  File "./WikiExtractor.py", line 132, in extract
    text = clean(text)
  File "./WikiExtractor.py", line 1172, in clean
    text = expandTemplates(text)
  File "./WikiExtractor.py", line 317, in expandTemplates
    res += expandTemplate(text[s+2:e-2], frame)
  File "./WikiExtractor.py", line 730, in expandTemplate
    ret =  expandTemplates(template, frame)
  File "./WikiExtractor.py", line 317, in expandTemplates
    res += expandTemplate(text[s+2:e-2], frame)
  File "./WikiExtractor.py", line 699, in expandTemplate
    params = templateParams(parts[1:])
  File "./WikiExtractor.py", line 406, in templateParams
    parameters[i] = expandTemplates(p)
  File "./WikiExtractor.py", line 317, in expandTemplates
    res += expandTemplate(text[s+2:e-2], frame)
  File "./WikiExtractor.py", line 730, in expandTemplate
    ret =  expandTemplates(template, frame)
  File "./WikiExtractor.py", line 317, in expandTemplates
    res += expandTemplate(text[s+2:e-2], frame)
  File "./WikiExtractor.py", line 699, in expandTemplate
    params = templateParams(parts[1:])
  File "./WikiExtractor.py", line 406, in templateParams
    parameters[i] = expandTemplates(p)
  ...
  File "./WikiExtractor.py", line 317, in expandTemplates
    res += expandTemplate(text[s+2:e-2], frame)
  File "./WikiExtractor.py", line 699, in expandTemplate
    params = templateParams(parts[1:])
  File "./WikiExtractor.py", line 406, in templateParams
    parameters[i] = expandTemplates(p)
  File "./WikiExtractor.py", line 317, in expandTemplates
    res += expandTemplate(text[s+2:e-2], frame)
  File "./WikiExtractor.py", line 735, in expandTemplate
    + str(maxTemplateRecursionLevels))
  File "/usr/lib/python2.7/logging/__init__.py", line 1604, in warning
    root.warning(msg, *args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1164, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
    self.handleError(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 802, in handleError
    None, sys.stderr)
  File "/usr/lib/python2.7/traceback.py", line 125, in print_exception
    print_tb(tb, limit, file)
  File "/usr/lib/python2.7/traceback.py", line 69, in print_tb
    line = linecache.getline(filename, lineno, f.f_globals)
  File "/usr/lib/python2.7/linecache.py", line 14, in getline
    lines = getlines(filename, module_globals)
RuntimeError: maximum recursion depth exceeded

The last log entry is: INFO:root:967 Acute disseminated encephalomyelitis.

AttributeError when using the namespaces option

File "wikiextractor-master/WikiExtractor.py", line 2453, in
main()
File "wikiextractor-master/WikiExtractor.py", line 2409, in main
acceptedNamespaces = set(args.ns.split(','))
AttributeError: 'Namespace' object has no attribute 'ns'

happens regardless of using -ns or --namespaces

fixed by acceptedNamespaces = set(args.namespaces.split(','))

<Doc> Tags are not removed.

Hi ,

I have tried cleaning my document using this: python WikiExtractor.py testfile.xml -o extractedcommand. And I got the following result:

<doc id="700292" url="https://en.wikipedia.org/wiki?curid=700292" title="Category:Scientific method">
Category:Scientific method

A scientific method is a sequence or collection of processes that are considered characteristic of scientific investigation and the acquisition of new scientific knowledge based upon physical evidence.

</doc>
<doc id="3674304" url="https://en.wikipedia.org/wiki?curid=3674304" title="Category:Scientific terminology">
Category:Scientific terminology


</doc>
<doc id="11886695" url="https://en.wikipedia.org/wiki?curid=11886695" title="Category:Science and engineering awards">
Category:Science and engineering awards

This category includes all prizes and awards relating to science, engineering, technology, and mathematics.

Scientific criticisms are a body of analysis of scientific methodologies, philosophies, and possible negative roles media and politics play in scientific research. Criticism of science is distinct from the academic positions of antiscience or anti-intellectualism which seek to reject entirely the scientific method. Rather, criticism is made to address and refine problems within the sciences in order to improve science as a whole, and its role in society.


That is with the <Doc> tags still visible, how do i remove all xml tags using wikiextractor?

I have tried :

` python WikiExtractor.py testfile.xml -o  --no-templates --escapedoc`

but unfortunately it does not work either. Please help me.



Thank you

Add option for extracting only template definitions

It would be nice to have an option (e.g. --no-articles) to only extract templates and not extract any articles.

It should be possible to achieve this by running WikiExtractor.py --templates templates.xml dump.xml.bz2 and killing it after it has finished preprocessing the templates. Will it work the way I want?

Thanks!

AttributeError when using the namespaces option

happens regardless of using -ns or --namespaces

fixed by acceptedNamespaces = set(args.namespaces.split(','))

TypeError: 'in <string>' requires string as left operand, not NoneType

File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "wikiextractor-master/WikiExtractor.py", line 2355, in run
job.extract(self._splitter)
File "wikiextractor-master/WikiExtractor.py", line 432, in extract
for line in compact(text):
File "wikiextractor-master/WikiExtractor.py", line 2033, in compact
if n not in '*#;:':
TypeError: 'in ' requires string as left operand, not NoneType

Only happens when using --html

relevant args: "python", "wikiextractor-master/WikiExtractor.py","-b", "100M", "-q", "-ns", "ns0",
"--html" , "--no-templates"

replaceExternalLinks removes wrong closing square bracket with nested internal link

To replicate:

replaceExternalLinks('[http://example.org [[internal|link]] text]')

result:

'[[internal|link] text]'

expected:

'[[internal|link]] text'

This causes further breakage in downstream processing.

Internal links within external links appear with some frequency in the raw Wikipedia data and are reasonably resolved by MediaWiki. For example,

[http://www.palaeos.org/Proteobacteria Proteobacteria information from [[Palaeos]].]

occurs in the source for https://en.wikipedia.org/wiki/Proteobacteria#External_links .

(Found when trying to fix #54, which previously masked this issue.)

Syntax Error

python WikiExtractor.py
File "WikiExtractor.py", line 877
afterPat = { o:re.compile(openPat+'|'+c, re.DOTALL) for o,c in izip(openDelim, closeDelim)}
^
SyntaxError: invalid syntax

Any idea what I am doing wrong here? I am using python 2.6.1

NameError: global name 'templatePrefix' is not defined

What does mean this bug and why it occurs?

N File "C:\Users\Crezary Wagner\Documents\GitHub\wikipedia-extractor\WikiExtractor.py", line 1896, in clean
ameError: global name 'templatePrefix' is not defined

Compile Error

root@f173:/home/onet/tests/chainer/wiki# wikiextractor Mozard.en
Traceback (most recent call last):
File "/usr/local/bin/wikiextractor", line 9, in
load_entry_point('wikiextractor==2.42', 'console_scripts', 'wikiextractor')()
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 351, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2363, in load_entry_point
return ep.load()
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 2088, in load
entry = import(self.module_name, globals(),globals(), ['name'])
ImportError: No module named WikiExtractor

Wrong dates parse for French Wiki

HI!
For Gabriel Biancheri https://fr.wikipedia.org/wiki?curid=9171

XML dump:
== Détail des mandats ==

{{date|21|mars|1977}} - {{date|13|mars|1983}} : adjoint au maire d'[[Hauterives]]([[Drôme %28département%29|Drôme]])
{{date|22|mars|1982}} - {{date|28|décembre|2010}} : conseiller général de la Drôme, élu dans le [[canton du Grand-Serre]]
{{date|14|mars|1983}} - {{date|28|décembre|2010}} : maire d'Hauterives
{{date|23|mars|1992}} - {{date|1er|avril|2001}} : conseiller régional de [[Rhône-Alpes]]
{{date|30|mars|1992}} - {{date|19|septembre|2002}} : vice-président du [[conseil général de la Drôme]]
{{date|19|juin|2002}} - {{date|28|décembre|2010}} : député de la [[quatrième circonscription de la Drôme]]

After processing all dates were gone:
Détail des mandats.

- : adjoint au maire d'Hauterives (Drôme)
- : conseiller général de la Drôme, élu dans le canton du Grand-Serre
- : maire d'Hauterives
- : conseiller régional de Rhône-Alpes
- : vice-président du conseil général de la Drôme
- : député de la quatrième circonscription de la Drôme

Extraction of articles cut off

Articles get cut off if they are placed in the last or one of the latter positions in the .xml dump. This problem is not present if the output is on the command line, and shows up consistently when outputting to a file. The last version we used in our project was 2.32, which did not produce that error.

'maximum template recursion' error after a few hours

Can you explain why this error occurs?
I used the updated version of the script uploaded yesterday.
Now it's giving this error.

Traceback (most recent call last):
File "./WikiExtractor.py", line 1797, in
main()
File "./WikiExtractor.py", line 1793, in main
process_data(input_file, args.templates, output_splitter)
File "./WikiExtractor.py", line 1621, in process_data
extract(id, title, page, output)
File "./WikiExtractor.py", line 132, in extract
text = clean(text)
File "./WikiExtractor.py", line 1256, in clean
text = expandTemplates(text)
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 808, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 769, in expandTemplate
params = templateParams(parts[1:], depth)
File "./WikiExtractor.py", line 396, in templateParams
parameters = [expandTemplates(p, frame) for p in parameters]
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 769, in expandTemplate
params = templateParams(parts[1:], depth)
File "./WikiExtractor.py", line 396, in templateParams
parameters = [expandTemplates(p, frame) for p in parameters]
File "./WikiExtractor.py", line 307, in expandTemplates
res += expandTemplate(text[s+2:e-2], depth+l)
File "./WikiExtractor.py", line 808, in expandTemplate
ret = expandTemplates(template, depth + 1)
File "./WikiExtractor.py", line 313, in expandTemplates
res += text[cur:]
MemoryError

Cross-language link support

It does not parse something like [[:en:Link|Link]] and removes the link completely.

For example, the markup, Hi, This is a test ([[:en:Hi|Hi]]) will output Hi, This is a test (). The correct output should have been Hi, This is a test (Hi)

Typo in readme and help output

You have this in the readme and as output of -h:

 -o OUTPUT, --output OUTPUT
                        directory for extracted files (or '-' for dumping to stdin)

Should be stdout, I guess?

WARNING: template errors / recursion

I encountered the problem after running WikiExtractor.py on an italian wiki dump

WikiExtractor.py Version: 2.36 (August 31, 2015)

Error Output:

INFO: Preprocessed 2300000 pages
INFO: Preprocessed 2400000 pages
INFO: Loaded 126134 templates in 750.0s
INFO: Starting page extraction from res/02/itwiki-latest-pages-articles.xml.xml.
INFO: Using 4 extract processes.
WARNING: template errors 'Archeologia subacquea' (12): untitled(163) recursion(0,0,0)
WARNING: template errors 'Arte' (19): untitled(163) recursion(0,0,0)
WARNING: template errors 'Sport individuale' (83): untitled(163) recursion(0,0,0)
WARNING: template errors 'Astronomo' (268): untitled(337) recursion(0,0,0)
WARNING: template errors 'America Centrale' (279): untitled(163) recursion(0,0,0)
WARNING: template errors 'Antoine de Saint-ExupÃ©ry' (285): untitled(167) recursion(0,0,0)
WARNING: template errors 'Abramo Abulafia' (293): untitled(163) recursion(0,0,0)
WARNING: template errors 'Aeroporto' (286): untitled(330) recursion(0,0,0)
WARNING: template errors 'Alan Kay' (330): untitled(163) recursion(0,0,0)
WARNING: template errors 'Agropoli' (315): untitled(163) recursion(0,0,0)
WARNING: template errors 'Antonio Fogazzaro' (336): untitled(163) recursion(0,0,0)
WARNING: template errors 'Alpi' (337): untitled(163) recursion(0,0,0)

WikiExtractor.py Version: 2.34 (June 2, 2015)

Error Output:

WARNING: Max template recursion exceeded!
INFO: 15 Analisi delle frequenze
INFO: 18 Aerofoni
INFO: 19 Arte
WARNING: Max template recursion exceeded!
WARNING: Max template recursion exceeded!
WARNING: Max template recursion exceeded!
WARNING: Max template recursion exceeded!
WARNING: Max template recursion exceeded!

Link to italian Wiki Dump:
http://download.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2

KeyErrors intruduced by fillvalue

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "wikiextractor-master/WikiExtractor.py", line 2355, in run
job.extract(self._splitter)
File "wikiextractor-master/WikiExtractor.py", line 432, in extract
for line in compact(text):
File "wikiextractor-master/WikiExtractor.py", line 2035, in compact
page.append(listClose[c])
KeyError: ' '

File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "wikiextractor-master/WikiExtractor.py", line 2355, in run
job.extract(self._splitter)
File "wikiextractor-master/WikiExtractor.py", line 432, in extract
for line in compact(text):
File "wikiextractor-master/WikiExtractor.py", line 2043, in compact
page.append(listClose[c])
KeyError: ' '

fix:
if c: to if c!=' ':

discardElement tags should be case insensitive

If the closing tag is not in the same case, e.g opening tag is <gallery> and closing is </Gallery>, dropNested wont remove this block.

replaceInternalLinks() deletes first and last characters of text in single square brackets

To replicate:

from WikiExtractor import replaceInternalLinks
replaceInternalLinks('ratio of [NADH]')

result:

'ratio of AD'

expected:

'ratio of [NADH]'

Single-bracketed strings such as [NAHD] are not markup and should be preserved in the text (see e.g. https://en.wikipedia.org/wiki/Adenosine_triphosphate#Regulation_of_biosynthesis, text ratio of below formula)

This issue appears to be caused by one of the following invocations of findBalanced:

https://github.com/attardi/wikiextractor/blob/master/WikiExtractor.py#L1710
https://github.com/attardi/wikiextractor/blob/master/WikiExtractor.py#L1728

in both cases the problem is that the simple parens around the strings '[[' and ']]' do not create tuples with a single value but are rather ignored, i.e.

findBalanced(text, ('[['), (']]'))

is equivalent to

findBalanced(text, '[[', ']]')

which is then expanded in findBalanced (https://github.com/attardi/wikiextractor/blob/master/WikiExtractor.py#L1040)

openPat = '|'.join([re.escape(x) for x in openDelim])

into a pattern \[|\[that recognizes also a single square bracket. The caller than swallows two characters from the start and end of the match (https://github.com/attardi/wikiextractor/blob/master/WikiExtractor.py#L1718)

inner = text[s + 2:e - 2]

It should be possible to fix this by adding commas to make the findBalanced arguments into single-valued tuples (as likely intended), i.e. replacing

findBalanced(text, ('[['), (']]'))

with

findBalanced(text, ('[[',), (']]',))

BTW thanks for a great tool!