New to Python and WikiMedia so it's possible I'm doing something wrong, but I'm gettin

It appears sometime like this is enough to set it off: <div class="snippet-clipboa

Parser Error about wikiextractor HOT 5 CLOSED

attardi commented on June 26, 2024

Parser Error

from wikiextractor.

Comments (5)

NoMoreFood commented on June 26, 2024

It appears sometime like this is enough to set it off:

<text>
#*: SomeText
# {{given name|female|from=English}}; SomeText
</text>

from wikiextractor.

attardi commented on June 26, 2024

The program is meant to work with the dump of the Wikipedia articles, not for Wiktionary.

from wikiextractor.

NoMoreFood commented on June 26, 2024

Fair enough. I thought all of Wikimedia uses the same markup -- reinforced by the fact that program is mentioned on the generic parser list that Wiktionary points to. If what I posted above would still be valid markup within a Wikipedia article (or you just want to improve error handling), it may be worth investigating. Thanks.

from wikiextractor.

attardi commented on June 26, 2024

Wiktionary uses the same markup, but different naming conventions. For example the titles of templates are capitalized in Wikipedia and not in Wiktionary.
Hence for example the template mechanism currently does not work.

BTW, I tried with the dump you mentioned, but I did not get the error you report.
I got many more pages from the dump also:

INFO: Preprocessed 4260000 pages
INFO: Starting processing pages from enwiktionary-latest-pages-articles.xml.bz2.
INFO: Using 2 CPUs.
INFO: 16 dictionary

Please download the latest version nof WikiExtractor and of Wiktionary and retry.
Delete your previous saved TEMPLATES, or they will be used instead of the new ones.

Don't expect much out of it though, for example the entry for 'dictionary' will just be this:

English.
Etymology.
, from , from , from , perfect past participle of + .

since no template gets expanded.

from wikiextractor.

NoMoreFood commented on June 26, 2024

Ok, I'll try tonight -- I might have been using a filtered page set when i was trying to narrow down what was going wrong. I should mention that I was trying to expand templates and the parser seemed at least run without error after stripping Unicode (might be another issue) and after replacing "^([#*]+){" with "\1 delete_me_later{" to address the aforementioned issue. More to come. Thanks for the help.

from wikiextractor.

Recommend Projects

Parser Error about wikiextractor HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs