Extracted text is not being escaped, which in some cases results in malformed XML.

Actually, the output is XML, not a plain text. It should contain <a href="http://www.w

The <a> tags are not the only issue. If the <co

Malformed XML/HTML and invalid links about wikiextractor HOT 18 OPEN

attardi commented on August 20, 2024

Malformed XML/HTML and invalid links

from wikiextractor.

Comments (18)

psibre commented on August 20, 2024 3

FWIW, the <doc id="..." url="..." title="...">...</doc> output format implies a certain XML affinity. However, the lack of a common single root element makes many XML parsers barf. IMHO, it would make sense to wrap the entire output text file in some top level element.

from wikiextractor.

attardi commented on August 20, 2024

The output should be text, not HTML, hence it is correct that HTML entities are converted to characters, exactly as they appear when reading the page.
In the case of entities within URL, they should be converted to urlencoding, I suppose.

from wikiextractor.

Blemicek commented on August 20, 2024

Actually, the output is XML, not a plain text. It should contain XML entities instead ", &, ', < and > because of future parsing.

from wikiextractor.

attardi commented on August 20, 2024

The input is XML, the output is plain text.
That is the intended use.
I use it for extracting a text corpus for performing linguistic
analysis: parsing, QA, creating word embeddings, etc.
If the content were not converted, you would get a lot of crap in the
output, including comments, etc.

I guess I could add an option to avoid conversion, if that helps.

On 15/4/2015 12:36, Petr Fanta wrote:

Actually, the output is XML, not a plain text. It should contain XML
entities
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent
instead ", &, ', < and > because of future parsing.

—
Reply to this email directly or view it on GitHub
#6 (comment).

from wikiextractor.

attardi commented on August 20, 2024

Links are now urlencoded.

from wikiextractor.

cifkao commented on August 20, 2024

If the output is supposed to be plain text, then it does not make sense to represent links, lists and headings using HTML tags (<a>, <h1>, <li>) and it's impossible to parse such output (what if the actual text of the article contains some of these tags, or worse, the <doc> tag, which is unlikely, but possible?).

from wikiextractor.

attardi commented on August 20, 2024

No tags will be present in the output: they all get stripped out, even if there was a .
The anchors are only present if you ask for them using the option to preserve links.
Use at your own discretion.

from wikiextractor.

cifkao commented on August 20, 2024

The <a> tags are not the only issue. If the --sections option is used, <li> and <h1>, <h2> etc. are inserted. If the option is not used, section headings and list items are completely removed (which breaks disambiguation pages, for example, where all the interesting information is present as list items).

from wikiextractor.

attardi commented on August 20, 2024

Same reason: they are inserted if you ask for them.
All tables and lists are removed, because they do not form linguistic sentences.
If you want to preserve the structure, you need a different tool.

from wikiextractor.

Blemicek commented on August 20, 2024

It seems that ~~HTML/XML tags are not removed from a template output~~ some tags are not removed. E.g. in HTML element:

<doc id="274393" url="http://en.wikipedia.org/wiki?curid=274393" title="HTML element">
HTML element

An <abbr title="Hyper Text Markup Language">HTML</abbr> element is an individual component of an <a href="HTML">HTML</a> document or <a href="web page">web page</a>, once this has been parsed into the <a href="HTML Document Object Model">Document Object Model</a>. HTML is composed of a <a href="Tree structure">tree</a> of HTML elements and other <a href="Node (computer science)">nodes</a>, such as text nodes. Each element can have <a href="HTML attribute">HTML attributes</a> specified. Elements can also have content, including other elements and text. Many HTML elements represent <a href="semantics">semantics</a>, or meaning. For example, the codice_1 element represents the title of the document.

...

</doc>

(Anyway, it is a bit confusing to use XML/HTML tags in a plain text.)

from wikiextractor.

attardi commented on August 20, 2024

I added to the list of ignoredTags.
The case of article HTML element is a little peculiar, since it is about HTML, hence the text extracted from the page should contain tags.
That page however is written using the extension SyntaxHighlight.
So now the content of is not converted.

from wikiextractor.

attardi commented on August 20, 2024

I agree that it might be confusing. But the format is not meant to be an XML format.
If it were to be XML, than all sort of escaping would have to be done, for instance to handle character entities, etc.
But this would defeat the purpose of a text extractor.
The output is just text, with tags used to separate the documents.
It is meant for easy processing: you can just drop the tags with a one-liner sed script.
You are not supposed to use an XML parser, since there is no need for it.
Actually the use of an XML is definitely discouraged, for the reasons mentioned above.

from wikiextractor.

psibre commented on August 20, 2024

I agree that everything between the <doc...> and </doc> is, and should be, plain text.
But the fact that the (sparse) metadata is still encoded in an XML-like way with attributes that do use character entities undermines the effort to avoid XML...

For example, the page for "Weird Al" Yankovic produces something like <doc ... title=""Weird Al" Yankovic">. It seems a bit odd to output XML-like elements with attributes, but to discourage XML parsing to extract the attribute and convert the entities. Why not produce something like JSON instead?

from wikiextractor.

psibre commented on August 20, 2024

I realize that my comments are going a bit off-topic and have opened #30 .

from wikiextractor.

nathj07 commented on August 20, 2024

Hi,
First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence of < as plain text within the <doc>...</doc> tags causes decoding to break. So when decoding the XML using tokenization it breas on the presence of < inside the tags typically with something like XML syntax error on line xx: expected element name after <

It was mentioned above that there could be a flag introduced to handle this so that those characters in that position get escaped. Has any progress been made on this? If need be I'd be happy to help out with that - given a pointer in the right direction.

Thanks

from wikiextractor.

attardi commented on August 20, 2024

Would it help just enclosing the text within

-- Beppe

On 12/11/2015 18:25, Nathan Davies wrote:

Hi,
First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence
of |<| as plain text within the |...| tags causes decoding
to break. So when decoding the XML using tokenization it breas on the
presence of |<| inside the tags typically with something like |XML
syntax error on line xx: expected element name after <|

It was mentioned above that there could be a flag introduced to handle
this so that those characters in that position get escaped. Has any
progress been made on this? If need be I'd be happy to help out with
that - given a pointer in the right direction.

Thanks

—
Reply to this email directly or view it on GitHub
#6 (comment).

from wikiextractor.

nathj07 commented on August 20, 2024

An interesting idea, I did think that would work in my use case. However, when I ran some simple tests I end up with an unexpected EOF error.

I think perhaps a flag on the command line to enable escaping of characters within the <doc>...</doc>. How does that sound?

from wikiextractor.

nathj07 commented on August 20, 2024

How does that PR look for this?

from wikiextractor.

Malformed XML/HTML and invalid links about wikiextractor HOT 18 OPEN

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs