GithubHelp home page GithubHelp logo

Comments (18)

psibre avatar psibre commented on August 20, 2024 3

FWIW, the <doc id="..." url="..." title="...">...</doc> output format implies a certain XML affinity. However, the lack of a common single root element makes many XML parsers barf. IMHO, it would make sense to wrap the entire output text file in some top level element.

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

The output should be text, not HTML, hence it is correct that HTML entities are converted to characters, exactly as they appear when reading the page.
In the case of entities within URL, they should be converted to urlencoding, I suppose.

from wikiextractor.

Blemicek avatar Blemicek commented on August 20, 2024

Actually, the output is XML, not a plain text. It should contain XML entities instead ", &, ', < and > because of future parsing.

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

The input is XML, the output is plain text.
That is the intended use.
I use it for extracting a text corpus for performing linguistic
analysis: parsing, QA, creating word embeddings, etc.
If the content were not converted, you would get a lot of crap in the
output, including comments, etc.

I guess I could add an option to avoid conversion, if that helps.

On 15/4/2015 12:36, Petr Fanta wrote:

Actually, the output is XML, not a plain text. It should contain XML
entities
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent
instead ", &, ', < and > because of future parsing.


Reply to this email directly or view it on GitHub
#6 (comment).

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

Links are now urlencoded.

from wikiextractor.

cifkao avatar cifkao commented on August 20, 2024

If the output is supposed to be plain text, then it does not make sense to represent links, lists and headings using HTML tags (<a>, <h1>, <li>) and it's impossible to parse such output (what if the actual text of the article contains some of these tags, or worse, the <doc> tag, which is unlikely, but possible?).

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

No tags will be present in the output: they all get stripped out, even if there was a .
The anchors are only present if you ask for them using the option to preserve links.
Use at your own discretion.

from wikiextractor.

cifkao avatar cifkao commented on August 20, 2024

The <a> tags are not the only issue. If the --sections option is used, <li> and <h1>, <h2> etc. are inserted. If the option is not used, section headings and list items are completely removed (which breaks disambiguation pages, for example, where all the interesting information is present as list items).

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

Same reason: they are inserted if you ask for them.
All tables and lists are removed, because they do not form linguistic sentences.
If you want to preserve the structure, you need a different tool.

from wikiextractor.

Blemicek avatar Blemicek commented on August 20, 2024

It seems that HTML/XML tags are not removed from a template output some tags are not removed. E.g. in HTML element:

<doc id="274393" url="http://en.wikipedia.org/wiki?curid=274393" title="HTML element">
HTML element

An <abbr title="Hyper Text Markup Language">HTML</abbr> element is an individual component of an <a href="HTML">HTML</a> document or <a href="web page">web page</a>, once this has been parsed into the <a href="HTML Document Object Model">Document Object Model</a>. HTML is composed of a <a href="Tree structure">tree</a> of HTML elements and other <a href="Node (computer science)">nodes</a>, such as text nodes. Each element can have <a href="HTML attribute">HTML attributes</a> specified. Elements can also have content, including other elements and text. Many HTML elements represent <a href="semantics">semantics</a>, or meaning. For example, the codice_1 element represents the title of the document.

...

</doc>

(Anyway, it is a bit confusing to use XML/HTML tags in a plain text.)

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

I added to the list of ignoredTags.
The case of article HTML element is a little peculiar, since it is about HTML, hence the text extracted from the page should contain tags.
That page however is written using the extension SyntaxHighlight.
So now the content of is not converted.

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

I agree that it might be confusing. But the format is not meant to be an XML format.
If it were to be XML, than all sort of escaping would have to be done, for instance to handle character entities, etc.
But this would defeat the purpose of a text extractor.
The output is just text, with tags used to separate the documents.
It is meant for easy processing: you can just drop the tags with a one-liner sed script.
You are not supposed to use an XML parser, since there is no need for it.
Actually the use of an XML is definitely discouraged, for the reasons mentioned above.

from wikiextractor.

psibre avatar psibre commented on August 20, 2024

I agree that everything between the <doc...> and </doc> is, and should be, plain text.
But the fact that the (sparse) metadata is still encoded in an XML-like way with attributes that do use character entities undermines the effort to avoid XML...

For example, the page for "Weird Al" Yankovic produces something like <doc ... title="&quot;Weird Al&quot; Yankovic">. It seems a bit odd to output XML-like elements with attributes, but to discourage XML parsing to extract the attribute and convert the entities. Why not produce something like JSON instead?

from wikiextractor.

psibre avatar psibre commented on August 20, 2024

I realize that my comments are going a bit off-topic and have opened #30 .

from wikiextractor.

nathj07 avatar nathj07 commented on August 20, 2024

Hi,
First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence of < as plain text within the <doc>...</doc> tags causes decoding to break. So when decoding the XML using tokenization it breas on the presence of < inside the tags typically with something like XML syntax error on line xx: expected element name after <

It was mentioned above that there could be a flag introduced to handle this so that those characters in that position get escaped. Has any progress been made on this? If need be I'd be happy to help out with that - given a pointer in the right direction.

Thanks

from wikiextractor.

attardi avatar attardi commented on August 20, 2024

Would it help just enclosing the text within

-- Beppe

On 12/11/2015 18:25, Nathan Davies wrote:

Hi,
First up this is a good tool and I'm generally finding it very useful.

I may be late to the party here but this is a big issue. The presence
of |<| as plain text within the |...| tags causes decoding
to break. So when decoding the XML using tokenization it breas on the
presence of |<| inside the tags typically with something like |XML
syntax error on line xx: expected element name after <|

It was mentioned above that there could be a flag introduced to handle
this so that those characters in that position get escaped. Has any
progress been made on this? If need be I'd be happy to help out with
that - given a pointer in the right direction.

Thanks


Reply to this email directly or view it on GitHub
#6 (comment).

from wikiextractor.

nathj07 avatar nathj07 commented on August 20, 2024

An interesting idea, I did think that would work in my use case. However, when I ran some simple tests I end up with an unexpected EOF error.

I think perhaps a flag on the command line to enable escaping of characters within the <doc>...</doc>. How does that sound?

from wikiextractor.

nathj07 avatar nathj07 commented on August 20, 2024

How does that PR look for this?

from wikiextractor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.