GithubHelp home page GithubHelp logo

Comments (6)

 avatar commented on August 17, 2024 3

I use this package to parse content in 20 languages, and had to write my own shim to ensure that I only feed unicode to the Document class.

I tried many ways to develop this shim, and finally found something that works across all tested languages

Attempt 1: Using chardet: Chardet worked very well for European languages, but fell short when it comes to CJK encodings which have a superset. Also, the amount of ASCII code in the first parts of HTML content throw it

Attempt 2 Reading headers and charset encoding declarations: I would look for these flags in the response headers and text, then decode. Unfortunately, people lie, especially CJK sites. Many Chinese/Korean sites would state that they use big5/utf-8 but not actually respond with that content. That, or the headers say 'utf-8' and the "charset encoding='big5"".

Attempt 3 Using UnicodeDammit: Unicode Dammit is pretty cool. It's aware that HTML tags/cruft need to be stripped before guessing encoding. And it tries many encodings before giving up. Unfortunately, it is useless for Korean pages that often feature broken encodings.

Final Solution So Far: I used the code in encoding.py to strip tags using regex. Then I ran cchardet (it's a bit faster) and have a list of superset encodings that I have encountered (for instance: 'gb18030' is a superset of 'gb2312') that are commonly "lied about" in the CJK space.

Because of this, I think the existing encoding support is a sound algorithm. It seems as though a common problem are these "lookalike" encodings. A call to cchardet instead of chardet could save time, and the alternate encodings list could be used

For instance, just replace detected encodings:
'big5' should be decoded with 'big5hkscs'
'gb2312' should be decoded with 'gb18030'
'ascii' should be decoded with 'utf-8'
'iso-8859-1' should be decoded with 'cp1252'

from python-readability.

Telofy avatar Telofy commented on August 17, 2024

Can UnicodeDammit guess the incorrect encoding specified in the headers?

You mean guess the correct encoding if the wrong encoding or no encoding was specified in the HTTP headers? The problem with the encoding specification in the HTTP header is that people often forget to set it. The default according to the HTTP specification is ISO-8859-1, which is rarely the actual encoding especially when you’re dealing with non-English pages.

I’m using the requests library in just about all of my programs, which “correctly” assumes ISO-8859-1 in such cases. What my programs do to actually decode the page is that they confirm with the actual content-type header whether the charset was set explicitly or whether it was inferred. If it was inferred, they ignore it. Then they call UnicodeDammit with the encoded response as markup and the HTTP-level encoding in override_encodings. That library then tries countless sources of explicit encoding specifications in the (HTML or XML) content.

Two problems remain, however. If none of the tried encodings could decode the response, I’m still in luck because the program actually knows that the decoding was unsuccessful; if, however, one of the incorrect encodings was coincidentally able to decode the page, no decoding errors are raised but the text is garbled. Barring comparisons of language-specific character n-gram models, I can’t think of any automatic method to detect such errors.

After all the sources of explicit encoding specifications have been tried, UnicodeDammit will try to use cchardet if it is available or chardet if it is not. If none is available, it’ll skip that step, which is what I’m currently doing for mostly German and English content. So far no one has alerted me of any decoding errors, and we’re fetching some 100k pages per day, so it must be working pretty okay.

Here’s some similar code from a private project of mine: https://bitbucket.org/Telofy/resyndicator/src/b0fdce864919bbbf68561142442428e09fb26112/resyndicator/fetchers.py?at=master#cl-52.
The wrapper (only needed anymore to raise the exception): https://bitbucket.org/Telofy/utilofies/src/4a1852218b7fe59fbb28fb1bad5a7b26be2a5a46/utilofies/bslib.py?at=master#cl-14.

from python-readability.

Telofy avatar Telofy commented on August 17, 2024

Thanks, that’s very interesting! I should run a few more tests with cchardet and preprocessed HTML (with tags stripped); that’ll probably restore my confidence into that method. I’ve only been working with English and German sites so far—apart from some that may have found their way in by accident—and the only kind of “lie” that I often encountered in that area was when no HTTP encoding was specified.

from python-readability.

Telofy avatar Telofy commented on August 17, 2024

There was another problem with my approach, which forced me to reimplement some small parts of UnicodeDammit: https://bitbucket.org/Telofy/utilofies/src/0d8cdc3ae5a0a08e7fb5906d96f0d8e2284751d1/utilofies/bslib.py?at=master#cl-15.

The encoding problem was reported to me, which usually means that it must’ve occurred on a number of pages, and I vaguely remember that I already ran into it and solved it for the old UnicodeDammit sometime in 2011.

When a page is declared, say, UTF-8 consistently everywhere (or anywhere) but contains a single illegal byte sequence—for example in an HTML comment like this one—then the “correct” encoding, UTF-8, is discarded. Moreover, Windows-1252 is somehow able to decode it, so that all umlauts and ligatures are mucked up.

When all declared encodings fail, I now immediately fall back on forcing the first one of them. Only if no encodings at all were declared anywhere do I fall back on UTF-8. I hope this will alleviate the problem.

from python-readability.

 avatar commented on August 17, 2024

One thing that I have done recently with my implementation is using HTTP headers to try to catch encoding before sending it to readability. I check for 'Content-Type' header and use regex to check for charset. If that works, I decode to unicode then encode the text to UTF-8 and send it to Readability. The cool trick to detect UTF-8 catches it really quickly and all is well.

FWIW, here's some URLs that I use to test Readability:

Of note are the following:

  • Chardet fails on aktualne.centrum.cz
  • www.aerzteblatt.de uses an XML encoding declaration that breaks LXML if you feed it unicode
  • udn.com declares Big5 encoding, but Acer's chinese name (宏碁) uses a character not found in Big5, but is found in Big5hkscs (a superset of Big5)

The others have other issues with Readability, and I am always building this list to test any changes I make to readability against it

from python-readability.

buriy avatar buriy commented on August 17, 2024

I believe now this looks more like a package to deal only with the document encoding, based on document text, meta encoding and HTTP responses, that readability module could import it and reuse. What do you think?
Could you make such package?

from python-readability.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.