buriy / python-readability Goto Github PK

View Code? Open in Web Editor NEW

This project forked from timbertson/python-readability

2.6K 96.0 348.0 731 KB

fast python port of arc90's readability tool, updated to match latest readability.js!

Home Page: https://github.com/buriy/python-readability

License: Apache License 2.0

Makefile 2.07% Python 97.93%

python-readability's Introduction

python-readability

Given an HTML document, extract and clean up the main body text and title.

This is a Python port of a Ruby port of arc90's Readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

As an alternative, you may also use conda to install, just run:

$ conda install -c conda-forge readability-lxml

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

0.8.2 Added article author(s) (thanks @mattblaha)
0.8.1 Fixed processing of non-ascii HTMLs via regexps.
0.8 Replaced XHTML output with HTML5 output in summary() call.
0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
0.4 Added Videos loading and allowed more images per paragraph
0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

Latest readability.js
Ruby port by starrhorne and iterationlabs
Python port by gfxmonk
Decruft effort <https://web.archive.org/web/20110214150709/https://www.minvolai.com/blog/decruft-arc90s-readability-in-python/> to move to lxml
"BR to P" fix from readability.js which improves quality for smaller texts
Github users contributions.

python-readability's People

Contributors

Stargazers

Watchers

Forkers

jcharum litso leechael agogoamalgamated psycojoker arowser facundo ritksm faraday vireshas sfioritto janx2 mitechie saidimu iwanbk evasdk andreypopp liangzhimy zacharydenton toddlehr nhnifong marazmiki akimboio danielsoneg dumpforjunk sunlightlabs mykolad rebelmouseteam linkmic soopercorp wmelton amoghtolay actberw hewigovens hallvors patrykw dvogel fanfannothing fireball2018 markdayel rofra mmirate kingxsp mgalves zgw21cn amferraz chundong nsecord ivbeg zoeyyoung hpsoar pansuo chengdujin nimast ledudu saffsd frnsys kouio mcpoet yagelix subroto paykroyd anoras fashtimedotcom zhoubug arezki1990 chrox jinpeng the-happy-hippo juancarloscruzd zixan liulu081227 kharazi wxpjimmy danieleguido jimmy0000 natanovia jeffreywugz andban yeli fqxp jadkik luis-wang bericht hsingh23 telofy appscluster hyperlinkapp kamotos nathanathan hxcomet ultimate010 dlarochelle tom31203120 shieffan dabonneville samuelchen wangfengliang cogniteev dhamaniasad

python-readability's Issues

Any research work for extracting the main body of pages?

Hi,

I'm new to this. Tested python-readability and it works most of the time. I have not yet studied the code in details. By a fast scanning, it looks like a heuristic scoring of elements. Is there any documents to explain the scoring strategies? e.g. some quantitative study of to what degree those heuristics work? e.g. some formal datasets to test the performance of extraction by different algorithms. etc.

Highly appreciated if someone can point me to recent good research works.

p.s. Add a funny case. It can not extract the README part (which I personally consider to be the main body) of the project homepage.

Erratic <p> insertion in Macrumors article

Macrumors uses <br> tags to separate their paragraphs. Readability attempts to insert <p> tags into the article, but results are not as you'd expect.

For example, this article: http://www.macrumors.com/2012/10/12/apples-ipad-mini-media-event-reportedly-scheduled-for-october-23/

results in this output for Document(content).summary():

<html><body><div><div class="content">
                        <a href="http://allthingsd.com/20121012/apple-likely-to-unveil-ipad-mini-at-october-23-event/">
<i>AllThingsD</i> reports</a><p> that Apple appears to be planning to hold a media event on Tuesday, October 23 to introduce 
the "iPad mini", Apple's smaller tablet device said to be carrying a display measuring 7.85 inches diagonally.</p>
<p class="quote">As AllThingsD reported in August, Apple will hold a special event this month at which it will showcase a new, 
smaller iPad. People familiar with Apple’s plans tell us that the company will unveil the so-called “iPad mini” on October 23 at an 
invitation-only event.<br/><br/>
That’s a Tuesday, not a Wednesday, so this is a bit of a break with recent tradition. It also happens to be just three days prior to 
the street date for Microsoft’s new Surface tablet.</p><center>
<img src="http://cdn.macrumors.com/article-new/2012/09/ipadmini_small.jpg"/><br/><i>Physical mockup of rumored iPad mini design</i></center><p>
The location of the event is unconfirmed, but the report suggests that it is likely to be held at the company's Town Hall 
auditorium at its corporate headquarters in Cupertino, California.</p><i>AllThingsD</i><p> has an excellent track record 
regarding Apple media event rumors, giving this claim a high probability of proving true.  Given past history, Apple would be 
expected to send out invitations early next week if the event is to be held on October 23.</p><b>Update</b><p>: </p><em>The 
Loop</em><p>'s Jim Dalrymple weighs in, </p>
<a href="http://www.loopinsight.com/2012/10/12/apples-rumored-oct-23-ipad-mini-event/">confirming the date</a><p> with a "Yep."
                                        </p><p/>
                    </div>
                    </div></body></html>

Can't get correct article content if content in table, bug?

I have some urls on same site.

Can't get corret contain(with table):
http://ecp.sgcc.com.cn/html/project/014002007/9990000000010135023.html
http://ecp.sgcc.com.cn/html/project/014001001/9900000000003560149.html

Work fine if no table:
http://ecp.sgcc.com.cn/html/news/014001008/7900000000000045486.html

Maybe a bug?

update:
I use print doc.score_paragraphs() see what happen:

{ < Element table at 0x50fb180 > : {
        'content_score': 1.95,
        'elem': < Element table at 0x50fb180 >
    },
    < Element tr at 0x5024b10 > : {
        'content_score': 2.0,
        'elem': < Element tr at 0x5024b10 >
    },
    < Element tr at 0x50fb300 > : {
        'content_score': 2.0,
        'elem': < Element tr at 0x50fb300 >
    }
}

I think tag with tr and td shoud not be treat as candidate, beside they are under table which also should plus some score to their parent table.

Keep the <meta charset> tag or even add it

Html5 uses <meta charset="UTF-8">
HTML4 uses <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Failure if best_elem is root

I have some documents that are raising the following exception

Traceback (most recent call last):
...
  File ".../site-packages/readability/readability.py", line 168, in summary
    html_partial=html_partial)
  File ".../site-packages/readability/readability.py", line 214, in get_article
    for sibling in best_elem.getparent().getchildren():
Unparseable: 'NoneType' object has no attribute 'getchildren'

I assume there should be a fix along the lines of:

if best_elem.getparent() is None:
    siblings = [best_elem]
else:
    siblings = best_elem.getparent().getchildren()
for sibling in siblings:

Getting lxml stack traces with NY Times urls

I've installed readability-lxml with pip on ubuntu 12.10 (64 bit, amd), and all my packages are up-to-date.

When I try accessing urls from the NY Times website using the command line syntax, e.g., these requests:

$ python -m readability.readability -u "http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=0"
$ python -m readability.readability -u "http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html"
$ python -m readability.readability -u "http://www.nytimes.com/2013/03/03/magazine/beer-mergers.html?ref=magazine"
$ python -m readability.readability -u "http://www.nytimes.com/2013/03/03/magazine/beer-mergers.html"

I'm getting the following stack trace:

ERROR:root:error getting summary: 
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 136, in summary
    self._html(True)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 104, in _html
    self.html = self._parse(self.input)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 108, in _parse
    doc = build_doc(input)
  File "/usr/local/lib/python2.7/dist-packages/readability/htmls.py", line 18, in build_doc
    doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2756, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54726)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82843)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81641)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78311)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74567)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75458)
  File "parser.pxi", line 601, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74958)
XMLSyntaxError: None
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 589, in <module>
    main()
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 584, in main
    url=options.url).summary().encode(enc, 'replace')
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 136, in summary
    self._html(True)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 104, in _html
    self.html = self._parse(self.input)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 108, in _parse
    doc = build_doc(input)
  File "/usr/local/lib/python2.7/dist-packages/readability/htmls.py", line 18, in build_doc
    doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2756, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54726)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82843)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81641)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78311)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74567)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75458)
  File "parser.pxi", line 601, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74958)
__main__.Unparseable: None

We need a way to cut out bad scoring elems inside good ones.

For example, in http://www.kwqc.com/story/19484184/shots-fired-in-ill-school-2-students-in-custody
readability will include the block of related links that is inside the div of the main story, dispite it's link density being high and it's overall content score being low.

Firstly I want to know if it's supposed to do that. Is there any support for cutting out bad divs from inside good ones?

and if not, can you suggest a starting place to go about doing this?

Debug.info to debug.warning

You should p)rovide an option to make readibility debug info silent while using it into a programm as you provide it in cmd line

I tried to rewrite logger.Level with no success
logging.getLogger("readability").setLevel(logging.WARNING)

Remove logging configuration when used in library mode.

Remove logging configuration when used in library mode.
Improve logging (add prefix)

Unparseable: local variable 'enc' referenced before assignment

Hi there!

Extracting doesn’t work anymore when you predecode the strings. This looks pretty trivial though. enc could be initialized with None, unless that would cause any problems in other parts of the code.

By the way, I would discourage the use of the old chardet library. The range of encodings it can detect is very limited and it’s slow on top. I’ve found cchardet to be a lot better, but really there is the excellent UnicodeDammit library in BeautifulSoup that first tries to extract various explicit encoding specifications and then falls back on such implicit methods. Thanks to their latest refactoring, I could even remove a number of ugly hacks I needed to use the older version.

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in summary(self, html_partial)
    152             ruthless = True
    153             while True:
--> 154                 self._html(True)
    155                 for i in self.tags(self.html, 'script', 'style'):
    156                     i.drop_tree()

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _html(self, force)
    117     def _html(self, force=False):
    118         if force or self.html is None:
--> 119             self.html = self._parse(self.input)
    120         return self.html
    121 

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _parse(self, input)
    121 
    122     def _parse(self, input):
--> 123         doc, self.encoding = build_doc(input)
    124         doc = html_cleaner.clean_html(doc)
    125         base_href = self.options.get('url', None)

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/htmls.pyc in build_doc(page)
     15         page_unicode = page.decode(enc, 'replace')
     16     doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
---> 17     return doc, enc
     18 
     19 def js_re(src, pattern, flags, repl):

Unparseable: local variable 'enc' referenced before assignment

processed html has self closed body tag

In testing it out I find that the body tag in the processed html is and not real open/close tags?

I'm using the 0.2.3 from pypi and you can see it here for instance:

view-source:http://readable.bmark.us/view/http%3A%2F%2Fcnnsi.com%2Fbaseball%2Fmlb%2Fgameflash%2F2012%2F04%2F08%2F40523_recap.html%3Fsct%3Dhp_t2_a2%26eref%3Dsihp

Error in install and import

hi
i install readability but when import readability dosent work

from readability.readability import Document
Traceback (most recent call last):
File "", line 1, in
File "readability.py", line 1, in
from readability.readability import Document
ImportError: No module named readability

but i can use command line :
python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml

Aggressively removes images

Between 0.3.0.6 and current release python-readability aggressively removes all images embedded in the html. There doesn't seem to be a way to control this behaviour.

Merge David's branch at https://github.com/dcramer/decruft

installed file(s) were overrided by readability's package

There is a pypi package named 'readability' (https://pypi.python.org/pypi/readability/0.1), which install files to the same directory of readability-lxml's.

No Title for most articles

Is there a known problem that there are nog titles for most articles on internet? When I try "python -m readability.readability -u " on popular news sites i don't get any headings.

Add charset info to the clean html

Thank you for keeping up the project!

I use readability to extract the article and then save it as html. Today I've run into problem when Chrome didn't display some unicode characters correctly (the .html file was saved with utf-8). Turns out that it got solved by adding the following line to the cleaned html:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Maybe it could be considered to add this info as a default behavior to get_clean_html().

Thank you very much!

Eliminate display:none tags

Hi there, I have a request that shouldn't be too hard to make, though might require some trickery for a proper full CSS solution.

If you look at the code of http://www.readwriteweb.com/start/2012/04/what-do-angels-want.php you'll see some tags like <div style="display:none;" id="disqus_post_message">...</div> in there, which are clearly hidden and shouldn't come up in the summary. However, in the current version of the library they do (and the display:none is stripped out in the summary()).

Interestingly, in the bookmark version of readability, these things don't show up, not even as display:none, so maybe there is a fix there already?

Thanks so much for looking into it.

Cheers

Detect whether page is suitable for applying readability

Hi.

I'd like to know whether it's possible to make python-readability to indicate whether the page is question is suitable for applying readability.

For example, consider New York Times homepage. It's not supposed to be parsed by python-readability.

Is it possible to check this?

Thanks.

error: get comments, not text

my code:

from readability.readability import Document
import urllib

url='http://www.womanclinics.ru/boli-vnizu-zhivota-u-zhenshhin.html'

html = urllib.urlopen(url).read()

readable_article = Document(html).summary()

print readable_article

"""
in readable_article we get comments, not article text
but when i use original tool: http://www.readability.com/articles/erldraqk
its work right, we get article text, not comments
"""

Pypi not up-to-date

Could there be a fresh version of readability on Pypi which includes fdba8d9 ?

I've just hit this bug, and whilst fixing it's not too much of a pain, it'd be good for people pulling fresh from Pypi to not hit it too.

Cheers!

Having problems grabbing the main article

I am having problems grabbing the main article for this url http://ukbdnews.com/2014/09/%E0%A6%85%E0%A6%AC%E0%A6%B6%E0%A7%87%E0%A6%B7-%E0%A6%9A%E0%A6%B2%E0%A6%9A%E0%A7%8D%E0%A6%9A%E0%A6%BF%E0%A6%A4%E0%A7%8D%E0%A6%B0%E0%A6%95%E0%A7%87-%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A6%BE%E0%A6%AF%E0%A6%BC/

I have tried using different combination of positive_keywords but no luck.

However, it does grab the title.

Any help would be much appreciated.

Thank you

I can't extract content from this Chinese article

When I use readability on this article http://www.he.xinhuanet.com/news/2012-06/05/content_25346340.htm I am unable to extract any content. The encoding is gb2312, but I've converted it to unicode and the summary is still empty. The html elements don't have informative ids/classes, is there any way readability could handle documents like it?

htmls.py sets root logger loglevel

line 8: logging.getLogger().setLevel(logging.DEBUG)

This line should be removed. It overrides other root logger setting.

Try UnicodeDammit instead of using chardet library (or maybe combine the approaches).

Using AdBlock rules to remove elements

AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly.

I'm using this in my own code to automatically remove social media share links from pages. You may want to consider including something similar in python-readablity.

EasyList is dual licensed Creative Commons Attribution-ShareAlike 3.0 Unported and GNU General Public License version 3. CC-BY-SA looks compatible with Apache licensed projects.

Example

First download the rules:

$ wget https://easylist-downloads.adblockplus.org/fanboy-annoyance.txt

Then you can simply extract the CSS selectors to match against a document tree.

from lxml import html
from lxml.cssselect import CSSSelector

RULES_PATH = 'fanboy-annoyance.txt'
with open(RULES_PATH, 'r') as f:
    lines = f.read().splitlines()

# get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

def remove_ads(tree):
    for rule in rules:
        for matched in rule(tree):
            matched.getparent().remove(matched)

doc = html.document_fromstring("<html>...</html>")
remove_ads(doc)

AttributeError: 'list' object has no attribute 'split'

Hello,

I faced the aforementioned error recently with the latest version of readability-lxml. This is part of the stacktrace:

    text = document_fromstring(Document(html,negative_keywords=['related']).summary()).text_content()
  File "build/bdist.linux-x86_64/egg/readability/readability.py", line 98, in __init__
    self.negative_keywords = compile_pattern(negative_keywords)
  File "build/bdist.linux-x86_64/egg/readability/readability.py", line 74, in compile_pattern
    elements = elements.split(',')
AttributeError: 'list' object has no attribute 'split'

And how it is used in our codebase:

from readability.readability import Document

html = ...

text = document_fromstring(Document(html,negative_keywords=['related']).summary()).text_content()

However, this is OK if I pin to the version 0.3.0.6 of the library.

$ python --version
Python 2.7.6

$ pip --version
pip 1.5.4 from /usr/lib/python2.7/dist-packages (python 2.7)

Any idea why such a breakage?

Thanks

cc @olivierthereaux

Crash when parsing articles with invalid link "http://["

It seems readability crashes on links with an extra "[" in the url. Here's an example: http://www.theguardian.com/film/2014/apr/24/the-hurricane-rubin-carter-denzel-washington

<a href="http://[http://www.theguardian.com/film/filmblog/2013/may/09/raging-bull-reel-history-martin-scorsese" title="">Raging Bull</a>

Here's the stacktrace:

Traceback (most recent call last):
  File "manage.py", line 14, in <module>
    execute_from_command_line(sys.argv)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "C:\Emils\Projects\presskoll\presskoll\webhook\management\commands\extract_articles.py", line 69, in handle
    parse_article(rawarticle, overwrite=options["overwrite"], DEBUG=DEBUG)
  File "C:\Emils\Projects\presskoll\presskoll\webhook\article_parser.py", line 59, in parse_article
    title, body = title_and_body_from_article(rawarticle)
  File "C:\Emils\Projects\presskoll\presskoll\webhook\content_util.py", line 393, in title_and_body_from_article
    doc = document._html(True)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\readability\readability.py", line 119, in _html
    self.html = self._parse(self.input)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\readability\readability.py", line 127, in _parse
    doc.make_links_absolute(base_href, resolve_base_href=True)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\__init__.py", line 340, in make_links_absolute
    self.rewrite_links(link_repl)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\__init__.py", line 469, in rewrite_links
    new_link = link_repl_func(link.strip())
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\__init__.py", line 335, in link_repl
    return urljoin(base_url, href)
  File "C:\Program Files\Python27\Lib\urlparse.py", line 260, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "C:\Program Files\Python27\Lib\urlparse.py", line 142, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Program Files\Python27\Lib\urlparse.py", line 190, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

I suggest this error is catched and discarded.

Update to latest readability

The latest version of arc90's Readability supports fetching multiple pages of an article and returning a stitched-together page.

Any chance that change might propagate into python-readability?

How about using path weight instead of class weight?

Hi @buriy, do you wake up? It seems I always post when you go to sleep, the dammit timezone...

Ok, I come up a thought :

List all node absolute path with class and id, calculate path score by positiveRe and negativeRe.
Such as:

body/div/p[1]
body/div/div[1].article/p[1]
body/div/div[1].article/div.comment/p
body/div/div[1].article/div.content/p
body/div/div[1].article/div/p
body/div/div[2]/div.comment/p.content

How to calculate score:

search each path from right to left,
sum all (distance * distance_punishment * negative weight or positive weight)

distance is current node depth minus ancestor depth which match negativeRe or positiveRe
PS: The bigger distance need make a bigger punishment I think, the correct parameter is hard to determine now, but I think this way can work.

Let's set
positiveRe weight to 25,
negativeRe weight to -25,
distance_punishment to 1

Example:

body/div/div[1].article/p[1]

score : 1 (p[1] depth - div[1].article depth) * 25 = 25

body/div/div[1].article/div.comment/p
order: div[1].article: positiveRe, div.comment: negativeRe

score : 1 (p depth - div.comment depth) * -25 + 1/2 * 25 = -12.5

body/div/div[2]/div.comment/p.content

score : 1 (p depth to itself is 1, I forget to set it ...)* 25 + 1 *-25 = 0

Maybe this approach would give an element with comment class with a positive score. But you could imagine that a comment can't be the only element under article,

Such as it has other bother like :

body/div/div[1].article/div.content/p
body/div/div[1].article/div/p

their's score must higher than comment.

However, what if article only has a child node with class comment with positive score ? Congratulation ! Just means that site's naming has big problem (abuse English word) , but we still find a good candidate! (at least a more confident candidate if they don't misunderstand what article means)

And, as I see, current approach do not give a score to html5 tag such as article, this approach cover this . Calculate node score from deep to shallow would work well with this too.

But , due to lack of data, I can't test it now....

Errors in processing slashdot pages

I noticed that there are some problems in processing pages from slashdot.org, for example this page:
http://developers.slashdot.org/story/12/03/22/2149251/gcc-turns-25?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Slashdot%2FslashdotDevelopers+%28Slashdot%3A+Developers%29
One of the comments is shown instead of the body.

With this other page python-readability works well:
http://developers.slashdot.org/story/12/03/19/1458206/mystery-of-duqu-programming-language-solved?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Slashdot%2FslashdotDevelopers+%28Slashdot%3A+Developers%29

The same problem is present in the javascript version of readability.

Travis for CI

Any interest in getting CI working for this project?

I am working on an OS project which will rely on this library and would like to see if your build is passing / failing more easily :)

I'd be happy to submit the PR for the .travis.yml file !

https://travis-ci.org/

HTML isn't parsed when get_clean_html is called

get_clean_html fails with TypeError when called right after document creation.
How to reproduce:

from readability import Document
document = Document('Doesn't matter what to write here')
document.get_clean_html()

Looks like problem is in not calling _html here. HTML isn't parsed if I didn't call for example title and so it passes None. It fixes the problem if I call _html by hand.

Require older lxml version for OSX compatibility

Currently, the package does not work on OSX, if installed directly via pip, because "pip install lxml" does not currently work on OSX.

I don't know how many programmer-hours have been lost due to frustrations in installing lxml on OSX, due to OSX shipping out-dated libxml libraries. StackOverflow is littered with questions about it, and even the solutions that worked for other people didn't work for me.

What did work for me to use this package was to simply install an older lxml. If you write a requirements.txt in the root of this package, and write the lxml dependency as e.g.

lxml==2.3

The package will work successfully on OSX. It works up to lxml 2.3.5, and doesn't work with lxml 3.0. I'm not sure what the oldest version your package will work with; I haven't tested that sorry.

P.S Thanks for your work on this package. If this wasn't here, I'd have probably written it myself.

P.P.S. After spending hours on this stupid, stupid problem, let me just say: **** developing on OSX sucks.

Add version 0.2.5.1 to pypi

Since the last version (0.2.5.1) integrates an interesting bugfix (95852d5) it would be great if it was available on pypi.

build_doc in htmls.py does not handle cases where get_encoding returns None

It is possible for the call on line 16 of htmls.py to return None. This is not handled by htmls.py, resulting in the next call (page.decode) to fail with a rather cryptic

Unparseable: decode() argument 1 must be string, not None

Might it be worthwhile to raise a specific error when get_encoding returns None?

Rename this package to python-readability-lxml

-v/--verbose not work

I add a print to see why not log:

-v 3

E:\Project>python -m readability.readability -v 3 -l readability_test.log  -u http://ecp.sgcc.com.cn/html/project/0140
02007/9990000000010135023.html
{'xpath': None, 'verbose': 1, 'url': 'http://ecp.sgcc.com.cn/html/project/014002007/9990000000010135023.html', 'negative
_keywords': None, 'browser': None, 'positive_keywords': None, 'log': 'readability_test.log'}

--verbose 3

E:\Project>python -m readability.readability -b --verbose 3 -l readability_test.log  -u http://ecp.sgcc.com.cn/html/proj
ect/014002007/9990000000010135023.html
{'xpath': None, 'verbose': 1, 'url': 'http://ecp.sgcc.com.cn/html/project/014002007/9990000000010135023.html', 'negative
_keywords': None, 'browser': True, 'positive_keywords': None, 'log': 'readability_test.log'}

since parser.add_option('-v', '--verbose', action='count', default=0) default is 0, don't understand why it become 1...

env: windows 8 , python 2.7

Annoying warning when using readability

Here it is:
/home/spt/projects/sandboxes/readability/lib/python2.7/site-packages/readability_lxml-0.2.6-py2.7.egg/readability/htmls.py:60: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

Sorry, I've not much time to contribute a patch right now but I may contribute one if nobody fixes the problem before me ;)

0.2.4 uninstallable .egg uploaded to pypi

The latest package isn't installable from pypi as it's a .egg. Previous versions appear to have been .zip files.

I've always just uploaded with setup.py sdist upload. I'm not sure how this was setup.

Issue with Medium pages

Hi.

I'm getting some errors when trying to process any page from Medium:

$ python -m readability.readability -u https://medium.com/thoughts-on-media/i-read-because-we-re-all-storytellers-it-s-human-nature-after-all-d957a9d2e621#.rn719ccdw
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 624, in <module>
    main()
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 600, in main
    file = urllib2.urlopen(options.url)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

At first I thought it could have something to do with special characters present on the url, but the same thing happens with Medium's homepage:

$ python -m readability.readability -u https://medium.com/
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 624, in <module>
    main()
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 600, in main
    file = urllib2.urlopen(options.url)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

Thanks.

Merge Jerry's branch at https://github.com/jcharum/lxml-readability

Save charset

Hello!
How to store information about the charset (<meta http-equiv="Content-Type" content="text/html; charset=xxx">)?
When I process documents that are not encoded in utf-8 - charset information is lost and I can not correctly display the processed document to the user.

Doesn't work on this website while safari reader works.

http://www.imyanmarhouse.com/news/read/492059

Python 3.4 Support

Latest version still cannot support Python 3.4

Why not calculate node score from deep to shallow?

Default calculation is follow:

/div/p[1]
/div/div[1]/p[1]
/div/div[1]/div/p
/div/div[1]/div/div/p[1]
/div/div[1]/div/div/p[2]
/div/div[1]/div/div/p[3]
/div/div[1]/p[2]
/div/div[1]/p[3]
/div/div[1]/p[4]
/div/div[1]/p[5]
/div/div[1]/p[6]

step 1. /div/div[1]/p[1] add score to /div/div[1] and /div

step 2. /div/div[1]/div/p add score to /div/div[1]/div and /div/div[1]

Score of /div doesn't change , but his child /div/div[1] score increased.

I think it is not correct here, sort tags first would make score more accurate

Warning in htmls.py

Warning (from warnings module):
File "c:\Python27\egg\Lib\site-packages\readability\htmls.py", line 60
if not title or not title.text:
FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

update

#if not title or not title.text:

replace to:

if title is not None or title.text is not None:

Differences with Goose

Hi, can I ask what are the differences with python-goose?
https://github.com/grangier/python-goose

Or, said in another way, why did you decide to resurrect python-readability, instead of investing in Goose?

It's a genuine question, I'm evaluating content extraction frameworks, and trying to decide which one to use. So far I prefer Goose, but I'm trying to understand if I missed something. Thank you in advance!

chardet requirement in setup.py un-installs newer versions of chardet

setup.py requires chardet from pypi http://pypi.python.org/pypi/chardet (version 1.01) but un-installs any later version of chardet not from pypi (version 2.01 from http://chardet.feedparser.org/download/)

Is there a way to leave intact newer versions of chardet not from pypi?

Please tag commits which correspond to the releases on PyPI

Keep the tags in git inline with the releases on PyPI please. Thanks