GithubHelp home page GithubHelp logo

buriy / python-readability Goto Github PK

View Code? Open in Web Editor NEW

This project forked from timbertson/python-readability

2.6K 96.0 348.0 731 KB

fast python port of arc90's readability tool, updated to match latest readability.js!

Home Page: https://github.com/buriy/python-readability

License: Apache License 2.0

Makefile 2.07% Python 97.93%

python-readability's Introduction

image

image

python-readability

Given an HTML document, extract and clean up the main body text and title.

This is a Python port of a Ruby port of arc90's Readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

As an alternative, you may also use conda to install, just run:

$ conda install -c conda-forge readability-lxml 

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.content)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8.2 Added article author(s) (thanks @mattblaha)
  • 0.8.1 Fixed processing of non-ascii HTMLs via regexps.
  • 0.8 Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

python-readability's People

Contributors

adbar avatar alphapapa avatar anekos avatar avalanchy avatar azmeuk avatar baby5 avatar buriy avatar clach04 avatar darkheir avatar decentral1se avatar facundo avatar horva avatar hugovk avatar hush-hush avatar jcharum avatar lsemel avatar markperdomo avatar martinth avatar mattblaha avatar maxdaten avatar mitechie avatar nabinkhadka avatar nathanathan avatar psycojoker avatar pypt avatar seanbrant avatar snovaisg avatar timbertson avatar tirkarthi avatar zacharydenton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-readability's Issues

Debug.info to debug.warning

You should p)rovide an option to make readibility debug info silent while using it into a programm as you provide it in cmd line

I tried to rewrite logger.Level with no success
logging.getLogger("readability").setLevel(logging.WARNING)

Issue with Medium pages

Hi.

I'm getting some errors when trying to process any page from Medium:

$ python -m readability.readability -u https://medium.com/thoughts-on-media/i-read-because-we-re-all-storytellers-it-s-human-nature-after-all-d957a9d2e621#.rn719ccdw
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 624, in <module>
    main()
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 600, in main
    file = urllib2.urlopen(options.url)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

At first I thought it could have something to do with special characters present on the url, but the same thing happens with Medium's homepage:

$ python -m readability.readability -u https://medium.com/
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 624, in <module>
    main()
  File "/usr/local/lib/python2.7/site-packages/readability/readability.py", line 600, in main
    file = urllib2.urlopen(options.url)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 437, in open
    response = meth(req, response)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

Thanks.

-v/--verbose not work

I add a print to see why not log:

-v 3

E:\Project>python -m readability.readability -v 3 -l readability_test.log  -u http://ecp.sgcc.com.cn/html/project/0140
02007/9990000000010135023.html
{'xpath': None, 'verbose': 1, 'url': 'http://ecp.sgcc.com.cn/html/project/014002007/9990000000010135023.html', 'negative
_keywords': None, 'browser': None, 'positive_keywords': None, 'log': 'readability_test.log'}

--verbose 3

E:\Project>python -m readability.readability -b --verbose 3 -l readability_test.log  -u http://ecp.sgcc.com.cn/html/proj
ect/014002007/9990000000010135023.html
{'xpath': None, 'verbose': 1, 'url': 'http://ecp.sgcc.com.cn/html/project/014002007/9990000000010135023.html', 'negative
_keywords': None, 'browser': True, 'positive_keywords': None, 'log': 'readability_test.log'}

since parser.add_option('-v', '--verbose', action='count', default=0) default is 0, don't understand why it become 1...

env: windows 8 , python 2.7

Save charset

Hello!
How to store information about the charset (<meta http-equiv="Content-Type" content="text/html; charset=xxx">)?
When I process documents that are not encoded in utf-8 - charset information is lost and I can not correctly display the processed document to the user.

Annoying warning when using readability

Here it is:
/home/spt/projects/sandboxes/readability/lib/python2.7/site-packages/readability_lxml-0.2.6-py2.7.egg/readability/htmls.py:60: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

Sorry, I've not much time to contribute a patch right now but I may contribute one if nobody fixes the problem before me ;)

build_doc in htmls.py does not handle cases where get_encoding returns None

It is possible for the call on line 16 of htmls.py to return None. This is not handled by htmls.py, resulting in the next call (page.decode) to fail with a rather cryptic

Unparseable: decode() argument 1 must be string, not None

Might it be worthwhile to raise a specific error when get_encoding returns None?

Errors in processing slashdot pages

I noticed that there are some problems in processing pages from slashdot.org, for example this page:
http://developers.slashdot.org/story/12/03/22/2149251/gcc-turns-25?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Slashdot%2FslashdotDevelopers+%28Slashdot%3A+Developers%29
One of the comments is shown instead of the body.

With this other page python-readability works well:
http://developers.slashdot.org/story/12/03/19/1458206/mystery-of-duqu-programming-language-solved?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+Slashdot%2FslashdotDevelopers+%28Slashdot%3A+Developers%29

The same problem is present in the javascript version of readability.

Getting lxml stack traces with NY Times urls

I've installed readability-lxml with pip on ubuntu 12.10 (64 bit, amd), and all my packages are up-to-date.

When I try accessing urls from the NY Times website using the command line syntax, e.g., these requests:

$ python -m readability.readability -u "http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html?_r=0"
$ python -m readability.readability -u "http://www.nytimes.com/2013/01/25/sports/25iht-sumo25.html"
$ python -m readability.readability -u "http://www.nytimes.com/2013/03/03/magazine/beer-mergers.html?ref=magazine"
$ python -m readability.readability -u "http://www.nytimes.com/2013/03/03/magazine/beer-mergers.html"

I'm getting the following stack trace:

ERROR:root:error getting summary: 
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 136, in summary
    self._html(True)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 104, in _html
    self.html = self._parse(self.input)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 108, in _parse
    doc = build_doc(input)
  File "/usr/local/lib/python2.7/dist-packages/readability/htmls.py", line 18, in build_doc
    doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2756, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54726)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82843)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81641)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78311)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74567)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75458)
  File "parser.pxi", line 601, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74958)
XMLSyntaxError: None
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 589, in <module>
    main()
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 584, in main
    url=options.url).summary().encode(enc, 'replace')
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 136, in summary
    self._html(True)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 104, in _html
    self.html = self._parse(self.input)
  File "/usr/local/lib/python2.7/dist-packages/readability/readability.py", line 108, in _parse
    doc = build_doc(input)
  File "/usr/local/lib/python2.7/dist-packages/readability/htmls.py", line 18, in build_doc
    doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2756, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54726)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82843)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81641)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78311)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74567)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75458)
  File "parser.pxi", line 601, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74958)
__main__.Unparseable: None

We need a way to cut out bad scoring elems inside good ones.

For example, in http://www.kwqc.com/story/19484184/shots-fired-in-ill-school-2-students-in-custody
readability will include the block of related links that is inside the div of the main story, dispite it's link density being high and it's overall content score being low.

Firstly I want to know if it's supposed to do that. Is there any support for cutting out bad divs from inside good ones?

and if not, can you suggest a starting place to go about doing this?

HTML isn't parsed when get_clean_html is called

get_clean_html fails with TypeError when called right after document creation.
How to reproduce:

from readability import Document
document = Document('Doesn't matter what to write here')
document.get_clean_html()

Looks like problem is in not calling _html here. HTML isn't parsed if I didn't call for example title and so it passes None. It fixes the problem if I call _html by hand.

AttributeError: 'list' object has no attribute 'split'

Hello,

I faced the aforementioned error recently with the latest version of readability-lxml. This is part of the stacktrace:

    text = document_fromstring(Document(html,negative_keywords=['related']).summary()).text_content()
  File "build/bdist.linux-x86_64/egg/readability/readability.py", line 98, in __init__
    self.negative_keywords = compile_pattern(negative_keywords)
  File "build/bdist.linux-x86_64/egg/readability/readability.py", line 74, in compile_pattern
    elements = elements.split(',')
AttributeError: 'list' object has no attribute 'split'

And how it is used in our codebase:

from readability.readability import Document

html = ...

text = document_fromstring(Document(html,negative_keywords=['related']).summary()).text_content()

However, this is OK if I pin to the version 0.3.0.6 of the library.

$ python --version
Python 2.7.6

$ pip --version
pip 1.5.4 from /usr/lib/python2.7/dist-packages (python 2.7)

Any idea why such a breakage?

Thanks

cc @olivierthereaux

Unparseable: local variable 'enc' referenced before assignment

Hi there!

Extracting doesn’t work anymore when you predecode the strings. This looks pretty trivial though. enc could be initialized with None, unless that would cause any problems in other parts of the code.

By the way, I would discourage the use of the old chardet library. The range of encodings it can detect is very limited and it’s slow on top. I’ve found cchardet to be a lot better, but really there is the excellent UnicodeDammit library in BeautifulSoup that first tries to extract various explicit encoding specifications and then falls back on such implicit methods. Thanks to their latest refactoring, I could even remove a number of ugly hacks I needed to use the older version.

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in summary(self, html_partial)
    152             ruthless = True
    153             while True:
--> 154                 self._html(True)
    155                 for i in self.tags(self.html, 'script', 'style'):
    156                     i.drop_tree()

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _html(self, force)
    117     def _html(self, force=False):
    118         if force or self.html is None:
--> 119             self.html = self._parse(self.input)
    120         return self.html
    121 

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/readability.pyc in _parse(self, input)
    121 
    122     def _parse(self, input):
--> 123         doc, self.encoding = build_doc(input)
    124         doc = html_cleaner.clean_html(doc)
    125         base_href = self.options.get('url', None)

/home/telofy/.buildout/eggs/readability_lxml-0.3.0.1-py2.7.egg/readability/htmls.pyc in build_doc(page)
     15         page_unicode = page.decode(enc, 'replace')
     16     doc = lxml.html.document_fromstring(page_unicode.encode('utf-8', 'replace'), parser=utf8_parser)
---> 17     return doc, enc
     18 
     19 def js_re(src, pattern, flags, repl):

Unparseable: local variable 'enc' referenced before assignment

Add charset info to the clean html

Thank you for keeping up the project!

I use readability to extract the article and then save it as html. Today I've run into problem when Chrome didn't display some unicode characters correctly (the .html file was saved with utf-8). Turns out that it got solved by adding the following line to the cleaned html:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Maybe it could be considered to add this info as a default behavior to get_clean_html().

Thank you very much!

Failure if best_elem is root

I have some documents that are raising the following exception

Traceback (most recent call last):
...
  File ".../site-packages/readability/readability.py", line 168, in summary
    html_partial=html_partial)
  File ".../site-packages/readability/readability.py", line 214, in get_article
    for sibling in best_elem.getparent().getchildren():
Unparseable: 'NoneType' object has no attribute 'getchildren'

I assume there should be a fix along the lines of:

if best_elem.getparent() is None:
    siblings = [best_elem]
else:
    siblings = best_elem.getparent().getchildren()
for sibling in siblings:

Why not calculate node score from deep to shallow?

Default calculation is follow:

/div/p[1]
/div/div[1]/p[1]
/div/div[1]/div/p
/div/div[1]/div/div/p[1]
/div/div[1]/div/div/p[2]
/div/div[1]/div/div/p[3]
/div/div[1]/p[2]
/div/div[1]/p[3]
/div/div[1]/p[4]
/div/div[1]/p[5]
/div/div[1]/p[6]

step 1. /div/div[1]/p[1] add score to /div/div[1] and /div

step 2. /div/div[1]/div/p add score to /div/div[1]/div and /div/div[1]

Score of /div doesn't change , but his child /div/div[1] score increased.

I think it is not correct here, sort tags first would make score more accurate

Travis for CI

Any interest in getting CI working for this project?

I am working on an OS project which will rely on this library and would like to see if your build is passing / failing more easily :)

I'd be happy to submit the PR for the .travis.yml file !

https://travis-ci.org/

Eliminate display:none tags

Hi there, I have a request that shouldn't be too hard to make, though might require some trickery for a proper full CSS solution.

If you look at the code of http://www.readwriteweb.com/start/2012/04/what-do-angels-want.php you'll see some tags like <div style="display:none;" id="disqus_post_message">...</div> in there, which are clearly hidden and shouldn't come up in the summary. However, in the current version of the library they do (and the display:none is stripped out in the summary()).

Interestingly, in the bookmark version of readability, these things don't show up, not even as display:none, so maybe there is a fix there already?

Thanks so much for looking into it.

Cheers

Aggressively removes images

Between 0.3.0.6 and current release python-readability aggressively removes all images embedded in the html. There doesn't seem to be a way to control this behaviour.

Require older lxml version for OSX compatibility

Currently, the package does not work on OSX, if installed directly via pip, because "pip install lxml" does not currently work on OSX.

I don't know how many programmer-hours have been lost due to frustrations in installing lxml on OSX, due to OSX shipping out-dated libxml libraries. StackOverflow is littered with questions about it, and even the solutions that worked for other people didn't work for me.

What did work for me to use this package was to simply install an older lxml. If you write a requirements.txt in the root of this package, and write the lxml dependency as e.g.

lxml==2.3

The package will work successfully on OSX. It works up to lxml 2.3.5, and doesn't work with lxml 3.0. I'm not sure what the oldest version your package will work with; I haven't tested that sorry.

P.S Thanks for your work on this package. If this wasn't here, I'd have probably written it myself.

P.P.S. After spending hours on this stupid, stupid problem, let me just say: **** developing on OSX sucks.

Warning in htmls.py

Warning (from warnings module):
File "c:\Python27\egg\Lib\site-packages\readability\htmls.py", line 60
if not title or not title.text:
FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.

update

#if not title or not title.text:

replace to:

if title is not None or title.text is not None:

How about using path weight instead of class weight?

Hi @buriy, do you wake up? It seems I always post when you go to sleep, the dammit timezone...

Ok, I come up a thought :

List all node absolute path with class and id, calculate path score by positiveRe and negativeRe.
Such as:

body/div/p[1]
body/div/div[1].article/p[1]
body/div/div[1].article/div.comment/p
body/div/div[1].article/div.content/p
body/div/div[1].article/div/p
body/div/div[2]/div.comment/p.content

How to calculate score:

  1. search each path from right to left,
  2. sum all (distance * distance_punishment * negative weight or positive weight)

distance is current node depth minus ancestor depth which match negativeRe or positiveRe
PS: The bigger distance need make a bigger punishment I think, the correct parameter is hard to determine now, but I think this way can work.

Let's set
positiveRe weight to 25,
negativeRe weight to -25,
distance_punishment to 1


Example:

body/div/div[1].article/p[1]

score : 1 (p[1] depth - div[1].article depth) * 25 = 25

body/div/div[1].article/div.comment/p
order: div[1].article: positiveRe, div.comment: negativeRe

score : 1 (p depth - div.comment depth) * -25 + 1/2 * 25 = -12.5

body/div/div[2]/div.comment/p.content

score : 1 (p depth to itself is 1, I forget to set it ...)* 25 + 1 *-25 = 0


Maybe this approach would give an element with comment class with a positive score. But you could imagine that a comment can't be the only element under article,

Such as it has other bother like :

body/div/div[1].article/div.content/p
body/div/div[1].article/div/p

their's score must higher than comment.

However, what if article only has a child node with class comment with positive score ? Congratulation ! Just means that site's naming has big problem (abuse English word) , but we still find a good candidate! (at least a more confident candidate if they don't misunderstand what article means)

And, as I see, current approach do not give a score to html5 tag such as article, this approach cover this . Calculate node score from deep to shallow would work well with this too.

But , due to lack of data, I can't test it now....

No Title for most articles

Is there a known problem that there are nog titles for most articles on internet? When I try "python -m readability.readability -u " on popular news sites i don't get any headings.

Using AdBlock rules to remove elements

AdBlock Plus element hiding rules specify elements to exclude and are specified by CSS selectors. This is easily implemented in lxml, if somewhat slowly.

I'm using this in my own code to automatically remove social media share links from pages. You may want to consider including something similar in python-readablity.

EasyList is dual licensed Creative Commons Attribution-ShareAlike 3.0 Unported and GNU General Public License version 3. CC-BY-SA looks compatible with Apache licensed projects.

Example

First download the rules:

$ wget https://easylist-downloads.adblockplus.org/fanboy-annoyance.txt

Then you can simply extract the CSS selectors to match against a document tree.

from lxml import html
from lxml.cssselect import CSSSelector

RULES_PATH = 'fanboy-annoyance.txt'
with open(RULES_PATH, 'r') as f:
    lines = f.read().splitlines()

# get elemhide rules (prefixed by ##) and create a CSSSelector for each of them
rules = [CSSSelector(line[2:]) for line in lines if line[:2] == '##']

def remove_ads(tree):
    for rule in rules:
        for matched in rule(tree):
            matched.getparent().remove(matched)

doc = html.document_fromstring("<html>...</html>")
remove_ads(doc)

Can't get correct article content if content in table, bug?

I have some urls on same site.

Can't get corret contain(with table):
http://ecp.sgcc.com.cn/html/project/014002007/9990000000010135023.html
http://ecp.sgcc.com.cn/html/project/014001001/9900000000003560149.html

Work fine if no table:
http://ecp.sgcc.com.cn/html/news/014001008/7900000000000045486.html

Maybe a bug?

update:
I use print doc.score_paragraphs() see what happen:

{ < Element table at 0x50fb180 > : {
        'content_score': 1.95,
        'elem': < Element table at 0x50fb180 >
    },
    < Element tr at 0x5024b10 > : {
        'content_score': 2.0,
        'elem': < Element tr at 0x5024b10 >
    },
    < Element tr at 0x50fb300 > : {
        'content_score': 2.0,
        'elem': < Element tr at 0x50fb300 >
    }
}

I think tag with tr and td shoud not be treat as candidate, beside they are under table which also should plus some score to their parent table.

Erratic <p> insertion in Macrumors article

Macrumors uses <br> tags to separate their paragraphs. Readability attempts to insert <p> tags into the article, but results are not as you'd expect.

For example, this article: http://www.macrumors.com/2012/10/12/apples-ipad-mini-media-event-reportedly-scheduled-for-october-23/

results in this output for Document(content).summary():

<html><body><div><div class="content">
                        <a href="http://allthingsd.com/20121012/apple-likely-to-unveil-ipad-mini-at-october-23-event/">
<i>AllThingsD</i> reports</a><p> that Apple appears to be planning to hold a media event on Tuesday, October 23 to introduce 
the "iPad mini", Apple's smaller tablet device said to be carrying a display measuring 7.85 inches diagonally.</p>
<p class="quote">As AllThingsD reported in August, Apple will hold a special event this month at which it will showcase a new, 
smaller iPad. People familiar with Apple’s plans tell us that the company will unveil the so-called “iPad mini” on October 23 at an 
invitation-only event.<br/><br/>
That’s a Tuesday, not a Wednesday, so this is a bit of a break with recent tradition. It also happens to be just three days prior to 
the street date for Microsoft’s new Surface tablet.</p><center>
<img src="http://cdn.macrumors.com/article-new/2012/09/ipadmini_small.jpg"/><br/><i>Physical mockup of rumored iPad mini design</i></center><p>
The location of the event is unconfirmed, but the report suggests that it is likely to be held at the company's Town Hall 
auditorium at its corporate headquarters in Cupertino, California.</p><i>AllThingsD</i><p> has an excellent track record 
regarding Apple media event rumors, giving this claim a high probability of proving true.  Given past history, Apple would be 
expected to send out invitations early next week if the event is to be held on October 23.</p><b>Update</b><p>: </p><em>The 
Loop</em><p>'s Jim Dalrymple weighs in, </p>
<a href="http://www.loopinsight.com/2012/10/12/apples-rumored-oct-23-ipad-mini-event/">confirming the date</a><p> with a "Yep."
                                        </p><p/>
                    </div>
                    </div></body></html>

0.2.4 uninstallable .egg uploaded to pypi

The latest package isn't installable from pypi as it's a .egg. Previous versions appear to have been .zip files.

I've always just uploaded with setup.py sdist upload. I'm not sure how this was setup.

Any research work for extracting the main body of pages?

Hi,

I'm new to this. Tested python-readability and it works most of the time. I have not yet studied the code in details. By a fast scanning, it looks like a heuristic scoring of elements. Is there any documents to explain the scoring strategies? e.g. some quantitative study of to what degree those heuristics work? e.g. some formal datasets to test the performance of extraction by different algorithms. etc.

Highly appreciated if someone can point me to recent good research works.

p.s. Add a funny case. It can not extract the README part (which I personally consider to be the main body) of the project homepage.

Detect whether page is suitable for applying readability

Hi.

I'd like to know whether it's possible to make python-readability to indicate whether the page is question is suitable for applying readability.

For example, consider New York Times homepage. It's not supposed to be parsed by python-readability.

Is it possible to check this?

Thanks.

Differences with Goose

Hi, can I ask what are the differences with python-goose?
https://github.com/grangier/python-goose

Or, said in another way, why did you decide to resurrect python-readability, instead of investing in Goose?

It's a genuine question, I'm evaluating content extraction frameworks, and trying to decide which one to use. So far I prefer Goose, but I'm trying to understand if I missed something. Thank you in advance!

Pypi not up-to-date

Could there be a fresh version of readability on Pypi which includes fdba8d9 ?

I've just hit this bug, and whilst fixing it's not too much of a pain, it'd be good for people pulling fresh from Pypi to not hit it too.

Cheers!

Crash when parsing articles with invalid link "http://["

It seems readability crashes on links with an extra "[" in the url. Here's an example: http://www.theguardian.com/film/2014/apr/24/the-hurricane-rubin-carter-denzel-washington

<a href="http://[http://www.theguardian.com/film/filmblog/2013/may/09/raging-bull-reel-history-martin-scorsese" title="">Raging Bull</a>

Here's the stacktrace:

Traceback (most recent call last):
  File "manage.py", line 14, in <module>
    execute_from_command_line(sys.argv)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\__init__.py", line 399, in execute_from_command_line
    utility.execute()
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\base.py", line 242, in run_from_argv
    self.execute(*args, **options.__dict__)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\django\core\management\base.py", line 285, in execute
    output = self.handle(*args, **options)
  File "C:\Emils\Projects\presskoll\presskoll\webhook\management\commands\extract_articles.py", line 69, in handle
    parse_article(rawarticle, overwrite=options["overwrite"], DEBUG=DEBUG)
  File "C:\Emils\Projects\presskoll\presskoll\webhook\article_parser.py", line 59, in parse_article
    title, body = title_and_body_from_article(rawarticle)
  File "C:\Emils\Projects\presskoll\presskoll\webhook\content_util.py", line 393, in title_and_body_from_article
    doc = document._html(True)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\readability\readability.py", line 119, in _html
    self.html = self._parse(self.input)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\readability\readability.py", line 127, in _parse
    doc.make_links_absolute(base_href, resolve_base_href=True)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\__init__.py", line 340, in make_links_absolute
    self.rewrite_links(link_repl)
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\__init__.py", line 469, in rewrite_links
    new_link = link_repl_func(link.strip())
  File "C:\Users\Admin\Envs\presskoll\lib\site-packages\lxml\html\__init__.py", line 335, in link_repl
    return urljoin(base_url, href)
  File "C:\Program Files\Python27\Lib\urlparse.py", line 260, in urljoin
    urlparse(url, bscheme, allow_fragments)
  File "C:\Program Files\Python27\Lib\urlparse.py", line 142, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "C:\Program Files\Python27\Lib\urlparse.py", line 190, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

I suggest this error is catched and discarded.

Error in install and import

hi
i install readability but when import readability dosent work

from readability.readability import Document
Traceback (most recent call last):
File "", line 1, in
File "readability.py", line 1, in
from readability.readability import Document
ImportError: No module named readability

but i can use command line :
python -m readability.readability -u http://pypi.python.org/pypi/readability-lxml

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.