Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
You can install this lib directly from github repository by execute these command
pip install git+ssh://[email protected]/derlin/boilerpipe3@master
Be sure to have set JAVA_HOME properly since jpype depends on this setting.
The constructor takes a keyword argment extractor
, being one of the available boilerpipe extractor types:
- DefaultExtractor
- ArticleExtractor
- ArticleSentencesExtractor
- KeepEverythingExtractor
- KeepEverythingWithMinKWordsExtractor
- LargestContentExtractor
- NumWordsRulesExtractor
- CanolaExtractor
If no extractor is passed the DefaultExtractor will be used by default.
from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor')
Once you get an extractor instance, extract relevant content using one of getText
, getHTML
, getTextBlock
, getImages
. Each one accepts one of the following arguments:
url
: the url of the pagehtml
: an html string to parseprocessed
: the(source, data)
returned by the methodget
.
Example:
extracted_text = extractor.getText(url=your_url)
extracted_html = extractor.getHTML(url=your_url)
If you need multiple information, you can save some computation time by doing:
processed = extractor.get(url=url) # download and process once
text = extractor.getText(processed=processed)
text_blocks = extractor.getTextBlocks(processed=processed)
html = extractor.getHTML(processed=processed)
images = extractor.getImages(processed=processed)