rodricios / eatiht Goto Github PK

View Code? Open in Web Editor NEW

436.0 17.0 43.0 9.65 MB

An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.

Home Page: http://rodricios.github.io/eatiht

License: MIT License

Python 8.21% HTML 91.79%

eatiht's People

Stargazers

Watchers

eatiht's Issues

Extractor misses on certain congressional sites

First, thank you for this library - it's really useful and an impressive achievement. I'll try digging into the code to see if I can't pinpoint where this happens, but wanted to bring it to your attention. For some congressional sites (example), eatiht extracts Non-breaking space within span tags -   - is required for WYSIWYG. as the text from the page.

"Your browser does not support JavaScript" or similar text

Issue brought up by @klvbdmh. Will be fixed in next release

Sentence termination ("Mr.", etc.)

Great start. I've been maintaining https://github.com/rcarmo/soup-strainer for a bit, and am going to have a go at testing this, but I've spotted that the regular expression you use for capturing abbreviations will be tricky to use with Portuguese (where we have "Sra" and a few other abbreviations).

Rather than keep adding stuff to the regular expression, why not assume a character length limit and rewrite the expression accordingly?

I am not getting right contents from the url

Great tool!

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"

I am not getting right contents from the url. The above url is used in NLTK as an example.
http://www.nltk.org/book_1ed/ch03.html

"list index out of range"

Hey there! This package is amazing.

I'm writing a basic function (which I can add you as a collaborator to, if you'd like) that pulls text from a webpage: https://api.blockspring.com/bs/get-text-from-url

The function ends up getting used by a lot of other projects. So I'm using v2.extract(url) now to pull out the main text from the page, but for some pages I get errors. For instance, today for yahoo.com I get "list index out of range". Do you know why that might be occurring?

here's the full error:

line 234, in extract
    target_tnodes = [tnode for par, tnode in pars_tnodes if hist[0][0] in par]
IndexError: list index out of range

UPDATE: I retried it just now and there was no error. Strange!

Environment Issues

import eatiht works only from inside the git folder. Otherwise it tells me couldn't resolve 'urllib2'. Installed with pip and pip3. Can only run on python 2.x...

What is the fix?

Only english language?

Hi I found your library really interesting. I need to obtain the article content from web pages that may be written in different languages, mostly English and Italian. Unfortunately when I tried to analyze Italian pages, I have encoding problems:
"UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position
4: character maps to "

Extractor seems to include a lot of js and in-line code?

Really enjoying eatiht, it's almost perfect for me.

I'm using it to process urls and send the text to a topic modeler. The output is admirably clean except for in-line code (which seems like it might be easy to detect?).

Here's a sample output that includes well-extracted text and then code:

In the best of worlds, we find an exact match d_{t'} and we use
d_{t' + delta} (suppose that t' + delta  
In practice, of course, prediction is much more complicated.  See
    [Sauer, 1994]  in 
  [Gershenfeld and
Weigend, 1993])  and the discussion in 

DelayCoordinateEmbedding.m  for more details.


mw.loader.implement("ext.vector.collapsibleNav",function($){(function($){var map={'ltr':{'opera':[['>=',9.6]],'konqueror':[['>=',4.0]],'blackberry':false,'ipod':false,'iphone':false,'ps3':false},'rtl':{'opera':[['>=',9.6]],'konqueror':[['>=',4.0]],'blackberry':false,'ipod':false,'iphone':false,'ps3':false}};if(!$.client.test(map)){return true;}var version=1;if(mediaWiki.config.get('wgCollapsibleNavForceNewVersion')){version=2;}else{if(mediaWiki.config.get('wgCollapsibleNavBucketTest')){version=$.cookie('vector-nav-pref-version');if(version==null){version=Math.round(Math.random()+1);$.cookie('vector-nav-pref-version',version,{'expires':30,'path':'/'});}}}if(version==2){var limit=5;var threshold=3;$('#p-lang ul').addClass('secondary').before('

');$('#p-lang-more h5').text(mw.usability.getMsg('vector-collapsiblenav-more'));$secondary.appendTo($('#p-lang-more div.body'));}$('#p-lang').addClass('persistent');}$('#mw-panel > div.portal:first').addClass('first persistent');$('#mw-panel').addClass('collapsible-nav');$('#mw-panel > div.portal:not(.persistent)').each(function(i){var id=$(this).attr('id');var state=$.cookie('vector-nav-'+id);if(state=='true'||(state==null&&i div.portal:not(.persistent) > h5');var tabIndex=$(document).lastTabIndex()+1;$('#searchInput').attr('tabindex',tabIndex++);$headings.each(function(){$(this).attr('tabindex',tabIndex++);});$('#mw-panel').delegate('div.portal:not(.persistent) > h5','keydown',function(event){if(event.which==13||event.which==32){toggle($(this));}}).delegate('div.portal:not(.persistent) > h5','mousedown',function(event){if(event.which!=3){toggle($(this));$(this).blur();}return false;});})(jQuery);;},{"all":
"#mw-panel.collapsible-nav div.portal{background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAIwAAAABCAMAAAA7MLYKAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAEtQTFRF29vb2tra4ODg6urq5OTk4uLi6+vr7e3t7Ozs8PDw5+fn4+Pj4eHh3d3d39/f6Ojo5eXl6enp8fHx8/Pz8vLy7+/v3Nzc2dnZ2NjYnErj7QAAAD1JREFUeNq0wQUBACAMALDj7hf6JyUFGxzEnYhC9GaNPG1xVffGDErk/iCigLl1XV2xM49lfAxEaSM+AQYA9HMKuv4liFQAAAAASUVORK5CYII=);background-image:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/portal-break.png?2014-12-19T03:13:20Z)!ie;background-position:left top;background-repeat:no-repeat;padding:0.25em 0 !important;margin:-11px 9px 10px 11px}#mw-panel.collapsible-nav div.portal h5{color:#4D4D4D;font-weight:normal;background:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQBAMAAADt3eJSAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAA9QTFRFeXl53d3dmpqasbGx////GU0iEgAAAAV0Uk5T/////wD7tg5TAAAAK0lEQVQI12NwgQIG0hhCDAwMTCJAhqMCA4MiWEoIJABiOCooQhULi5BqMgB2bh4svs8t+QAAAABJRU5ErkJggg==) left center no-repeat;background:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/open.png?2014-12-19T03:13:20Z) left center no-repeat!ie;padding:4px 0 3px 1.5em;margin-bottom:0px}#mw-panel.collapsible-nav div.collapsed h5{color:#0645AD;background:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAAxQTFRF3d3deXl5////nZ2dQA6SoAAAAAN0Uk5T//8A18oNQQAAADNJREFUeNpiYEIDDMQKMKALMDOgCTDCRWACcBG4AEwEIcDITEAFuhnotmC4g4EEzwEEGAADqgHmQSPJKgAAAABJRU5ErkJggg==) left center no-repeat;background:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/closed-ltr.png?2014-12-19T03:13:20Z) left center no-repeat!ie;margin-bottom:0px}#mw-panel.collapsible-nav div h5:hover{cursor:pointer;text-decoration:none}#mw-panel.collapsible-nav div.collapsed h5:hover{text-decoration:underline}#mw-panel.collapsible-nav div.portal div.body{background:none !important;padding-top:0px;display:none}#mw-panel.collapsible-nav div.persistent div.body{display:block}#mw-panel.collapsible-nav div.first h5{display:none}#mw-panel.collapsible-nav div.persistent h5{background:none !important;padding-left:0.7em;cursor:default}#mw-panel.collapsible-nav div.portal div.body ul li{padding:0.25em 0}#mw-panel.collapsible-nav div.first{background-image:none;margin-top:0px}#mw-panel.collapsible-nav div.persistent div.body{margin-left:0.5em}\n\n/* cache key: wikidb:resourceloader:filter:minify-css:7:da8e8a773eccab12cc615654d05b8845 */\n"
},{"vector-collapsiblenav-more":"More languages"});mw.loader.implement("ext.vector.collapsibleTabs",function($){jQuery(function($){var rtl=$('body').is('.rtl');$.collapsibleTabs.moveToCollapsed=function(ele){var $moving=$(ele);var data=$.collapsibleTabs.getSettings($moving);if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=true;var target=data.collapsedContainer;$moving.css("position","relative").css((rtl?'left':'right'),0).animate({width:'1px'},"normal",function(){$(this).hide();$(' ').insertAfter(this);$(this).detach().prependTo(target).data('collapsibleTabsSettings',data);$(this).attr('style','display:list-item;');var data=$.collapsibleTabs.getSettings($(ele));if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.
shifting=false;$.collapsibleTabs.handleResize();});};$.collapsibleTabs.moveToExpanded=function(ele){var $moving=$(ele);var data=$.collapsibleTabs.getSettings($moving);if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=true;var $target=$(data.expandedContainer).find('span.placeholder:first');var expandedWidth=data.expandedWidth;$moving.css("position","relative").css((rtl?'right':'left'),0).css('width','1px');$target.replaceWith($moving.detach().css('width','1px').data('collapsibleTabsSettings',data).animate({width:expandedWidth+"px"},"normal",function(){$(this).attr('style','display:block;');var data=$.collapsibleTabs.getSettings($(this));if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=false;$.collapsibleTabs.handleResize();}));};$('#p-views ul').bind(
'beforeTabCollapse',function(){if($('#p-cactions').css('display')=='none'){$('#p-cactions').addClass('filledPortlet').removeClass('emptyPortlet').find('h5').css('width','1px').animate({'width':'26px'},390);}}).bind('beforeTabExpand',function(){if($('#p-cactions li').length==1){$('#p-cactions h5').animate({'width':'1px'},370,function(){$(this).attr('style','').parent().addClass('emptyPortlet').removeClass('filledPortlet');});}}).collapsibleTabs({expandCondition:function(eleWidth){if(rtl){return($('#right-navigation').position().left+$('#right-navigation').width()+1)$('#left-navigation').position().left;}else{return($('#left-navigation').position().left+$('#left-navigation').width())>$(
'#right-navigation').position().left;}}});});;},{},{});mw.loader.implement("ext.vector.simpleSearch",function($){jQuery(document).ready(function($){if($('#simpleSearch').length==0){return;}var map={'browsers':{'ltr':{'opera':[['>=',9.6]],'docomo':false,'blackberry':false,'ipod':false,'iphone':false},'rtl':{'opera':[['>=',9.6]],'docomo':false,'blackberry':false,'ipod':false,'iphone':false}}};if(!$.client.test(map)){return true;}if(window.os_MWSuggestDisable){window.os_MWSuggestDisable();}$('#simpleSearch > input#searchInput').attr('placeholder',mw.msg('vector-simplesearch-search')).placeholder();$('#searchInput, #searchInput2, #powerSearchText, #searchText').suggestions({fetch:function(query){var $this=$(this);if(query.length!==0){var request=$.ajax({url:mw.util.wikiScript('api'),data:{action:'opensearch',search:query,namespace:0,suggest:''},dataType:'json',success:function(data){if($.isArray(data)&&1 in data){$this.suggestions('suggestions',data[1]);}}});$this.data('request',request);}
},cancel:function(){var request=$(this).data('request');if(request&&$.isFunction(request.abort)){request.abort();$(this).removeData('request');}},result:{select:function($input){$input.closest('form').submit();}},delay:120,positionFromLeft:$('body').hasClass('rtl'),highlightInput:true}).bind('paste cut drop',function(e){$(this).trigger('keypress');});$('#searchInput').suggestions({result:{select:function($input){$input.closest('form').submit();}},special:{render:function(query){if($(this).children().length===0){$(this).show();var $label=$(' ',{'class':'special-label',text:mw.msg('vector-simplesearch-containing')}).appendTo($(this));var $query=$(' ',{'class':'special-query',text:query}).appendTo($(this));$query.autoEllipsis();}else{$(this).find('.special-query').empty().text(query).autoEllipsis();}},select:function($input){$input.closest('form').append($(' ',{type:'hidden',name:'fulltext',val:'1'}));$input.closest('form').submit();}},$region:$('#simpleSearch')}
);});;},{},{"vector-simplesearch-search":"Search","vector-simplesearch-containing":"containing..."});

/* cache key: wikidb:resourceloader:filter:minify-js:7:dac1b333490d761e05aeebe7d78de4c6 */


The most important phase space reconstruction technique is the  method of
delays . Vectors in a new space, the embedding space, are formed from time
delayed values of the scalar measurements:

The number  m  of elements is called the  embedding dimension , the time
  is generally referred to as the  delay  or  lag .  Celebrated
embedding theorems by Takens[ 21 ] and by Sauer et al.[ 22 ]
state that if the sequence   does indeed consist of scalar measurements
of the state of a dynamical system, then under certain genericity assumptions,
the time delay embedding provides a one-to-one image of the original set
 , provided  m  is large enough.

...I think I could get rid of the scripting/coding parts with some hacking, but wanted to bring this issue up here in case it was helpful to know, or in case I'm missing an obvious solution ;-)

Thanks for any help you can provide, and thanks more for making this awesome repo!

AKA

Swapping out hard-coded xpath for something a bit more robust = "targeted extraction"

From this comment, the poster suggested I look into using the lxml.html.Clean module for cleaning out unnecessary subtrees in the HTML.

Aside from that, I'm considering implement an extra tiny module, for modality sake, that will "build" the xpath from optional inputs to some wrapper function/class - possibly adding extra args to the extract function for example.

The hope is that one can use eatiht not only for extracting the main text, but also to extract say the ads, presumably for the more masochistically-inclined (just kidding you data scientists out there). Whether or not these sorts of "targetted" extractions will happen concurrently, probably not in the first version of feature release.

Endings with '?' and double quotes causes output errors

I'm opening up an issue on behalf of an email I received from a concerned user.

Fix is on its way (literally right after I publish this). Please stay tuned for update, and I apologize for this ridiculously simple yet breaking mistake.

P.S. If you're up for a laugh, check out the mistake located here

Information about new algorithm and submodules

There's a bit to be said about the v2 of the algorithm.

First, read the docstrings of the new modules; that will have an explanation of the changes to the algorithm. A more hands-on writeup will be coming.

Second, I'm hesitant about how to "briefly" describe the algorithm. If I continue to say "improvement," which I have in previous updates to the README, I will be shooting myself in the foot; the changes to the algorithm may not be noticeable, and it has the potential to produce worse results:

more unnecessary text or headlines within articles
on wikipedia pages, where eatiht.py would do fairly well in extracting the just the main article, etv2 and eatiht_v2 will also extract the sources of said articles - although this effect can be explained by the exclusion of the "bandaid" code still present in eatiht.py.
more and more

But what it has in potential for error, it can make up for in consistency. For one, there are no more regex splits, nor checks on sentence's endings. This was a bandaid fix in the original algorithm for, honestly, very few cases.

As I said earlier, I will be writing a follow up article where I describe clearly what this algorithm does in each critical step. This will, hopefully, be a learning tool for you - if you've never basked in the uncertainty of statistical, alright, machine learning algorithms - and for me because trying to explain v2 of the text-extraction algorithm would be only my second time writing about these sorts of things.

That said, any insight, tips, a "hey, this step here is actually a hyper-maximization-optimization problem with two second-derivative look-ahead steps in the 4th-level of Gondor" is more help than you think.

For example, it wasn't until an email that I was made aware that this algorithm was a type of unsupervised classification algorithm.

As for feature requests, please do share especially now that I've sort of laid out a prototype of the class that represents the "state space".

Thanks for keeping up with this package!

Updating text extraction algorithm

I'm making progress in updating the text extraction algorithm. It's going to be in the same spirit as the original, in that I will not be using any external library (unless it is empirically proven to improve performance.

The explanation and justification will largely be intuitive and I will provide a thorough walk through of the algorithm as I did with the original. I have heard back from Tim Weninger (author I mention in the readme) and he's given me the virtual equivalent of "two thumbs up" and some very inspiring remarks :)

If there are any questions, comments, or requests, please don't be afraid to email :)

UnicodeDecodeError: -> crash

I get the following killer error while trying to extract articles from this german website:

Traceback (most recent call last):
  File "extract.py", line 6, in <module>
    tree = etv2.extract(url)
  File "/usr/local/lib/python2.7/dist-packages/eatiht/etv2.py", line 214, in ext                                        ract
    subtrees = get_textnode_subtrees(html_tree)
  File "/usr/local/lib/python2.7/dist-packages/eatiht/etv2.py", line 198, in get                                        _textnode_subtrees
    for n in nodes_with_text]
  File "lxml.etree.pyx", line 1498, in lxml.etree._Element.xpath (src/lxml/lxml.                                        etree.c:52102)
  File "xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/                                        lxml/lxml.etree.c:151941)
  File "xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (                                        src/lxml/lxml.etree.c:150939)
  File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src/lxml/lx                                        ml.etree.c:145771)
  File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src/lxml/                                        lxml.etree.c:146126)
  File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src/lxml/l                                        xml.etree.c:146323)
  File "extensions.pxi", line 786, in lxml.etree._buildElementStringResult (src/                                        lxml/lxml.etree.c:147530)
  File "apihelpers.pxi", line 1371, in lxml.etree.funicode (src/lxml/lxml.etree.                                        c:26844)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 24: invalid                                         start byte

Help would be appreciated

Christian

Is it possible to get title of page ?

like for example beautifulSoup has soup.title.string api which you can use to extract title

Access to HTML of Article

Would be very helpful to be able to access the html of the article; not only can one extract the text from the html but it would also avoid losing any context present in the formatting.

Resulting article loses all tabs/newlines

Whatever article is passed into the algorithm comes out sans tabs and newlines. Would be nice to have :)

Weired Encoding Problems

I have an article that is parsed to this:

 DFKI ist Gründungsmitglied der EU Big Data Value PPP — DFKI

EU-Kommissarin Neelie Kroes und Jan Sundelin, PrÃ¤sident von Big Data Value Association, unterzeichneten gestern eine Vereinbarung zur Einrichtung einer Ã¶ffentlich.....

In the title the ü is properly encoded, in the rest of the content however it fails to encode the special chars properly (eg. ä -> Ã¤). The weired part is that it works fine in the first bit and fails in the rest of the article. Even though the characters are exactly the same in the original article. (http://www.dfki.de/web/presse/pressemitteilungen_intern/2014/dfki-ist-grundungsmitglied-der-eu-big-data-value-ppp)

Cheers
Christian

Doesn't seem to work on Mac OS

Error when trying to parse:

Buzut$ eatiht /Volumes/Storage/analytics/ENI\ Training\ -\ Livre\ Numérique.html 
Traceback (most recent call last):
  File "/usr/local/bin/eatiht", line 5, in <module>
    pkg_resources.run_script('eatiht==0.1.14', 'eatiht')
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/pkg_resources.py", line 442, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/pkg_resources.py", line 1160, in run_script
    execfile(script_filename, namespace, namespace)
  File "/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg/EGG-INFO/scripts/eatiht", line 3, in <module>
    import eatiht
  File "/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg/eatiht/__init__.py", line 101, in <module>
    from .eatiht import extract, get_sentence_xpath_tuples, get_xpath_frequencydistribution
  File "/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg/eatiht/eatiht.py", line 77, in <module>
    from collections import Counter
ImportError: cannot import name Counter

Install output:

Buzut$ sudo python setup.py install
Password:
running install
running bdist_egg
running egg_info
writing requirements to eatiht.egg-info/requires.txt
writing eatiht.egg-info/PKG-INFO
writing top-level names to eatiht.egg-info/top_level.txt
writing dependency_links to eatiht.egg-info/dependency_links.txt
reading manifest file 'eatiht.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'eatiht.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.6-universal/egg
running install_lib
running build_py
creating build/bdist.macosx-10.6-universal/egg
creating build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/__init__.py -> build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/eatiht.py -> build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/eatiht_trees.py -> build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/etv2.py -> build/bdist.macosx-10.6-universal/egg/eatiht
creating build/bdist.macosx-10.6-universal/egg/eatiht/tests
creating build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/__init__.py -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/foo1.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/full_of_foos.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/google_wiki.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/regex_dot_endings.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/regex_various_endings.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/v2.py -> build/bdist.macosx-10.6-universal/egg/eatiht
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/eatiht.py to eatiht.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/eatiht_trees.py to eatiht_trees.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/etv2.py to etv2.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/v2.py to v2.pyc
creating build/bdist.macosx-10.6-universal/egg/EGG-INFO
installing scripts to build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
copying build/scripts-2.6/eatiht -> build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
changing mode of build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts/eatiht to 755
copying eatiht.egg-info/PKG-INFO -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/SOURCES.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/dependency_links.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/not-zip-safe -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/requires.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/top_level.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
creating 'dist/eatiht-0.1.14-py2.6.egg' and adding 'build/bdist.macosx-10.6-universal/egg' to it
removing 'build/bdist.macosx-10.6-universal/egg' (and everything under it)
Processing eatiht-0.1.14-py2.6.egg
removing '/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg' (and everything under it)
creating /Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg
Extracting eatiht-0.1.14-py2.6.egg to /Library/Python/2.6/site-packages
eatiht 0.1.14 is already the active version in easy-install.pth
Installing eatiht script to /usr/local/bin

Installed /Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg
Processing dependencies for eatiht==0.1.14
Searching for chardet
Reading http://pypi.python.org/simple/chardet/
Best match: chardet 2.3.0
Downloading https://pypi.python.org/packages/source/c/chardet/chardet-2.3.0.tar.gz#md5=25274d664ccb5130adae08047416e1a8
Processing chardet-2.3.0.tar.gz
Running chardet-2.3.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-xcHwiW/chardet-2.3.0/egg-dist-tmp-l3cu59
warning: no files found matching 'COPYING'
warning: no files found matching '*.html' under directory 'docs'
warning: no files found matching '*.css' under directory 'docs'
warning: no files found matching '*.png' under directory 'docs'
warning: no files found matching '*.gif' under directory 'docs'
zip_safe flag not set; analyzing archive contents...
Adding chardet 2.3.0 to easy-install.pth file
Installing chardetect script to /usr/local/bin

Installed /Library/Python/2.6/site-packages/chardet-2.3.0-py2.6.egg
Searching for lxml
Reading http://pypi.python.org/simple/lxml/
Best match: lxml 3.4.2
Downloading https://pypi.python.org/packages/source/l/lxml/lxml-3.4.2.tar.gz#md5=429e5e771c4be0798923c04cb9739b4e
Processing lxml-3.4.2.tar.gz
Running lxml-3.4.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-QJVQPH/lxml-3.4.2/egg-dist-tmp-k9RYwn
Building lxml version 3.4.2.
Building without Cython.
Using build configuration of libxslt 1.1.24
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/dist.py:266: UserWarning: Unknown distribution option: 'bugtrack_url'
  warnings.warn(msg)
Adding lxml 3.4.2 to easy-install.pth file

Installed /Library/Python/2.6/site-packages/lxml-3.4.2-py2.6-macosx-10.6-universal.egg
Finished processing dependencies for eatiht==0.1.14

Bug?

imiell@osboxes:~$ eatiht http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html
: No such file or directory

Chrome/Opera extension coming soon.

Chrome/Opera extension in-progress with "working" pre-alpha-version. Email me if you'd like to play with it. I'm not posting the code onto github yet, because it's frightening to look at. Would be interested in hearing peoples request of features?

Some things I'm planning on including is:

Clean-up current page.
Extract and save to .txt
Email just the article(?)

Possible additions:

Named entity recognition using open-sourced NLP engines (no, probably not nltk)
Save to other doc. types?
Anything else that might catch fire.

Not seems to work in aspx page

Rodric,

Great work and thanks for sharing.I'm mainly trying to extract the main text from a news link and it seems to work in most of the sites except for aspx pages.In aspx it's only giving the meta-information such as copyright info.

For example
import eatiht

url='http://www.fool.com/investing/general/2015/08/12/this-startup-is-bigger-than-microsoft-corporation.aspx'

eatiht.extract(url)
Out[37]: u'\n Copyright, Trademark and Patent Information Terms of Use Please read our Terms and Conditions\n \xa9 1995 - 2015 The Motley Fool. All rights reserved. \n\n\n BATS data provided in real-time. NYSE, NASDAQ and NYSEMKT data delayed 15 minutes. Real-Time prices provided by BATS BZX. Market data provided by Interactive Data. Company fundamental data provided by Morningstar. Earnings Estimates, Analyst Ratings and Key Statistics provided by Zacks. SEC Filings and Insider Transactions provided by Edgar Online. Powered and implemented by Interactive Data Managed Solutions.

rodricios / eatiht Goto Github PK

eatiht's People

Stargazers

Watchers

Forkers

eatiht's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs