rodricios / eatiht Goto Github PK
View Code? Open in Web Editor NEWAn exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
Home Page: http://rodricios.github.io/eatiht
License: MIT License
An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
Home Page: http://rodricios.github.io/eatiht
License: MIT License
First, thank you for this library - it's really useful and an impressive achievement. I'll try digging into the code to see if I can't pinpoint where this happens, but wanted to bring it to your attention. For some congressional sites (example), eatiht extracts Non-breaking space within span tags - - is required for WYSIWYG.
as the text from the page.
Issue brought up by @klvbdmh. Will be fixed in next release
Great start. I've been maintaining https://github.com/rcarmo/soup-strainer for a bit, and am going to have a go at testing this, but I've spotted that the regular expression you use for capturing abbreviations will be tricky to use with Portuguese (where we have "Sra" and a few other abbreviations).
Rather than keep adding stuff to the regular expression, why not assume a character length limit and rewrite the expression accordingly?
Great tool!
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
I am not getting right contents from the url. The above url is used in NLTK as an example.
http://www.nltk.org/book_1ed/ch03.html
Hey there! This package is amazing.
I'm writing a basic function (which I can add you as a collaborator to, if you'd like) that pulls text from a webpage: https://api.blockspring.com/bs/get-text-from-url
The function ends up getting used by a lot of other projects. So I'm using v2.extract(url) now to pull out the main text from the page, but for some pages I get errors. For instance, today for yahoo.com I get "list index out of range"
. Do you know why that might be occurring?
here's the full error:
line 234, in extract
target_tnodes = [tnode for par, tnode in pars_tnodes if hist[0][0] in par]
IndexError: list index out of range
UPDATE: I retried it just now and there was no error. Strange!
import eatiht
works only from inside the git folder. Otherwise it tells me couldn't resolve 'urllib2'. Installed with pip and pip3. Can only run on python 2.x...
What is the fix?
Hi I found your library really interesting. I need to obtain the article content from web pages that may be written in different languages, mostly English and Italian. Unfortunately when I tried to analyze Italian pages, I have encoding problems:
"UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position
4: character maps to "
Really enjoying eatiht, it's almost perfect for me.
I'm using it to process urls and send the text to a topic modeler. The output is admirably clean except for in-line code (which seems like it might be easy to detect?).
Here's a sample output that includes well-extracted text and then code:
In the best of worlds, we find an exact match d_{t'} and we use
d_{t' + delta} (suppose that t' + delta
In practice, of course, prediction is much more complicated. See
[Sauer, 1994] in
[Gershenfeld and
Weigend, 1993]) and the discussion in
DelayCoordinateEmbedding.m for more details.
mw.loader.implement("ext.vector.collapsibleNav",function($){(function($){var map={'ltr':{'opera':[['>=',9.6]],'konqueror':[['>=',4.0]],'blackberry':false,'ipod':false,'iphone':false,'ps3':false},'rtl':{'opera':[['>=',9.6]],'konqueror':[['>=',4.0]],'blackberry':false,'ipod':false,'iphone':false,'ps3':false}};if(!$.client.test(map)){return true;}var version=1;if(mediaWiki.config.get('wgCollapsibleNavForceNewVersion')){version=2;}else{if(mediaWiki.config.get('wgCollapsibleNavBucketTest')){version=$.cookie('vector-nav-pref-version');if(version==null){version=Math.round(Math.random()+1);$.cookie('vector-nav-pref-version',version,{'expires':30,'path':'/'});}}}if(version==2){var limit=5;var threshold=3;$('#p-lang ul').addClass('secondary').before('
');$('#p-lang-more h5').text(mw.usability.getMsg('vector-collapsiblenav-more'));$secondary.appendTo($('#p-lang-more div.body'));}$('#p-lang').addClass('persistent');}$('#mw-panel > div.portal:first').addClass('first persistent');$('#mw-panel').addClass('collapsible-nav');$('#mw-panel > div.portal:not(.persistent)').each(function(i){var id=$(this).attr('id');var state=$.cookie('vector-nav-'+id);if(state=='true'||(state==null&&i div.portal:not(.persistent) > h5');var tabIndex=$(document).lastTabIndex()+1;$('#searchInput').attr('tabindex',tabIndex++);$headings.each(function(){$(this).attr('tabindex',tabIndex++);});$('#mw-panel').delegate('div.portal:not(.persistent) > h5','keydown',function(event){if(event.which==13||event.which==32){toggle($(this));}}).delegate('div.portal:not(.persistent) > h5','mousedown',function(event){if(event.which!=3){toggle($(this));$(this).blur();}return false;});})(jQuery);;},{"all":
"#mw-panel.collapsible-nav div.portal{background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAIwAAAABCAMAAAA7MLYKAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAEtQTFRF29vb2tra4ODg6urq5OTk4uLi6+vr7e3t7Ozs8PDw5+fn4+Pj4eHh3d3d39/f6Ojo5eXl6enp8fHx8/Pz8vLy7+/v3Nzc2dnZ2NjYnErj7QAAAD1JREFUeNq0wQUBACAMALDj7hf6JyUFGxzEnYhC9GaNPG1xVffGDErk/iCigLl1XV2xM49lfAxEaSM+AQYA9HMKuv4liFQAAAAASUVORK5CYII=);background-image:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/portal-break.png?2014-12-19T03:13:20Z)!ie;background-position:left top;background-repeat:no-repeat;padding:0.25em 0 !important;margin:-11px 9px 10px 11px}#mw-panel.collapsible-nav div.portal h5{color:#4D4D4D;font-weight:normal;background:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQBAMAAADt3eJSAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAA9QTFRFeXl53d3dmpqasbGx////GU0iEgAAAAV0Uk5T/////wD7tg5TAAAAK0lEQVQI12NwgQIG0hhCDAwMTCJAhqMCA4MiWEoIJABiOCooQhULi5BqMgB2bh4svs8t+QAAAABJRU5ErkJggg==) left center no-repeat;background:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/open.png?2014-12-19T03:13:20Z) left center no-repeat!ie;padding:4px 0 3px 1.5em;margin-bottom:0px}#mw-panel.collapsible-nav div.collapsed h5{color:#0645AD;background:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAAxQTFRF3d3deXl5////nZ2dQA6SoAAAAAN0Uk5T//8A18oNQQAAADNJREFUeNpiYEIDDMQKMKALMDOgCTDCRWACcBG4AEwEIcDITEAFuhnotmC4g4EEzwEEGAADqgHmQSPJKgAAAABJRU5ErkJggg==) left center no-repeat;background:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/closed-ltr.png?2014-12-19T03:13:20Z) left center no-repeat!ie;margin-bottom:0px}#mw-panel.collapsible-nav div h5:hover{cursor:pointer;text-decoration:none}#mw-panel.collapsible-nav div.collapsed h5:hover{text-decoration:underline}#mw-panel.collapsible-nav div.portal div.body{background:none !important;padding-top:0px;display:none}#mw-panel.collapsible-nav div.persistent div.body{display:block}#mw-panel.collapsible-nav div.first h5{display:none}#mw-panel.collapsible-nav div.persistent h5{background:none !important;padding-left:0.7em;cursor:default}#mw-panel.collapsible-nav div.portal div.body ul li{padding:0.25em 0}#mw-panel.collapsible-nav div.first{background-image:none;margin-top:0px}#mw-panel.collapsible-nav div.persistent div.body{margin-left:0.5em}\n\n/* cache key: wikidb:resourceloader:filter:minify-css:7:da8e8a773eccab12cc615654d05b8845 */\n"
},{"vector-collapsiblenav-more":"More languages"});mw.loader.implement("ext.vector.collapsibleTabs",function($){jQuery(function($){var rtl=$('body').is('.rtl');$.collapsibleTabs.moveToCollapsed=function(ele){var $moving=$(ele);var data=$.collapsibleTabs.getSettings($moving);if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=true;var target=data.collapsedContainer;$moving.css("position","relative").css((rtl?'left':'right'),0).animate({width:'1px'},"normal",function(){$(this).hide();$(' ').insertAfter(this);$(this).detach().prependTo(target).data('collapsibleTabsSettings',data);$(this).attr('style','display:list-item;');var data=$.collapsibleTabs.getSettings($(ele));if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.
shifting=false;$.collapsibleTabs.handleResize();});};$.collapsibleTabs.moveToExpanded=function(ele){var $moving=$(ele);var data=$.collapsibleTabs.getSettings($moving);if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=true;var $target=$(data.expandedContainer).find('span.placeholder:first');var expandedWidth=data.expandedWidth;$moving.css("position","relative").css((rtl?'right':'left'),0).css('width','1px');$target.replaceWith($moving.detach().css('width','1px').data('collapsibleTabsSettings',data).animate({width:expandedWidth+"px"},"normal",function(){$(this).attr('style','display:block;');var data=$.collapsibleTabs.getSettings($(this));if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=false;$.collapsibleTabs.handleResize();}));};$('#p-views ul').bind(
'beforeTabCollapse',function(){if($('#p-cactions').css('display')=='none'){$('#p-cactions').addClass('filledPortlet').removeClass('emptyPortlet').find('h5').css('width','1px').animate({'width':'26px'},390);}}).bind('beforeTabExpand',function(){if($('#p-cactions li').length==1){$('#p-cactions h5').animate({'width':'1px'},370,function(){$(this).attr('style','').parent().addClass('emptyPortlet').removeClass('filledPortlet');});}}).collapsibleTabs({expandCondition:function(eleWidth){if(rtl){return($('#right-navigation').position().left+$('#right-navigation').width()+1)$('#left-navigation').position().left;}else{return($('#left-navigation').position().left+$('#left-navigation').width())>$(
'#right-navigation').position().left;}}});});;},{},{});mw.loader.implement("ext.vector.simpleSearch",function($){jQuery(document).ready(function($){if($('#simpleSearch').length==0){return;}var map={'browsers':{'ltr':{'opera':[['>=',9.6]],'docomo':false,'blackberry':false,'ipod':false,'iphone':false},'rtl':{'opera':[['>=',9.6]],'docomo':false,'blackberry':false,'ipod':false,'iphone':false}}};if(!$.client.test(map)){return true;}if(window.os_MWSuggestDisable){window.os_MWSuggestDisable();}$('#simpleSearch > input#searchInput').attr('placeholder',mw.msg('vector-simplesearch-search')).placeholder();$('#searchInput, #searchInput2, #powerSearchText, #searchText').suggestions({fetch:function(query){var $this=$(this);if(query.length!==0){var request=$.ajax({url:mw.util.wikiScript('api'),data:{action:'opensearch',search:query,namespace:0,suggest:''},dataType:'json',success:function(data){if($.isArray(data)&&1 in data){$this.suggestions('suggestions',data[1]);}}});$this.data('request',request);}
},cancel:function(){var request=$(this).data('request');if(request&&$.isFunction(request.abort)){request.abort();$(this).removeData('request');}},result:{select:function($input){$input.closest('form').submit();}},delay:120,positionFromLeft:$('body').hasClass('rtl'),highlightInput:true}).bind('paste cut drop',function(e){$(this).trigger('keypress');});$('#searchInput').suggestions({result:{select:function($input){$input.closest('form').submit();}},special:{render:function(query){if($(this).children().length===0){$(this).show();var $label=$(' ',{'class':'special-label',text:mw.msg('vector-simplesearch-containing')}).appendTo($(this));var $query=$(' ',{'class':'special-query',text:query}).appendTo($(this));$query.autoEllipsis();}else{$(this).find('.special-query').empty().text(query).autoEllipsis();}},select:function($input){$input.closest('form').append($(' ',{type:'hidden',name:'fulltext',val:'1'}));$input.closest('form').submit();}},$region:$('#simpleSearch')}
);});;},{},{"vector-simplesearch-search":"Search","vector-simplesearch-containing":"containing..."});
/* cache key: wikidb:resourceloader:filter:minify-js:7:dac1b333490d761e05aeebe7d78de4c6 */
The most important phase space reconstruction technique is the method of
delays . Vectors in a new space, the embedding space, are formed from time
delayed values of the scalar measurements:
The number m of elements is called the embedding dimension , the time
is generally referred to as the delay or lag . Celebrated
embedding theorems by Takens[ 21 ] and by Sauer et al.[ 22 ]
state that if the sequence does indeed consist of scalar measurements
of the state of a dynamical system, then under certain genericity assumptions,
the time delay embedding provides a one-to-one image of the original set
, provided m is large enough.
...I think I could get rid of the scripting/coding parts with some hacking, but wanted to bring this issue up here in case it was helpful to know, or in case I'm missing an obvious solution ;-)
Thanks for any help you can provide, and thanks more for making this awesome repo!
AKA
From this comment, the poster suggested I look into using the lxml.html.Clean module for cleaning out unnecessary subtrees in the HTML.
Aside from that, I'm considering implement an extra tiny module, for modality sake, that will "build" the xpath from optional inputs to some wrapper function/class - possibly adding extra args to the extract function for example.
The hope is that one can use eatiht not only for extracting the main text, but also to extract say the ads, presumably for the more masochistically-inclined (just kidding you data scientists out there). Whether or not these sorts of "targetted" extractions will happen concurrently, probably not in the first version of feature release.
I'm opening up an issue on behalf of an email I received from a concerned user.
Fix is on its way (literally right after I publish this). Please stay tuned for update, and I apologize for this ridiculously simple yet breaking mistake.
P.S. If you're up for a laugh, check out the mistake located here
There's a bit to be said about the v2 of the algorithm.
First, read the docstrings of the new modules; that will have an explanation of the changes to the algorithm. A more hands-on writeup will be coming.
Second, I'm hesitant about how to "briefly" describe the algorithm. If I continue to say "improvement," which I have in previous updates to the README, I will be shooting myself in the foot; the changes to the algorithm may not be noticeable, and it has the potential to produce worse results:
But what it has in potential for error, it can make up for in consistency. For one, there are no more regex splits, nor checks on sentence's endings. This was a bandaid fix in the original algorithm for, honestly, very few cases.
As I said earlier, I will be writing a follow up article where I describe clearly what this algorithm does in each critical step. This will, hopefully, be a learning tool for you - if you've never basked in the uncertainty of statistical, alright, machine learning algorithms - and for me because trying to explain v2 of the text-extraction algorithm would be only my second time writing about these sorts of things.
That said, any insight, tips, a "hey, this step here is actually a hyper-maximization-optimization problem with two second-derivative look-ahead steps in the 4th-level of Gondor" is more help than you think.
For example, it wasn't until an email that I was made aware that this algorithm was a type of unsupervised classification algorithm.
As for feature requests, please do share especially now that I've sort of laid out a prototype of the class that represents the "state space".
Thanks for keeping up with this package!
I'm making progress in updating the text extraction algorithm. It's going to be in the same spirit as the original, in that I will not be using any external library (unless it is empirically proven to improve performance.
The explanation and justification will largely be intuitive and I will provide a thorough walk through of the algorithm as I did with the original. I have heard back from Tim Weninger (author I mention in the readme) and he's given me the virtual equivalent of "two thumbs up" and some very inspiring remarks :)
If there are any questions, comments, or requests, please don't be afraid to email :)
I get the following killer error while trying to extract articles from this german website:
Traceback (most recent call last):
File "extract.py", line 6, in <module>
tree = etv2.extract(url)
File "/usr/local/lib/python2.7/dist-packages/eatiht/etv2.py", line 214, in ext ract
subtrees = get_textnode_subtrees(html_tree)
File "/usr/local/lib/python2.7/dist-packages/eatiht/etv2.py", line 198, in get _textnode_subtrees
for n in nodes_with_text]
File "lxml.etree.pyx", line 1498, in lxml.etree._Element.xpath (src/lxml/lxml. etree.c:52102)
File "xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/ lxml/lxml.etree.c:151941)
File "xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result ( src/lxml/lxml.etree.c:150939)
File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src/lxml/lx ml.etree.c:145771)
File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src/lxml/ lxml.etree.c:146126)
File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src/lxml/l xml.etree.c:146323)
File "extensions.pxi", line 786, in lxml.etree._buildElementStringResult (src/ lxml/lxml.etree.c:147530)
File "apihelpers.pxi", line 1371, in lxml.etree.funicode (src/lxml/lxml.etree. c:26844)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 24: invalid start byte
Help would be appreciated
Christian
like for example beautifulSoup has soup.title.string api which you can use to extract title
Would be very helpful to be able to access the html of the article; not only can one extract the text from the html but it would also avoid losing any context present in the formatting.
Whatever article is passed into the algorithm comes out sans tabs and newlines. Would be nice to have :)
I have an article that is parsed to this:
DFKI ist Gründungsmitglied der EU Big Data Value PPP — DFKI
EU-Kommissarin Neelie Kroes und Jan Sundelin, Präsident von Big Data Value Association, unterzeichneten gestern eine Vereinbarung zur Einrichtung einer öffentlich.....
In the title the ü is properly encoded, in the rest of the content however it fails to encode the special chars properly (eg. ä -> ä). The weired part is that it works fine in the first bit and fails in the rest of the article. Even though the characters are exactly the same in the original article. (http://www.dfki.de/web/presse/pressemitteilungen_intern/2014/dfki-ist-grundungsmitglied-der-eu-big-data-value-ppp)
Cheers
Christian
Error when trying to parse:
Buzut$ eatiht /Volumes/Storage/analytics/ENI\ Training\ -\ Livre\ Numérique.html
Traceback (most recent call last):
File "/usr/local/bin/eatiht", line 5, in <module>
pkg_resources.run_script('eatiht==0.1.14', 'eatiht')
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/pkg_resources.py", line 442, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/pkg_resources.py", line 1160, in run_script
execfile(script_filename, namespace, namespace)
File "/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg/EGG-INFO/scripts/eatiht", line 3, in <module>
import eatiht
File "/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg/eatiht/__init__.py", line 101, in <module>
from .eatiht import extract, get_sentence_xpath_tuples, get_xpath_frequencydistribution
File "/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg/eatiht/eatiht.py", line 77, in <module>
from collections import Counter
ImportError: cannot import name Counter
Install output:
Buzut$ sudo python setup.py install
Password:
running install
running bdist_egg
running egg_info
writing requirements to eatiht.egg-info/requires.txt
writing eatiht.egg-info/PKG-INFO
writing top-level names to eatiht.egg-info/top_level.txt
writing dependency_links to eatiht.egg-info/dependency_links.txt
reading manifest file 'eatiht.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'eatiht.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.6-universal/egg
running install_lib
running build_py
creating build/bdist.macosx-10.6-universal/egg
creating build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/__init__.py -> build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/eatiht.py -> build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/eatiht_trees.py -> build/bdist.macosx-10.6-universal/egg/eatiht
copying build/lib/eatiht/etv2.py -> build/bdist.macosx-10.6-universal/egg/eatiht
creating build/bdist.macosx-10.6-universal/egg/eatiht/tests
creating build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/__init__.py -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/foo1.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/full_of_foos.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/google_wiki.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/regex_dot_endings.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/tests/assets/regex_various_endings.html -> build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets
copying build/lib/eatiht/v2.py -> build/bdist.macosx-10.6-universal/egg/eatiht
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/eatiht.py to eatiht.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/eatiht_trees.py to eatiht_trees.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/etv2.py to etv2.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/tests/assets/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.6-universal/egg/eatiht/v2.py to v2.pyc
creating build/bdist.macosx-10.6-universal/egg/EGG-INFO
installing scripts to build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
copying build/scripts-2.6/eatiht -> build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts
changing mode of build/bdist.macosx-10.6-universal/egg/EGG-INFO/scripts/eatiht to 755
copying eatiht.egg-info/PKG-INFO -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/SOURCES.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/dependency_links.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/not-zip-safe -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/requires.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
copying eatiht.egg-info/top_level.txt -> build/bdist.macosx-10.6-universal/egg/EGG-INFO
creating 'dist/eatiht-0.1.14-py2.6.egg' and adding 'build/bdist.macosx-10.6-universal/egg' to it
removing 'build/bdist.macosx-10.6-universal/egg' (and everything under it)
Processing eatiht-0.1.14-py2.6.egg
removing '/Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg' (and everything under it)
creating /Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg
Extracting eatiht-0.1.14-py2.6.egg to /Library/Python/2.6/site-packages
eatiht 0.1.14 is already the active version in easy-install.pth
Installing eatiht script to /usr/local/bin
Installed /Library/Python/2.6/site-packages/eatiht-0.1.14-py2.6.egg
Processing dependencies for eatiht==0.1.14
Searching for chardet
Reading http://pypi.python.org/simple/chardet/
Best match: chardet 2.3.0
Downloading https://pypi.python.org/packages/source/c/chardet/chardet-2.3.0.tar.gz#md5=25274d664ccb5130adae08047416e1a8
Processing chardet-2.3.0.tar.gz
Running chardet-2.3.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-xcHwiW/chardet-2.3.0/egg-dist-tmp-l3cu59
warning: no files found matching 'COPYING'
warning: no files found matching '*.html' under directory 'docs'
warning: no files found matching '*.css' under directory 'docs'
warning: no files found matching '*.png' under directory 'docs'
warning: no files found matching '*.gif' under directory 'docs'
zip_safe flag not set; analyzing archive contents...
Adding chardet 2.3.0 to easy-install.pth file
Installing chardetect script to /usr/local/bin
Installed /Library/Python/2.6/site-packages/chardet-2.3.0-py2.6.egg
Searching for lxml
Reading http://pypi.python.org/simple/lxml/
Best match: lxml 3.4.2
Downloading https://pypi.python.org/packages/source/l/lxml/lxml-3.4.2.tar.gz#md5=429e5e771c4be0798923c04cb9739b4e
Processing lxml-3.4.2.tar.gz
Running lxml-3.4.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-QJVQPH/lxml-3.4.2/egg-dist-tmp-k9RYwn
Building lxml version 3.4.2.
Building without Cython.
Using build configuration of libxslt 1.1.24
/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/dist.py:266: UserWarning: Unknown distribution option: 'bugtrack_url'
warnings.warn(msg)
Adding lxml 3.4.2 to easy-install.pth file
Installed /Library/Python/2.6/site-packages/lxml-3.4.2-py2.6-macosx-10.6-universal.egg
Finished processing dependencies for eatiht==0.1.14
imiell@osboxes:~$ eatiht http://news.yahoo.com/curiosity-rover-drills-mars-rock-finds-water-122321635.html
: No such file or directory
Chrome/Opera extension in-progress with "working" pre-alpha-version. Email me if you'd like to play with it. I'm not posting the code onto github yet, because it's frightening to look at. Would be interested in hearing peoples request of features?
Some things I'm planning on including is:
Possible additions:
Rodric,
Great work and thanks for sharing.I'm mainly trying to extract the main text from a news link and it seems to work in most of the sites except for aspx pages.In aspx it's only giving the meta-information such as copyright info.
For example
import eatiht
eatiht.extract(url)
Out[37]: u'\n Copyright, Trademark and Patent Information Terms of Use Please read our Terms and Conditions\n \xa9 1995 - 2015 The Motley Fool. All rights reserved. \n\n\n BATS data provided in real-time. NYSE, NASDAQ and NYSEMKT data delayed 15 minutes. Real-Time prices provided by BATS BZX. Market data provided by Interactive Data. Company fundamental data provided by Morningstar. Earnings Estimates, Analyst Ratings and Key Statistics provided by Zacks. SEC Filings and Insider Transactions provided by Edgar Online. Powered and implemented by Interactive Data Managed Solutions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.