kaqqao / nutch-element-selector Goto Github PK
View Code? Open in Web Editor NEWNutch 2.3.1 plugin for whitelisting/blacklisting specific HTML elements
License: MIT License
Nutch 2.3.1 plugin for whitelisting/blacklisting specific HTML elements
License: MIT License
This is probably a problem with my setup rather than your plugin.
I have nutch-2.3.1 and have installed your plugin to get rid of a bunch of navigation elements, breadcrumbs, footers components from my standard web pages.
I've set my nutch-site.xml property as follows (also using tika for pdf and word documents)
plugin.includes protocol-httpclient|urlfilter-regex|element-selector|index-(basic|more)|query-(basic|site|url)|indexer-solr|nutch-extensionpoints|parse-(text|html|tika|js)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)And here are the elements I am trying to grab with the blacklist, also in nutch-site.xml
parser.html.selector.blacklist header,footer,div.breadcrumbs,div.searchForm,table.jobTable A comma-delimited.... I've tried with parse-html included and excluded. I read that if you enable tika don't use parse-html. But I thought that for your plugin to work, parse-html must be enabled.Any guidance would be helpful.
Thanks very much in advance.
Hi there.
This is great work. I have it working on Nutch-2.4 (Feb 2021).
Question: why would such an important plugin such as this not have been integrated into the 1.x stream?
Is there any way to make this work with Nutch-1.18
Thanks,
Colin
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.