GithubHelp home page GithubHelp logo

imclab / dogeared-extruder Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aaronland/dogeared-extruder

0.0 1.0 0.0 532 KB

A simple HTTP pony to wrap a variety of text extraction libraries (Boilerpipe, Tika, Java-Readability) using dropwizard

dogeared-extruder's Introduction

dogeared-extruder

This is a meant to be a simple HTTP Pony to wrap the boilerpipe and Tika and clones of the readability text extraction libraries using the dropwizard framework.

To start the server:

$> cd dogeared-extruder
$> make build
... JAVA STUFF ...
$> java -jar target/extruder-1.0.jar server
... MOAR JAVA STUFF ...
INFO  [2013-08-30 12:49:12,184] org.eclipse.jetty.server.AbstractConnector: Started [email protected]:8080
INFO  [2013-08-30 12:49:12,189] org.eclipse.jetty.server.AbstractConnector: Started [email protected]:8081

And then you can pass it URLs as GET parameters:

$> curl 'http://localhost:8080/boilerpipe?url=SOME_URL'

$> curl 'http://localhost:8080/java-readability?url=SOME_URL'

$> curl 'http://localhost:8080/tika?url=SOME_URL_DOT_PDF'

It also supports local files via POST uploads:

$> curl -X POST -F "file=@SOME_FILE.html" http://localhost:8080/boilerpipe

$> curl -X POST -F "file=@SOME_FILE.html" http://localhost:8080/java-readability

$> curl -X POST -F "file=@SOME_FILE.pdf" http://localhost:8080/tika 

By default the server will return HTML but if you pass an Accept: application/json header you'll get a big old blob of JSON instead.

$> curl -H 'Accept:application/json' 'http://localhost:8080/boilerpipe?url=SOME_URL'

Notes

  • It works but I am not a Java person so I am still fumbling my way around this foreign land.

  • The text/content extraction is pretty heavy-handed and relies on the underlying libraries to do the right thing. Currently everything returns blocks of plain text so things like lists and code samples will probably be mangled. This is not ideal but that stuff is meant to be handled going forward.

  • If you look carefully at the URLs above and the actual classes that define the functionality they all look basically the same save for the names of the extraction tools. For the time being I think the classes (and URLs) should remain separate if only to keep things simple(r) while everything else is sorted out.

  • There is also a separate branch that uses the snacktory readability clone but it has not been merged in to master yet. I can't remember why except that I was having trouble getting it to work and decided to try the java-readability library instead.

  • You can also type make exec to recompile the code and launch the server in foreground mode, which is useful for debugging things.

Dependencies

  • You will need to have maven installed to manage the build process.

To do

Aside from stuff listed in the TODO.txt file:

  • Try to be smarter about extracting or generating a page title for HTML output. Currently the code does not try to parse HTML input for title and simply parrots the basename of the input URL and/or relies on Tika's internal metadata parser.
  • A resource endpoint that calls the readability.com API

See also

dogeared-extruder's People

Contributors

straup avatar thisisaaronland avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.