GithubHelp home page GithubHelp logo

snacktory's Introduction

Future

Snacktory is no longer actively maintained by @karussell.

Crux is a fork under active development and is the recommended alternative.

  • Available under the same permissive Apache 2.0 License.
  • Adds several new features, such as Rich Text output (HTML), preserves links, extracts more metadata content, etc.
  • Optimized for Android. Decoupled from optional dependencies such as HttpUrlConnection, log4j, etc.
  • Actively developed by Chimbori, the developers of Hermit, a Lite Apps Browser for Android.
  • Already being used in multiple apps.
  • Crux has a different architecture from Snacktory: it is designed as a collection of several separate APIs instead of a single one. Clients can pick and choose which ones they wish to use.
  • As a result, Crux is not a drop-in replacement for Snacktory, but fairly easy to migrate to.

Snacktory

This is a small helper utility for people who don't want to write yet another java clone of Readability. In most cases, this is applied to articles, although it should work for any website to find its major area, extract its text, keywords, its main picture and more.

The resulting quality is high, even paper.li uses the core of snacktory. Also have a look into this article, it describes a news aggregator service which uses snacktory. But jetslide is no longer online.

Snacktory borrows some ideas and a lot of test cases from goose and jreadability:

License

The software stands under Apache 2 License and comes with NO WARRANTY

Features

  • article text detection
  • get top image url(s)
  • get top video url
  • extraction of description, keywords, ...
  • good detection for none-english sites (German, Japanese, ...), snacktory does not depend on the word count in its text detection to support CJK languages
  • good charset detection
  • possible to do URL resolving, but caching is still possible after resolving
  • skipping some known filetypes
  • no http GET required to run the core tests

TODOs

  • only top text supported at the moment

Usage

Include the repo at: https://github.com/karussell/mvnrepo

Then add the dependency

<dependency>
   <groupId>de.jetwick</groupId>
   <artifactId>snacktory</artifactId>
   <version>1.1</version>
   <!-- or if you prefer the latest build <version>1.2-SNAPSHOT</version> -->
</dependency>

If you need this for Android be sure you read this issue.

Or, if you prefer, you can use a build generated by jitpack.io.

Now you can use it as follows:

HtmlFetcher fetcher = new HtmlFetcher();
// set cache. e.g. take the map implementation from google collections:
// fetcher.setCache(new MapMaker().concurrencyLevel(20).maximumSize(count).
//    expireAfterWrite(minutes, TimeUnit.MINUTES).makeMap();

JResult res = fetcher.fetchAndExtract(articleUrl, resolveTimeout, true);
String text = res.getText(); 
String title = res.getTitle(); 
String imageUrl = res.getImageUrl();

Build

via Maven. Maven will automatically resolve dependencies to jsoup, log4j and slf4j-api

snacktory's People

Contributors

bejean avatar chimbori avatar dajac avatar hnrc avatar ifesdjeen avatar jloomis avatar jonathansantilli avatar karussell avatar kinow avatar kireet avatar nzv8fan avatar pyr avatar soebbing avatar tjerkw avatar todvora avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snacktory's Issues

hello

hello

sorry, bad action. please delete this issue !

sorry, sorry ...

Many websites only extract partial content

Hi Peter,

I notice that I can extract only a part of the content of many websites, for example this site: http://sheldonbrown.com/brandt/patching.html I only get a part of the article, starting from: "Assuming that a patch was properly".

Do you know if the reason is because the library needs more development to complete the TODO: "only top text supported at the moment".

If so, could you give me some guidelines on how can I work to improve that?

Thank you so much.

Please don't cause referrer spam

Your software already appears to properly advertise itself in the User-Agent, please don't cause Referer spam by using a fake referer pointing to this repository.

Crux, an Android-optimized fork of Snacktory, with many issues fixed

Hi @karussell, thanks for building and sharing Snacktory!

You said you were looking for someone to take over maintenance and future development?

We’ve been working hard on our own fork, with several features over the original Snacktory. The reason we forked it is because we needed to change the basic API to make it fit our requirements, including optimizing the library for Android, decoupling it from optional dependencies such as HttpUrlConnection, log4j, etc., and adding several new features, such as rich-text output (HTML), preserving links, extracting more metadata content, etc.

Announcing Crux: https://github.com/chimbori/crux

If you are interested, let us know how we can work together for maintenance and future development!

Stack overflow ...

Very occasionally I'm getting a stack overflow in 1.3-SNAPSHOT- so clearly it is content specific. Sadly I haven't been able to capture an offending site yet:

java.lang.StackOverflowError
at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299)
at java.util.HashMap.putVal(HashMap.java:663)
at java.util.HashMap.put(HashMap.java:611)
at org.jsoup.nodes.Attributes.put(Attributes.java:74)
at org.jsoup.nodes.Attributes.put(Attributes.java:51)
at org.jsoup.nodes.TextNode.ensureAttributes(TextNode.java:138)
at org.jsoup.nodes.TextNode.attr(TextNode.java:144)
at de.jetwick.snacktory.OutputFormatter.unlikely(OutputFormatter.java:118)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:130)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
........

Not able to extract content

Not able to extract content from the some websites like quora.com and possibly some others.
It is returning 403, for HEAD request method at this line in HtmlFetcher class.

Fetch content from Twitter URLs?

Hi!
I'm trying to fetch content from the URLs inside of a Tweet.

When I try to do it for Official Twitter Android app, Twitter only shares with me a text like "read this tweet from @user at http://twitter.com/status/8341234812634".

So I fetch this URL with the hope to get the real tweet text with the real URL that I want to fetch.

However, when I do that I receive from Twitter a sort of warning that I must accept the use of cookies "To bring you Twitter, we and our partners use cookies on our and other websites. Cookies help personalize Twitter content, tailor Twitter Ads, measure their performance and provide you with a better, faster, safer Twitter experience. By using our services, you agree to our Cookie Use. Close".

I tried to set some "user-agent" and "cookie" configuration to HttpURLConnection before fetch Twitter, without success.

Do you know how can I achieve that?

That's currently my code (some dirty, I'm wondering to push you a fix when it works).

public String fetchAsString(String urlAsString, int timeout, boolean includeSomeGooseOptions)
        throws MalformedURLException, IOException {
    HttpURLConnection hConn = createUrlConnection(urlAsString, timeout, includeSomeGooseOptions);
    hConn.setInstanceFollowRedirects(true);

   // Start "hack"
    hConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
    Log.d("EXTRACT", hConn.getRequestProperty("User-Agent"));
    CookieManager cookieManager = new CookieManager();
    CookieHandler.setDefault(cookieManager);

    HttpCookie cookie = new HttpCookie("lang", "en");
    cookie.setDomain("twitter.com");
    cookie.setPath("/");
    cookie.setVersion(0);
    try {
        cookieManager.getCookieStore().add(new URI("http://twitter.com/"), cookie);
    } catch (URISyntaxException e) {
        e.printStackTrace();
    }
   // End "hack"

    String encoding = hConn.getContentEncoding();        
    InputStream is;
    if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
        is = new GZIPInputStream(hConn.getInputStream());
    } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
        is = new InflaterInputStream(hConn.getInputStream(), new Inflater(true));
    } else {
        is = hConn.getInputStream();
    }

    String enc = Converter.extractEncoding(hConn.getContentType());
    String res = createConverter(urlAsString).streamToString(is, enc);
    if (logger.isDebugEnabled())
        logger.debug(res.length() + " FetchAsString:" + urlAsString);
    return res;
}

bump jsoup version

Is snacktory usable with the last version of jsoup 1.6.2? If yes, it could be great to bump it.
Thanks,
David

ignore hidden items?

It seems when I do a snacktory pass on amazon pages (ex: http://www.amazon.com/Vandaveer-Software-Brick-Buster-Pro/dp/B006T4IJTK)

its extracting hidden data, which is obviously not that relevant from a readability perspective. I was a bit confused when looking at the output, as a simple find via a browser didnt see the same text. when i looked at the source I realized that it was using display: hidden.

I realize in some cases it might be beyond the scope of readability (if its not really approaching it from a full DOM perspective), but it would seem in some cases (such as here), it should be more obvious that these nodes should be excluded

Snacktory on Android? java.beans.Introspector

Hi! I'm trying to make it work on Android project but when I initialize fetcher:

HtmlFetcher fetcher = new HtmlFetcher(); 

A java.lang.NoClassDefFoundError: Failed resolution of: Ljava/beans/Introspector; is thrown.
I read about java.beans are not fully implemented on Android, and I found Open Beans project http://code.google.com/p/openbeans/ but I don't know how to make it work or if there is a simpler way to fix that exception.

Thank you.

determineImageSource for images without width and height attributes

Images without weight and height attributes are ignored by determineImageSource method.
I think images without these attributes can be considered as images with width > 50 and height > 50

Futhermore width=50 and height = 50 are also ignored by the test

        if (height > 50)
            weight += 20;
        else if (height < 50)
            weight -= 20;

In my opinion, we should use :

        int weight = 0;
        int height = 0;
        int width = 0;
        try {
            height = Integer.parseInt(e.attr("height"));
        } catch (Exception ex) {}
        if (height == 0 || height >= 50)
            weight += 20;
        else if (height < 50)
            weight -= 20;

        try {
            width = Integer.parseInt(e.attr("width"));
        } catch (Exception ex) {}
        if (width == 0 || width >= 50)
            weight += 20;
        else if (width < 50)
            weight -= 20;

OuptutFormater issue

Hi,

If a html page contains something like

aaaa <strong>bbbb </strong>cccc

The result of replaceTagsWithText method is

aaaa bbbbcccc

The space after bbbb is lost

But, if a html page contains something like

aaaa <strong>bbbb</strong> cccc

there is no problem.

TextNode tn = new TextNode(item.text(), topNode.baseUri());

remove the space

Regards

ensure asian characters are not broken

This is now fixed! But needs a unit test!

From email:

The issue is in Converter.streamToString(). There's a loop to read http data chunks. Each chunk is converted separately to String, but may contain only the first (or seconf) half of a character, thus result in corrupted data. It happens sporadically depending on timing.

Also, the counting of bytesRead was wrong, so for slow connection there may be a "size exceeded" message with no justification.

What I did to test this problem is reading a Japanese article (url below) with the Browser, save its content somewhere (e.g. on file). Then run the streamToString() function in a loop (with some delay) and each time compare its output with the expected output on file. Sometimes I experienced dozens successful tests and then several failures, so this is not too persistent but the errors were often enough.

The article I tested on is http://astand.asahi.com/magazine/wrscience/2012022900015.html, and the corruption was almost always visible in the string "300" (see in the article), where instead of the "3" some junk was displayed.

Detect publish date

A great feature could be to detect the published date of the web page.
This information is often located somewhere at the top or the bottom of the main text.

Misspelling in README file

I wanted to change in the beginning of this file, pepole for people, but i could not create a pull request.

Cheers!

Unsupported Popular Internet Landmarks

Articles from the following properties don't currently work:

  • m.slashdot.org

    • Returns JResult with empty contents, probably due to redirect
  • arstechnica.com

    • Produces java.net.ProtocolException: Unexpected status line: �����������������������������������HTTP/1.1 200 OK

    (full trace)

Great work btw. I'll keep hunting for more.

Preserve paragraphs?

Hi, is it possible to preserve/restore paragraphs with Snacktory engine? Extracted articles are not really readable when joined in one big chunk of text.

NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher

I'm using snacktory with IntelliJ 15 and Gradle. The following was working yesterday, but stopped working today:

repositories {
  maven {
    url "https://github.com/karussell/mvnrepo/raw/master/releases/"
  }
}
dependencies {
  compile('de.jetwick:snacktory:1.2')
}

Getting errors from HtmlFetcher fetcher = new HtmlFetcher();:

java.lang.NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher

Things I tried in IntelliJ:

  • refresh gradle deps
  • Build -> Rebuild Project
  • File -> Invalidate Caches/Restart

Interesting if I build a jar with dependencies and run javac -jar..., then it does seem to work.

Any ideas what might have gone wrong?

Build fail due to test failed

Running de.jetwick.snacktory.ArticleTextExtractorTest
2012-09-23 08:22:25,963 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
2012-09-23 08:22:26,006 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
Tests run: 72, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.646 sec <<< FAILURE!

Failed tests:
testYomiuri(de.jetwick.snacktory.ArticleTextExtractorTest): yomiuri:????????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Tests run: 95, Failures: 1, Errors: 0, Skipped: 0

The issue is line 111

    assertTrue("yomiuri:" + res.getText(), res.getText().startsWith(" 海津市海津町の国営木曽三川公園で、チューリップが見頃を迎えている。20日までは「チューリップ祭」が開かれており、大勢の人たちが多彩な色や形を鑑賞している=写真="));

The test is in success if you remove the space here >> startsWith(" 海津

Relevant content in XML island is not returned

When the relevant article content is in an XML island it wouldn't be returned. See for example WSJ Japan article http://jp.wsj.com/Finance-Markets/Foreign-Currency-Markets/node_400108 with the following fragment (shortened for clarity):

<p>
<?xml version="1.0" encoding="utf-8"?>
<section xmlns:image="http://ez.no/namespaces/ezpublish3/image/" ...>
<paragraph>(this is the relevant content) イスラエル銀行(**銀行)は景気下支えを目的に過去5カ月間に ...</paragraph>
</section>
</p>

On slow networks Converter logs error

We need to check if we really already have read something ...

Error message for HtmlFetcherIntegrationTest.testHashbang:
"Converter - Couldn't reset stream to re-read with new encoding UTF-8 java.io.IOException: Resetting to invalid mark"

dependency via sbt

Did you manage to add the dependency with sbt? I do get different exceptions while referring to different versions

Can't split getText() into paragraphs

Hello, I’m trying to get major text from articles but I got a string without new line characters. Is there a way to extract the text while retaining all new line characters? Otherwise there will be only one single paragraphe per article…

Or is there a switch to retain certain html tag while doing the extraction? Like retain all <a> and <br>.

By the way, thanks for your great work!

Provide optional extraction directives

What about provide optional extraction directives ?

In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :

ArticleTextExtractor extractor = new ArticleTextExtractor();
extractor.setTextSelector("div.article_content");
extractor.setTitleSelector("h2", "first");
String dateRegEx = "xxxx";
extractor.setDateSelector("#published", dateRegEx);

JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = res.getDate();

Not working

I am trying to implement it using the code from Readme but it just doesn't work. There are no errors but it doesn't work either.

If I try to Log.d value returned from JResult, that debug log is also not in the output. I just don't know what is the issue here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.