karussell / snacktory Goto Github PK

View Code? Open in Web Editor NEW

460.0 35.0 159.0 2.42 MB

Readability clone in Java

Java 1.71% HTML 98.29%

snacktory's Introduction

Future

Snacktory is no longer actively maintained by @karussell.

Crux is a fork under active development and is the recommended alternative.

Available under the same permissive Apache 2.0 License.
Adds several new features, such as Rich Text output (HTML), preserves links, extracts more metadata content, etc.
Optimized for Android. Decoupled from optional dependencies such as HttpUrlConnection, log4j, etc.
Actively developed by Chimbori, the developers of Hermit, a Lite Apps Browser for Android.
Already being used in multiple apps.
Crux has a different architecture from Snacktory: it is designed as a collection of several separate APIs instead of a single one. Clients can pick and choose which ones they wish to use.
As a result, Crux is not a drop-in replacement for Snacktory, but fairly easy to migrate to.

Snacktory

This is a small helper utility for people who don't want to write yet another java clone of Readability. In most cases, this is applied to articles, although it should work for any website to find its major area, extract its text, keywords, its main picture and more.

The resulting quality is high, even paper.li uses the core of snacktory. Also have a look into this article, it describes a news aggregator service which uses snacktory. But jetslide is no longer online.

Snacktory borrows some ideas and a lot of test cases from goose and jreadability:

License

The software stands under Apache 2 License and comes with NO WARRANTY

Features

article text detection
get top image url(s)
get top video url
extraction of description, keywords, ...
good detection for none-english sites (German, Japanese, ...), snacktory does not depend on the word count in its text detection to support CJK languages
good charset detection
possible to do URL resolving, but caching is still possible after resolving
skipping some known filetypes
no http GET required to run the core tests

TODOs

only top text supported at the moment

Usage

Include the repo at: https://github.com/karussell/mvnrepo

Then add the dependency

<dependency>
   <groupId>de.jetwick</groupId>
   <artifactId>snacktory</artifactId>
   <version>1.1</version>
   <!-- or if you prefer the latest build <version>1.2-SNAPSHOT</version> -->
</dependency>

If you need this for Android be sure you read this issue.

Or, if you prefer, you can use a build generated by jitpack.io.

Now you can use it as follows:

HtmlFetcher fetcher = new HtmlFetcher();
// set cache. e.g. take the map implementation from google collections:
// fetcher.setCache(new MapMaker().concurrencyLevel(20).maximumSize(count).
//    expireAfterWrite(minutes, TimeUnit.MINUTES).makeMap();

JResult res = fetcher.fetchAndExtract(articleUrl, resolveTimeout, true);
String text = res.getText(); 
String title = res.getTitle(); 
String imageUrl = res.getImageUrl();

Build

via Maven. Maven will automatically resolve dependencies to jsoup, log4j and slf4j-api

snacktory's People

Contributors

Stargazers

Watchers

Forkers

herbjiang stanislawosinski jloomis dajac tjerkw liusiqi43 kousuke finity-ai vikasing rclongyin bejean mhuckaby omersever lakenono petrusp samma835 skyshard fatum houdejun214 itgongfu hivenki saikswaroop chetan1 hnrc smitp sahildave moshdev jonathansantilli 0359xiaodong selectingprocess wayd2001 bryangarza phongtnit masasdani patelatharva lipengyu danielgrech miraczpp rubdottocom kinow c0d3rm0nk3y firstwave dongwq cloudxtreme engmsaleh allwefantasy fengbingjian peterdietz johnkim sunilgiri appgree atdixon bilash yx-will lf3d mjunaidi zhaiyuyong paskualeorg jtsay362 thedatawhore stripathi669 fjkfwz chrisemoulton drumcap anikikvn vadimio nzv8fan cdzhoubin sobolsigizmund mrerya chenying99 jibaro nanakatta dmitart nikolaylagutko rubenmayayo speed csayogesh maniaclee t0n0 backupmanager kyhoolee arunkumar9t2 grub-basket lopescan darongmean lboraz sixman9 chethan123 todvora tanvohub michael0x2a adelaidaram davelnewton dankito zavakid leusonmario enterstudio johngray1965 mkucharek

snacktory's Issues

hello

sorry, bad action. please delete this issue !

sorry, sorry ...

Many websites only extract partial content

Hi Peter,

I notice that I can extract only a part of the content of many websites, for example this site: http://sheldonbrown.com/brandt/patching.html I only get a part of the article, starting from: "Assuming that a patch was properly".

Do you know if the reason is because the library needs more development to complete the TODO: "only top text supported at the moment".

If so, could you give me some guidelines on how can I work to improve that?

Thank you so much.

Make it possible to Increase maxBytes in HtmlFetcher

Hello,
I am getting an exception when loading urls with pages larger than the fixed 500000 maxBytes limit specified in Converter class.
Please add a way to either modify this value.

Allow users to set a proxy

IIUC, it is not supported at the moment

snacktory/src/main/java/de/jetwick/snacktory/HtmlFetcher.java

Line 398 in 665d54a

 HttpURLConnection hConn = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY); 

Any chance to have this as a parameter in the future?

Please don't cause referrer spam

Your software already appears to properly advertise itself in the User-Agent, please don't cause Referer spam by using a fake referer pointing to this repository.

Crux, an Android-optimized fork of Snacktory, with many issues fixed

Hi @karussell, thanks for building and sharing Snacktory!

You said you were looking for someone to take over maintenance and future development?

We’ve been working hard on our own fork, with several features over the original Snacktory. The reason we forked it is because we needed to change the basic API to make it fit our requirements, including optimizing the library for Android, decoupling it from optional dependencies such as HttpUrlConnection, log4j, etc., and adding several new features, such as rich-text output (HTML), preserving links, extracting more metadata content, etc.

Announcing Crux: https://github.com/chimbori/crux

If you are interested, let us know how we can work together for maintenance and future development!

Bad parsing of article from `nytimes`

Here is the example:
https://www.nytimes.com/2017/10/09/business/general-motors-driverless.html

Text not fully parsed from the beginning. It starts only from:

The efforts have been moving forward in earnest since early last year, when G.M. bought Cruise Automation, a software company based in San Francisco.
...

String text ignores paragraphs, isn't there a way to get the text in html

Stack overflow ...

Very occasionally I'm getting a stack overflow in 1.3-SNAPSHOT- so clearly it is content specific. Sadly I haven't been able to capture an offending site yet:

java.lang.StackOverflowError
at java.util.LinkedHashMap.afterNodeInsertion(LinkedHashMap.java:299)
at java.util.HashMap.putVal(HashMap.java:663)
at java.util.HashMap.put(HashMap.java:611)
at org.jsoup.nodes.Attributes.put(Attributes.java:74)
at org.jsoup.nodes.Attributes.put(Attributes.java:51)
at org.jsoup.nodes.TextNode.ensureAttributes(TextNode.java:138)
at org.jsoup.nodes.TextNode.attr(TextNode.java:144)
at de.jetwick.snacktory.OutputFormatter.unlikely(OutputFormatter.java:118)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:130)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
at de.jetwick.snacktory.OutputFormatter.appendTextSkipHidden(OutputFormatter.java:142)
........

Text content is removed when there is an image in news webpage.

Hi,

I have tried using snacktory and It works well on the webpages which do not contain images. I have tried using one of the newspapers and I found that whenever there is an image, snacktory removes text block close to the image.

Try this url : http://articles.timesofindia.indiatimes.com/2013-09-17/rest-of-world/42147651_1_tropical-depression-mexico-city-heavy-rains

Not able to extract content

Not able to extract content from the some websites like quora.com and possibly some others.
It is returning 403, for HEAD request method at this line in HtmlFetcher class.

Fetch content from Twitter URLs?

Hi!
I'm trying to fetch content from the URLs inside of a Tweet.

When I try to do it for Official Twitter Android app, Twitter only shares with me a text like "read this tweet from @user at http://twitter.com/status/8341234812634".

So I fetch this URL with the hope to get the real tweet text with the real URL that I want to fetch.

However, when I do that I receive from Twitter a sort of warning that I must accept the use of cookies "To bring you Twitter, we and our partners use cookies on our and other websites. Cookies help personalize Twitter content, tailor Twitter Ads, measure their performance and provide you with a better, faster, safer Twitter experience. By using our services, you agree to our Cookie Use. Close".

I tried to set some "user-agent" and "cookie" configuration to HttpURLConnection before fetch Twitter, without success.

Do you know how can I achieve that?

That's currently my code (some dirty, I'm wondering to push you a fix when it works).

public String fetchAsString(String urlAsString, int timeout, boolean includeSomeGooseOptions)
        throws MalformedURLException, IOException {
    HttpURLConnection hConn = createUrlConnection(urlAsString, timeout, includeSomeGooseOptions);
    hConn.setInstanceFollowRedirects(true);

   // Start "hack"
    hConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
    Log.d("EXTRACT", hConn.getRequestProperty("User-Agent"));
    CookieManager cookieManager = new CookieManager();
    CookieHandler.setDefault(cookieManager);

    HttpCookie cookie = new HttpCookie("lang", "en");
    cookie.setDomain("twitter.com");
    cookie.setPath("/");
    cookie.setVersion(0);
    try {
        cookieManager.getCookieStore().add(new URI("http://twitter.com/"), cookie);
    } catch (URISyntaxException e) {
        e.printStackTrace();
    }
   // End "hack"

    String encoding = hConn.getContentEncoding();        
    InputStream is;
    if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
        is = new GZIPInputStream(hConn.getInputStream());
    } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
        is = new InflaterInputStream(hConn.getInputStream(), new Inflater(true));
    } else {
        is = hConn.getInputStream();
    }

    String enc = Converter.extractEncoding(hConn.getContentType());
    String res = createConverter(urlAsString).streamToString(is, enc);
    if (logger.isDebugEnabled())
        logger.debug(res.length() + " FetchAsString:" + urlAsString);
    return res;
}

bump jsoup version

Is snacktory usable with the last version of jsoup 1.6.2? If yes, it could be great to bump it.
Thanks,
David

wrong imageUrl in youtube url's

This happens whenever you fetch an Youtube link like:
https://www.youtube.com/watch?v=1a6KjDmHbR4

Instead of using the "og:image" from the head, it's setting the imageUrl as the first image in the , so in the provided example url, it's getting "https://s.ytimg.com/yts/img/pixel-vfl3z5WfW.gif" instead of the og:image that is the correct one: "https://i.ytimg.com/vi/1a6KjDmHbR4/maxresdefault.jpg"

Is there a work around this?

ignore hidden items?

It seems when I do a snacktory pass on amazon pages (ex: http://www.amazon.com/Vandaveer-Software-Brick-Buster-Pro/dp/B006T4IJTK)

its extracting hidden data, which is obviously not that relevant from a readability perspective. I was a bit confused when looking at the output, as a simple find via a browser didnt see the same text. when i looked at the source I realized that it was using display: hidden.

I realize in some cases it might be beyond the scope of readability (if its not really approaching it from a full DOM perspective), but it would seem in some cases (such as here), it should be more obvious that these nodes should be excluded

Snacktory on Android? java.beans.Introspector

Hi! I'm trying to make it work on Android project but when I initialize fetcher:

HtmlFetcher fetcher = new HtmlFetcher();

A java.lang.NoClassDefFoundError: Failed resolution of: Ljava/beans/Introspector; is thrown.
I read about java.beans are not fully implemented on Android, and I found Open Beans project http://code.google.com/p/openbeans/ but I don't know how to make it work or if there is a simpler way to fix that exception.

Thank you.

ArticleTextExtractor.getNodes() questions

Not an issue as such, a few questions.

Why in ArticleTextExtractor.getNodes() do you:

Use a Map, generate a hashCode and then only return the map values? Wouldn't a Set do the same job?
Add the parent of each element?

determineImageSource for images without width and height attributes

Images without weight and height attributes are ignored by determineImageSource method.
I think images without these attributes can be considered as images with width > 50 and height > 50

Futhermore width=50 and height = 50 are also ignored by the test

        if (height > 50)
            weight += 20;
        else if (height < 50)
            weight -= 20;

In my opinion, we should use :

        int weight = 0;
        int height = 0;
        int width = 0;
        try {
            height = Integer.parseInt(e.attr("height"));
        } catch (Exception ex) {}
        if (height == 0 || height >= 50)
            weight += 20;
        else if (height < 50)
            weight -= 20;

        try {
            width = Integer.parseInt(e.attr("width"));
        } catch (Exception ex) {}
        if (width == 0 || width >= 50)
            weight += 20;
        else if (width < 50)
            weight -= 20;

OuptutFormater issue

Hi,

If a html page contains something like

aaaa <strong>bbbb </strong>cccc

The result of replaceTagsWithText method is

aaaa bbbbcccc

The space after bbbb is lost

But, if a html page contains something like

aaaa <strong>bbbb</strong> cccc

there is no problem.

TextNode tn = new TextNode(item.text(), topNode.baseUri());

remove the space

Regards

ensure asian characters are not broken

This is now fixed! But needs a unit test!

From email:

The issue is in Converter.streamToString(). There's a loop to read http data chunks. Each chunk is converted separately to String, but may contain only the first (or seconf) half of a character, thus result in corrupted data. It happens sporadically depending on timing.

Also, the counting of bytesRead was wrong, so for slow connection there may be a "size exceeded" message with no justification.

What I did to test this problem is reading a Japanese article (url below) with the Browser, save its content somewhere (e.g. on file). Then run the streamToString() function in a loop (with some delay) and each time compare its output with the expected output on file. Sometimes I experienced dozens successful tests and then several failures, so this is not too persistent but the errors were often enough.

The article I tested on is http://astand.asahi.com/magazine/wrscience/2012022900015.html, and the corruption was almost always visible in the string "300" (see in the article), where instead of the "3" some junk was displayed.

Detect publish date

A great feature could be to detect the published date of the web page.
This information is often located somewhere at the top or the bottom of the main text.

Bad parsing of article from `cnbc`

Here is the example:
https://www.cnbc.com/2017/10/09/amazons-comedies-win-with-critics-while-hulu-is-a-hit-with-audiences.html
https://www.cnbc.com/2017/10/10/opec-calls-on-us-shale-oil-producers-to-accept-shared-responsibility.html

Text not fully parsed. Only first part of article.

Misspelling in README file

I wanted to change in the beginning of this file, pepole for people, but i could not create a pull request.

Cheers!

Unsupported Popular Internet Landmarks

Articles from the following properties don't currently work:

m.slashdot.org
- Returns JResult with empty contents, probably due to redirect
arstechnica.com
- Produces java.net.ProtocolException: Unexpected status line: ��HTTP/1.1 200 OK
(full trace)

Great work btw. I'll keep hunting for more.

Preserve paragraphs?

Hi, is it possible to preserve/restore paragraphs with Snacktory engine? Extracted articles are not really readable when joined in one big chunk of text.

NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher

I'm using snacktory with IntelliJ 15 and Gradle. The following was working yesterday, but stopped working today:

repositories {
  maven {
    url "https://github.com/karussell/mvnrepo/raw/master/releases/"
  }
}
dependencies {
  compile('de.jetwick:snacktory:1.2')
}

Getting errors from HtmlFetcher fetcher = new HtmlFetcher();:

java.lang.NoClassDefFoundError: Could not initialize class de.jetwick.snacktory.HtmlFetcher

Things I tried in IntelliJ:

refresh gradle deps
Build -> Rebuild Project
File -> Invalidate Caches/Restart

Interesting if I build a jar with dependencies and run javac -jar..., then it does seem to work.

Any ideas what might have gone wrong?

Converter.detectCharset throws for inputs longer than 2048

protected String detectCharset(String key, ByteArrayOutputStream bos, BufferedInputStream in, String enc) throws IOException {
byte[] arr = new byte[2048];

how to reproduce:
do a fetchAndExtract of this url 'http://www.gazzetta.it/Sport-Invernali/Sci-Alpino/Coppa-Mondo-Sci/26-02-2017/sci-combinata-brignone-ho-sciato-senza-paura-uscire-180995893986.shtml'

Build fail due to test failed

Running de.jetwick.snacktory.ArticleTextExtractorTest
2012-09-23 08:22:25,963 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
2012-09-23 08:22:26,006 [main] WARN de.jetwick.snacktory.Converter - Maxbyte of 500000 exceeded! Maybe html is now broken but try it nevertheless. Url: null
Tests run: 72, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.646 sec <<< FAILURE!

Failed tests:
testYomiuri(de.jetwick.snacktory.ArticleTextExtractorTest): yomiuri:????????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Tests run: 95, Failures: 1, Errors: 0, Skipped: 0

The issue is line 111

    assertTrue("yomiuri:" + res.getText(), res.getText().startsWith("　海津市海津町の国営木曽三川公園で、チューリップが見頃を迎えている。２０日までは「チューリップ祭」が開かれており、大勢の人たちが多彩な色や形を鑑賞している＝写真＝"));

The test is in success if you remove the space here >> startsWith("　海津

Relevant content in XML island is not returned

When the relevant article content is in an XML island it wouldn't be returned. See for example WSJ Japan article http://jp.wsj.com/Finance-Markets/Foreign-Currency-Markets/node_400108 with the following fragment (shortened for clarity):

<p>
<?xml version="1.0" encoding="utf-8"?>
<section xmlns:image="http://ez.no/namespaces/ezpublish3/image/" ...>
<paragraph>(this is the relevant content) イスラエル銀行（**銀行）は景気下支えを目的に過去5カ月間に ...</paragraph>
</section>
</p>

On slow networks Converter logs error

We need to check if we really already have read something ...

Error message for HtmlFetcherIntegrationTest.testHashbang:
"Converter - Couldn't reset stream to re-read with new encoding UTF-8 java.io.IOException: Resetting to invalid mark"

dependency via sbt

Did you manage to add the dependency with sbt? I do get different exceptions while referring to different versions

Can't split getText() into paragraphs

Hello, I’m trying to get major text from articles but I got a string without new line characters. Is there a way to extract the text while retaining all new line characters? Otherwise there will be only one single paragraphe per article…

Or is there a switch to retain certain html tag while doing the extraction? Like retain all <a> and <br>.

By the way, thanks for your great work!

Provide optional extraction directives

What about provide optional extraction directives ?

In a majority of cases the extraction algorithm woks great. But for some web sites it can fail to extract relevant content. For these web sites it could be possible to "help" snacktory to focus on a specific part of the page content by providing it a Jsoup selector. For instance, we could have something like :

ArticleTextExtractor extractor = new ArticleTextExtractor();
extractor.setTextSelector("div.article_content");
extractor.setTitleSelector("h2", "first");
String dateRegEx = "xxxx";
extractor.setDateSelector("#published", dateRegEx);

JResult res = extractor.extractContent(rawData);
text = res.getText();
title = res.getTitle();
date = res.getDate();

Not working

I am trying to implement it using the code from Readme but it just doesn't work. There are no errors but it doesn't work either.

If I try to Log.d value returned from JResult, that debug log is also not in the output. I just don't know what is the issue here.