GithubHelp home page GithubHelp logo

webcrawler's People

Contributors

arocketman avatar daytron avatar novoselrok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

webcrawler's Issues

Re-validating a subreddit causes a JSON exception

Here is the full log:

What subbredit do you want to download from?
slimchance
Enter how many pictures do you want to download: 
3
Images from which period: hot, new, rising, controversial, top or gilded?
hot
Do you want to save in the default folder? (y)es/(n)o
y
Do you want to put the images in the subreddit's folder? (y)es/(n)o
y
==================
Download #1 complete:
Name of the file: /home/ryan/GitRepos/WebCrawler/slimchance/XExfsl4.jpg
==================
Download was skipped, URL unreachable.
HTTP error code: 404 Not Found
==================
Download was skipped, Link is private.
HTTP error code: 403 Permission Denied
==================
There weren't enough pictures for your request.
==================
==================
Download finished!
==================
Do you want to open /home/ryan/GitRepos/WebCrawler/slimchance/ in your File Explorer? (y)es/(n)o
y
Do you want to start again? (y)es/(n)o
y

What subbredit do you want to download from?
slimchance
org.json.JSONException: A JSONObject text must begin with '{' at character 1
    at org.json.JSONTokener.syntaxError(JSONTokener.java:410)
    at org.json.JSONObject.<init>(JSONObject.java:179)
    at org.json.JSONObject.<init>(JSONObject.java:402)
    at com.redditprog.webcrawler.SubRedditChecker.verifySubReddit(SubRedditChecker.java:36)
    at com.redditprog.webcrawler.Launcher.getSub(Launcher.java:64)
    at com.redditprog.webcrawler.Launcher.start(Launcher.java:25)
    at com.redditprog.webcrawler.App.main(App.java:29)
No such subreddit exist! try again.


What subbredit do you want to download from?
slimchance
org.json.JSONException: A JSONObject text must begin with '{' at character 1
No such subreddit exist! try again.


What subbredit do you want to download from?
    at org.json.JSONTokener.syntaxError(JSONTokener.java:410)
    at org.json.JSONObject.<init>(JSONObject.java:179)
    at org.json.JSONObject.<init>(JSONObject.java:402)
    at com.redditprog.webcrawler.SubRedditChecker.verifySubReddit(SubRedditChecker.java:36)
    at com.redditprog.webcrawler.Launcher.getSub(Launcher.java:64)
    at com.redditprog.webcrawler.Launcher.start(Launcher.java:25)
    at com.redditprog.webcrawler.App.main(App.java:29)

Album duplicate detection

We can filter out early the duplicate albums, instead of going through each file again for any single photo duplication, saves time in extracting.

Imgur deleted links cause the program to launch an exception

If a user deletes an image on imgur but it still is on reddit the program will launch:

java.io.FileNotFoundException: https://api.imgur.com/3/album/cyatx
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at com.redditprog.webcrawler.Extractor.extractImgurAPIJson(Extractor.java:282)
at com.redditprog.webcrawler.Extractor.extractImgurAlbum(Extractor.java:238)
at com.redditprog.webcrawler.Extractor.beginExtract(Extractor.java:98)
at com.redditprog.webcrawler.Launcher.start(Launcher.java:34)
at com.redditprog.webcrawler.App.main(App.java:23)
org.json.JSONException: A JSONObject text must begin with '{' at character 0
at org.json.JSONTokener.syntaxError(JSONTokener.java:410)
at org.json.JSONObject.(JSONObject.java:179)
at org.json.JSONObject.(JSONObject.java:402)
at com.redditprog.webcrawler.Extractor.extractImgurAlbum(Extractor.java:240)
at com.redditprog.webcrawler.Extractor.beginExtract(Extractor.java:98)
at com.redditprog.webcrawler.Launcher.start(Launcher.java:34)
at com.redditprog.webcrawler.App.main(App.java:23)

Reddit not responding due to high traffic

Causing the site to redirect to an error page resulting the extracted json string to this (html format):

<html>
<head><title>302 Found</title></head>
<body bgcolor="white">
<center><h1>302 Found</h1></center>
<hr><center>cloudflare-nginx</center>
</body>
</html>

Solution: Filter the extracted string and terminates the program if this happens.

Unit test for Launcher Class

Junit testing for Launcher class.

19 test cases

Merge Caution:
Affected classes:

  • Launcher
  • Pom.xml
  • App
  • Extractor
  • InputValidator
  • LauncherTest (new)

New framework library support proposal for unit testing

I just found out about this really nice library called Mockito (Docs) (Homepage). This will make our project testing much more predictable.

An example of this is when our app display certain questions, Mockito can sense upcoming questions and simulate a corresponding user input. For example

when(asker.ask("What subbredit do you want to download from?")).thenReturn("aww");
when(asker.ask("Enter how many pictures do you want to download:")).thenReturn(3);

The reason why I brought this up is because certain test method has unpredictable behaviours such as the start() method, where the input pattern varies on the input itself. For example a subreddit with pictures will enforce the open folder option at the end; otherwise, this is skip.

With Mockito, we don't have to check manually the subreddit, if this behaviour will happen to procure certain test case inputs, we can simply let Mockito predict it by using when condition methods. There's a whole lot more functionality we can use.

Program crashes if the user inputs a random directory that doesn't exist.

ott 03, 2014 12:43:38 PM com.redditprog.webcrawler.Extractor extractSingle
SEVERE: null
java.io.FileNotFoundException: sdasd\fcuiTxB.gif (The system cannot find the path specified)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(Unknown Source)
at java.io.FileOutputStream.(Unknown Source)
at com.redditprog.webcrawler.Extractor.extractSingle(Extractor.java:187)
at com.redditprog.webcrawler.Extractor.beginExtract(Extractor.java:108)
at com.redditprog.webcrawler.Launcher.start(Launcher.java:35)
at com.redditprog.webcrawler.App.main(App.java:22)

Refactoring of user input (Question answering)

We have a lot of duplicate code when asking user to input a question like this one:

    while (something) {
        System.out.println(GlobalConfiguration.QUESTION);
        String answer = scanner.next().toLowerCase();

        if (answer.equals("y") || answer.equals("yes")) {

etc.

We have to refractor this according to DRY (Don't repeat yourself). A possible solution would be a function :

boolean getAnswer(String question) ;

Which we can handle properly and more cleanly.

A better way to detect less popular with less than 1000 submissions

For these subreddits with less than 1000 submissions, it currently works by incrementing the count variable until 1000 is reached even if there is no new page to open.

I would suggest a better solution, so it won't have to count unnecessarily. By parsing through the Json file we can find if there are more child items as we compare it with num_pics.

beginExtractor loops for TOTAL_SEARCH_LIMIT times even tho there are not as many submissions in the subreddit

Ex:

What subbredit do you want to download from?
fantasypoliticsleague
Enter how many pictures do you want to download:
10
Images from which period: hot, new, rising, controversial, top or gilded?
top
Top links from which period: hour, day, week, month, year, all
all
Do you want to save in the default folder? (y)es/(n)o
y
There weren't enough pictures for your request. <-- It just looped 1000times (useless calculations)

Express mode - single input option

This feature allows user to have an option to use the application in single input with each requirements separated by a comma.

e.g.

  1. aww,20,top,all,y
  2. pics,100,hot,n,/home/user/Pictures

All of these data will undergo the same treatment as single question/answer cycle. It just needs to parse the single input from user.

A new question is ask for this option:
e.g. Do you want to enter express mode? (y)es/(n)o:

Bug with mobile album imgur links

When we stumble upon this kind of link:

http://www.reddit.com/r/gameofthrones/comments/2i1hv4/no_spoilersgot_my_bottle_of_the_new_ommegang_game/

We extract:

http://m.imgur.com/ogqM50H,ikw80zW

Which is not correctly handled by the software and causes exception:

java.io.FileNotFoundException: http://m.i.imgur.com/ogqM50H,ikw80zW.gif
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.URL.openStream(Unknown Source)
at com.redditprog.webcrawler.Extractor.extractSingle(Extractor.java:186)
at com.redditprog.webcrawler.Extractor.beginExtract(Extractor.java:108)
at com.redditprog.webcrawler.Launcher.start(Launcher.java:34)
at com.redditprog.webcrawler.App.main(App.java:22)

As you can see the problem is here:

                        } else {
                            urlString = urlString.replace("imgur", "i.imgur");
                            url = new URL(urlString + ".gif");
                        }

Software doesn't recognize a : IMGUR_ALBUM_URL_PATTERN (imgur.com/a) or a IMGUR_SINGLE_URL_PATTERN (i.imgur) and assumes it's a gif, while it's not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.