GithubHelp home page GithubHelp logo

reddit-crawler's Introduction

reddit-crawler

Description:

reddit-crawler obtains subreddit data, reports completion status, and displays its data graphically.

Crawling

crawler.py - Builds a database of related subreddits

show_progress.py- Periodically checks the database and reports crawler progress

Analysis

grapher.py - Creates a graph of all subreddits that are connected through recommendations section

An interactive visualization of the data can be found here: https://github.com/cdated/subredditor

Usage:

To use grapher.py one must either run crawler.py to populate the MongoDB database, or use mongorestore on the bson in data/dump/reddit.

Loading Database

There's already a database (approx 8Mb) in the repo for those who don't want to run the crawler to see the connections. To load it just run the restore_db.sh script.

Crawling

crawler.py starts at a user defined subreddit and collects all the recommendations. It uses the parsed recommendations to get the more until is recurses through all possible subreddits linked. Previously explored subreddits are not revisited. A backlog of subreddits to be visited, and subreddit relationships are stored in MongoDB and loaded on application start if available.

usage: crawler.py [-h] -s SUBREDDIT

optional arguments:
  -h, --help            show this help message and exit
  -s SUBREDDIT, --subreddit SUBREDDIT
                        Subreddit seed

show_progress.py checks the backlog every 2 seconds and prints the current subreddit being crawled, the number visited, and the number currently in the backlog.

./show_progress.py
Checking truepoetry
Checked: 14
Remaining: 159

Checking badarthistory
Checked: 15
Remaining: 158

Analysis

grapher.py generates a full graph of recommended subreddits. By default it hides nodes featuring explicit content, but can generate a censored graph (default), full graph, and the difference of the two. One may also filter out subreddits with subscriber counts below a specified number with the minimum flag. Output is a graphviz file.

usage: grapher.py [-h] [-c] -m MINIMUM [-n] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -c, --censored        Hide over 18 subreddits
  -m MINIMUM, --minimum MINIMUM
                        Min subcribers to be added
  -n, --nsfw            Only over 18 subreddits
  -v, --verbose         Show debugging

reddit-crawler's People

Contributors

cdated avatar

Stargazers

leon avatar Miguel Magueijo avatar prankousky avatar Zhiyu Chen avatar Joshua Whetton avatar  avatar

Watchers

 avatar Gabriel Araújo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.