GithubHelp home page GithubHelp logo

karthikchaganti / redditr--insight-data-engineering-project Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aravindr18/redditr--insight-data-engineering-project

1.0 1.0 0.0 37.48 MB

RedditR for Content Engagement and Recommendation

Shell 1.96% Python 48.67% CSS 3.21% JavaScript 1.57% HTML 44.60%

redditr--insight-data-engineering-project's Introduction

RedditR- Live Trends and Engagement on Reddit

This is the project I carried out during the seven-week Insight Data Engineering Fellows Program which helps recent grads and experienced software engineers learn the latest open source technologies by building a data platform to handle large, real-time datasets.
RedditR is a real time content engagement platform that helps maximize your content engagement on reddit. The platform gives you real time trend tracking so you never miss out on anything. You can find the app at <a href="http://www.redditr.space> www.redditr.space

Motivation

The Functional motivation for this project was to create a real-time trending feature very similar to the one you can find on twitter. With so much interactions happening on reddit, it may be useful for a user to be able to gauge traction received on a subreddit. With so much information out there, it may also be overwhelming for users to navigate across reddit to find the contents they like and for this particular reason, i've built in some features in the app namely "Engagement" and "Snapshot" that lets users track historical stats about a particular subreddit with dynamic recommendations of subreddits they may like.
The Engineering movitation behind building such a product is to demonstrate the ability to create a scalable real-time big data pipleline using open source technologies.

Data Pipeline

alt tag Thanks to /r/Stuck_In_the_Matrix for providing the historical dataset from October 2007- December 2015 for this project. The entire repo is here. The data is a monthly dump that is in the bz2 file format. The file dump was then downloaded to s3 bucket on Amazon AWS and uncompressed using a wget request in python. ( This was automated using a bash script) In Order to ensure fast processing on Spark, the file from s3 was processed and compressed into Parquet. Parquet is a columnar store that helps us store the data(JSONs) with its schema on HDFS there by leveraging the fault tolerance and distributed nature of the Hadoop Distributed File System. The original dataset, roughly about 1084.5 GB when compressed into Parquet was a mere 187.8 GB which gave us a lot of space savings with queries running 3x faster on Spark as compared to text data. Checkout the Ingest folder for implementation details.

Spark is used for Batch Processing. Check out the Batch folder for implementation and code in python. The feature implemented is called "Flashback" that lets user input their username and the system shows the first ever post of that particular user. There is also support to download all the posts of a user as a json file which may then be used to carry out user profiling. The next feature implemeted in pyspark was the recommendation and user interaction which can be found in the SimpleGraph folder. The idea is to connect users who have interacted with each other by a directed graph and grouping them by the subreddits. Once this is done, we then compute the indegree and outdegree of everynode in the clusters of subreddits. This gives us an idea as so to who the most influential and active users in a particular subreddits are. This helps us maximize content engagement as it would make more sense to interact often with these influential users to elicit maximum viewability and interaction on your posts. The recommendation is implemented by looking up at the subreddits that the influential users of a subreddit of your liking are active on and then suggesting these subreddits as recommendations. Recommendation folder contains implementation of a collaborative filter algorithm (ALS) in python. This was done to validate the user graph approach that I came up with. The results were very intuitive and promising. The user graph model also scored much higher in terms of compute time as it took 4.6 mins for computing recommendations as compared to the ALS approach with 20 iterations which took approximately 13 hours


Real time processing was done using Storm and Kafka was used as the publish-subscribe broker. Implementation can be found in the storm folder.
real-time reddit api can be found in the Stream folder Cassandra was chosen as the key-value store for this project.

redditr--insight-data-engineering-project's People

Contributors

aravindr18 avatar

Stargazers

 avatar

Watchers

Karthik Chaganti avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.