GithubHelp home page GithubHelp logo

anvaka / redsim Goto Github PK

View Code? Open in Web Editor NEW
71.0 6.0 7.0 8.06 MB

reddit discovery

Home Page: https://anvaka.github.io/redsim/

License: MIT License

Shell 1.16% JavaScript 62.27% HTML 31.07% Less 5.50%

redsim's Introduction

reddit discovery

NOTE: Checkout couple more similar projects: https://anvaka.github.io/sayit/ and https://anvaka.github.io/map-of-reddit/. Description below is slightly outdated and should be updated.

Your comments on reddit are not only what makes reddit fun. They can also be used to x-ray the friendly alien and reveal its hidden structure.

Redditors who commented to this subreddit also commented to...

This simple idea is the core of the current recommendation website. Despite the simplicity it yields amazing results.

Before you read any further, please go ahead and check for yourself:

It works really well for subreddits with under 1 million subscribers. But how exactly does it work?

How exactly does it work?

Recently /u/Stuck_In_the_Matrix publicly released reddit's ~1.7 billion comments dataset. Each record contains information about author's name and target subreddit.

If you post to subreddit A and C very often - it doesn't necessary mean that A and C are related. But if there are thousands of people posting to both A and C we could suspect that maybe subreddits are related.

Of course sometimes A is way more popular than C, and we need to take that into account. Let's consider three subreddits:

  • A - has 1,000 subscribers
  • B - also has 1,000 subscribers; and
  • C - has only 100 subscribers

Imagine A and B share 100 reddittors who posted to both A and B. Also imagine A and C share other 100 redditors who posted to both A and C.

Which subreddit is more related to A? Is it B or C?

Only 10% of B has posted to A. While 100% of C has posted to A. This means C has very high "relationship index" with A.

Turns out this "relationship index" has many names and forms. One of the simplest forms is called Jaccard index (similarity).

Jaccard similarity

To find how much subreddits A and B are similar with each other, all we need to do is:

  1. Find how many subscribers who posted to A has also posted to B (intersection of A and B).
  2. Find how many subscribers has posted to A or B (union of A and B).
  3. Divide 1 by 2 and we'll get Jaccard similarity.

In the example above. Jaccard similarity of A and C is: J(A, C) = 100/1000 = 0.1, while Jaccard similarity of A and B is: J(A, B) = 100/(900 + 100 + 900) = 0.053

This makes C two times more similar to A than B.

Drawbacks

This approach works extremely well for subreddits with less than 1,000,000 subscribers. For more popular subreddits results are getting saturated by popularity of those subreddits. If you have an idea how to fix this please let me know :).

Technical details

Note: The details below outline my old procedure. I didn't use it to build the latest snapshot, which is based on 150 million unique comments. Still keeping it here for reference.

To compute similarity between subreddits I downloaded only one month worth of public comments. This gives more than 50,000,000 user โ‡„ subreddit records. Which translates to almost 50,000 unique subreddits.

Each record is stored into redis database in these 50 lines of code. And then I'm using SINTERSTORE and SUNIONSTORE to compute intersection and union of subreddits (code).

This is the most straightforward brute-force approach to compute similarities. It took almost 70 CPU hours of my old MacBookPro friend to compare all subreddits with other.

What's next?

I truly hope you enjoyed the simplicity of the formula and the power of results. If you have any feedback please let me know!

license

MIT

redsim's People

Contributors

anvaka avatar hussein-esmail7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

redsim's Issues

idea: a Chrome extension

What if there was an extension that shows similar subreddits on the subreddit page you're at? I can help with the CSS part if you'd write automatically searching & injecting results on the page.

I guess that was a native feature Reddit had before redesign, but now it disappeared or not visible on most subreddits so it would be super useful.

Share The Results on API

Just adding all data on a json file here would work as well. This way you don't have to bear any cost.

Would be a good idea to update the comments

A lot has changed since 3 years ago, and a lot of subreddits in the redsim are now closed (I'm not talking about the recent Reddit protest), so it would be nice to update the comments database maybe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.