GithubHelp home page GithubHelp logo

pombredanne / ghindex Goto Github PK

View Code? Open in Web Editor NEW

This project forked from anvaka/ghindex

0.0 1.0 0.0 350 KB

Creates github index for similar repositories discovery

Home Page: http://www.yasiv.com/github/#/

License: MIT License

Shell 27.73% JavaScript 72.27%

ghindex's Introduction

Finding related projects on GitHub

This repository creates a recommendation database of "related" projects on GitHub. Interactive version is available here: http://www.yasiv.com/github/#/

How does it work?

How can we tell whether Project A is more related to Project B, than it is related to Project C?

Turns out, that we, project followers, tend to give stars to similar projects. If I gave stars to A, B, and C, and you gave stars to A, B, C, and D, then I should probably go check out D as well. Giving stars on GitHub, most of the time, is a good sign of project appreciation. So if we have starred three projects together, then we value similar things.

To turn this fact into a number, I'm using Sorensen-Dice similarity coefficient:

                        number_of_shared_stars(A, B)
similarity(A, B) = ---------------------------------------
                   number_of_stars(A) + number_of_stars(B)

"Developers who gave star to this repository, also gave star to ..." metric works decently well for projects with 150 .. 2000 stars. For projects with smaller amount of stars there is not enough intersection between watchers. For extremely popular projects coefficient becomes higher when other project is also extremely popular. Thus projects like Bootstrap get Angular, jQuery, and Node as the most relevant.

Data Gathering

GitHub Archive provides gigabytes of data from GitHub. We can query it using Google's BigQuery API.

For example, this query:

SELECT repository_url, actor_attributes_login
FROM [githubarchive:github.timeline]
WHERE type='WatchEvent'

Give us list of repositories, along with users who gave them stars:

| Row | repository_url                                     | actor_attributes_login |
| --- | -------------------------------------------------- | ---------------------- |
| 1   | https://github.com/alump/Masonry                   | markiewb               |
| 2   | https://github.com/andrewjstone/rafter             | kirsn                  |
| 3   | https://github.com/jgraph/draw.io                  | nguyennamtien          |
| 4   | https://github.com/samvermette/SVWebViewController | dlo                    |
| 5   | https://github.com/mafintosh/peerflix              | 0xPr0xy                |
| ..  | ...                                                | ...                    |

By iteratively processing each record we can calculate number of stars for each project. We can also find how many shared stars each project has with every other project. But... The dataset is huge. Today (Nov 30, 2014) there are 25M watch events produced, by more than 1.8M unique users. They are given to more than 1.2M unique repositories. We need to reduce the dataset:

SELECT repository_url, actor_attributes_login
FROM [githubarchive:github.timeline]
WHERE type='WatchEvent' AND actor_attributes_login IN (
  SELECT actor_attributes_login FROM [githubarchive:github.timeline]
  WHERE type='WatchEvent'
  GROUP BY actor_attributes_login HAVING (count(*) > 1) AND (count (*) < 500)
)
GROUP EACH BY repository_url, actor_attributes_login;

Why do we limit lower bound to at least 2 stars?

Since we are using Sorensen-Dice similarity coefficient, users who gave only 1 star total, can be excluded from "shared stars" metric. In theory this will slightly skew similarity coefficient and make two projects more similar than they should be, but in practice results seem to be helpful enough. This also serves as a good filter against bot attacks.

Why do we limit upper bound to at most 500?

To save the CPU power. Is this bad? There are only 0.7% of users who gave more than 500 stars.

This query reduces dataset from 25M to 16M records.

Data storing

We got the dataset, downloaded and stored into CSV file, for further processing.

To calculate similarity we need to be able to quickly answer two questions:

  1. Who gave stars to project A?
  2. Which projects were starred by User B?

If only we could save this into hash-like data structure - that would give us O(1) time to answer both of these questions.

Naive solution to store all inside one process into hash (using either C++ or node) turned out to be extremely inefficient. My processes exceeded 8GB RAM limit, and started killing my laptop with constant swapping.

Maybe I should save it into a local database?

I tried to use neo4j but it failed with out of memory exception during CSV import.

Next and last stop was redis. Absolutely beautiful piece of software. It swallowed 16M rows without blinking an eye. RAM was within sane 3GB range, and disk utilization is only 700MB.

Building recommendations

Recommendation database is created by these ~200 lines of code. There is a lot of asynchronous code in there, hidden behind promises.

In nutshell, this is what it's doing:

1. Find all repositories with more than 150 stars.
2. For each repository find users who gave it a star.
     For each user who gave a star, find which other projects were starred.
     For each other project increase number of shared stars
3. Produce similarity coefficient.

Final results are saved to disk, and then uploaded to S3, so that the frontend can immediately get them.

Final Notes

It takes ~2 hours 20 minutes to construct recommendations for 15K most popular GitHub projects. It also takes another 40 minutes to prepare/download data from BigQuery.

My previous approach, where I had to manually index GitHub via GitHub's API, was taking approximately 5 days to build the index, and one more day to calculate recommendations.

GitHub Archive is awesome; Redis is awesome too! Maybe next step will be improving results with content-based recommendation. E.g. we could index source code and find similarity based on AST. Anyway, please let me know what you think.

license

MIT

ghindex's People

Contributors

anvaka avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.