GithubHelp home page GithubHelp logo

ed-science / analyzing_github Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fhoffa/analyzing_github

0.0 1.0 0.0 16 KB

Analyzing GitHub with BigQuery and other tools

Home Page: https://medium.com/@hoffa

License: Apache License 2.0

analyzing_github's Introduction

Analyzing GitHub

A growing collection of what I know about analyzing GitHub with BigQuery and other tools

Main datasets in BigQuery

GH Archive

A log of all public events happening on GitHub.

Updated: Hourly

GHTorrent

A relational dataset that adds more data from GitHub's knowledge graph:

Updated: ~Monthly (depending on the project owner availability).


Note: Hasn't been updated since 2019-06. The latest upload is here:

GitHub contents

A snapshot of the open source contents of GitHub, ready to be analyzed.

Updated: ~Weekly.


FAQ

GitHub contents

Could you please share more information regarding which projects are available via BQ?

The bigquery-public-data.github_repos.contents table only contains the copy of ASCII files that are less than 10MB. To be included, projects need to be open source (as determined by GitHub's License API).

Do you provide forked repos? or only non-forked?

Mostly non-forked.

Which percentage of the repositories of GitHub exists on the data set?

There's many ways to count this. One of them:

SELECT COUNT(repo_with_stars) repos_with_stars
  , 100*ROUND(COUNT(repo_in_mirror)/COUNT(repo_with_stars), 4) percentage_in_mirror
FROM (
  SELECT DISTINCT repo_name AS repo_in_mirror
  FROM `bigquery-public-data.github_repos.files` 
) a RIGHT JOIN (
  SELECT repo.name AS repo_with_stars, APPROX_COUNT_DISTINCT(actor.id) stars 
  FROM `githubarchive.month.201706` 
  WHERE type='WatchEvent'
  GROUP BY 1 
  HAVING stars > 15
) b
ON a.repo_in_mirror = b.repo_with_stars
LIMIT 10

The results say that 6 months ago (June 2017) ~21,890 repositories got more than 15 stars. Of those ~21,890 repositories, ~53.86% are mirrored on bigquery-public-data.github_repos.contents.

How do you refresh the data? Do you use a selected list of projects or do you refresh the list each time?

The pipeline looks periodically for new projects. Old projects that change their license to a valid open source one (according to the API) might be missed and there's an option to add them manually.


Articles

Misc (to be organized)

Bonus: pypi top Python breakthrough install 2018/02 vs 2017/01

To be blogged...

SELECT file.project, COUNT(*) pypi_201802_c, ANY_VALUE(b.rn_201701) rn_201701
FROM `the-psf.pypi.downloads201802*` a
LEFT JOIN (SELECT project, ROW_NUMBER() OVER(ORDER BY pypi_c DESC) rn_201701 FROM (
  SELECT file.project, COUNT(*) pypi_c
  FROM `the-psf.pypi.downloads201701*` 
  GROUP BY 1 ORDER BY 2 DESC
  LIMIT 10000
)) b
ON a.file.project=b.project 
WHERE b.rn_201701 IS null OR b.rn_201701>500
GROUP BY 1 ORDER BY 2 DESC
LIMIT 300

Disclaimer

This is not an official Google product (experimental or otherwise), it is just code that happens to be owned by Google.

analyzing_github's People

Contributors

fhoffa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.