GithubHelp home page GithubHelp logo

alcampopiano / nbestimate Goto Github PK

View Code? Open in Web Editor NEW

This project forked from parente/nbestimate

0.0 0.0 0.0 170.57 MB

Estimate of Public Jupyter Notebooks on GitHub

License: MIT License

Jupyter Notebook 94.87% Python 5.13%

nbestimate's Introduction

Estimate of Public Jupyter Notebooks on GitHub

Data Collection History

  • Late-2014 to mid-2016: I wrote a script that scrapes the GitHub web search UI for the count, appends to a CSV, executes a notebook, and stores the results in a gist at https://gist.github.com/parente/facb555dfbae28e817e0. I scheduled the script to run daily.
  • Mid-2106 to Late-2016: The GitHub web search UI started requiring authentication to see global search results. I stopped collecting data.
  • Late-2016 to early-2019: I rewrote the process to include a human-in-the-loop who entered the hit count after viewing the search results page. I moved the CSV, notebook, and scripts to this repo, and sporadically ran the script.
  • Early-2019: I found out that the GitHub search API now supports global search. I automated the entire collection process again and set it to run on TravisCI on a daily schedule.
  • December 2020: GitHub changed their code search index results to exclude repositories without activity for the past year. The ipynb search result count dropped from nearly 10 million to 4.5 million ipynb files, stayed there for a day or so, and then began climbing again from that new origin.
  • June 2021: I started collecting data again but disabled the notebook showing the historical and predicted counts.
  • July 2021: I revived the notebook showing the historical counts but kept prediction disabled.

Assumptions

  • That the search query hits are less than or equal to the total number of *.ipynb files on GitHub.
  • That the result is not inflated due to GitHub forks.
    • Evidence: We do not see the tutorial notebooks from the ipython/ipython GitHub repository duplicated in the search results because of the 2,000+ forks of the ipython/ipython repo.
  • That the result is inflated a tiny bit by manually created duplicates of notebooks.
    • Evidence: Some people seem to download their favorite notebooks and then upload them into their own git repositories for safe keeping.

nbestimate's People

Contributors

dependabot[bot] avatar ericdill avatar parente avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.