GithubHelp home page GithubHelp logo

github-data-2014's Introduction

GitHub Data Challenge 2014 Entry

Theme: Quest for True Names

Result: http://euske.github.io/github-data-2014/index.html

Challenge announcement: https://github.com/blog/1864-third-annual-github-data-challenge

Key Findings

  • What are typical words used for variable names and function names? Answer: we got a list.
  • Are they different in different languages? Answer: Yes, they are!
  • We can assume nouns are commonly used for variables, and verbs for functions. Is there an interesting connection between them? Answer: Yes, there is!
  • Result page: http://euske.github.io/github-data-2014/index.html

Method

  • Examine source code in three major languages: C, Java and Python.
  • List the names for variables, functions (methods) and types (classes).
  • Count common words from each list, and see if an interesting statistics emerges.

Data We Used

  • Pick the top 100 repositories (in terms of Stars) that are labelled as each language. (via GitHub Search API, cf. https://developer.github.com/v3/search/) [Obtained at Aug. 8, 2014]
  • List all the files: (Caveats: We excluded non-ASCII filenames.)
    • C: 110,705 files (including header files)
    • Java: 94,635 files
    • Python: 33,710 files
  • Randomly pick files for each language that has reasonable file size. (1KB-100KB)
    • C: 7,381 files (80MB in total)
    • Java: 7,764 files (53MB in total)
    • Python: 5,872 files (56MB in total)
  • In order to mitigate the data skewness, we limited the maximum number of files for each repository to 100.

Language Parsing

  • We used ANTLR4 for C and Java (cf. http://antlr.org), and ast module for Python (cf. https://docs.python.org/2/library/ast.html).
  • C Caveats: ANTLR4 cannot handle preprocessor directives, so we stripped out #defines and #includes from the code. After all, we're interested in a source code for human readers, not for compilers. But this left a certain number of non-syntactic C codes. (e.g. int func(void *p, EXTRA_ARGS); )
  • Python Caveats: Mixture of Python 2 and 3 code. The compiler.ast module handles both pretty well.
  • The following names were excluded: one-letter names, "assert" as a function name, and "self" as a variable name (in Python).

github-data-2014's People

Contributors

euske avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.