GithubHelp home page GithubHelp logo

madnight / githut Goto Github PK

View Code? Open in Web Editor NEW
935.0 25.0 121.0 39.33 MB

Github Language Statistics

Home Page: https://madnight.github.io/githut

License: GNU Affero General Public License v3.0

JavaScript 70.97% HTML 29.03%
github-language-statistics languages functional-reactive-programming serverless react programming-languages github-pages-website bigquery sql-query dataset

githut's Introduction

GitHub Language Statistics

License (AGPL-3.0) Issue Count Code Style

Data Generation

Languages

Get language top list for Github

SELECT language.name, COUNT(language.name)
AS count FROM [bigquery-public-data:github_repos.languages]
group by language.name order by count DESC

Result of first 10 from 322

{"language_name":"JavaScript","count":"1006022"}
{"language_name":"CSS","count":"745573"}
{"language_name":"HTML","count":"663315"}
{"language_name":"Shell","count":"593461"}
{"language_name":"Python","count":"492715"}
{"language_name":"Ruby","count":"365413"}
{"language_name":"Java","count":"340622"}
{"language_name":"PHP","count":"328907"}
{"language_name":"C","count":"286272"}
{"language_name":"C++","count":"267552"}
...

Licenses

Get license top list for Github

SELECT license, COUNT(license)
AS count FROM [bigquery-public-data:github_repos.licenses]
group by license order by count DESC

Full result

{"license":"mit","count":"1551711"}
{"license":"apache-2.0","count":"455316"}
{"license":"gpl-2.0","count":"376453"}
{"license":"gpl-3.0","count":"284761"}
{"license":"bsd-3-clause","count":"161041"}
{"license":"bsd-2-clause","count":"57412"}
{"license":"unlicense","count":"43899"}
{"license":"lgpl-3.0","count":"38213"}
{"license":"agpl-3.0","count":"38034"}
{"license":"cc0-1.0","count":"28600"}
{"license":"epl-1.0","count":"24074"}
{"license":"lgpl-2.1","count":"23872"}
{"license":"isc","count":"17690"}
{"license":"mpl-2.0","count":"17421"}
{"license":"artistic-2.0","count":"9413"}

Pull Requests

Get the number of Pull Requests per day/month/year

SELECT language as name, year, quarter, count FROM ( SELECT * FROM (
SELECT lang as language, y as year, q as quarter, type,
COUNT(*) as count FROM (SELECT a.type type, b.lang lang, a.y y, a.q q FROM (
SELECT type, actor.login, YEAR(created_at) as y, QUARTER(created_at) as q,
STRING(REGEXP_REPLACE(repo.url, r'https:\/\/github\.com\/|https:\/\/api\.github\.com\/repos\/', '')) as name
FROM [githubarchive:month.201901] WHERE NOT LOWER(actor.login) LIKE "%bot%") a
JOIN ( SELECT repo_name as name, lang FROM ( SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY repo_name ORDER BY lang) as num FROM (
SELECT repo_name, FIRST_VALUE(language.name) OVER (
partition by repo_name order by language.bytes DESC) AS lang
FROM [bigquery-public-data:github_repos.languages]))
WHERE num = 1 order by repo_name)
WHERE lang != 'null') b ON a.name = b.name)
GROUP by type, language, year, quarter
order by year, quarter, count DESC)
WHERE count >= 100) WHERE type = 'PullRequestEvent'

Manual

Googles BigQuery is free for public datasets like Github, Reddit or Stackoverflow. It is limited to 1000 GB query volume per month. One of the querys above takes about 50-200 MB query volume. The public dataset for Github is available here: https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=samples&t=github_nested&page=table

URL Schema

madnight.github.io/githut/#/pull_requests/2021/1/Python,Lua,JavaScript
                                 ▲         ▲   ▲        ▲
                                 │         │   │        │
                pull_requests ───┘   year ─┘   │        └─ languages
                pushes                         └─ quarter
                stars
                issues

BibTeX

If you wish to quote, you may use the following BibTeX.

@misc{githuttwo,
  author = {Fabian Beuke},
  title = {GitHut 2.0: GitHub Language Statistics},
  year = {2023},
  note = {GitHub repository},
  howpublished = {\url{https://madnight.github.io/githut/#/}}
}

githut's People

Contributors

blegat avatar captainwalters avatar dependabot[bot] avatar jeremylardenois avatar madnight avatar mardecode avatar pheogrammer avatar razrfalcon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

githut's Issues

2023Q3

Can you please update?

More languages

It is great that you revived the GitHub project !
Would it be possible to add more languages than the top 10 in the GitHut website ? It is good to have one clean graph with only the top 10 but it would be nice to also have the other ones below.

missing abscissa

Assuming the chart is updated in regular intervals, reading the plot would be greatly facilitated if there were a time line included.

As illustration, I merged the current plot with the abscissa provided/used by Tiobe index:

pattern_with_abscissa

Maybe you did not start to record the popularity of the programming languages by 2002, which would justify a different limit to the left hand side. Then maybe you can use a finer scale; e.g. large tics with annotation of the year, and shorter ones of the quarters/months in between without an annotation -- just like the scale on a thermometer.

question about table "github_repos.languages"

Hi, may I ask a question about the bigquery pulic databases?
I checked the number of distinct "repo.name" of PullRequest events in table "githubarchive.year.2020", and got the number 8,752,017, which is much larger than the number of table "bigquery-public-data.github_repos" rows 3,331,762.
Did I have anything wrong or is the table "github_repos.languages" imcomplete?
image

Languages in graph not synced with table data

The languages shown in the graph are not the same as the top 10 from the table. For example, on the Pull Requests graph, Go (6) and TypeScript (10) are not included in the graph whereas Scala (12) and Objective-C (16) are included in the 10 languages in the graph.

Counting extensions and file sizes of project folders on GitHub to recommend changes

Over the past several years I have been studying groups on GitHub. Part of the Internet Foundation studies for global communities on the Internet for the past 23 years.

Much of the human cost of learning these various projects, for any group size, is dealing with the many and different formats. A mature project like LLVM-Project (see attached) has what I see as the normal human capabilities curve (where the file sizes are log normal with a peak around 1000 characters). And a long tail of exceptions for filetypes that are seldom used, but often critical.

I see you using a database that I have not seen or tried yet. I have been downloading source code folders to my computer, (I am also reading, parsing and analyzing the contents of the files). I would like to put these kinds of tools and analyses where others can try them. I mostly use Javsascript (with a localhost for access to hardware and file system services).

Would it be too much to get a directory of all files in all projects on GitHub so I can analyze the maturity and character of them all? I can tell a lot about the learning curve and costs involved for a project just by looking at the source code folder and repositories.

I want to recommend different practices for GitHub and these kinds of projects on the internet. The current practices (global) are wasting too much human time and delaying response of things like "covid", "global climate change", "deforestation", "online education" and others. I have about 20,000 global communities that I have investigated to see why the stall or die or simply take years to do something that can be done in a few days with the proper tools.

I talked about some of the related issues in a video I made yesterday and mentioned where this kind of analysis might fit into the larger picture.

New Video: Energy Office of Science, PNNL Article, Climate Model, Sharing
https://theinternetfoundation.net/?p=1347

Richard K Collins, Director, The Internet Foundation

Counts of Extensions and Log FileSizes for LLVM-Project
Counts Plots of Sourcecode folder extensions and slzes Exts Log10ths.xlsx

Change default selected languages

Go and typescript are the last 5 years in the top 7. And c hasn't been in the last 5 years. Not to mention ruby which is selected by default too.

Maybe its better if it uses the current top 7 (2022/q4) as default
Being

  • Python
  • Java
  • TypeScript
  • C++
  • JavaScript
  • Go
  • PHP

Button should be aware of user language selection

Leave the user language selection for the chart when you switch between metric (starts, pull requests, pushes...), now every-time you switch metric the language selection revert to default (most common languages per metric)

Feature request: hash value in URL to select mode and data-range

It would be great to be able to link directly to a specific time-frame and mode. I think the best way to do this is in the hash value (i.e. "#year=2017,quarter=Q4,mode=pushrequests"). The advantage of the hash value over the query parameter is that the page javascript can update the hash value (using the history API) without requiring the page to reload.

Option to presort/preselect by current rather than start time

Language preselection and sorting seems to be by the start time of the data time range. An option to sort/select by most recent time would also be nice.

Well, even nicer might be to sort/select by any chosen quarter. If it doesn't make the interface too messy.

safe selected Languages

doesn't need to be in localstorage just in the current window so when switching to stuff like stars it stores the languages

display more attributes?

I'd like to be able to display more attributes, including number of authors, projects, files, lines of code, etc. Please consider extending the site in this direction.

Add Ada

Could perhaps Ada be added to the list of languages? It is currently #23 on TIOBE and #15 on pypl.

Page responsivity

The pages when viewed on mobile phone, they doesn't fit well

They need some padding on left and right so a reader can see all words and read at ease

Run locally

Is it possible to run Githut locally ? When I try it complains that it does not have the files toplang-2016-12.json and LangTable.styl

query.js is biased towards more verbose languages (winner-takes-all)

The sub-query selects each GitHub repository's name and identifies the primary programming language used in that repo based on the highest byte count in the repository.

githut/scripts/query.js

Lines 65 to 67 in 140f017

SELECT repo_name, FIRST_VALUE(language.name) OVER (
partition by repo_name order by language.bytes DESC) AS lang
FROM [bigquery-public-data:github_repos.languages]))

The current query adopts a "winner-takes-all" approach, which disadvantage less verbose programming languages by only selecting the lang with the largest byte count in each repository. A less biased approach would be list all programming languages present in each repository, rather than just the predominant one. Still, the languages could be weighted based on their respective byte counts, to not overemphasize small snippets.

2021/Q1 data is probably incorrect

  1. In pushes, stars and issues, all the percentage changes are negative.
  2. Some of the trends are incorrect, for example Elixir in pull requests or Groovy in pushes.

Show absolute numbers?

Hi, thanks for this.

If I may, I think showing only relative weight of each language isn't as interesting as the absolute counts could be.

For example right now you see Javascript "declining" since 2017-ish, but that doesn't tell us whether there is less JavaScript related activity on GitHub, or whether Javascript activity is still growing, but less than other languages.

I'm really bad at react, but from what I managed to figure out, all these percentages are computed from raw absolute counts, so the data is there already.

Javascript data accurate?

In the year 2021 Javascript pushes and PRs dropped dramatically in the year 2021. The scale of the change (40% down to 10% pushes) makes me think that this could be a data issue rather than a trend. Any explanation?
Screenshot 2022-12-26 at 7 26 54 AM

Add "number of unique comitters" metric

This is probably a bit more resource-intensive to compute than the other metrics, but I think it would be a very interesting additional perspective. You could do the same for issue comments, PRs, and stars, but I don't know if it's worth the trouble.

I tried to write the corresponding query, but got confused pretty quickly :/

When will 22Q1 come?

When i look statics, i see 2021/4 last on madnight.github.io. We've left 22Q1 behind, but it's still not there. When will 22Q1 come? Thanks..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.