Finding the Most Influential CRAN Contributor using Neo4j and Libraries.io Open Data

For the first part of this project, to do with PyPi, see here

As a graduate student in statistics, I used R a lot. In fact, entire semester-long courses were dedicated to learning how to harness some of R's single-purpose (read: esoteric) packages for statistical modeling.

But when it came time for my capstone project, the data manipulation was daunting... until I discovered dyplr. Moreover, I was completely taken by the paradigm outlined by dyplr's author, Hadley Wickham, in the Split-Apply-Combine paper that he wrote, introducing what would become the guiding principle of the Tidyverse.

Since then, the Tidyverse has exploded in popularity, becoming the de facto standard for data manipulation in R, and Hadley Wickham's veneration among R users has only increased— and for good reason: now Python and R are the two most-used languages in data science.

So, the motivation for this project is akin to that of the aforementioned PyPi contributors investigation: is Hadley Wickham the most influential R contributor? To answer this question, we will analyze the R packages uploaded to CRAN; specifically:

The R packages themselves
What packages depend on what other packages
Who contributes to what packages

Using these items, we will use the degree centrality algorithm from graph theory to find the most influential node in the graph of R packages, dependencies, and contributors.

Summary of Results

After constructing the graph (including imputing more than 2/3 of the R packages from Libraries.io Open Data dataset) and analyzing the degree centrality, Hadley Wickham is indeed the most influential R contributor according to the data from Libraries.io and CRAN. Below are the 10 Contributors with the highest degree centrality scores for this graph:

Contributor	GitHub login	Degree Centrality Score
Hadley Wickham	hadley	244 751
Jim Hester	jimhester	170 123
Kiril Müller	krlmlr	159 577
Jennifer (Jenny) Bryan	jennybc	121 543
Mara Averick	batpigandme	121 253
Gábor Csárdi	gaborcsardi	101 351
Hiroaki Yutani	yutannihilation	100 625
Christophe Dervieux	cderv	98 078
Jeroen Ooms	jeroen	82 055
Craig Citro	craigcitro	71 207

For insight into how this result was arrived at, read on.

The Approach

Libraries.io Open Data

CRAN is the repository for R packages that developers know and love. Analogously to CRAN, other programming languages have their respective package managers, such as PyPi for Python. As a natural exercise in abstraction, Libraries.io is a meta-repository for package managers. From their website:

Libraries.io gathers data from 36 package managers and 3 source code repositories. We track over 2.7m unique open source packages, 33m repositories and 235m interdependencies between [sic] them. This gives Libraries.io a unique understanding of open source software. An understanding that we want to share with you.

Using Open Data Snapshot to Save API Calls

Libraries.io has an easy-to-use API, but given that CRAN has 15,000+ packages in the Open Data dataset, the number of API calls to various endpoints to collate the necessary data is not appealing (also, Libraries.io rate limits to 60 requests per minute). Fortunately, Jeremy Katz on Zenodo maintains snapshots of the Libraries.io Open Data source. The most recent version is a snapshot from 22 December 2018, and contains the following CSV files:

Projects (3 333 927 rows)
Versions (16 147 579 rows)
Tags (52 506 651 rows)
Dependencies (105 811 885 rows)
Repositories (34 061 561 rows)
Repository dependencies (279 861 607 rows)
Projects with Related Repository Fields (3 343 749 rows)

More information about these CSVs is in the README file included in the Open Data tar.gz, copied here. There is a substantial reduction in the data when subsetting these CSVs just to the data pertaining to CRAN; find the code used to subset them and the size comparisons here.

WARNING: The tar.gz file that contains these data is 13 GB itself, and once downloaded takes quite a while to untar; once uncompressed, the data take up 64 GB on disk!

Graph Databases, Starring Neo4j

Because of the interconnected nature of software packages (dependencies, versions, contributors, etc.), finding the most influential "item" in that web of data make graph databases and graph theory the ideal tools for this type of analysis. Neo4j is the most popular graph database according to DB engines, and is the one that we will use for the analysis. Part of the reason for its popularity is that its query language, Cypher, is expressive and simple:

Terminology that will be useful going forward:

Jane Doe and John Smith are nodes (equivalently: vertexes)
The above two nodes have label Person, with property name
The line that connects the nodes is an relationship (equivalently: edge)
The above relationship is of type KNOWS
KNOWS, and all Neo4j relationships, are directed; i.e. Jane Doe knows John Smith, but not the converse

On MacOS, the easiest way to use Neo4j is via the Neo4j Desktop app, available as the neo4j cask on Homebrew. Neo4j Desktop is a great IDE for Neo4j, allowing simple installation of different versions of Neo4j as well as plugins that are optional (e.g. APOC) but are really the best way to interact with the graph database. Moreover, the screenshot above is taken from the Neo4j Browser, a nice interactive database interface as well as query result visualization tool.

Neo4j Configuration

Before we dive into the data model and how the data are loaded, Neo4j's default configuration isn't going to cut it for the packages and approach that we are going to use, so the customized configuration file can be found here, corresponding to Neo4j version 3.5.7.

Making a Graph of Libraries.io Open Data

Importing from CSV is the most common way to populate a Neo4j graph, and is how we will proceed given that the Open Data snapshot untars into CSV files. However, first a data model is necessary— what the entities that will be represented as labeled nodes with properties and the relationships among them are going to be. Moreover, some settings of Neo4j will have to be customized for proper and timely import from CSV.

Data Model

Basically, when translating a data paradigm into graph data form, the nouns become nodes and how the nouns interact (the verbs) become the relationships. In the case of the Libraries.io data, the following is the data model:

So, a Platform HOSTS a Project, which IS_WRITTEN_IN a Language, and HAS_VERSION Version. Moreover, a Project DEPENDS_ON other Projects, and Contributors CONTRIBUTE_TO Projects. With respect to Versions, the diagram communicates a limitation of the Libraries.io Open Data: that Project nodes are linked in the dependencies CSV to other Project nodes, despite the fact that different versions of a project depend on varying versions of other projects. Take, for example, this row from the dependencies CSV:

ID	Project_Name	Project_ID	Version_Number	Version_ID	Dependency_Name	Dependency_Kind	Optional_Dependency	Dependency_Requirements	Dependency_Project_ID
29033435	archivist	687281	1.0	7326353	RCurl	imports	false	*	688429

I.e.; Version 1.0 of Project archivist depends on Project RCurl. There is no demarcation of which version of RCurl it is that version 1.0 of archivist depends on, other than * which forces the modeling decision of Projects depending on other Projects, not Versions.

Contributors, the Missing Data

It is impossible to answer the question of what contributor to CRAN is most influential without, obviously, data on contributors. However, the Open Data dataset lacks this information. In order to connect the Open Data dataset with contributors data will require calls to the Libraries.io API. As mentioned above, there is a rate limit of 60 requests per minute. If there are

$ mlr --icsv --opprint filter '$Platform == "CRAN"' then uniq -n -g "ID" projects-1.4.0-2018-12-22.csv
 14455

Python-language Pypi packages, each of which sends one request to the Contributors endpoint of the Libraries.io API, at "maximum velocity", it will require

to get contributor data for each project.

Following the example of this blog, it is possible to use the aforementioned APOC utilities for Neo4j to load data from web APIs, but I found it to be unwieldy and difficult to monitor. So, I used Python's requests and SQLite packages to send requests to the endpoint and store the responses in a long-running Bash process (code for this here).

Database Constraints

Analogously to the unique constraint in a relational database, Neo4j has a uniqueness constraint which is very useful in constraining the number of nodes created. Basically, it isn't useful, and hurts performance, to have two different nodes representing the platform Pypi (or the language Python, or the project pipenv, ...) because it is a unique entity. Moreover, uniqueness constraints enable more performant queries. The following Cypher commands add uniqueness constraints on the properties of the nodes that should be unique in this data paradigm:

CREATE CONSTRAINT on (platform:Platform) ASSERT platform.name IS UNIQUE;
CREATE CONSTRAINT ON (project:Project) ASSERT project.name IS UNIQUE;
CREATE CONSTRAINT ON (project:Project) ASSERT project.ID IS UNIQUE;
CREATE CONSTRAINT ON (version:Version) ASSERT version.ID IS UNIQUE;
CREATE CONSTRAINT ON (language:Language) ASSERT language.name IS UNIQUE;
CREATE CONSTRAINT ON (contributor:Contributor) ASSERT contributor.uuid IS UNIQUE;
CREATE INDEX ON :Contributor(name);

All of the ID properties come from the first column of the CSVs and are ostensibly primary key values. The name property of Project nodes is also constrained to be unique so that queries seeking to match nodes on the property name— the way that we think of them— are performant as well.

N.b. if the graph from the first part of this analysis, concerning PyPi packages, is already populated, it will be necessary to drop the uniqueness constraint on the Project names to avoid collisions. This is acceptable, as there will still be distinct ID values for the projects, but as Project name is the natural property to use for querying, a Neo4j index will do the trick:

DROP CONSTRAINT ON (p:Project) ASSERT p.name IS UNIQUE;
CREATE INDEX ON :Project(name):

Populating the Graph

With the constraints, plugins, and configuration of Neo4j in place, the Libaries.io Open Data dataset can be loaded. Loading CSVs to Neo4j can be done with the default LOAD CSV command, but in the APOC plugin there is an improved version, apoc.load.csv, which iterates over the CSV rows as map objects instead of arrays; when coupled with periodic execution (a.k.a. batching), loading CSVs can be done in parallel, as well.

Creating `R` and CRAN Nodes

As all projects that are to be loaded are hosted on CRAN, the first node to be created in the graph is the CRAN Platform node itself:

CREATE (:Platform {name: 'CRAN'});

Not all projects hosted on CRAN are written in R, but those are the focus of this analysis, so we need a R Language node:

CREATE (:Language {name: 'R'});

With these two, we create the first relationship of the graph:

MATCH (p:Platform {name: 'CRAN'})
MATCH (l:Language {name: 'R'})
CREATE (p)-[:HAS_DEFAULT_LANGUAGE]->(l);

Now we can load the rest of the entities in our graph, connecting them to these as appropriate, starting with Projects.

Neo4j's `MERGE` Operation

The key operation when loading data to Neo4j is the MERGE clause. Using the property specified in the query, MERGE either MATCHes the node/relationship with the property, and, if it doesn't exist, duly CREATEs the node/relationship. If the property in the query has a uniqueness constraint, Neo4j can thus iterate over possible duplicates of the "same" node/relationship, only creating it once, and "attaching" nodes to the uniquely-specified node on the go.

This is a double-edged sword, though, in the situation of creating relationships between unique nodes; if the participating nodes are not specified exactly, to MERGE a relationship between them will create new node(s) that are duplicates. This is undesirable from an ontological perspective, as well as a database efficiency perspective. So, all this to say that, to create unique node-relationship-node entities requires three passes over a CSV: the first to MERGE the first node type, the second to MERGE the second node type, and the third to MATCH node type 1, MATCH node type 2, and MERGE the relationship between them.

Lastly, for the same reason as the above, it is necessary to create "base" nodes before creating nodes that "stem" from them. For example, if we had not created the R Language node above (with unique property name), for every R project MERGED from the projects CSV, Neo4j would create a new Language node with name 'R' and a relationship between it and the R Project node. This duplication can be useful in some data models, but in the interest of parsimony, we will load data in the following order:

Projects
Versions
Dependencies among Projects and Versions
Contributors

Loading `Project`s

First up is the Project nodes. The source CSV for this type of node is cran_projects.csv and the queries are in this file. Neo4j loads the CSVs data following the instructions of the file with the apoc.cypher.runFile command; i.e.

CALL apoc.cypher.runFile('/path/to/libraries_io/cypher/projects_apoc.cypher') yield row, result return 0;

The result of this set of queries is that the following portion of our graph is populated:

Loading `Version`s

Next are the Versions of the Projects. The source CSV for this type of node is cran_versions.csv and the queries are in this file. These queries are run with

CALL apoc.cypher.runFile('/path/to/libraries_io/cypher/versions_apoc.cypher') yield row, result return 0;

The result of this set of queries is that the graph has grown to include the following nodes and relationships:

Loading Dependencies among `Project`s and `Version`s

Now that there are Project nodes and Version nodes, it's time to link their dependencies. The source CSV for these data is cran_dependencies.csv and this query is in this file. Because the Projects and Versions already exist, this operation is just the one MATCH-MATCH-MERGE query, creating relationships. It is run with

CALL apoc.cypher.runFile('/path/to/libraries_io/cypher/dependencies_apoc.cypher') yield row, result return 0;

Caveat

Despite the Libraries.io Open Data dataset that contains dependencies among R projects, there are some projects that have no versions listed on CRAN, yet still report imports relationships on their CRAN sites. So, in order to include the impact of these DEPENDS_ON relationships in the degree centrality algorithm, a pseudo Version node was created with the number property "NONE", and an auto-generated UUID from the apoc.create.uuid function; i.e.

CREATE (v:Version{name:"NONE", number: apoc.create.uuid()});

Then, the R script found here creates a JSON file using that Version node, attached to Project nodes with no DEPENDS_ON relationships in the current graph. Then, the JSON file is loaded to Neo4j using the Cypher query here. That is, using the Neo4j cypher shell, and the Rscript executable from R:

GRAPHDBPASS=graph_db_pass_here Rscript --vanilla get_missing_dependencies_from_crandb_api.R > some_file.json && bin/cypher-shell -u neo4j -p $GRAPHDBPASS

which opens up the cypher-shell in which the aforementioned Cypher query and the just-created JSON file can be passed to the shell.

The result of these operations is that the graph has grown to include the DEPENDS_ON relationship:

Loading `Contributor`s

Because the data corresponding to R Project Contributors was retrieved from the Libraries.io API, it is not run with Cypher from a file, but in a Python script, particularly this section.

Caveat

Unfortunately, that's not the end of the story for the Contributor data: over 70% of the R Projects have no Contributors reported by the Libraries.io API. So, even after the ~15k Project Contributors were scraped from the API, more than 10k of those needed Contributor data imputed. To do this, I used the crandb package from one of the Top-10 most-influential Contributors, Gábor Csárdi. For each package on CRAN, the crandb package will return the information on its official CRAN page, in an R object that is easily parsed. For example, using crandb on the venerable bootstrapping package, boot, gives Contributor in the form of Author and Maintainer:

> library(crandb)
> crandb::package('boot')
CRAN package boot 1.3-23, 4 months ago
Title: Bootstrap Functions (Originally by Angelo Canty for S)
Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>
Author: Angelo Canty [aut], Brian Ripley [aut, trl, cre] (author of
    parallel support)
# ...

The Maintainer field is always of the form "Maintainer: name ", so that text was extracted and used as the name property of the Contributor node for the Project. The Author field proved to be too unstructured for reliable scraping. This process is in this R file.

After executing this process, the graph is now in its final form:

Preliminary Results

On the way to understanding the most influential Contributor, it is useful to find the most influential Project. Intuitively, the most influential Project node should be the node with the most (or very many) incoming DEPENDS_ON relationships; however, the degree centrality algorithm is not as simple as just counting the number of relationships incoming and outgoing and ordering by descending cardinality (although that is a useful metric for understanding a [sub]graph). This is because the subgraph that we are considering to understand the influence of Project nodes also contains relationships to Version nodes.

Degree Centrality

So, using the Neo4j Graph Algorithm plugin's algo.degree procedure, all we need are a node label and a relationship type. The arguments to this procedure could be as simple as two strings, one for the node label, and one for the relationship type. However, as mentioned above, there are two node labels at play here, so we will use the alternative syntax of the algo.degree procedure in which we pass Cypher statements returning the set of nodes and the relationships among them.

To run the degree centrality algorithm on the Projects written in R that are hosted on CRAN, the syntax (found here) is:

call algo.degree(
    "MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name:'CRAN'}) return id(p) as id",
    "MATCH (p1:Project)-[:HAS_VERSION]->(:Version)-[:DEPENDS_ON]->(p2:Project) return id(p2) as source, id(p1) as target",
    {graph: 'cypher', write: true, writeProperty: 'cran_degree_centrality'}
)
;

It is crucially important to alias as source the Project node MATCHed in the second query as the end node of the DEPENDS_ON relationship, and the start node of the relationship as target. This is not officially documented, but the example in the documentation has it as such, and I ran into Java errors if not aliased exactly that way.

Now that there is a property on each R Project node denoting its degree centrality score, the following query returns the top 10 Projects:

MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name:'CRAN'})
RETURN p.name, p.cran_degree_centrality ORDER BY p.cran_degree_centrality DESC LIMIT 10
;

Project	Degree Centrality Score
`Rcpp`	6048
`ggplot2`	4269
`MASS`	4024
`dplyr`	3573
`plyr`	3017
`stringr`	2622
`Matrix`	2512
`magrittr`	2200
`httr`	2073
`jsonlite`	2070

The Project that is out in front by a good margin is Rcpp, the R package that allows developers to integrate C++ code into R, usually for significant speedup. Another interesting note is that 4 of these top 10 are part of the "Tidyverse", Hadley Wickham's collection of packages designed for data science. Moreover, as noted on the Tidyverse website, the last two Projects, httr and jsonlite, are "Tidyverse-adjacent", in that they have a similar design and philosophy. It seems that the hypothesis that @hadley is the most influential contributor deserves a hefty amount of a priori weight!

The Most Influential Contributor

To properly evaluate the hypothesis, the degree centrality algorithm will be run again, this time focusing on the Contributor nodes, and their contributions to Projects. The query (found here) is:

call algo.degree(
    "MATCH (:Platform {name:'CRAN'})-[:HOSTS]->(p:Project) with p MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p)<-[:CONTRIBUTES_TO]-(c:Contributor) return id(c) as id",
    "MATCH (c1:Contributor)-[:CONTRIBUTES_TO]->(:Project)-[:HAS_VERSION]->(:Version)-[:DEPENDS_ON]->(:Project)<-[:CONTRIBUTES_TO]-(c2:Contributor) return id(c2) as source, id(c1) as target",
    {graph: 'cypher', write: true, writeProperty: 'cran_degree_centrality'}
)
;

This puts a property on each Contributor node denoting its degree centrality score, and the following query returns the top 10 Contributors and their scores:

MATCH (:Platform {name:'CRAN'})-[:HOSTS]->(p:Project)-[:IS_WRITTEN_IN]->(:Language {name: 'R'})
MATCH (c:Contributor)-[:CONTRIBUTES_TO]->(p)
RETURN c.name, c.cran_degree_centrality ORDER BY c.cran_degree_centrality DESC LIMIT 10
;

Contributor	GitHub login	Degree Centrality Score	# Top-10 Contributions	# Total Contributions	Total Contributions Rank
Hadley Wickham	hadley	239 829	5	121	2nd
Jim Hester	jimhester	167 662	3	120	3rd
Kiril Müller	krlmlr	154 655	3	106	5th
Jennifer (Jenny) Bryan	jennybc	119 082	3	57	13th
Mara Averick	batpigandme	118 792	3	50	15th
Hiroaki Yutani	yutannihilation	98 164	3	49	16th
Christophe Dervieux	cderv	98 078	3	36	28th
Gábor Csárdi	gaborcsardi	93 968	2	91	6th
Jeroen Ooms	jeroen	72 211	2	117	4th
Craig Citro	craigcitro	71 207	3	15	107th

As was surmised from the result of the Projects degree centrality query, the most influential R contributor on CRAN is Hadley Wickham, and it's not even close. Not only has does @hadley contribute to the second-most R projects of any Contributor (only behind Contributor Scott Chamberlain who is curiously absent from the élité of most influential), he contributes to the most Top-10 projects of any Contributor, with fully half bearing his mark.

There are only 253 Contributors who contribute to a Top-10 project–in terms of degree centrality–however even being one of those is not a sufficient condition for a high degree centrality score; i.e. even though this table hints at a correlation between degree centrality score and number of total projects (query here and rank query here) contributed to, there is a higher association between degree centrality and number of Top-10 projects contributed to. Indeed, using the algo.similarity.pearson function:

MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name:'CRAN'})
WITH p order by p.cran_degree_centrality DESC
WITH collect(p) as r_projects
UNWIND r_projects as project
SET project.cran_degree_centrality_rank = apoc.coll.indexOf(r_projects, project)+1
WITH project WHERE project.cran_degree_centrality_rank <= 10
MATCH (project)<-[ct:CONTRIBUTES_TO]-(c:Contributor)
WITH c, count(ct) as num_top_10_contributions
WITH collect(c.cran_degree_centrality) as dc, collect(num_top_10_contributions) as tc
RETURN algo.similarity.pearson(dc, tc) AS degree_centrality_top_10_contributions_correlation_estimate;

yields an estimate of 0.8462, whereas

MATCH (:Language {name: 'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name: 'CRAN'})
MATCH (p)<-[ct:CONTRIBUTES_TO]-(c:Contributor)
WITH c, count(ct) as num_total_contributions
WITH collect(c.cran_degree_centrality) as dc, collect(num_total_contributions) as tc
RETURN algo.similarity.pearson(dc, tc) AS degree_centrality_total_contributions_correlation_estimate
;

is only 0.6830. All this goes to show that, in a network, the centrality of a node is determined by contributing to the right nodes, not necessarily the most nodes.

Conclusion

Using the Libraries.io Open Data dataset, the R projects on CRAN and their contributors were analyzed using Neo4j–in particular, the degree centrality algorithm–to find out which contributor is the most influential to the graph of R packages, versions, dependencies, and contributors. That contributor is @hadley: the Tidyverse creator, Hadley Wickham.

This analysis did not take advantage of a commonly-used feature of graph data; weights of the edges between nodes. A future improvement of this analysis would be to use the number of versions of a project, say, as the weight in the degree centrality algorithm to down-weight those projects that have few versions as opposed to the projects that have verifiable "weight" in the R community, e.g. dplyr. Similarly, it was not possible to delineate the type of contribution made in this analysis; more accurate findings would no doubt result from the distinction between a package's author, for example, and a contributor who merged a small pull request to fix a typo. Similarly, the imputation of just a single contributor for more than 70% of the R packages potentially influenced in a non-trivial way the topology of this network.

Moreover, the data used in this analysis are just a snapshot of the state of CRAN from December 22, 2018: needless to say the number of versions and projects and contributions is always in flux and so behooves updating. However, the Libraries.io Open Data are a good window into the dynamics of statistical programming's premier community.

ebb-earl-co / libraries_io_cran Goto Github PK

libraries_io_cran's Introduction

Finding the Most Influential CRAN Contributor using Neo4j and Libraries.io Open Data

Summary of Results

The Approach

Libraries.io Open Data

Using Open Data Snapshot to Save API Calls

Graph Databases, Starring Neo4j

Neo4j Configuration

Making a Graph of Libraries.io Open Data

Data Model

Contributors, the Missing Data

Database Constraints

Populating the Graph

Creating R and CRAN Nodes

Neo4j's MERGE Operation

Loading Projects

Loading Versions

Loading Dependencies among Projects and Versions

Caveat

Loading Contributors

Caveat

Preliminary Results

Degree Centrality

The Most Influential Contributor

Conclusion

libraries_io_cran's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Creating `R` and CRAN Nodes

Neo4j's `MERGE` Operation

Loading `Project`s

Loading `Version`s

Loading Dependencies among `Project`s and `Version`s

Loading `Contributor`s