GithubHelp home page GithubHelp logo

atrisovic / dataverse-r-study Goto Github PK

View Code? Open in Web Editor NEW
16.0 5.0 2.0 8.46 MB

Data and code for a large-scale study on research code quality and execution at Harvard Dataverse.

License: MIT License

Jupyter Notebook 99.31% Python 0.62% Dockerfile 0.02% R 0.04% Shell 0.02%
code-quality reproducibility r-programming r-language documentation

dataverse-r-study's Introduction

A large-scale study on research code quality and execution

arXiv PyPi license Open Source Love svg1

This work is published here: https://www.nature.com/articles/s41597-022-01143-6

Step 1. get-dois

Code from get-dois enables communication with the Harvard Dataverse repository and collects DOIs of datasets that contain R code.

Step 2. aws-cli

The list of DOIs is used to define jobs for the AWS Batch. Code from aws-cli sends these jobs to the batch queue, where they will wait until resources become available for their execution.

Step 3. docker

When a job leaves the queue, it instantiates a pre-installed Docker image containing code to retrieve a replication package, executes R code, and collects data. Code from docker prepares the image.

Step 4. analysis

All collected data is retrieved and analyzed in analysis.

Figure

Q&A

  1. Before you do any cleaning, 850 scripts produce a library error. Some of those involve referencing a package that has not been loaded. And those you can fix by installing packages, reducing the errors to 'just' 496. All of those are instances when a package failed to load despite including an install.packages() command. Is that right?

Yes, that's correct. More precisely, in the code cleaning step we add if (!require(lib)) install.packages(lib) for all detected libraries in the code. I also tested the code cleaning step by adding just install.packages() or install.packages() & library(), but require() was best performing.

  1. I wanted to look into specific cases that are coded as library errors here, but could not find the file in the dataverse that would allow me to do that. Does that file exist?

Yes! You can see how all the errors were classified here under the heading "Error type".

  1. Do you have a sense to what extent reliance on non-CRAN repositories may account for some of the errors you obtain?

This is a good question and a limitation of our approach. I have previewed a lot of the research code to create the code cleaning step and haven't seen bioconductor and GitHub packages, so my intuition is that it is a small subset, but I cannot be sure.

  1. Is there a way for me to see instances of R scripts that were vs. were not fixed by looking at your posted .csv files?

So the issue of allocating a specific time period for the re-execution on the cloud created the following problem in data collection: For example, out of 10 scripts in the initial re-execution, we'll initially have a result for 9. But after code cleaning, we'll have the result for 6 (as "fixed" code may take more time to re-execute). So we needed to "match" the 6 re-executed scripts in the second run to their result in the first run to see how the result had changed (Fig. 8 in the paper). That was done in this notebook. In the section "Constructing Sankey", you can see how the error changed before and after code cleaning for each file (ie, those are result_x and result_y after merge).

dataverse-r-study's People

Contributors

atrisovic avatar dependabot[bot] avatar mklau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dataverse-r-study's Issues

Congratulations to the article - questions

Excellent work 😄

Comments and questions, I hope you don't mind me misusing the repo for that. Happy to take the conversation elsewhere if you prefer (email?).

  1. The installation of missing packages could also fail when you use an old R version today but CRAN does not have the package for that version, or you use a new R version and R only has it working for an older release. Did you consider using MRAN checkpoints matching the R version release window?
  2. Have you considered using R package that can parse R files for loaded packages? automagic and similar stuff?
  3. Why Python code to analyse R code - personal preference, or did you find that useful somehow?

(Might add more questions as I digest... again, cool work!)

Cannot install requirements on Debian 10, with Python 2.7

git clone https://github.com/atrisovic/dataverse-r-study
cd dataverse-r-study/
pip2.7 install -r requirements.txt
  Downloading ipykernel-4.10.1-py2-none-any.whl (109 kB)
     |################################| 109 kB 21.0 MB/s
ERROR: Could not find a version that satisfies the requirement ipython==7.16.3 (from -r requirements.txt (line 25)) (from versions: 0.10, 0.10.1, 0.10.2, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.13.2, 1.0.0, 1.1.0, 1.2.0, 1.2.1, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 3.0.0, 3.1.0, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 4.0.0b1, 4.0.0, 4.0.1, 4.0.2, 4.0.3, 4.1.0rc1, 4.1.0rc2, 4.1.0, 4.1.1, 4.1.2, 4.2.0, 4.2.1, 5.0.0b1, 5.0.0b2, 5.0.0b3, 5.0.0b4, 5.0.0rc1, 5.0.0, 5.1.0, 5.2.0, 5.2.1, 5.2.2, 5.3.0, 5.4.0, 5.4.1, 5.5.0, 5.6.0, 5.7.0, 5.8.0, 5.9.0, 5.10.0)
ERROR: No matching distribution found for ipython==7.16.3 (from -r requirements.txt (line 25))

Plots

Replication package = dataset = data & code

code-stats

  • Lines of code per comment
  • Lines of code per dependency
  • Lines of code per function (modularity)
  • Lines of code per 'test' (testing)
  • File encoding

overview

  • Histogram of dataset sizes
  • Histogram of number of files per dataset
  • Histogram of file name lenghts
  • Pie chart - file name contains space?
  • Pie chart - dataset contains documentation?
  • Pie chart - dataset contains other code?
  • Pie chart - dataset contains testing script?
  • Pie chart - dataset contains R markdown?

exe-rates

  • exe rate before cleaning
  • exe rate after cleaning
  • exe rate per R version
  • aggrigated results per package

exe-stats

  • exe rate per year of publishing
  • exe rate per field of study
  • exe rate per publisher
  • exe rate per dependency count

Using `require()` is a bad idea, and so is `install.packages()`

Because they do not stop on errors, that is why they are "best performing". You are basically ignoring errors:

{ require("foobar") ; install.packages(tempfile()); message("\nNOT GOOD!!!\n") }

See how "NOT GOOD" is printed here:

Loading required package: foobar
Installing package into ‘/Users/gaborcsardi/Library/R/arm64/4.2/library’
(as ‘lib’ is unspecified)

NOT GOOD!!!

Warning messages:
1: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘foobar’
2: package ‘/var/folders/ph/fpcmzfd16rgbbk8mxvy9m2_h0000gn/T//Rtmph0Pd7A/file12c3c78c562f5’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

OTOH you could argue that if the script finishes anyway, then those packages were not really needed in the first place....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.