russhyde / code_as_data Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 0.0 302 KB

Analysis of code in R dev packages (for a planned talk)

R 69.30% Shell 7.04% Python 12.44% Nextflow 11.22%

code_as_data's People

Contributors

Stargazers

Watchers

code_as_data's Issues

[app] Analysed packages table: include most recent commit date

Add plumber script to release results

Use dev-version of dupree and explicit git-commit

We currently install cloc and gitsum from github and use dupree-v0.2 from conda.

Better to use a commit-stamped version of cloc and gitsum, for reproducibility

Also, I've updated dupree based on some minor bugs identified during writing this project, for example, to check R-package structure within dupree_package (the dev version of dupree will not be made available on conda)

Therefore,

update config.yaml to include commit-stamps for each repo
add dupree to config.yaml::remotes

[app] Use showNotification when loading data

Since loading the data takes a while, add a notification that closes on completion https://mastering-shiny.org/action-feedback.html#removing-on-completion

Add a shiny app

Produce shiny app to show commit frequencies, duplication, etc etc.

[app] non-default theme

PLEASE!

Use snakemake to run the whole project

Use snakemake at the top level

add separate snakemake workflows for 1) choosing the packages for analysis 2) downloading and analysing the packages

[Comparison of the workflow management tools should be a separate job]

Fix nextflow / renv incompatibility problem

See #31 for description of problem
Need to copy ./renv and ./.Rprofile into the working directory for every process that uses R scripts.

Comparison of workflow-management tools

Have added a bash script to:

check environment is activated
check results dir is in place
run all analysis scripts

But I'd normally use snakemake to control running the scripts and my own tools to check the environment / workspace before running a project

Would like to compare

snakemake
nextflow
{drake}
~~- {workflowr}~~
~~- bash~~
~~- make~~ (deleted due to too 'scope creep')
.. approaches to the same project

Modifications required before this is possible:

./R/config.R should be rewritten as ./config.yaml
All scripts should be configured from above by passing in file-names, URLs etc as for command-line scripts (at present they ~~source config.R~~ are configured by passing in conf/config.yaml)

Anticipated pain-points:

non-predefined output files for some steps (the packages that will be analysed depend on current-cran database and current devtools cran view) will be problematic for snakemake / make
multiple output files for some steps will be problematic for make
will need a slightly different config for each workflow manager

Therefore need a different subjob for each workflow manager, with

links from that subjob to the relevant R scripts
a way to ensure that the link-timestamps stay in check with the timestamp of their target R scripts
a way to update the subjob-config if the main config changes

TODO:

replace bash script with snakemake script(s) for the main project
Decide on a core set of scripts for running in the workflow-comparison subproject
Nextflow.io branch
Drake branch
~~- [ ] Move reduced version of existing bash script to bash branch~~

Bug: gitsum bumps the timestamp for an analysed repo & affects timestamp-dependent workflow

If I have a snakemake workflow that:

runs dupree on a repo (where the repo directory is used as 'input')
then runs gitsum on that repo

Then the next time I run snakemake, the dupree step will run again (despite the files that are analysed by dupree not having been modified)

Reason:

gitsum updates the .gitsum directory in the git repo
this updates the timestamp for the git repo
the dupree rule compares the timestamp for the git repo against the timestamp for output files
but if gitsum was ran subsequent to dupree, then the git repo will be ahead of the timestamp for the dupree output files

Suggest:

~~make the dupree rule depend upon the R-subdirectory as input, not the repository itself.~~ (not possible to depend on a subdir or a directory that is the 'output' of a rule (ie, clone_repo))
Run gitsum without creating the .gitsum subdirectory?
Set directory-inputs to ancient()

README: should have summary of the aim of the project

add hook: ensure renv::snapshot() has been called

[app] Shiny app features

All-package analyses:

Table of the packages that were analysed, with a link to their github repo
- use data in results/dev-pkg-repositories.tsv
cloc analysis across all packages (barplot and table)
gitsum analysis across all packages (barplot and table)
lines-of-code vs number of commits
Separate pages for single-statistic analysis and two-statistic comparisons across packages

Single-package analyses:

Number of commits, per author, for each file
Correlated changes between files

Niceties:

Compare data to the package-download statistics

Download counts can be obtained using https://github.com/r-hub/cranlogs

[app] 'sandstone' theme messes up table navigation

See the "PREVIOUS 1 2 3 4 NEXT" buttons on the tables in "Analysed Packages" and "Cross-Package Analysis"

Reproducibility

Environment reproducibility:

Already in place:
- conda env
- specific git repos for non-conda (or preferentially non-conda) packages that are used by the running env

Data reproducibility:
? Ensure URLs are commit-stamped
? Can the CRAN database be versioned
? Before running analyses on the packages, checkout a time-stamped commit (eg, use git rev-list: https://stackoverflow.com/questions/6990484/how-to-checkout-in-git-by-date)

Update "Environment" section of README

Since we now use 'venv' instead of conda to manage python dependencies

[app] footer should use `<footer>` element

https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/Document_and_website_structure#html_for_structuring_content

Plan for presentation

Plan for newcastle satrdays abstract:

Code analysis tools:

static stuff: code-quality, code-style {goodpractice, lintr, cloc, }
dynamic stuff: benchmarking,
archaeological stuff: gitsum

How to combine all these things together, similar to the code-maat thing

Probably need to work on the visual representation of projects

workflow_tools: `{targets}`

Try using r package {targets} for workflow management.
targets is a spin off from drake, which we didn't end up using.

Rewrite scripts to conform to workflow-able architecture

Assume the existence of a coordinator (be it a bash script, makefile or some other workflow-manager; currently this would be run_me.sh)

Each script should:

make a single output file (or directory)
- the path for the output should be passed in by the coordinator (--output <output_file> or --output_dir <output_dir>)
analyse / summarise a single dataset (eg, repo); combine a small number of non-homogeneous datasets; or merge the results for multiple homogeneous datasets
- single-input: file path is passed in using --input
- small number of non-homogeneous datasets (each with a named cmd argument)
- multiple dataset:
  - (initial) file-paths are passed in using a config
  - (ok) file-paths are passed as <interpreter> <script_name> [options] --input_files <some_file> where the latter file defines all file-paths that are to be combined together
  - (preferably) file-paths are passed as <interpreter> <script_name> [options] input1 input2 input3 ...
~~and should not need to parse a config file to decide what it needs to make (only the coordinator should interact with the config)~~

Filepaths in dev-pkg-repositories should be relative

They are currently saved as full paths

Reconfigure the environment: use renv inside conda

.. so that we don't end up in unupdatable environment hell (note, I tried to install r-pkgnet with a triumphant failure prior to edinbr talk)

[app] Workflow summary figure

Describe the architeture of the project

Refactor the R scripts into a package, config.yaml and minimal script logic

Add an R package to ./lib/code.as.data
- eg, move contents of utils.R to {code.as.data}
- ensure {code.as.data} is built and installed before running the analysis scripts
  - eg, add R CMD build etc to ~~run_me.sh, or add a~~ setup.sh script
Rename "./R" directory as "./scripts"
Put package-loading code at start of scripts (use for(pkg in pkgs){library(...)}) not load_packages()
Parse the config from .yaml
Add optparse calls where relevant and pass in the config from above
Then rewrite to pass in
- the files / directory names
- and other arguments (min-block-size etc), instead of the config

Add a notebook to summarise the results

- compare dupree times over packages
- compare dupree scores over packages
- compare lines-of-code over packages
- plot dupree-time vs lines-of-code

rscala reveals package-structure bug

Analysis of the rscala repo identified a problem (similar to that for r-logging)

rscala repo structure looks like this:

<root>
- R
    - rscala [the actual package]
        - R
        - inst
        - tests
        - ... <rest of the actual R package>
- benchmarks
- bin
- ...

So the R-package is not at the root of the repo structure.

In r-logging, the R package was similarly nested, but it did not have an R directory at the top-level, so it was simple to tell that the whole repo did not have a typical R package structure

Suggest either:

adding a check during the github-download script that fails if markers of R-package structure are absent from the repo top-level (DESCRIPTION, NAMESPACE, R/); and dropping any packages that don't conform to that structure (by updating config.yaml::drop)
adding code to find a DESCRIPTION-containing subdir, and appending an R-package-root column to the repository filepath files (distinguishing repo-root from r-pkg-root) and then analysing the r-pkg for R-specific stuff (coverage, duplication etc) and the repo for git-specific stuff (change-frequencies etc)

The first alternative seems easiest to explain in a presentation, quickest to implement, and less open to subsequent failures.

russhyde / code_as_data Goto Github PK

code_as_data's People

Contributors

Stargazers

Watchers

code_as_data's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs