GithubHelp home page GithubHelp logo

code_as_data's People

Contributors

dependabot[bot] avatar russhyde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

code_as_data's Issues

Use dev-version of dupree and explicit git-commit

We currently install cloc and gitsum from github and use dupree-v0.2 from conda.

Better to use a commit-stamped version of cloc and gitsum, for reproducibility

Also, I've updated dupree based on some minor bugs identified during writing this project, for example, to check R-package structure within dupree_package (the dev version of dupree will not be made available on conda)

Therefore,

  • update config.yaml to include commit-stamps for each repo
  • add dupree to config.yaml::remotes

Add a shiny app

Produce shiny app to show commit frequencies, duplication, etc etc.

Use snakemake to run the whole project

Use snakemake at the top level

  • add separate snakemake workflows for 1) choosing the packages for analysis 2) downloading and analysing the packages

[Comparison of the workflow management tools should be a separate job]

Comparison of workflow-management tools

Have added a bash script to:

  • check environment is activated
  • check results dir is in place
  • run all analysis scripts

But I'd normally use snakemake to control running the scripts and my own tools to check the environment / workspace before running a project

Would like to compare

  • snakemake
  • nextflow
  • {drake}
    - {workflowr}
    - bash
    - make (deleted due to too 'scope creep')
    .. approaches to the same project

Modifications required before this is possible:

  • ./R/config.R should be rewritten as ./config.yaml
  • All scripts should be configured from above by passing in file-names, URLs etc as for command-line scripts (at present they source config.R are configured by passing in conf/config.yaml)

Anticipated pain-points:

  • non-predefined output files for some steps (the packages that will be analysed depend on current-cran database and current devtools cran view) will be problematic for snakemake / make
  • multiple output files for some steps will be problematic for make
  • will need a slightly different config for each workflow manager

Therefore need a different subjob for each workflow manager, with

  • links from that subjob to the relevant R scripts
  • a way to ensure that the link-timestamps stay in check with the timestamp of their target R scripts
  • a way to update the subjob-config if the main config changes

TODO:

  • replace bash script with snakemake script(s) for the main project
  • Decide on a core set of scripts for running in the workflow-comparison subproject
  • Nextflow.io branch
  • Drake branch
    - [ ] Move reduced version of existing bash script to bash branch

Bug: gitsum bumps the timestamp for an analysed repo & affects timestamp-dependent workflow

If I have a snakemake workflow that:

  • runs dupree on a repo (where the repo directory is used as 'input')
  • then runs gitsum on that repo

Then the next time I run snakemake, the dupree step will run again (despite the files that are analysed by dupree not having been modified)

Reason:

  • gitsum updates the .gitsum directory in the git repo
  • this updates the timestamp for the git repo
  • the dupree rule compares the timestamp for the git repo against the timestamp for output files
  • but if gitsum was ran subsequent to dupree, then the git repo will be ahead of the timestamp for the dupree output files

Suggest:

  • make the dupree rule depend upon the R-subdirectory as input, not the repository itself. (not possible to depend on a subdir or a directory that is the 'output' of a rule (ie, clone_repo))
  • Run gitsum without creating the .gitsum subdirectory?
  • Set directory-inputs to ancient()

[app] Shiny app features

All-package analyses:

  • Table of the packages that were analysed, with a link to their github repo
    • use data in results/dev-pkg-repositories.tsv
  • cloc analysis across all packages (barplot and table)
  • gitsum analysis across all packages (barplot and table)
  • lines-of-code vs number of commits
  • Separate pages for single-statistic analysis and two-statistic comparisons across packages

Single-package analyses:

  • Number of commits, per author, for each file
  • Correlated changes between files

Niceties:

Plan for presentation

Plan for newcastle satrdays abstract:

Code analysis tools:

  • static stuff: code-quality, code-style {goodpractice, lintr, cloc, }
  • dynamic stuff: benchmarking,
  • archaeological stuff: gitsum

How to combine all these things together, similar to the code-maat thing

Probably need to work on the visual representation of projects

workflow_tools: `{targets}`

Try using r package {targets} for workflow management.
targets is a spin off from drake, which we didn't end up using.

Rewrite scripts to conform to workflow-able architecture

Assume the existence of a coordinator (be it a bash script, makefile or some other workflow-manager; currently this would be run_me.sh)

Each script should:

  • make a single output file (or directory)
    • the path for the output should be passed in by the coordinator (--output <output_file> or --output_dir <output_dir>)
  • analyse / summarise a single dataset (eg, repo); combine a small number of non-homogeneous datasets; or merge the results for multiple homogeneous datasets
    • single-input: file path is passed in using --input
    • small number of non-homogeneous datasets (each with a named cmd argument)
    • multiple dataset:
      • (initial) file-paths are passed in using a config
      • (ok) file-paths are passed as <interpreter> <script_name> [options] --input_files <some_file> where the latter file defines all file-paths that are to be combined together
      • (preferably) file-paths are passed as <interpreter> <script_name> [options] input1 input2 input3 ...
  • and should not need to parse a config file to decide what it needs to make (only the coordinator should interact with the config)

Refactor the R scripts into a package, config.yaml and minimal script logic

  • Add an R package to ./lib/code.as.data
    • eg, move contents of utils.R to {code.as.data}
    • ensure {code.as.data} is built and installed before running the analysis scripts
      • eg, add R CMD build etc to run_me.sh, or add a setup.sh script
  • Rename "./R" directory as "./scripts"
  • Put package-loading code at start of scripts (use for(pkg in pkgs){library(...)}) not load_packages()
  • Parse the config from .yaml
  • Add optparse calls where relevant and pass in the config from above
  • Then rewrite to pass in
    • the files / directory names
    • and other arguments (min-block-size etc), instead of the config

rscala reveals package-structure bug

Analysis of the rscala repo identified a problem (similar to that for r-logging)

rscala repo structure looks like this:

<root>
- R
    - rscala [the actual package]
        - R
        - inst
        - tests
        - ... <rest of the actual R package>
- benchmarks
- bin
- ...

So the R-package is not at the root of the repo structure.

In r-logging, the R package was similarly nested, but it did not have an R directory at the top-level, so it was simple to tell that the whole repo did not have a typical R package structure

Suggest either:

  • adding a check during the github-download script that fails if markers of R-package structure are absent from the repo top-level (DESCRIPTION, NAMESPACE, R/); and dropping any packages that don't conform to that structure (by updating config.yaml::drop)
  • adding code to find a DESCRIPTION-containing subdir, and appending an R-package-root column to the repository filepath files (distinguishing repo-root from r-pkg-root) and then analysing the r-pkg for R-specific stuff (coverage, duplication etc) and the repo for git-specific stuff (change-frequencies etc)

The first alternative seems easiest to explain in a presentation, quickest to implement, and less open to subsequent failures.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.