o2r-project / erc-spec Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 5.0 9.33 MB

Executable Research Compendium specification and guides

Home Page: https://o2r.info/erc-spec/

License: Creative Commons Zero v1.0 Universal

erc-spec's People

Contributors

Stargazers

Watchers

Forkers

nuest fmazin simonwaldherr diegosiqueir4 msch0

erc-spec's Issues

Add progress communication to specification

Well-defined log messages during execution could provide progress information, which can be parsed by tools and communicated to the user, during ERC execution.

This could initially be based on the default progress percentage of rmarkdown/knitr.

Authors could add progress information in their scripts, too (e.g. by calling R functions, or by adding well-defined comments).

This requires an extension of the specification.

Add fields to manifest on host config and requirements

Should an ERC be able to announce what ressources (cores, RAM) it needs, or within which limits it should work?

Add reference to https://peerj.com/articles/cs-112/ to the user guide or reasoning for Rmd as core format

https://peerj.com/articles/cs-112/

Krewinkel A, Winkler R. (2017) Formatting Open Science: agilely creating multiple document formats for academic manuscripts with Pandoc Scholar. PeerJ Computer Science 3:e112 https://doi.org/10.7717/peerj-cs.112

ERC metadata as part of Rmd header

RMarkdown headers are yaml, the erc.yml file is yaml - why not think about a variant of the ERC where the ERC metadata is actually in the header of the main document?

ERC as Data Package

From the Open Data community, there comes a specification Data Package: https://specs.frictionlessdata.io/data-package/

It's quite simple with some metadata and a list of resources in a JSON file.

Evaluate potential support for StatTag

StatTag lets users create reproducible docx documents: http://sites.northwestern.edu/stattag/

OOXML being an open format, we should evaluate if a StatTag-based workflow is supported by the current ERC specification.

Evaluate runtime environment definitions in Common Workflow Language (CWL)

See presentation 2017-03-17 CWL @ HTS-CSRS "BioCompute Object" Workshop at https://docs.google.com/presentation/d/1a-iQYhu52F5L0-UaD-5mGCpWIJCxdVEH9a1i4Rx8BOA/edit#slide=id.g15b9625092_0_337

CWL also defines runtimes, so there is probably something to learn from that: http://www.commonwl.org/draft-3/CommandLineTool.html#Runtime_environment

Fix overfull boxes in code blocks in PDF

Move information about secondary metadata files to archival extension

http://o2r.info/erc-spec/spec/#secondary-metadata-files

Allow minimal ERC with only one plot

I like the idea of allowing an ERC to contain only code for one plot. I.e. an R-script plot.R and an output document plot{.png,.pdf}. The former could be the "main document" and the latter the "view document".

I suggest to make sure that the wording still allows this, i.e. that the main document CAN be a script instead of a literate programming file.

If this should be possible, it would be nice to have an example minimal ERC like this, I think it would be easy for people to understand.

@7048730 Thoughts on this?

Add packages for tracking changes in R scripts to user guide

We could mention packages freezr and recordr for

capture data provenance for R scripts and console commands without the need to modify existing R code.

Via https://discuss.ropensci.org/t/track-fast-evolving-custom-r-scripts-via-freezr/903

This is merely related, but interesting if one of the user guides develops into recommendations of day-to-day habits.

Manual UI bindings creation

[ outsourced from #31 ]

Users want to write UI bindings directly into RMarkdown themselves, not only create them in a (browser-based) UI-based workflow.

Do we want to work on this now?

How can UI bindings be embedded into RMarkdown?

Validation instructions

There are two major aspects to validation:

Validation of proper reproduction of the contents of the erc.
Validation of the archival-related integrity of all data.

This issue deals with 2.

Taks for the User guide part of the spec:

add detailed instruction on how to create a bagit bag with custom properties (such as erc-version tag in bag-info.txt) with a standard tool like loc java bagger
add instructions on how to manually validate the bag correctly with a standard tool like loc java bagger

edit: will add this in branch https://github.com/o2r-project/erc-spec/tree/update-eval_1 as it was also part of first feedback

Add to user guide: the ERC as journal supplement

minimal draft, that can be extended in the future.

make reference to article of #27

Drop extensions

Move all contents from extensions into the spec. Do not remove any content.

Add Labels to Docker images

Use LABEL in Dockerfiles and give example of using docker inspect to see core metadata.

Compare structure introduced by ERC with scientific literature

Structuring supplemental materials in support of reproducibility > https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1205-3
- does not mention computational/runtime environments
- talks about linking plots with their code
- ...

Add a user guide on manual examination (without Docker)

We should add a user guide on how to examine an ERC without Docker and without the reproducibility service and platform/UI.
This could mitigate issues about "What if Docker is not available anymore?" and demonstrate that the information is still there and accessible using the structure required by the spec alone.

unpack the archive
on the structure of Docker exports (maybe a side note on image squashing)
where to find the ERC metadata files
how to extract the ERC payload from /erc
how to check bag validity on the extracted payload
how to find the main and view files (default names, erc.yml)
blog post (examining a Docker image without Docker)

Resources

Plain OCI/Docker bundle

Explore how the label mechanisms of Docker and OCI (especially the latter) allow to merge the inner and outer container.

How can files in the container be accessed easily? (extract tarball, then make sense of the layers? does squashing help?)
How can metadata be accessed? (docker inspect ... command to access the erc.yml)

We must clarify in the spec how to easily identify a BagIt bag as an ERC

Ideas (not necessarily conflicting):

"open bag, look for data/erc.yml, if it exists, it's an ERC
use mime type?
use bag metadata, add field Executable-Research-Compendium: Yes to bag-info.txt

New package workflowr gives a new way to structure a workspace

Add https://jdblischak.github.io/workflowr/index.html to the creation user guide at https://github.com/o2r-project/erc-spec/blob/master/docs/user-guide/creation.md#workspace-structure

Developer guide improvements

mention the usage of a copy-on-write storage for checking purposes (don't need to copy all the files before running the ERC but copy/overwrite only the changed ones)

Add OAIS-related content to Archival extension

See https://github.com/o2r-project/erc-spec/blob/master/docs/spec/archival.md#oais---under-development

Overhaul validation procedure (conceptionally and technically)

ERC as RO

Let's package an ERC as a Research Object.

We can re-use a lot of their (meta)data model, especially the added semantics and see how the ERC "(old) simple tools and manual is possible" approach relates to the world of Linked Open Data.

Added concepts: "one click", nested containers, "offline"/self-consistency

See also the disambiguation in the paper

First Comments on spec

"These typically consist of data, code and libraries in executable form which are needed to re-do an analysis, and the outputs of the original analysis." - not an easy sentence
required fields for erc.yml do not match minimal example
"Default command statements of implementing tools" - "for" instead of "of"?
is time zone a MUST?
link to o2r-metadata schema?
Example configuration file at the end? Or a link?
Example docker file?
Is it possible that I can avoid the entire validation process by using the .ercignore file? Should that be possible?
Validation of research results is missing in Validation, right?

I am not sure if I understood each point in sufficient detail. I will re-read it on a later occasion.

Package slip for archival extension

add metadata to connect which metadata file uses which schema and which contained file is the actual schema

Add non-containered information to erc.yml

A core point of containers is not to include the kernel but use that from the host. This means for complete metadata we must include the kernel version into erc.yml.

What other things are not captured by the container?

Add relatedIdentifiers to erc.yml to depict <ERC as supplement> connection to publication

Add a new element to the erc.yml MD, called relatedIdentifiers that refers to the main publications and possibly other ERC supplements with persistent identifiers.

Add FAQ/developer note on Singularity

ERCs could just as well use Singularity instead of Docker, and (cf. C4RR workshop) it might be very well suited for reproducible research. Add a statement to the developer guide.

Add user guide for direct usage with containerit, o2r-meta, and erc-checker

Something along the lines of "this is how you can check before uploading if we can create a runtime manifest (using containerit) and what kind of metadata we will be able to extract (with o2r-meta extract)".

Empty Affiliation Structure

Shouldn't the structure of an empty affiliation be an empty array instead of null as defined here?

This should be according to our definition and would provide a more consistent structure

Current list of minor formatting errors, typos, necessary changes

Fixes:

Displayfile frame does not render (404) at erc-spec/user-guide/minimal/
template download table is broken at erc-spec/user-guide/template/ although correct md
admonition box !!! tip “Example in bagit example is broken at erc-spec/spec/#bagit-outer-container
Indicate external links with icon (e. g. 🔗 🔗), e. g. at erc-spec/glossary/#discover

Add section on manipulation

The "manipulation extension" draft was removed, content preserved here:

## UI bindings

How is the user interface defined?

## Using other data

Define in `erc.yml` which files are potential input data which can be exchanged.

```yml
id: adcd
manipulate:
    input_data:
        - filename: are.json
          format: geojson
        - filename: rs.tiff
          format: geotiff

Then: How is external data mounted into the container and where to (what are the paths)?

Validation

How are UI bindings validated/checked?

Update and clarify Docker

Re-check that the export and import can use the ERC identifier as it is in the spec.

Git LFS & GitBag experiments

Describe how to handle large data files in an ERC using git and how this could be integrated with https://github.com/mjordan/GitBags

Integrating expert feedback

[These comments are based on notes and transcripts from a discussion of the ERC specification with publishing domain experts.]

dropping extensions is an excellent idea #36
a plain R solution is much smaller, consequently less burden for collaboration, but higher burden for preservation
need and option to download just main file and data (cf. o2r-project/o2r-platform#26); minimize footprint for specific usages
- what would be the "dev version" of an ERC? everything but the runtime environment image? The Dockerfile, the RMarkdown document, and the data (explicitly)?
~~community support is more important than technical support for people to be using the specification~~
- noted the meta-idea of a discussion forum, e.g. discourse
- not relevant for the spec
- add section to user guide (add email, add dicussion forum)
authors should not read the ERC spec, which is developer material; authors should be confronted with a very simple system within the submission experience; alternative: guide of required steps
system must be as simple as possible, understandable by users if need be - current state is really good
automated learning should be considered for user experience, i.e. after an upload give feedback "you have 80% complete, consider these things", comparable to "profile completeness score" > taken into consideration in UI only, not part of spec, rather an application of the spec
~~badges should be considered~~
- ~~badges about ERC contents are imho out of scope, and badges for full ERC are under development, see https://github.com/o2r-project/o2r-badger (@nuest)~~
two possibilities for UI bindings: RMarkdown + UI and write directly into RMarkdown; not being able to do the latter would be a drawback, "scientists don't like interactions" > deferred, see #32
~~evolutionary approach is favoured, i.e. not replacing the article completely but having one/multiple ERC as supplemental material~~

Spec updates to be done

clarify the target audience for spec (devs, not authors) and guide (authors)
add reasoning against plain R solution to dev guide
how to handle ERC as supplemental material instead of the main published item? what metadata is (not) needed?
write concept for completeness score

Provide erc-spec with a persistent identifier

in the long run, publish repo e. g. on zenodo, include doi for the spec on the pages

Add a user guide on inspection with udocker

Try out udocker to run an ERC, and if it works, write a short guide and discuss pros & cons.

Move o2r meta schema documentation to spec

TBD

.erc file extension

Since BagIt is the desired outer container, an ERC can readily be zipped as a single file. Consider .erc as a file extension for the ZIP archive (instead of .zip).

Require unbroken hash function in bag

Prefer checksums from cryptographic hash functions that have not yet been broken by collisions.

As soon as supported by bagit standard and implementations, we should go for sha3. Bagit is likely to support multiple hash functions and not require this high-quality one itself, see also LibraryOfCongress/bagit-python#86

Description of the "Extension" concept and the other concepts

It might be unclear to anyone new to ERC, how the extensions work and what is left, if no extension is used.
The spec documentation should be updated to include a paragraph on this ("the extension concept") and reflect in its structure what is "base" and what is "extension".

Mechanism for intermediate results

Is there a possibility for a transparent mechanism to handle intermediate (calculation intensive) results?

Check with other R packages for handling workflows...

Add single file HTML and MD output of spec and store it in each ERC

use https://github.com/jgrassler/mkdocs-pandoc to create single file, then https://github.com/jgrassler/mkdocs-pandoc#usage-example and http://pandoc.org/demos.html to create

single Markdown document output of spec is part of Travis build @nuest
the version of the spec is available at http://o2r.info/erc-spec/erc-spec-v1.md @nuest
new content for spec where in the ERC the actual spec is stored (integrate with spec_version!) > #10
Automatically download (see URL above) and (a) save the latest spec version into ERC during creation in extractor and (b) creates a file .erc/erc_raw.yml @7048730
PDF is at http://o2r.info/erc-spec/erc-spec-v1.md
unversioned PDF and single file markdown at http://o2r.info/erc-spec/erc-spec.pdf respectively http://o2r.info/erc-spec/erc-spec.md

CLI tool

Create a tool that allows to run an ERC from a command-line interface:

erc create /directory /erc-dir
erc package /erc-dir my_research.zip
erc reproduce my_research.zip

Do it with golang :-).

Add possibility to "sign" ERCs

An ERC must be subject to human inspection.

How can we model and trace the involved people in a way that is open to scrutiny?

Examples: reviewers "sign" an ERC which they examined and evaluated, librarians add their signature on receiving and checking a submission to an archive, an author does a self-check and confirms "to his best knowledge" the ERC is OK. Could possibly be done by storing files in the .erc directory.

And blockchains are suppossed to be good for this stuff, too...

Make use of Discover, Examine, Create on top level

We just discussed:

Discover
Examine
-- Check
-- Inspect
-- Manipilate
-- Substitute
Create

as top level interactions and hence items for the spec.

Tasks for dev branch:

add terms to glossary
use check instead of validate in spec
replace "validation" with "bag validation" whenever this is meant, to avoid confusion

User guide improvements

add more specifics on the topic of checking image outputs

OCI support

The Open Container Initiative (OCI) develops an open specification for container runtime and image, see https://github.com/opencontainers/

The image spec (currently) has flexible annotations, which we can use.

Must analyse the actual differences to Docker first.