researchobject / workflow-run-crate Goto Github PK

View Code? Open in Web Editor NEW

8.0 35.0 9.0 58.1 MB

Workflow Run RO-Crate profile

Home Page: https://www.researchobject.org/workflow-run-crate/

License: Apache License 2.0

Shell 8.49% Python 82.39% Nextflow 9.12%

cwl provenance ro-crate workflow

workflow-run-crate's Introduction

workflow-run-crate

Workflow Run RO-Crate profile

workflow-run-crate's People

Contributors

Stargazers

Watchers

Forkers

simleo rudowittner pauldg ilveroluca rsirvent inab glassofwhiskey kinow stain

workflow-run-crate's Issues

Citing the profiles

The citation for the profiles is very hard to find. We also do not have a CFF file or bibtex for the specific profiles that is easy to reach.

@dataset{workflow_run_ro_crate_working_group_2024_12159311,
  author       = {Workflow Run RO-Crate working group},
  title        = {Workflow Run Crate specification},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {0.5},
  doi          = {10.5281/zenodo.12159311},
  url          = {https://doi.org/10.5281/zenodo.12159311}
}

If available, record the application's resource requirements. This is useful to those who want to reproduce the run. In CWL, this is provided via ResourceRequirement. We could use properties like memoryRequirements or storageRequirements for this.

CQ4 - environment/container file

What is the environment/container file used in a specific workflow execution step?

Similar to the configuration file (#11) problem. Need env dump support from workflow engine.

Add extra columns in ro-crate requirements

So we know which requirement/CQ maps to each profile, in order to determine whether everything is covered.

Representing conditional execution

How should we model workflows where one or more steps are executed only if a condition is verified (e.g., when clause in CWL)?

CQ8 - Workflow inputs and outputs

What are the inputs and outputs of the overall workflow?

We're using object and result on the workflow run action.

CQ10 - Tool wrappers

What is the script used to wrap up a software component?

We're mapping tool wrappers (e.g., foo.cwl) to SoftwareApplication. Wrappers at lower levels can also be SoftwareApplication, but we need to draw the line somewhere (related to container image).

CQ11 - Parameter connections

Knowing how workflow parameters were passed to individual tools is important to find out how they affected the outputs.

We are currently linking workflow and tool parameters with connectedTo from the source tool / workflow to the target tool / workflow. For instance, in revsort:

we currently have:

{
    "@id": "packed.cwl#revtool.cwl",
    "@type": "SoftwareApplication",
    "input": [
        {"@id": "packed.cwl#revtool.cwl/input"}
    ],
    "output": [
        {"@id": "packed.cwl#revtool.cwl/output"}
    ]
},
{
    "@id": "packed.cwl#sorttool.cwl",
    "@type": "SoftwareApplication",
    "input": [
        {"@id": "packed.cwl#sorttool.cwl/reverse"},
        {"@id": "packed.cwl#sorttool.cwl/input"}
    ],
    "output": [
        {"@id": "packed.cwl#sorttool.cwl/output"}
    ]
},
{
    "@id": "packed.cwl#revtool.cwl/output",
    "@type": "FormalParameter",
    "connectedTo": {"@id": "packed.cwl#sorttool.cwl/input"}
}

but that's inaccurate, since such links only exist within the revsort workflow. packed.cwl#revtool.cwl and packed.cwl#sorttool.cwl represent standalone software tools that happen to be connected this way in revsort, but might be used differently in another workflow.

cwlprov_to_crate: support for nested workflows

Workflows can run other workflows as subworkflows. CWLProv outputs separate provenance documents in this case, but such runs are not yet supported in cwlprov_to_crate. Functionally, we need to add the capability to parse the provenance metadata in this scenario. Then there's the issue of adding subworkflow metadata to the RO-Crate. In the relationship graph, subworkflows need to appear in the same place as tool wrappers (what's run by a step). Their type should be the same as the main workflow, minus File since they are stored as sections in packed.cwl:

["SoftwareSourceCode", "ComputationalWorkflow", "HowTo"]

Then we'd need to recursively convert all subworkflows as we did for the main one.

One possibly weird consequence is that some of the workflow components would be SoftwareApplications (the tool wrappers) while others would be of type SoftwareSourceCode (the subworkflows). I guess the reason for the presence of both entities in Schema.org is that the former should model an executable, while the latter should represent code that needs to be compiled. With interpreted languages such as CWL (or Python, etc.), however, the source code is also runnable, so the distinction is not so meaningful.

cwlprov_to_crate: test converting cwlprovs created for each CWL conformance tests

https://github.com/common-workflow-language/cwl-v1.2/blob/1.2.1_proposed/CONFORMANCE_TESTS.md

@mr-c will test creation of the CWLProv RO Bundles (fixing any issues found), and then he will help @simleo create his own for testing with cwlprov_to_crate

CQ6 - Workflow running time

How long does this workflow take to run?

The actual duration of the represented workflow run can be obtained from endTime - startTime on the CreateAction. Providing an estimate of the typical running time, on the other hand, is a different thing. Can we use totalTime for that? Or a more specific custom property like estimatedRunningTime? And do workflow languages have annotation fields for this?

CQ3 - Configuration files

What are the configuration files used in a workflow execution step?

ChooseAction? Though maybe the crate generator should just merge the params with the other ones if it can parse the config file. To link to the config file as a black box instead we probably need a new property.

Representing secondary files

How to link to secondary files, e.g. CWL's secondaryFile

CQ2 - Resource usage

How much memory/cpu/disk was used in run?

Source entity?
Property?

CQ1 - Container image

What container images (e.g., Docker) were used by the run?

Source entity: CreateAction
Target entity? It could be File if the image is a tarball from docker save
Property? Overload image?

CQ7 - Outcome

Was the execution successful?

Can use actionStatus to FailedActionStatus or CompletedActionStatus - can also provide error.

Remove individual roterms from profile crates

From #73 (comment)

I think if the terms are defined now properly in the roterms namespace (incl. HTML!) we no longer need to define them individually in the profile crates, just refer to the DefinedTermSet alone. Will raise as new issue for 0.6, that's just housekeeping, no harm in the current way except duplication.

So suggestion is to remove each of the DefinedTerm to avoid this sublisting in https://github.com/ResearchObject/workflow-run-crate/blob/main/docs/profiles/0.6-DRAFT/process_run_crate/ro-crate-metadata.json#L329 and each DefinedTerm reference as that will essentially be re-declaring what is now deployed on https://w3id.org/ro/terms/workflow-run and we no longer need ad-hoc definitions in our Profile Crates.

Instead there will just be a DefinedTermSet reference only, as specified in https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/profiles.html#extension-vocabularies

Content negotiation is not enabled for the profiles

All profiles should resolve, at least to their context file:

curl -sH "Accept:application/ld+json" -L https://w3id.org/ro/wfrun/provenance

returns just HTML, which is not ideal

How to represent which compute provider was used?

Could be for the entire workflow and/or each individual step (for the case of distributed execution)

Where (and with what term) in the ro-crate-metadata.json metadata file would we represent this for an entire workflow?
Where (and with what term) in the ro-crate-metadata.json metadata file would we represent this for a specific step?
How should the provider be credited? Current thinking is to use an ROR + a local identifier (like a cluster name)
Probably should be a list of identifiers, for layered systems. Example: OpenStack VMs provisioned on the de.NBI cloud, at the Freiburg University instance. (needing to give credit to OpenStack, de.NBI Cloud, and the Freiburg Uni computers)

(@mr-c has heard about instrument identifiers, but knows nothing about that beyond the existence of the concept)

CQ9 - Software version

What is the source code version of the component executed in a workflow step?

We can use softwareVersion, though getting the version of the actual tool (e.g., grep) that was called by the wrapper might not be easy (related to container image).

Join the working group (post here to be added)

Add a comment to this issue to join the Workflow Run RO-Crate working group. Please indicate your ORCID, if you have one.

We coordinate using the channel #ro-crate on seek4science.slack.com (join) and the RO-Crate mailing list.

Workflow Run RO-Crate is an RO-Crate profile, i.e., a specialization of RO-Crate to a set of use cases. You can join the RO-Crate community at large here.

Representing environment variables

Both in the prospective provenance (what's the name of the variable that a workflow or tool needs?) and the retrospective one (what was the value).

Also, how to hide the value if it's sensitive.

Brought up by @jmfernandez

CQ5 - Step running time

How long does this workflow component take to run? (estimate)

Use totalTime? Allowed on HowTo and HowToDirection but not on HowToStep. Can also get actual duration from endTime - startTime on the action.

researchobject / workflow-run-crate Goto Github PK

workflow-run-crate's Introduction

workflow-run-crate

workflow-run-crate's People

Contributors

Stargazers

Watchers

Forkers

workflow-run-crate's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs