Workflow Run RO-Crate profile
researchobject / workflow-run-crate Goto Github PK
View Code? Open in Web Editor NEWWorkflow Run RO-Crate profile
Home Page: https://www.researchobject.org/workflow-run-crate/
License: Apache License 2.0
Workflow Run RO-Crate profile
Home Page: https://www.researchobject.org/workflow-run-crate/
License: Apache License 2.0
The citation for the profiles is very hard to find. We also do not have a CFF file or bibtex for the specific profiles that is easy to reach.
@dataset{workflow_run_ro_crate_working_group_2024_12159311,
author = {Workflow Run RO-Crate working group},
title = {Workflow Run Crate specification},
month = jun,
year = 2024,
publisher = {Zenodo},
version = {0.5},
doi = {10.5281/zenodo.12159311},
url = {https://doi.org/10.5281/zenodo.12159311}
}
If available, record the application's resource requirements. This is useful to those who want to reproduce the run. In CWL, this is provided via ResourceRequirement. We could use properties like memoryRequirements or storageRequirements for this.
What is the environment/container file used in a specific workflow execution step?
Similar to the configuration file (#11) problem. Need env dump support from workflow engine.
So we know which requirement/CQ maps to each profile, in order to determine whether everything is covered.
How should we model workflows where one or more steps are executed only if a condition is verified (e.g., when
clause in CWL)?
What is the script used to wrap up a software component?
We're mapping tool wrappers (e.g., foo.cwl
) to SoftwareApplication. Wrappers at lower levels can also be SoftwareApplication
, but we need to draw the line somewhere (related to container image).
Knowing how workflow parameters were passed to individual tools is important to find out how they affected the outputs.
We are currently linking workflow and tool parameters with connectedTo from the source tool / workflow to the target tool / workflow. For instance, in revsort:
we currently have:
{
"@id": "packed.cwl#revtool.cwl",
"@type": "SoftwareApplication",
"input": [
{"@id": "packed.cwl#revtool.cwl/input"}
],
"output": [
{"@id": "packed.cwl#revtool.cwl/output"}
]
},
{
"@id": "packed.cwl#sorttool.cwl",
"@type": "SoftwareApplication",
"input": [
{"@id": "packed.cwl#sorttool.cwl/reverse"},
{"@id": "packed.cwl#sorttool.cwl/input"}
],
"output": [
{"@id": "packed.cwl#sorttool.cwl/output"}
]
},
{
"@id": "packed.cwl#revtool.cwl/output",
"@type": "FormalParameter",
"connectedTo": {"@id": "packed.cwl#sorttool.cwl/input"}
}
but that's inaccurate, since such links only exist within the revsort
workflow. packed.cwl#revtool.cwl
and packed.cwl#sorttool.cwl
represent standalone software tools that happen to be connected this way in revsort
, but might be used differently in another workflow.
Workflows can run other workflows as subworkflows. CWLProv outputs separate provenance documents in this case, but such runs are not yet supported in cwlprov_to_crate. Functionally, we need to add the capability to parse the provenance metadata in this scenario. Then there's the issue of adding subworkflow metadata to the RO-Crate. In the relationship graph, subworkflows need to appear in the same place as tool wrappers (what's run by a step). Their type should be the same as the main workflow, minus File
since they are stored as sections in packed.cwl
:
["SoftwareSourceCode", "ComputationalWorkflow", "HowTo"]
Then we'd need to recursively convert all subworkflows as we did for the main one.
One possibly weird consequence is that some of the workflow components would be SoftwareApplications (the tool wrappers) while others would be of type SoftwareSourceCode (the subworkflows). I guess the reason for the presence of both entities in Schema.org is that the former should model an executable, while the latter should represent code that needs to be compiled. With interpreted languages such as CWL (or Python, etc.), however, the source code is also runnable, so the distinction is not so meaningful.
https://github.com/common-workflow-language/cwl-v1.2/blob/1.2.1_proposed/CONFORMANCE_TESTS.md
@mr-c will test creation of the CWLProv RO Bundles (fixing any issues found), and then he will help @simleo create his own for testing with cwlprov_to_crate
How long does this workflow take to run?
The actual duration of the represented workflow run can be obtained from endTime - startTime on the CreateAction
. Providing an estimate of the typical running time, on the other hand, is a different thing. Can we use totalTime for that? Or a more specific custom property like estimatedRunningTime
? And do workflow languages have annotation fields for this?
What are the configuration files used in a workflow execution step?
ChooseAction? Though maybe the crate generator should just merge the params with the other ones if it can parse the config file. To link to the config file as a black box instead we probably need a new property.
How to link to secondary files, e.g. CWL's secondaryFile
How much memory/cpu/disk was used in run?
What container images (e.g., Docker) were used by the run?
File
if the image is a tarball from docker save
Was the execution successful?
Can use actionStatus to FailedActionStatus or CompletedActionStatus - can also provide error.
From #73 (comment)
I think if the terms are defined now properly in the roterms namespace (incl. HTML!) we no longer need to define them individually in the profile crates, just refer to the
DefinedTermSet
alone. Will raise as new issue for 0.6, that's just housekeeping, no harm in the current way except duplication.
So suggestion is to remove each of the DefinedTerm
to avoid this sublisting in https://github.com/ResearchObject/workflow-run-crate/blob/main/docs/profiles/0.6-DRAFT/process_run_crate/ro-crate-metadata.json#L329 and each DefinedTerm
reference as that will essentially be re-declaring what is now deployed on https://w3id.org/ro/terms/workflow-run and we no longer need ad-hoc definitions in our Profile Crates.
Instead there will just be a DefinedTermSet
reference only, as specified in https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/profiles.html#extension-vocabularies
All profiles should resolve, at least to their context file:
curl -sH "Accept:application/ld+json" -L https://w3id.org/ro/wfrun/provenance
returns just HTML, which is not ideal
Could be for the entire workflow and/or each individual step (for the case of distributed execution)
ro-crate-metadata.json
metadata file would we represent this for an entire workflow?ro-crate-metadata.json
metadata file would we represent this for a specific step?(@mr-c has heard about instrument identifiers, but knows nothing about that beyond the existence of the concept)
What is the source code version of the component executed in a workflow step?
We can use softwareVersion, though getting the version of the actual tool (e.g., grep
) that was called by the wrapper might not be easy (related to container image).
Add a comment to this issue to join the Workflow Run RO-Crate working group. Please indicate your ORCID, if you have one.
We coordinate using the channel #ro-crate on seek4science.slack.com (join) and the RO-Crate mailing list.
Workflow Run RO-Crate is an RO-Crate profile, i.e., a specialization of RO-Crate to a set of use cases. You can join the RO-Crate community at large here.
Both in the prospective provenance (what's the name of the variable that a workflow or tool needs?) and the retrospective one (what was the value).
Also, how to hide the value if it's sensitive.
Brought up by @jmfernandez
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.