biocompute-objects / bco_documentation Goto Github PK

Repository for documentation to support the IEEE 2791-2020 standard. Please see our home page for communications/publications:

Home Page: http://biocomputeobject.org/

License: BSD 3-Clause "New" or "Revised" License

Python 16.60% Shell 3.01% HTML 54.86% CSS 23.08% JavaScript 2.45%

standardization bioinformatics workflow science-communication hts-computations

bco_documentation's People

Contributors

Stargazers

Watchers

Forkers

rahi13 jpat1546 fochtmanb stain joshuagay john-clarke gitter-badger gaybro8777 eethomp syntheticgio liambirt

bco_documentation's Issues

Are arbitrary extra keys allowed?

In the example for structured_name the arbitrary key taxonomy is introduced. (See #14)

Several other examples also use extra keys.

It might be powerful to allow BCO extensions to add more fields, but I thought we already had extension_domain for that.

It must be declared if arbitrary extra keys are allowed or not (and where). If they are allowed I would recommend they are namespaced so that they are not in conflict across vendors or future BCO versions.

Recommend UUIDs for BCO_id?

The field BCO_id is defined as

A unique identifier that should be applied to each BCO instance. These can be assigned by a BCO database engine. IDs should be URIs (expressed as a URN or URL). IDs should never be reused.

Hiroki Morizono suggested that we recommend using UUIDs (sometimes called GUIDs) as they are easy to generate and also to keep unique.

UUIDs can be URNs, e.g. urn:uuid:2bf8397b-9aa8-47f2-80a7-235653e8e824 (which are then not resolvable) or be used as part of an in-house identifier, http://repo.example.com/bco/2bf8397b-9aa8-47f2-80a7-235653e8e824

I don't think we should mandate which way - although I prefer the second form as it can be resolvable (e.g. click the hyperlink). We should probably only have a soft recommendations for UUIDs, something like:

It is RECOMMENDED that the BCO identifier is based on a UUID to ensure uniqueness, either as a location-independent URN (e.g. urn:uuid:2bf8397b-9aa8-47f2-80a7-235653e8e824) or as part of an identifier permalink, (e.g. http://repo.example.com/bco/2bf8397b-9aa8-47f2-80a7-235653e8e824)

A related question would be if a change of provenance_domain/version means the BCO_id should be changed or not.

provide 2 platform specific structures for the IO domain

Starting with the platforms of interest to FDA: HIVE & CWL

Other platforms can be defined by each vendor/language community.

Update environment_variables schema

@HadleyKing

environment_variables schema:
    "environment_variables": {
      "type": "object",
      "description": "Environmental parameters that are useful to configure the execution environment on the target platform.",
      "additionalProperties": false,
      "patternProperties": {
        "^[a-zA-Z_]+[a-zA-Z0-9_]*$": {
          "type": "string"
        }
      }
    }
The regex is based on the following:

http://pubs.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap08.html
Environment variable names used by the utilities in the Shell and Utilities volume of IEEE Std 1003.1-2001 consist solely of uppercase letters, digits, and the '_' (underscore) from the characters defined in Portable Character Set and do not begin with a digit. Other characters may be permitted by an implementation; applications shall tolerate the presence of such names. Uppercase and lowercase letters shall retain their unique identities and shall not be folded together. The name space of environment variable names containing lowercase letters is reserved for applications. Applications can define any environment variables with names from this name space without modifying the behavior of the standard utilities.
Note:
Other applications may have difficulty dealing with environment variable names that start with a digit. For this reason, use of such names is not recommended anywhere.

script and script_access_type

I would drop script_access_type and define script simply as a URI:

https://tools.ietf.org/html/rfc8089
file://path/to/rfc8089.txt
"script": {
    "$ref": "#/definitions/uri"
}

Allow relative URI in input_list/output_list "address"

input_list (and output_list) uses keys address and access_time which are not explained.

The text says:

...expressed as a URN or URL

However Joseph Nooraga comments:

Needs clarity. Is this indicating that all data being used needs to be addressable via HTTP, and should remain so for the life of the BCO?

@rajamazumder responded:

"or a unique location in a file system"? I forgot what the discussion around this was.

I think we do need to permit relative references here, see for example Dataset_BCO_example that uses relative URIs:

"input_list":[
  "human_protein_position_pmid_id_aminoacid_glytoucan_2018_09_04_07_51_27.txt"
],

To avoid BCO parsers having to second-guess if h:/file.txt is a URI or a file location we should say that this must be an absolute URI or a relative URI reference. If we say it is always like that it means for instance that spaces in filenames are always URI escaped and have / forward slashes:

```json
"input_list":[
   "nested%20folder/file_with_50%25_percent.txt"
],

It must made clear that the relative URIs are relative to the location of the BCO JSON file and that file name must be assumed to be case-sensitive.

If this is found in D:\Submissions\bco15\bco.json then this would mean the file D:\Submissions\bco15\nested folder\file_with_50%percent.txt - or file:///d:/Submissions/bco15/nested%20folder/file_with_50%25_percent.txt as absolute (but local) file: URI

This issue relates to packaging and distribution of BCOs which is currently undefined.

UVP-BCO@c33e365 feedback

@HadleyKing following up on biocompute-objects/UVP-BCO@c33e365#commitcomment-31591403

pipeline_steps validation

Inserting "additionalProperties": false, into the pipeline_steps definition would catch errors such the following prerequisites typo:

jsonschema.exceptions.ValidationError: Additional properties are not allowed ('prerequisites' was unexpected)

uri-in-uri stutter

Fields such as the following could be flattened using the JSON Schema allOf property:

"software_prerequisites": [{
    "name": "BEDtools",
    "version": "2.17.0",
    "uri": {
        "uri": "http://example.com/example"
    }
}]

"software_prerequisites": {
  "type": "array",
  "description": "Minimal necessary prerequisites, library, tool versions needed to successfully run the script to produce BCO.",
  "items": {
    "allOf": [
      { "$ref": "biocomputeobject.json#/definitions/uri" },
      {
        "type": "object",
        "description": "A necessary prerequisite, library, or tool version.",
        "required": [ "name", "version" ],
        "additionalProperties": false,
        "properties": {
          "name": {
            "type": "string",
            "description": "Names of software prerequisites",
            "examples": [ "HIVE-hexagon" ]
          },
          "version": {
            "type": "string",
            "description": "Versions of the software prerequisites",
            "examples": [ "babajanian.1" ]
          },
        },
      }
    ]
  }
},

"software_prerequisites": [{
    "name": "BEDtools",
    "version": "2.17.0",
    "uri": "http://example.com/example"
}]

What is a derived_from objectId

derived_from says:

If the object is derived from another, this field will specify the parent object, in the form of the ‘objectid’. If the object inherits only from the base BioCompute Object or a type definition than the value here is null.

Is is unclear what is an objectid. Is this different from a BCO_id of the other BCO? (which we said was a URI)

The example is shown with null - rather this should be shown with a value.

top level _type, _id, _inherits, name, title and description needs defining

https://github.com/biocompute-objects/BCO_Specification/blob/master/top-level.md

Define keys of review object

The review object is only shown by example.

The sub-keys reviewer_comment and reviewer must be further defined. For instance, why is reviewer_comment a list?

The possible values for status should be defined as a bullet-point list rather than inline in a text, so that is clear what are the only possible values.

Clarify script_access_type vs script

script_access_type feels a bit contrived, as it is defining the type of script key, making it hard to validate/use alone.

The script key however does not explain at all how an inline script can be used.

Other parts of BCO use sub-objects like

 "source": {
    "address": "http://example.com/file.txt"
}

It might make sense to do a similar approach here, where one can provide either address for URI (potentially relative #23 for files) or value for inline scripts (but preferably not both!) - thus one can remove script_access_type

BioCompute Object Consortium members (BCOC)

I think BioCompute Object Consortium members (BCOC) should go at the end of the MD doc

remove 'additionalProperties: false'

TODO: remove 'additionalProperties: false' after handling inherited types is resolved. In the meantime it was set to avoid forgetting to write a schema for a new property because it is silently ignored during validation"

This should be done after resolving #11

script_driver values not clearly defined

script_driver defines a couple of example values:

hive, cwl-runner, shell.

The text needs to be refined to either clearly define possible values (and eventual extensions), rather than listing these as examples from thin air.

This sentence feels out of place:

It is noteworthy to mention that scripts and script drivers by themselves can be objects. These objects can exist in internal (BCO) or external databases and be publicly or privately accessible.

What does this mean? We can instead use a { json } object here with undefined keys? Suggest to remove.

Domain Prerequisites → external_data_endpoints

https://github.com/biocompute-objects/BCO_Specification/blob/master/execution-domain.md#257-domain-prerequisites-domain_prerequisites

Remove multi_value in "reviewer_comment”

From the following:

Recommend Semantic Versioning for version field?

The version field defines briefly what constitutes a change of a BCO.

We should recommend using https://semver.org/ (e.g. 1.2.0) so that there is also a clear semantics on how to compare version numbers to determine which BCO is "newer" in what way.

As pointed out in #8 we also need to be clear if such a change should constitute a change of BCO_id or not.

Explain template language

The usability_domain seems to use some kind of template language with examples like [SO:0000694].

I think this is what is alluded to in external-references - but it must be made clear for each field where such [expansions] are allowed/expected or not, as well as defining their syntax.

Execution Domain

Issues from execution-domain

Maybe script_access_type and script properities can be tied together (#26)
script.uri
URI values require regex
pipeline_version should not be required (is listed as required in schema)
(maybe it is not necessary here)
platform maybe not needed in this domain

In general, this domain needs more refinement

domain_prerequisites should be renamed (maybe "data_endpoint")
domain_prerequisites.name should also be renamed
regex for domain_prerequisites.url
env_parameters should a simple dictionary of a list of dictionaries
Should avoid saying "Environmental parameters" --> should be environment variables

How are templates expanded in structured_name?

structured_name seems to define a simple template system:

This field can refer to other fields within the same or other objects. For example, a string like "HCV1a [taxonomy:$taxonomy] mutation detection"

It is not defined what is the syntax for the $magic variable, neither if this is restricted to looking up direct neighbouring keys of provenance_domain.

The example also seems to imply that adding arbitrary non-namespaced keys like taxonomy is allowed, but nowhere else is this clarified.

This should be clarified or removed.

License of the BCO Specification repository?

What should be the license of this BCO_Specification repository? Presumably we want this to be Open Access as it's on GitHub and our IEEE Open Source Pilot part will reference this repo?

Normally for documents Creative Commons BY 4.0 is a good choice, however it is not recommended for software.

A technical specification with various JSON examples is somewhat in-between a Document and Software. The line moves more towards Software when we introduce formal schemas that implementer might want to copy into their code.

So our license should presumably be something that is easy to integrate into other commercial and open source projects, like Apache License v2.0 which conveniently also cover contributions as well as protection against patent traps.

Note that for relicensing we should ideally ask for permission from every BCO copyright holder as the previous Google Docs document never had a license or Intellectual Property section. In reality only those who contributed "substantial work" (e.g. a paragraph) would own copyright.

Conflict of interest

Note that I am probably biased above. I am both a Apache Software Foundation member and on the Common Workflow Language leadership team, which use Apache License for the CWL specifications.

Consistent name for the specification

There are multiple variants for the specification name. Should there be an attempt to be consistent?

capitalization
plural/singular
with/without a dash
with/without "object"

Rephrase/remove confusing Seven Bridges script reference

script says:

This may be a reference to Galaxy Project or Seven Bridges Genomics pipeline, a Common Workflow Language (CWL) object in GitHub, a High-performance Integrated Virtual Environment (HIVE) computational service or any other type of script.

In a comment @mr-c says:

SBG exclusively used CWL, so that is redundant. This sentence confuses platform providers with workflow technologies/standards.

Suggestion is to remove or rephrase reference to Seven Bridges Genomics pipeline as a platform provider is different from a workflow technology.

(It might still make sense to link to the pipeline in SBG platform, but this link will most likely not be directly to the script and might need a different key)

ECO - Evidence and Conclusion Ontology

The error_domain should use ECO to describe the results.
This would mean updating the text as well as creating a good example. @rajamazumder and @openbox-bio are currently working on an example that may be a good test case.

Other suggestions/comments welcome.
Currently this is only described as follows:

The empirical error subdomain contains the limits of detectability, false positives, false negatives, statistical confidence of outcomes, etc. This can be measured by running the algorithm on multiple data samples of the usability domain or in carefully designed in-silico spiked data. For example, a set of spiked, well-characterized samples can be run through the algorithm to determine the false positives, negatives and limits of detection.
The algorithmic subdomain is descriptive of errors that originated by fuzziness of the algorithms, driven by stochastic processes, in dynamically parallelized multi-threaded executions, or in machine learning methodologies where the state of the machine can affect the outcome. This can be measured in repeatability experiments of multiple runs or using some rigorous mathematical modeling of the accumulated errors. For example: bootstrapping is frequently used with stochastic simulation based algorithms to accumulate sets of outcomes and estimate statistically significant variability for the results.

Maybe we could incorporate this elsewhere too?

biocomputeobjects.org/schema

@kee007ney
We should have something at biocomputeobjects.org/schema

digital signature: needs work

How to generate using MD5? (can't include a signature inside itself)

I recommend another field to specify the algorithm

regex for datetime

From extra-validation-items-description_domain.md

xref.access_time regex for date-time but this maybe already validated in first level

[ ] investigate if/how the date-time is validated

Explain keys of prerequisite

prerequisite has inconsistent definition:

A list of text values to indicate any packages or prerequisites for running the tool used.

Yet it is shown as example with sub-keys that are not defined anywhere:

                    "prerequisite": [
                        {
                            "name": "Hepatitis C virus genotype 1", 
                            "source": {
                                "address": "http://www.ncbi.nlm.nih.gov/nuccore/22129792",
                                "access_time": "2017-01-24T09:40:17-0500"
                            }
                        }, 
                        {
                            "name": "Hepatitis C virus type 1b complete genome", 
                            "source": {
                                "address": "http://www.ncbi.nlm.nih.gov/nuccore/5420376",
                                "access_time": "2017-01-24T09:40:17-0500"
                            }
                        },

The text needs to explain what is the meaning of name, source, address and access_time. Some examples use other keys like uri version and sha1_chksum which are not defined anywhere.

If prerequisite is a free-for-all in terms of keys this should be defined, although it can be argued over how useful that will be.

Description domain

Issues from description_domain.md

xref.access_time
regex for datetime but this mayebe already validated in first level
pipeline_steps.tools
schema says it is required, spec text should say the same thing
pipeline_steps.tool
should is redundant property (pipeline_steps should be flat list of
step objects)
pipeline_steps.step_number
is an integer in the schema but the spec text has it as a string
pipeline_steps.prerequisite.uri.access, pipeline_steps.prerequisite.uri.address
regex required
pipeline_steps.input_list,pipeline_steps.output_list
regex for url values of these arrays

Develop a spec release protocol

We need to develop a spec release protocol so we can update in an orderly fashion. A few suggestions for what it should contain to start:

method explained for versioning based on Semantic Versioning and utilizing Permanent Identifiers for the Web
Method for issue triage, I.E. how are issues/suggestions handled? What about ideas for future development or implementation?

Also, where should these policies be stated? In the README, in a new user guide, or in a new document?

Avoid null fields, but declare if optional or required

Several fields are documented with null in their examples.

derived_from
prerequisites
sha1_chksum (where is this documented?)

I think in general these fields should rather be optional, as otherwise we have some kind of distinction of a field missing vs. it being present and having null as value. Sometimes this is appropriate (e.g. unknown vs. nothing), but I don't think that is the intention here.

If some fields can be null and/or missing, then we need to be explicit about that, as parsers would need to handle that the value might not be there.

Fix inconsistencies in HCV1a.json example

The example HCV1a.json includes some keys not defined elsewhere (uri, sha1_chksum):

            {
                "name": "HIVE-heptagon", 
                "version": "albinoni.2",
                "uri": {
                    "address": "https://hive.biochemistry.gwu.edu/dna.cgi?cmd=dna-heptagon&cmdMode=-",
                    "access_time": "2017-01-24T09:40:17-0500",
                    "sha1_chksum": null
}

While the BCO do permit arbitrary keys for software_prerequisites the example should only use values defined in the spec.

One error_domain is listed twice, with and without spaces (and different values!):

     "false positive mutation calls discovery": "<0.0005", 
     "false_positive_mutation_calls_discovery": "<0.00005",

Access to FTP is used without hostname, but this behavior is not defined in domain_prerequisites

            {
                "name": "access to ftp", 
                "url": "ftp://:22/"
},

Similarly this abstract example should be removed as this "concrete" example don't want to access the protocol protocol:

				"name": "generic name",
			    "url": "protocol://domain:port/application/path"
}

Access to HIVE should presumably extend beyond the login page:

            {
                "name": "HIVE", 
                "url": "https://hive.biochemistry.gwu.edu/dna.cgi?cmd=login"
},

so here the URL should be chopped at first /

The script_access_type is text, yet a URI is provided for script:

        "script_access_type": "text",
        "script": ["https://example.com/workflows/antiviral_resistance_detection_hive.py"],

The script driver manual is undefined:

"script_driver": "manual",

The input/output URI examples have invalid hostname hive.biochemistry.gwu.edudata. These should either be neutral on http://example.com/ or actually work.

 "input_list": [
                        {
                            "address": "https://hive.biochemistry.gwu.edudata/514769/dnaAccessionBased.csv",
                            "access_time": "2017-01-24T09:40:17-0500"
                        }
],

Some of the Sequence Ontology examples are missing SO: and thus don't work with http://identifiers.org/so/ according to external references expansion.

 "structured_name": "HCV1a [taxonomy:31646] ledipasvir 
       [pubchem.compound:67505836] resistance SNP 
       [so:0000694] detection",

"name": "Sequence Ontology",
"ids": ["0000048"], 

  "usability_domain": [
        "Identify baseline single nucleotide polymorphisms SNPs [SO:0000694], insertions [so:SO:0000667], and deletions [so:SO:0000045] that correlate with reduced ledipasvir [pubchem.compound:67505836] antiviral drug efficacy in Hepatitis C virus subtype 1 [taxonomy:31646]", 
],

pipeline_version: Clarify what is a pipeline implementation

pipeline_version says simply:

This field records the version of the pipeline implementation.

Anonymous (I think it was me) questioned:

what is a pipeline implementation? Does this mean the version of the script above, the pipeline in abstract, or the implementation of the workflow system the pipeline is executed on?

Extension domain issues

Issues from extension_domain-fhir.md

fhir_endpoint regex for URL (since format is specified, maybe it is already being validated in the first level)
fhir_version regex [0-9]

Issues from extension_domain-scm.md

scm_extension.scm_repository, scm_extension.scm_preview
regex for URL (maybe this is already being validated at the first level)
scm_extension.scm_preview is listed as required in schema but it
should not be required
scm_extension.scm_path should not be uri-reference (as described in schema)

Remove reference to "BCO Server"

digital_signature introduces a term "BCO Server" which is not explained elsewhere:

The BCO server can provide an API validating the signature versus BCO content, allowing users to validate the signature "offline" on their own. The server will also must provide a reference to the signature creation algorithm, facilitating for greater interoperability.

This is very confusing as "BCO Server" is a new term here, and the specification does not elsewhere talk about how BCO APIs are meant to work.

Suggestion to remove that paragraph or to make a new top-level section about how BCOs are resolved/transferred.

Where are xref namespace from?

The xref use CURIEs in namespaces.

I think these are identifiers.org namespaces as explained in external-references but we need to update xref to use hyperlinks to that.

(also then the so example should be a valid http://identifiers.org/so/ identifier starting with SO: )

platform: What are possible values?

platform is defined as:

The multi-value reference to a particular deployment of an existing platform where this BCO can be reproduced. A platform can be a bioinformatic platform such as Galaxy or HIVE or it can be a software package such as CASAVA or apps that includes multiple algorithms and software.

"platform": "HIVE"

The example is not actually multi-value
Where do the values for platform come from? This seems to be free-text, so are we OK with variable values like "Galaxy", "UseGalaxy" and "Galaxy Platform"?

provenance_domain obsolete or obsolete_after

The specification uses obsolete, but the schema uses obsolete_after:

./base_type_BioCompute.json:89: "obsolete_after" : {
./provenance-domain.md:17: "obsolete" : "2118-09-26T14:43:43-0400",
./provenance-domain.md:105:### 2.1.6 Obsolescence "obsolete"
./provenance-domain.md:110:"obsolete" : "2118-09-26T14:43:43-0400"
./HCV1a.json:33: "obsolete" : "2118-09-26T14:43:43-0400",
./user_guide.md:242: "obsolete_after" : {
./user_guide.md:461: "obsolete" : "2118-09-26T14:43:43-0400",

Refine github extension keys

The github extension is only explained by example.

The two keys github_repository and github_URI are not explained and seem to be partially overlapping. The camel_Case is also inconsistent.

Given that these are URLs I think the extension should support any source control repository, not just github.com, perhaps something like:

"extension_domain":{
  "scm_extension": {
    "scm_repository": "https://github.com/example/repo1",
    "scm_type": "git",
    "scm_branch": "c9ffea0b60fa3bcf8e138af7c99ca141a6b8fb21",
    "scm_path": "workflow/hive-viral-mutation-detection.cwl",
    "scm_preview": "https://github.com/example/repo1/blob/c9ffea0b60fa3bcf8e138af7c99ca141a6b8fb21/workflow/hive-viral-mutation-detection.cwl"
}

Here's how Maven defines it's scm metadata.

bco-specification.md delete section 3.4 or specify JSON Schema

bco-specification.md section 3.4 references data-typing.md which was discontinued in e9686c1

3.4 Appendix V Data typing

The conceptual schema for BCO creation can be defined in ??? schema language.

Specifications:

Data Typing

Fix inconsistent JSON key styles

There are some inconsistencies in JSON key names. Generally BCO use lower_case_with_underscore which is common in JSON.

Some inconsistencies that unnecessarily deviate from this style should be fixed, for example:

Name in domain_prerequisites
false discovery vs false_discovery in error_domain examples
BCO_id in top level (why not bco_id ? )
FHIR_extension including FHIRendpoint_Resource, FHIRendpoint_URL and FHIRendpoint_Ids
github_URI in github_extension

Parametric Domain

Issues from parametric_domain.md

1) parametric_domain is not used for running/reproducing a bco e.g. not used by execution_domain
2) the parameters exposed are NOT default values.
3) automatically generated
4) human readable
5) Value HAS to be resolved before being populated (elaborate on this @Mazumder)
6) Defined as:

"parametric_domain": [
    {"param": "name_of_parameter", "value": "value_of_parameter", "step": "step#"},
    {"param": "name_of_parameter", "value": "value_of_parameter", "step": "step#"}
]

Explain FHIR extension

fhir extension only shown by example, the keys are not explained. The camel_Casing is a bit inconsistent.

James Jones suggested:

Can address the camel_Case issues and describe the keys as follows:

FHIR_endpoint is a string containing the URL of endpoint of the FHIR server containing the resource.

FHIR_resource is a string containing the type of resource used. A full list of permitted FHIR resources is available at http://hl7.org/fhir/resourcelist.html.

FHIR_ID is a string containing the server-specific identifier for the resource instance.

The full URL of each referenced FHIR object is the combined address of the form: FHIR_endpoint/FHIR_resource/FHIR_ID

Avoid type mentions in parametric_domain

parametric_domain says:

All BCOs should inherit from the fundamental BioCompute data type and as such inherit all of the core fields described in document. Specific BioCompute types introduce specific fields designed to customize the use of pipelines for a particular use pattern. Please refer to documentation of individual scripts and specific BCO descriptions for details.

It is very unclear what these types are talking about and how such specific fields can be defined. Does this relate to #11?

Suggestion is to remove this paragraph and rather explain how parametric_domain are reflecting configurations of other parts of the BCO, presumably the keys here relate to the the name of individual pipeline_steps ?

(Perhaps other keys are allowed? Some workflow systems like KNIME also have workflow-wide parameters)

Define BCO in JSON Schema?

Link to @corburn's JSON Schema for BCO in the README

Instead of the custom BCO data type schema with _type etc (which itself does not have any documentation), we should use something like JSON Schema which have multiple tools and validators.

Note that JSON Schema itself is working towards RFC so there might be finer details here changing, but I would argue it is still a more mature schema language for defining expected JSON types than the one we blindly use in primitives.json

Refine SCM extension keys

When opening issue #21 @stain said:

Given that these are URLs I think the extension should support any source control repository, not just github.com, perhaps something like:

"extension_domain":{
  "scm_extension": {
    "scm_repository": "https://github.com/example/repo1",
    "scm_type": "git",
    "scm_branch": "c9ffea0b60fa3bcf8e138af7c99ca141a6b8fb21",
    "scm_path": "workflow/hive-viral-mutation-detection.cwl",
    "scm_preview": "https://github.com/example/repo1/blob/c9ffea0b60fa3bcf8e138af7c99ca141a6b8fb21/workflow/hive-viral-mutation-detection.cwl"
}

Here's how Maven defines it's scm metadata.

As we have now changed the extension to scm_extension I felt the discussion should be continued on another thread.
The wording in extension-scm.md has been updated and the antiviral_resistance_detectionTypeDef.json has been as well, but only on the most superficial level. Each of the fields are simply described as string.

Should we have a more comprehensive definition here?

How are BCOs packaged or transferred?

Relates a bit to #23 - how are BCOs serialized and transferred?

Is there a conventional file name for the BCO JSON? (bco.json springs to mind)

Is there a conventional path structure to contain a BCO and its sub-resources"? (e.g. data/ ? )

Are BCO sub-resources (scripts, inputs, outputs) webby or contained in some kind of package? (alternative: snapshots of the webby resources)

It has been briefly mentioned that BCOs are meant to be submitted to FDA in the form of physical hard-drives. What form does this take? (We can rule it out of scope for this spec)

Should BCO packaging reuse existing standards like bagit or Research Object?

`BCO_id` -> why not `bco_id`

Taken from a closed issue, #30

* BCO_id in top level (why not bco_id ? )

I think that distinguishing the defining field in the BCO is important, and as such the CAPS are appropriate. If anyone disagrees, please convince me.

"bco_spec_version" as a URL?

Should the "bco_spec_version" field be expressed as a URL to the RELEASE of the version used to draft it? For example:

"bco_spec_version": "v1.1-draft1"

Becomes:

"bco_spec_version": "https://github.com/biocompute-objects/BCO_Specification/releases/tag/v1.2"

What is a type?

type is explained as if one is already meant to know how types are declared.

As any object of type 'type,' it has its own fields: _type, _id, _inherits, name, title and description. Type of this JSON object is "antiviral_resistance_detection"

The meaning of these fields are nowhere explained, and this typing system is not explained.

Suggestion is to remove the type field or to add a section that explain the typing system. data-typing.md might be an early attempt of this, but it has no technical information.

Keywords should not be nested

The keywords is for some reason nested as a map to lists.

      "keywords": [
            {
                "key": "search terms",
                "value": [
                    "HCV1a", 
                    "Ledipasvir", 
                    "antiviral resistance", 
                    "SNP", 
                    "amino acid substitutions"
                ]
            }
        ]

It is unclear what is the meaning of nestings like search terms and where such keys should be defined.

Keywords are normally not structured, so I would change this to a flat listing:

      "keywords": [
                    "HCV1a", 
                    "Ledipasvir", 
                    "antiviral resistance", 
                    "SNP", 
                    "amino acid substitutions"
        ]

biocompute-objects / bco_documentation Goto Github PK

bco_documentation's People

Contributors

Stargazers

Watchers

Forkers

bco_documentation's Issues

environment_variables schema:

script and script_access_type

pipeline_steps validation

uri-in-uri stutter

Conflict of interest

3.4 Appendix V Data typing

Recommend Projects

Recommend Topics

Recommend Org

Jobs