ga4gh-beacon / beacon-v2 Goto Github PK

View Code? Open in Web Editor NEW

22.0 9.0 19.0 5.31 MB

Unified repository for the GA4GH Beacon v2 API standard

License: Creative Commons Zero v1.0 Universal

json-schema schema

beacon-v2's Introduction

Unified repository for Beacon v2 Code & Documentation

Description

This repository is a unified repository representing the different parts of the Beacon API:

framework
models
Beacon v2 Documentation
- authoritive source already in this repository /docs
- rendered version through here (alternative address is docs.genomebeacons.org)

As with other schema projects, here we separate between the schema source files (in src; JSON-Schema written in YAML) and the generated versions for referencing. The current setup allows already the direct referencing of the generated JSON schemas. Examples:

ontologyTerm:
- YAML (source):
  - edit: https://github.com/ga4gh-beacon/beacon-v2/blob/main/framework/src/common/ontologyTerm.yaml
  - raw: https://raw.githubusercontent.com/ga4gh-beacon/beacon-v2/main/framework/src/common/ontologyTerm.yaml
- JSON: https://raw.githubusercontent.com/ga4gh-beacon/beacon-v2/main/framework/json/common/ontologyTerm.json
biosamples/defaultSchema
- YAML (source):
  - edit: https://github.com/ga4gh-beacon/beacon-v2/blob/main/models/src/beacon-v2-default-model/biosamples/defaultSchema.yaml
  - raw: https://raw.githubusercontent.com/ga4gh-beacon/beacon-v2/main/models/src/beacon-v2-default-model/biosamples/defaultSchema.yaml
- JSON: https://raw.githubusercontent.com/ga4gh-beacon/beacon-v2/main/models/json/beacon-v2-default-model/biosamples/defaultSchema.yaml
beaconRequestBody:
- YAML (source): https://raw.githubusercontent.com/ga4gh-beacon/beacon-v2/main/framework/src/requests/beaconRequestBody.yaml
- JSON: https://raw.githubusercontent.com/ga4gh-beacon/beacon-v2/main/framework/json/requests/beaconRequestBody.json

There is a set of tools in /bin to facilitate the conversion. ATM, after editing ...yaml schema files somewhere in the /src tree, a (local) run of bin/yamlerRunner.sh - which re-generates the ....json files in the /json tree) has to be performed before pushing changes.

Changelog

2.1.0

Released, July, 19, 2024

Relocated TypedQuantity required to proper level of the schema for complexValue
Added end and start entities for ageRange and iso8601duration for age
Filtering terms scopes changed from string to array of strings

2.0.1

Released July, 16, 2024

Replaced ENSGLOSSARY for SO ontology family in documentation examples
Moved CURIE to beaconCommonComponents
Created filtering terms entity
Removed validation directories
Several fixes to entity types, typos and other non-breaking changes

2.0.0

Released June, 21, 2022

change notes with respect to the repository & documentation are now in docs.genomebeacons.org
NOTE: on 2022-06-20 the previous development repositories have been archived:
- ARCHIVE - beacon-framework-v2
- ARCHIVE - beacon-v2-Models

Directory structure

|-docs          Contain the source (Markdown) for the mkdocs generated documentation
|
|- framework
|   |
|   |- src      schema source in YAML format; for editing
|   |
|   |- json     JSON versions of the schema files generated from src, authorative/referenceable version
|
|- models
|   |
|   |- src      schema source in YAML format; for editing
|   |
|   |- json     JSON versions of the schema files generated from src, authorative/referenceable version
|
|- bin          scripts and configurations for creating the unified structure
    |
    |- yamlerRunner.sh    runs the conversions for the different repos and format options
    |
    |- beaconYamler.py    conversion app
    |
    |- config.yaml        text replacements and options

beacon-v2's People

Contributors

Stargazers

Watchers

Forkers

tb143 hangjiaz mshadbolt gsfk mrrobb rahelp genostack iper-jkane victorskl daisieh d-salgado aleixcanalda ega-archive redmitry guigolab echodataworkinggroup joneubank mrueda datsirul

beacon-v2's Issues

Filtering terms in model: Confusing `/entryType/filteringTerms.json` files

FilteringTerms and filters are still confusing. One of the areas is the existence of filteringTerms.json files (e.g. in biosamples) which are supposedly placeholders for information files about the available filtering terms for the entry type but do not constitute an endpoint. So it seems that they are for internal use only (and anyway most probably would be kept in a database or generated on the fly).

Proposals:

delete filteringTerms.yaml / .json
optionally add a /filtering_terms endpoint to endpoints, for each entity where filters apply, e.g. /biosamples/filtering_terms/, additionally to /filtering_terms, to list all applying to the given scope

Real contact address in schemas

The email address [email protected] didn't exist.

Suggestion:

one at ELIXIR?
or [email protected] w/ forwarding?

Ontology Resource object weird properties names

Hello,

Although not an issue, but rather comment.

The beaconFilteringTermsResults.json Resource object has two bad-named properties:

"iriPrefix" - is essentially the "namespace" - the default ontology namespace.
"nameSpacePrefix" - which is the namespace prefix which is used to shorthand the identifiers (IRIs).

from the ontology point of view should be just "namespace" and "prefix".

Best,

Dmitry

Add variant query documentation

There s now a placeholder page for genomic query documentation. Please add content (e.g. transfer from the query scouts document) here. (also @babisingh).

Overly complex response `meta` parameters should be optional

Beacon v2 responses are very chatty & require many parameters in their meta elements which only serve verification purposes (e.g. that the beacon could interpret the query ...). This, however, makes it rather hard to demonstrate the simplicity of especially v2 Boolean and count responses. Below is the minimal response example for a variant query which I've used in the documentation - all parameters are required:

{
  "meta": {
    "apiVersion": "v2.0.0",
    "beaconId": "org.progenetix.beacon",
    "receivedRequestSummary": {
      "apiVersion": "v2.0.0",
      "pagination": {
        "limit": 2000,
        "skip": 0
      },
      "requestedGranularity": "boolean",
      "requestedSchemas": [
        {
          "entityType": "genomicVariant",
          "schema": "https://progenetix.org/services/schemas/genomicVariant/"
        }
      ],
      "requestParameters": {
        "alternateBases": "A",
        "referenceBases": "G",
        "referenceName": "refseq:NC_000017.11",
        "start": [
          7577120
        ]
      }
    },
    "returnedGranularity": "boolean",
    "returnedSchemas": [
      {
        "entityType": "genomicVariant",
        "schema": "https://progenetix.org/services/schemas/genomicVariant/"
      }
    ]
  },
  "responseSummary": {
    "exists": true
  }
}

Not necessary:

receivedRequestSummary.apiVersion, receivedRequestSummary.requestedSchemas, requestedGranularity are known by the client and the beacon's returned values are stated separately
receivedRequestSummary.pagination doesn't make sense for Boolean and count responses since it is for record pagination

Maybe, but:

receivedRequestSummary.requestedSchemas and returnedSchemas basically point to the requirement of having a schema for the entity. However, when e.g. doing the basic "beacon from a VCF" - which at least for SNVs is well understood - there is no inherent variant schema to point to, based on the source data.

Helpful for debugging:

receivedRequestSummary.requestParameters

So while it is obviously beneficial to have a full framework + default/alternative model implementation, the current use of required parameters makes it overly complex & scary for resource owners wanting to add a "v1-style" beacon (and thereby enriching the landscape for aggregators etc.). Minimal response IMO (and receivedRequestSummary is an edge case here since it won't be interpreted, mostly):

{
  "meta": {
    "apiVersion": "v2.0.0",
    "beaconId": "org.progenetix.beacon",
    "entityType": "genomicVariant",
    "receivedRequestSummary": {
      "requestParameters": {
        "alternateBases": "A",
        "referenceBases": "G",
        "referenceName": "refseq:NC_000017.11",
        "start": [
          7577120
        ]
      }
    },
    "returnedGranularity": "boolean"
  },
  "responseSummary": {
    "exists": true
  }
}

Discuss!

Improve datatype of runs:platform to OntologyTerm

The datatype of runs:platform could be changed from String to OntologyTerm, using as value range the terms listed under https://www.ebi.ac.uk/ols/ontologies/genepio/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FGENEPIO_0000071. That would make the data more interoperable. In case the data doesn't fit, perhaps null flavors could be used (https://raw.githubusercontent.com/fairgenomes/fairgenomes-semantic-model/main/lookups/NullFlavors.txt), or a new term introduced locally.

requests/filteringTerms.yaml too forgiving for request validation

I've been using BeaconRequestBody.yaml (and its json cousin) for request validation, but the spec for filtering terms is too forgiving, and lets even the most obviously mangled filters through. The issue is with the use of anyOf, the overlaps between the different filter types, and a forgiving attitude toward additional properties.

For example, this filter passes, even though similarity is required to be a string that's one of "exact", "high", "medium", or "low":

  {
    "id": "some_id",
    "similarity": {"I snuck": "an object in here"}
  }

Some other examples are here: https://www.jsonschemavalidator.net/s/pFQY4Xno

This happens because:

FilteringTerm is defined to be any of OntologyFilter, AlphanumericFilter, or CustomFilter,
a filter only has to match one of the three types to pass validation
malformed AlphanumericFilter can often pass as OntologyFilter
almost anything will pass as a CustomFilter, any malformed values other than "id" or "scope" are considered extra properties with no restrictions on them.

The obvious sledgehammer solution is to forbid extra properties in filters, I don't know enough of the history here to know if that's too harsh of a solution. Is CustomFilter meant to be more open?

Here is a solution that forbids all extra properties in filters.

Here, if needed, is a more conservative solution that allows extra properties in CustomFilter but requires them to be objects, which should cut down on the amount of confusion between a CustomFilter extra property and a malformed field in another filter type, all of which are string or boolean.

Documentation: Point framework & models links to unity repo

As the title says ...

Github branching model

During today's GA4GH call, David Salgado has shared the versioning protocol I was referring to and with which many should be familiar.

We discussed that a model loke that could be overkill, but I was checking that this branching model is having plugins, at least for VSCode and Visual Studio, that could make the process simpler.
Are you having plugins for your developement environment?
Would it be worth to abide to one clear protocol for everyone to follow?

permissive complexValue schema

Hi all,

There is a bug in the the complexValue.json schema.

"properties": {
    "typedQuantities": {
        "description": "List of quantities required to fully describe the complex value",
        "items": {
            "$ref": "#/definitions/TypedQuantity"
        },
        "type": "array"
    }
},

The typedQuantities property is not required so ANY value would be accepted.
This gives the problem in the measurement.json

"measurementValue": {
    "description": "The result of the measurement",
    "oneOf": [
        {
            "$ref": "./value.json"
        },
        {
            "$ref": "./complexValue.json"
        }
    ]
},

The problem is that IF some value validates over value.json, as it also validates over complexValue.json validations fails, as it must validate only with one schema in "oneOf".

The solution is to put "required": [ "typedQuantities" ] in the complexValue.

Cheers,

Dmitry

example files in src directory have been translated into YAML

These should be in JSON format.

It might be worth moving the examples out of the json and src directories and into the the framework and src directories or possibly even into root to avoid duplication and to encourage the idea that using either yaml or json the final structure is the same.

YAML output needed or confusing?

The current method converts the source data into YAML and JSON versions, with the JSON version in json being thought to be the one for referencing etc..

Since we have now the source in src in YAML format¹ this might be a bit confusing to have a second (converted) YAML in yaml. Can we drop this and stay with src and json? Other structure ideas?

As of early March 2022 the YAML in source is still being generated/overwritten by imports from the separate models and framework projects but this will change soon. ↩

Suggestion: add a 'measuredVariable' to individuals:measures

Suggestion for potential improvement. The individuals:measures object (https://github.com/ga4gh-beacon/beacon-v2/blob/main/models/src/beacon-v2-default-model/common/measurement.yaml) including assayCode with measuredValue(value, unit) and this works well for generic medical examinations like blood pressure, heart rate, C-reactive protein, etc.

However, for *omics measurements (e.g. metabolomics, genomics) this becomes problematic because there is no ontology term to represent every possibility such as "measured the read count for COL7A1 expression" or "number of peptide fragments for NM_001267550.1".

An elegant way to resolve this issue (as we have done in current MOLGENIS EMX2 implementation) is to add a String variable called 'measuredVariable'. The assayCode can then point to a type of *omics test which may comprise thousands of variables, and measuredVariable can then point to the specific gene, transcript, metabolite etc that was quantified. This does not break the existing model but adds better *omics capabilities to it.

Some errors in the documentation

Hi, I really appreciate the documentation website, I have found a few bugs here and there that I thought I would make you aware of, not sure if there are some issues with how it is reading information from the schemas?

Here are some examples:
http://docs.genomebeacons.org/schemas-md/cohorts_defaultSchema/

inclusionCriteria listed as NOT matching

http://docs.genomebeacons.org/schemas-md/obj/eventGenders/

eventGenders listed as 'geographical information'

Response schema but not an entry type?

For some responses one could envision the use of alternative schemas which do not represent an entry type and/or have an id. A practical example here would be the export of biosamples after a biosample query in the phenopackets format.

What would be the options for such an alternative schema use, e.g. optionally just referencing the (local or external) phenopackets schema? Obv. one can define a complete /phenopackets entry type & model - but this is a bit complex for mostly just changing the response format when not having a very different response type.

Comments, please!

Explanation on ERD

Greetings,

In my understanding, we should be able to have cohorts defined under different criteria such as study, described in beacon or user-defined (I guess this is done using ontology terms).

In these scenarios, having a particular individual referenced across several cohorts is inevitable. However, from the ETD diagram, it seems the cohort-individual relationship has cardinality 1 -<> n or one to many. Could you kindly elaborate on this design aspect? I have attached the ERD for reference.

Thanks

pagination should not be required parameter in beaconReceivedRequestSummary

Pagination should not be required parameter in beaconReceivedRequestSummary since it doesn't make sense in Boolean and count responses (and also not if all data is returned but here YMMV).

beacon-v2/framework/src/responses/sections/beaconReceivedRequestSummary.yaml

Line 50 in eb8958f

- pagination

Improve datatype of runs:libraryStrategy to OntologyTerm

The datatype of runs:libraryStrategy could be changed from String to OntologyTerm, using as value range the terms listed under https://www.ebi.ac.uk/ols/ontologies/genepio/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FGENEPIO_0001973. That would make the data more interoperable. In case the data doesn't fit, the term 'other library strategy could be used, or a new term introduced locally.

Add Filters documentation

There is now a placeholder page for the Filters documentation. Please add information here.

`datasetIds` and/or other additional request parameters?

The only ones that we agreed on adding were the collection ones if we prefer:
biosamples?datasetIds=1,2,3
to
dataset/1/biosamples
dataset/2/biosamples
dataset/3/biosamples

Originally posted by @jrambla in ga4gh-beacon/beacon-v2-Models#112 (comment)

datasetIds needs to be defined as request parameter for most entry types.

Additional parameters for GET queries and documentation of use vs. POST/schemas

As per various discussions, using GET request seems a rather widely used concept with implementers (rapid UI integration w/o need of using specific frameworks; easier documentation, examples, testing etc.). In fact, while we also implement POST in parallel, for Progenetix so far we rely on GET.

I would propose that we provide an extended matrix for parameters in GET queries. Examples here are

includeDescendantTerms - this can in POST be provided per filter but works for "medium complex use cases" as a global parameter
requestedGranularity, includeResultsetResponses - simple globals, but need to be documented that they can be used in GET
requestedSchemas - could be supported as resolving to entityType?
...

IMO it would be much easier for many implementers to see such parameters as simple query parameters, and then the comparable documentation of the POST schemas.

Also see ga4gh-beacon/beacon-framework-v2#43

Suggestion: redefine individuals:sex

Suggestion for potential improvement. Within the FAIR Genomes project (https://www.nature.com/articles/s41597-022-01265-x) there have been many discussions on a Dutch national level on how to best represent this type of information. The NCIT terms are, quite frankly, vague and thus not very useful (i.e. female = "[..] indicate biological sex distinctions, or cultural gender role distinctions, or both"). In the end, we chose to represent what Beacon v2 calls ‘sex’ as 'GenderAtBirth' using GSSO terms (https://github.com/fairgenomes/fairgenomes-semantic-model/blob/main/lookups/GenderAtBirth.txt) with separate terms for 'GenderIdentity' (https://github.com/fairgenomes/fairgenomes-semantic-model/blob/main/lookups/GenderIdentity.txt) and 'GenotypicSex' (in Beacon v2 as ‘KaryotypicSex’, https://github.com/fairgenomes/fairgenomes-semantic-model/blob/main/lookups/GenotypicSex.txt) to complete the full picture.

Improve datatype of runs:librarySelection to OntologyTerm

The datatype of runs:librarySelection could be changed from String to OntologyTerm, using as value range the terms listed under https://www.ebi.ac.uk/ols/ontologies/genepio/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FGENEPIO_0001940. That would make the data more interoperable. In case the data doesn't fit, the term 'other library method' could be used, or a new term introduced locally.

Structure of models directory

The current v2 models repository contains a root BEACON-V2-Model directory, in which one finds directories for the individual schemas and their endpoints. The schemas themselves are named e.g. biosamples/defaultSchema.json. Also, the BEACON-V2-Model directory contains the overall configuration, map and endpoints files.

To provide a clearer blueprint for extensions and separation between default and alternative schemas it would IMO make sense to re-structure models which could easily be done as part of the move to a unified Beacon repository ga4gh-beacon/beacon-v2-Models#76. A possible layout could look like this:

beacon
  |
  |-- framework
  |-- models
  |      |-- src
  |      |      |-- beaconConfiguration.yaml
  |      |      |-- beaconMap.yaml
  |      |      |-- endpoints.yaml
  |      |      |-- alternativeModel
  |      |      |      |-- phenopackets
  |      |      |      |      |-- schema.yaml
  |      |      |      |      |-- endpoints.yaml
  |      |      |      |      |-- filteringTerms.yaml
  |      |      |      |      |-- examples
  |      |      |      |-- ...otherAlternativeSchema...
  |      |      |
  |      |      |-- beaconDefaultModel
  |      |             |-- analyses
  |      |             |      |-- schema.yaml
  |      |             |      |-- endpoints.yaml
  |      |             |      |-- filteringTerms.yaml
  |      |             |      |-- examples
  |      |             |-- biosamples
  |      |             |      |-- schema.yaml
  |      |             |      |-- endpoints.yaml
  |      |             |      |-- filteringTerms.yaml
  |      |             |      |-- examples
  |      |             |-- ...
  |      |
  |      |-- json
  |      |      |-- beaconConfiguration.json
  |      |      |-- beaconMap.json
...
  |-- docs
  |-- (tools / bin ...)

Also note the change from defaultSchema.... to schema.....

Such a restructuring would also allow to have a template which includes the Beacon default model - or rather include the example as part of the standard repo.

Other options would be:
B) as above, but split at the root:

  |-- models
  |      |-- alternativeModel
  |      |      |-- src
  |      |      ...
  |      |      
  |      |-- beaconDefaultModel
  |      |      |-- src
  |      |      ...

C) assume that only the entities in the default model will be allowed, with e.g. biosampleMySchema.yaml inside the biosamples - not flexible & potentially confusing; also may clash w/ concepts where the general data model is different (e.g. Phenopackets can be representedon the individuals level but this isn't correct senso stricto).
D) add alternative models to the current structure (e.g. biosample and myBiosample) - possible but again no separation from Beacon default & possible proliferation of naming issues...

Feedback, please!

Update 2022-03-02: de-pluralize defaultModels etc.

invalid examples in disease.json

Hello,

The examples in the disease.json are invalid:

"examples": [
    {
        "ageGroup": {
            "id": "NCIT:C49685",
            "label": "Adult 18-65 Years Old"
        }
    },
    {
        "age": {
            "iso8601duration": "P32Y6M1D"
        }
    },
    {
        "ageRange": {
            "end": {
                "iso8601duration": "P59Y"
            },
            "start": {
                "iso8601duration": "P18Y"
            }
        }
    },
    {
        "age": {
            "iso8601duration": "P2M4D"
        }
    }
]

There no such properties as "ageRange" or "age", should be:

"examples": [
    {
        "id": "NCIT:C49685",
        "label": "Adult 18-65 Years Old"
    },
    {
        "iso8601duration": "P32Y6M1D"
    },
    {
        "end": {
            "iso8601duration": "P59Y"
        },
        "start": {
            "iso8601duration": "P18Y"
        }
    },
    {
        "iso8601duration": "P2M4D"
    }
]

disease.json:

    "properties": {
        "ageOfOnset": {
            "$ref": "./timeElement.json",

timeElement.json:

    "oneOf": [
        {
            "$ref": "./age.json"
        },
        {
            "$ref": "./ageRange.json"

Cheers,

Replace `...entity` to `...entry`

"...entity" and "entryType" occasions should be normalized. This was agreed upon previously & just didn't make it into 2.0.0 as planned: ga4gh-beacon/beacon-framework-v2#76 as issue in the unified repo, based on comments in ga4gh-beacon/beacon-framework-v2#63.

Filtering across endpoint boundaries

Hi,

I have installed the Beacon 2 RI and imported the Cineca test dataset from the GUI RI. I can query the endpoints "individuals", "biosamples" and "g_variants" and I can do filtering.

I would like to be able to find a count of all genetic variants for which there is a biosample of type "blood".

The only way that I have been able to think of is to run a POST request on the endpoint:

http://beacon:5050/api/biosamples/

with the filter:

"filters" : [ {
"id" : "UBERON:0000178"
} ],

Then I would have to extract the biosample IDs from the results and run the following for every returned ID:

http://beacon:5050/api/biosamples/HG00657/g_variants/

(HG00657 is an example ID)

It sounds like it would be slow and anyway, I don't want to have to pull the sample IDs to my server, I would rather they stay on site, for data protection reasons.

Is there some kind of shortcut notation that I could use to get what I want? E.g. something like:

http://beacon:5050/api/biosamples/*/g_variants/

...with the above filter?

Regards,

David Croft.

Having multiscope in the filterTerm definition

Solution proposed by @redmitry to attach information about the scope it applies for filtering terms.

{
    "id": "LOINC:3141-9",
    "label": "Weight",
    "type": "alphanumeric",
    "scope": ["individual", "biosamples", ...]
}

In this case, as filtering terms can be used for different scopes, adding multiscope information would solve the problem (if there is one) about where can a filtering term apply.
In my opinion, this is good for some models but for other models this can be a headache, so I would say this a good solution (and I would adopt it) to make life easier in some models but should be "optional" and not "required".

Should we stick to the current combination of schema definition languages - a.k.a. A case for OpenAPI?

The current framework is using two different schema definition languages:

OpenAPI 3.0.2 for actual API endpoint definitions. To keep consistency with the Beacon v2 draft1+
Json Schema draft07, for any schema that is not an API endpoint. Version 'draft07' was chosen as it is the one supported by OpenAPI 3.0.2

The topic for discussion is, in the next version (minor or major):

Makes sense to keep OpenAPI? or could/should we go just for Json Schema?
Should we keep the compatibility between OA and JS? or could we update JS to the latest version as they are used quite independently in the Framework?

`filteringTerms`: Add `excluded` flag

There is currently no way to exclude specific ontologyTerms in filter queries although this would be desirable to expand query options (especially regarding the limitation of not having Booleans) and to stay in line with Phenopackets (which has an excluded flag e.g. in PhenotypicFeature.

(There is a current workaround to use some custom filter design, e.g. pre-pending an ontologyTerm in a request with ! - but this is non-standard & only would work for individual Beacons.)

Proposal:

  OntologyFilter:
    type: object
    description: Filter results to include records that contain a specific ontology
      term.
    required:
      - id
    properties:
      id:
        type: string
        description: >-
          Term ID to be queried, using CURIE syntax where possible.
        example: HP:0002664
      includeDescendantTerms:
        type: boolean
        default: true
        description: >-
          Define if the Beacon should implement the ontology hierarchy,
          thus query the descendant terms of `id`.
      similarity:
        type: string
        enum:
          - exact
          - high
          - medium
          - low
        default: exact
        description: >-
          Allow the Beacon to return results which do not match the filter
          exactly, but do match to a certain degree of similarity. The Beacon defines
          the semantic similarity model implemented and how to apply the thresholds
          of 'high', 'medium' and 'low' similarity.
      scope:
        type: string
        description: The entry type to which the filter applies
        example: biosamples
      excluded:
        description: >-
          Flag to indicate whether the filtering term was observed or not. The default is `false`,
          _i.e._ will be selected for **existence** of the term.
        type: boolean
        default: false

phenotypicFeatures.evidence --- change from object to array of objects

To align with Phenopackets v2, I recommend modifying the schema of phenotypicFeatures.evidence from an object to an array of objects

Beacon v2
https://docs.genomebeacons.org/schemas-md/obj/evidence/

Phenopackets v2
https://phenopacket-schema.readthedocs.io/en/latest/phenotype.html#evidence

Thanks,

Manu

Inconsistent references in YAML $ref

In principle the $ref parameters in the YAML docs are broken since (partially) referencing .json schemas (which then do not exist in this location). Blunt-force replacements here need a bit of work since http ... refs will use the JSON versions.

Shelved for now, until moving to actual use of src as the edited version (but happy if someone else engineers the replacements in beaconYamler.py ...).

LegacyVariation or VRS?

Quick question re: genomicVariants, does the Beacon team recommend following the VRS variant specification (i.e. either MolecularVariation or SystemicVariation) for defining the variation field in genomicVariant objects or are you agnostic to which reference is used?

I am not sure whether the LegacyVariation definition is something to ensure backward compatibility only and it is preferred to use the VRS schemas for any new beacons or it is still a valid option moving forward?

Use of Ensembl Glossary Variant Consequence in beacon2

Hi there. I've been reviewing beacon2 ahead of the GA4GH SC meeting and I was surprised to see beacon refers to ENSGLOSSARY as a source of terms to describe molecularEffects. I feel I have a duty to say ENSGLOSSARY was never intended to be used extensively outside of Ensembl. It is very much an Ensembl view of the world. The major consequence prediction tools including Ensembl VEP spent quite a bit of time aligning consequence terms in SO. For instance ENSGLOSSARY:0000140 or Transcript ablation whilst has an is_a relationship to Variant consequence it has an xref to SO:0001893/transcript_ablation.

There might be a reason you chose the Ensembl Glossary but I would urge looking towards sequence ontology to ensure tool/provider portability.

genomicVariations to return pathogenicity predictions

Hello,
This is more of a question.
I need Beacon to return pathogenicity predictions such as CADD and EVE.
The current schema does not specify any field for these types of predictions.
Could we have a new section under VariantLevelData just for in silico predictions?

Questions about linking between entities in the beacon v2 default model

Hi all

I have a few questions around how entities are linked in the beacon model. @mbaudis kindly shared the documentation website with handy model diagram here: http://docs.genomebeacons.org/models/#introduction. Based on this it looks like you could explicitly link individuals to a particular cohort or dataset, but i don't believe there are any fields in these entities where you could store identifiers that would make these links explicit. Is there something I am missing here? I was perhaps thinking of a use case where you could search for a cohort of interest, then from that cohort get more information about individuals/analyses/variants, but I don't follow how that would be possible without linking.

The other question I have is around the cardinality of the relationships between objects. The foreign key type fields, (e.g. runId, biosampleId) seem to generally be string type variables, meaning they can only ever link to one object per field. I could imagine situations where you might want to have multiple objects linked in this field, e.g. an analysis performed on multiple runs or containing many individuals. So if I understand correctly, it currently represents a one way one-to-many relationship, e.g. an individual can have multiple analyses performed on it, but one analysis can't have multiple individuals involved. Are these kinds of relationships the only ones allowed within the beacon v2 model?

I also noticed that in the diagram there isn't a link displayed between runs and analyses, while you can have a runId stored in an analysis object. ( I understand these are still draft but wanted to point out to potentially help improve )

Thanks in advance for helping us understand the model!

CC: @victorskl

Incorrect version for analyses schema

The analyses defaultSchema has 'biosample' instead of 'analysis' in its version string under comment.

beacon-v2/models/json/beacon-v2-default-model/analyses/defaultSchema.json

Line 2 in 8de7dd4

"$comment": "version: ga4gh-beacon-biosample-v2.0.0",

"aggregated" granularity

There are four Beacon granularities: boolean, count, aggregated and record, where aggregated is supposed to return "summary, aggregated or distribution-like responses".

https://github.com/ga4gh-beacon/beacon-framework-v2/blob/efed363fd3624aa58aeaa895abbff149cdf47bcc/common/beaconCommonComponents.json#L113

But it's not clear how to implement the aggregated granularity, since it's mentioned nowhere else in the spec, and a typical Beacon endpoint has three response types rather than four, which map to the boolean, count and record granularites. See for example, the individuals endpoints:

https://github.com/ga4gh-beacon/beacon-v2-Models/blob/main/BEACON-V2-Model/individuals/endpoints.json#L273-L276

Change beaconFilterinTermsResults type to enum

In the beaconFilteringTermsResults.json the "type" property of the FilteringTerm is ambiguous as it only requires an string.

I suggest to enumerate the three types of filtering terms to constrain the property.

Currently, the property is the following:

"type":{
    "description": "Either \"custom\", \"alphanumeric\" or ontology/terminology full name. TODO: An ontology ... with a registered prefix does not need a full name so one may better use CURIE to indicate that the resource can be retrieved from the id. This also will allow to provide an enum here.",
    "examples": [
        "Human Phenotype Ontology",
        "alphanumeric"
    ],
    "type": "string"
}

My suggestion is:

"type":{
    "description": "Either \"custom\", \"alphanumeric\" or \"ontology\" .",
    "enum": [
        "custom",
        "alphanumeric",
        "ontology"
    ]
    "example": "alphanumeric",
    "type": "string"
}

The "custom" type already includes any other type of filters. This enumeration will help to standardize the filters endpoint in the different instances of Beacon, specially when gathering them in the Beacon Network.

Schemas missing $id

Schemas should have $id parameters.

Example: "$id": https://progenetix.org/services/schemas/BeaconServiceError/v2021-03-07

models.md diagram has some errors

The current Model only describes Individuals inside cohorts and genomicVariantions inside datasets. The relationships from dataset to any other entry type are not defined. Neither between individuals and datasets. Except for cases where a dataset is for biosamples (no individuals in that Beacon instance), that approach was simpler. Datasets > Biosamples use case is something we would probably need to add to the Model itself.

Improve datatype of runs:libraryLayout to OntologyTerm

The datatype of runs:libraryLayout could be changed from enum[PAIRED, SINGLE] to OntologyTerm, using as value range the terms listed under https://www.ebi.ac.uk/ols/ontologies/fbcv/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FFBcv_0003208&lang=en&viewMode=PreferredRoots&siblings=false. These include ‘paired-end layout’ and ‘single-end layout’. That would make the data more interoperable. In case the data doesn't fit, perhaps null flavors could be used (https://raw.githubusercontent.com/fairgenomes/fairgenomes-semantic-model/main/lookups/NullFlavors.txt), or a new term introduced locally.

Docs: broken link for implementation example

See 16db543

Representing caseLevelData/zygosity with VRS alleles

If I'm creating a genomicVariant specification from a VCF variant record, I can't see how I'd specify multiple alleles in a single genomicVariant: the variation property seems to be singular? For example, a variant record might have a ref A and an alt C,T. Samples in that record might have genotypes that correspond to A/C, A/A, A/T, C/T.

LegacyVariation seems to be able to capture basic VCF-format ref/alt, at least in the case where there is only one alternate allele. It does not seem like there's an option for multiple alt alleles. So I could capture zygosity/genotype for A/A and A/T as caseLevelData corresponding to one Variation, and A/A and A/C as a different one (even there, how would I know which variation to put the A/A cases in?). But how would I represent C/T samples?

VRS's MolecularVariation seems to be the preferred schema moving forward, I assume. It seems like in this schema, there is no idea of a reference allele at all: each allele is represented by a single variation. But without an ability to specify multiple variations for a genomicVariant, how would I represent zygosity for caseLevelData?

Verifier not accepting CUSTOM for HandoverType

When I execute Beacon Verifier and the schema being verified has the generic CUSTOM in HandoverType, the Verifier complains about the schema because CUSTOM doesn't match the OntologyTerm.json nomenclature when looking at beaconCommonComponents.json reference.
I would suggest to add CUSTOM like an Ontology, with a fake Ontology Id part. For example:

"HandoverType": {
"$ref": "./ontologyTerm.json",
"description": "Handover type, as an Ontology_term object with CURIE syntax for the `id` value. Use `CUSTOM:000001` for the `id` when no ontology is available.",
"examples": [
{
"id": "EFO:0004157",
"label": "BAM format"
},
{
"id": "CUSTOM:000001",
"label": "download genomic variants in .pgxseg file format"
}
]
},

Keeping or not the filteringTerms.yaml inside every entry type folder

First message of the issue #93, opening by @mbaudis.
I think the first point of the solution proposed here is the one I would adopt, deleting the filteringTerms.yaml files for each endpoint, as it is not strictly needed.

Data use conditions and querying

Greetings,

I am implementing a querying capability that honours data use conditions attached to datasets. Following the documentation at https://github.com/EBISPOT/DUO, I was not able to identify a correct way to call API around this.

Ideally, the querying party should be able to explicitly specify their intended data use scenario using DUO terms (AKA data use category) which shall then be matched against the DUO terms attached to datasets. Using these terms right within filters does not seem appropriate as they behave in a significantly different manner.

Would you be able to comment on this?

What would be nice is to have an attribute to specify them as below;

{
    "query": {
        "filters": [],
        "requestedGranularity": "record"
    },
    "dataAccess" {
        "duoDataUse": [
             {
                  "id": "DUO:0000018",
                  "modifiers": [ { "id": "MONDO:0005105" } ]
             },
        ]
    },
    "meta": {
        "apiVersion": "v2.0"
    }
}

Appreciate your feedback.
Thanks

filtering terms for alphanumeric filter

I use "alphanumeric" filters for key/operator/value queries... motivated partly by using phenopackets, where many fields (eg sex) expect values from an enum, not ontology terms.

From what I understand, the goal of the /filtering_terms endpoint is to make the data in a particular beacon discoverable, but it's not clear to me what to return in /filtering_terms for these, if in fact I can return anything at all, since the only fields in a filtering terms result are type / id / label.. it seems odd to pack values into the "label" field.

For a simplified example, phenopacket sex is an enum of UNKNOWN_SEX, FEMALE, MALE, OTHER_SEX. So do I want four filtering terms? eg:

    {
        "type": "alphanumeric",
        "id": "subject.sex",
        "label": "UNKNOWN_SEX"
    }
    {
        "type": "alphanumeric",
        "id": "subject.sex",
        "label": "FEMALE"
    }

... and so on for MALE and OTHER_SEX. "label" doesn't really make sense in this context. I would prefer something along the lines of:

        "type": "alphanumeric",
        "id": "subject.sex",
        "options": ["UNKNOWN_SEX", "FEMALE", "MALE", "OTHER_SEX"]

... but this seems far from the spec. Possibly the issue is that these kinds of metadata queries are not considered "terms" so weren't really expected here. But I'm puzzled why the spec for filters and the spec for filtering term results are so far apart.

Phenopackify disease

The disease class is still missing a number of parameters from Phenopackets v2 or represents them slightly different. We should:

rename ageOfOnset to onset, since it could be another time element
add resolution
add excluded flag option (CAVE: should we add "default": false?)
add clinicalTnmFinding
add primarySite
add laterality

This is a call for re-implementing the changes suggested here, directly in the unified beacon-v2 project:

ga4gh-beacon/beacon-v2-Models#116

Validation errors with `oneOf` elements

This is part of an ongoing discussion that I've been having with @mbaudis.

When we validate this example data for individuals we get these errors:

Row 1:
/diseases/0/ageOfOnset: oneOf rules 0, 1 match.
/measures/0/observationMoment: oneOf rules 0, 1 match.

Meaning that we have a match in two of the options, when we should have a match in exactly one.

The error has been reproduced with other JSON schema validators (e.g., Python's jsonchema).

Here I am showing a simplified version of what is happening:
Example schema:

 {
  type                 : 'object',
    properties           : {},
    additionalProperties : {
        oneOf : [
            { oneLevel  : { type : 'string' } },
            {
                properties :
                  { twoLevel : { type : 'string' } } 
            }
        ],
    },
}

Input:
{ bar : 'beacon' }

The example can be validated at https://www.jsonschemavalidator.net

It's likely that this issue happens in other places of the beacon v2 models as well, as I recall having validation problems with quantity object when I was transforming CINECA's synthetic data for the RI.

If this is actually an error/problem (is it???) one possible (ad hoc) solution could be changing oneOf for anyOf, but it sounds very drastic to me.