GithubHelp home page GithubHelp logo

biopragmatics / biomappings Goto Github PK

View Code? Open in Web Editor NEW
50.0 8.0 12.0 91.13 MB

πŸ—ΊοΈ Community curated and predicted equivalences and related mappings between named biological entities that are not available from primary sources.

Home Page: https://biopragmatics.github.io/biomappings/

License: Creative Commons Zero v1.0 Universal

Python 36.14% HTML 2.95% Jupyter Notebook 60.90%
biopragmatics biocuration ontology semantics mappings bioregistry

biomappings's Introduction

Biomappings

Check mappings PyPI PyPI - Python Version PyPI - License Documentation Status DOI Code style: black Powered by the Bioregistry Contributor Covenant

Biomappings is a repository of community curated and predicted equivalences and related mappings between named biological entities that are not available from primary sources. It's also a place where anyone can contribute curations of predicted mappings or their own novel mappings. Ultimately, we hope that primary resources will integrate these mappings and distribute them themselves.

Mappings are stored in a simple TSV file that looks like this:

πŸ’Ύ Data

The data are available through the following four files on the biopragmatics/biomappings GitHub repository.

Curated Description Link
Yes Human-curated true mappings src/biomappings/resources/mappings.tsv
Yes Human-curated non-trivial false (i.e., incorrect) mappings src/biomappings/resources/incorrect.tsv
Yes Mappings that have been checked but not yet decided src/biomappings/resources/unsure.tsv
No Automatically predicted mappings src/biomappings/resources/predictions.tsv

The primary and derived data in this repository are both available under the CC0 1.0 Universal License.

Predictions are generated by scripts in the scripts/ folder. Each uses the utilities from the biomappings.resources module to programmatically interact with the mappings files, e.g., to add predictions.

πŸ₯’ Derived

The mappings are distributed in the Simple Standard for Sharing Ontology Mappings (SSSOM) format (here) and can be referenced by PURL such as https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv. The positive mappings are also available as a network through NDEx.

Equivalences and related mappings that are available from the OBO Foundry and other primary sources can be accessed through Inspector Javert's Xref Database on Zenodo which was described in this blog post.

πŸ“Š Summary

Summary statistics of the manually curated mappings and predicted mappings are automatically generated nightly and deployed as a website with GitHub Actions to https://biopragmatics.github.io/biomappings.

Summary statistics

πŸ™ Contributing

We welcome contributions in the form of curations to any of the four primary TSV files in this repository via a pull request to the main Biomappings repository at https://github.com/biopragmatics/biomappings.

Predicted mappings can be curated by moving a row in the predictions.tsv file into either the positive mappings file (mappings.tsv), negative mappings file (incorrect.tsv), or the unsure mappings file (unsure.tsv). Additionally, the confidence column should be removed, a type column should be added with the value manually_reviewed, and the source column should be changed from the prediction script's URI to your ORCiD identifier written as a CURIE (e.g., orcid:0000-0003-1307-2508).

Novel mappings can be curated by adding a full row to the positive mappings file (mappings.tsv) following the format of the previous lines.

While Biomappings is generally able to use any predicate written as a compact URI (CURIE), it's preferred to use predicates from the Simple Knowledge Organization System (SKOS) to denote hierarchical relationships. The three most common predicates that are useful for curating mappings are:

Predicate Description
skos:exactMatch The two terms can be used interchangeably
skos:broadMatch The object term is a super-class of the subject
skos:narrowMatch The object term is a sub-class of the subject

Online via GitHub Web Interface

GitHub has an interface for editing files directly in the browser. It will take care of creating a branch for you and creating a pull request. After logging into GitHub, click one of the following links to be brought to the editing interface:

This has the caveat that you can only edit one file at a time. It's possible to navigate to your own forked version of the repository after, to the correct branch (will not be the default one), then edit other files in the web interface as well. However, if you would like to do this, then it's probably better to see the following instructions on contributing locally.

✍️ Local via a Text Editor

  1. Fork the repository at https://github.com/biopragmatics/biomappings, clone locally, and make a new branch (see below)
  2. Edit one or more of the resource files (mappings.tsv, incorrect.tsv, unsure.tsv, predictions.tsv)
  3. Commit to your branch, push, and create a pull request back to the upstream repository.

🌐 Local via the Web Curation Interface

Rather than editing files locally, this repository also comes with a web-based curation interface. Install the code in development mode with the web option (which installs flask and flask-bootstrap) using:

$ git clone git+https://github.com/biopragmatics/biomappings.git
$ cd biomappings
$ git checkout -b your-branch-name
$ pip install -e .[web]

The web application can be run with:

$ biomappings web

It can be accessed by navigating to http://localhost:5000/ in your browser. After you do some curations, the web application takes care of interacting with the git repository from which you installed biomappings via the "commit and push" button.

Note if you've installed biomappings via PyPI, then running the web curation interface doesn't make much sense, since it's non-trivial for most users to find the location of the resources within your Python installation's site-packages folder, and you won't be able to contribute them back.

Curation Attribution

There are three places where curators of Biomappings are credited:

  1. ORCiD identifiers of curators are stored in each mapping
  2. The summary website groups and counts contributions curator
  3. A curation leaderboard is automatically uploaded to APICURON.

⬇️ Installation

The most recent release can be installed from PyPI with:

$ pip install biomappings

The most recent code and data can be installed directly from GitHub with:

$ pip install git+https://github.com/biopragmatics/biomappings.git

To install in development mode and create a new branch, use the following:

$ git clone git+https://github.com/biopragmatics/biomappings.git
$ cd biomappings
$ pip install -e .

πŸ’ͺ Usage

There are three main functions exposed from biomappings. Each loads a list of dictionaries with the mappings in each.

import biomappings

true_mappings = biomappings.load_mappings()

false_mappings = biomappings.load_false_mappings()

predictions = biomappings.load_predictions()

Alternatively, you can use the above links to the TSVs on GitHub in with the library or programming language of your choice.

The data can also be loaded as networkx graphs with the following functions:

import biomappings

true_graph = biomappings.get_true_graph()

false_graph = biomappings.get_false_graph()

predictions_graph = biomappings.get_predictions_graph()

Full documentation can be found on ReadTheDocs.

πŸ‘‹ Attribution

βš–οΈ License

Code is licensed under the MIT License. Data are licensed under the CC0 License.

πŸ“– Citation

Prediction and Curation of Missing Biomedical Identifier Mappings with Biomappings
Hoyt, C. T., Hoyt, A. L., and Gyori, B. M. (2022)
Bioinformatics, btad130.

@article{Hoyt2022,
   title = {{Prediction and Curation of Missing Biomedical Identifier Mappings with Biomappings}},
   author = {Hoyt, Charles Tapley and Hoyt, Amelia L and Gyori, Benjamin M},
   journal = {Bioinformatics},
   year = {2023},
   month = {03},
   issn = {1367-4811},
   doi = {10.1093/bioinformatics/btad130},
   url = {https://doi.org/10.1093/bioinformatics/btad130},
   note = {btad130},
   eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad130/49521613/btad130.pdf},
}

🎁 Support

Biomappings was developed by the INDRA Lab, a part of the Laboratory of Systems Pharmacology and the Harvard Program in Therapeutic Science (HiTS) at Harvard Medical School.

πŸ’° Funding

The development of the Bioregistry is funded by the DARPA Young Faculty Award W911NF2010255 (PI: Benjamin M. Gyori).

πŸͺ Cookiecutter

This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.

πŸ› οΈ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making a code contribution.

Development Installation

To install in development mode, use the following:

$ git clone git+https://github.com/biopragmatics/biomappings.git
$ cd biomappings
$ pip install -e .

πŸ₯Ό Testing

After cloning the repository and installing tox with pip install tox, the unit tests in the tests/ folder can be run reproducibly with:

$ tox

Additionally, these tests are automatically re-run with each commit in a GitHub Action.

πŸ“– Building the Documentation

The documentation can be built locally using the following:

$ git clone git+https://github.com/biopragmatics/biomappings.git
$ cd biomappings
$ tox -e docs
$ open docs/build/html/index.html

The documentation automatically installs the package as well as the docs extra specified in the setup.cfg. sphinx plugins like texext can be added there. Additionally, they need to be added to the extensions list in docs/source/conf.py.

πŸ“¦ Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

$ tox -e finish

This script does the following:

  1. Uses Bump2Version to switch the version number in the setup.cfg, src/biomappings/version.py, and docs/source/conf.py to not have the -dev suffix
  2. Packages the code in both a tar archive and a wheel using build
  3. Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
  4. Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
  5. Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion minor after.

biomappings's People

Contributors

actions-user avatar alhoyt avatar bgyori avatar cthoyt avatar kkaris avatar sumirp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biomappings's Issues

Published SSSOM mapping set is not compliant with SSSOM TSV specification

The SSSOM TSV file (docs/_data/sssom/biomappings.sssom.tsv) and its associated metadata YML file (docs/_data/sssom/biomappings.sssom.yml) are not valid according to the SSSOM specification:

  1. Invalid use of mapping_set_group.

The mapping_set_group slot is only allowed on a MappingSetReference class, not in the MappingSet class.

  1. Use of non-declared prefixes.

Some prefixes used are undeclared: cvx, orcid, RO, kegg.pathway

The spec says β€œThe YAML metadata block MUST contain a curie map that allows the unambiguous interpretation of CURIES”, which SSSOM-Java interprets as a requirement for all prefix names to be explicitly declared in said map (otherwise there is always room for β€œambiguous interpretation”), the only exception being the β€œbuilt-in” prefixes: sssom, owl, rdf, rdfs, skos, semapv.

Installation for linux - Note

Noting some instructions I used for installing on a linux server, if want to add to readme.

Works with Biomappings version = 0.1.3-dev

On Ubuntu 18.04+ [on fresh server]:

py_36 () {
  cur_path=ls
  cd /usr/bin
  sudo unlink python
  sudo ln -s /usr/bin/python3.6 python
  python --version
  cd cur_path
}

On Ubuntu 20.04+:

sudo apt update
sudo apt install python-is-python3

Install biomappings:

git clone git+https://github.com/biopragmatics/biomappings.git
# py_36
cd biomappings
sudo apt update
sudo apt install python3-pip
sudo pip install --upgrade setuptools pystow bioregistry pyobo gilda indra rdflib
sudo pip install -e .[web]
biomappings web

Add contribution guidelines

  • What's the difference between mappings.tsv and predictions.tsv
  • Step-by-step contribution guidelines
  • Add licensing information

Redundant exact match curations

This was briefly mentioned in #73 but currently there are a number of redundant curations where skos:exactMatch is curated between A-B and B-A in both orders. Given that skos:exactMatch should be symmetric, I don't think this makes sense. Of course for other relationships it may make sense. But I do think we should normalize these out for skos:exactMatch and remove one of each redundant pair of rows.

Consider switching canonical data sheets to SSSOM TSV format

There are a few benefits:

  1. The SSSOM format uses CURIEs exclusively, so there won't be any confusion about identifier style ambiguity
  2. We can use SSSOM tooling instead of implementing it ourselves or alternatively contribute our tooling back to the SSSOM community
  3. There would be fewer artifacts that need to get built, and the messaging in Biomappings could be a bit more straightforwards

A few drawbacks:

  1. Requires more programming churn
  2. Have to assess how downstream dependencies consume Biomappings

Note: we're still allowed to choose our own prefix map with our SSSOM files, so we don't necessarily have to consider conflicts in nomenclature with the OBO community, but it would probably be a good idea to consider.

INDRA-based prioritization

  1. klas generated processed statement dump, check frequency of chebi terms and mesh terms
  2. go in and curate those that actually show up

first find what occurs in INDRA assembled knowledge, then curate those based on priority

Add APICURON upload script

Following the instructions at https://apicuron.org/help, the following can be POSTed to https://apicuron.org/api/submit_description:

{
  "resource_id": "biomappings",  
  "resource_name": "Biomappings",
  "resource_uri": "https://biomappings.github.io/biomappings/",
  "resource_url": "https://biomappings.github.io/biomappings/",
  "resource_long_name": "Biomappings",
  "resource_description": "Community curated and predicted equivalences and related mappings between named biological entities that are not available from primary sources.",
  "terms_def": [
    {
      "activity_term": "novel_curation",
      "activity_name": "Curated novel mapping",
      "activity_category": "generation",
      "score": 50,
      "description": "Curated a novel mapping between two entities"
    },
    {
      "activity_term": "validate_prediction",
      "activity_name": "Validate predicted mapping",
      "activity_category": "generation",
      "score": 50,
      "description": "Affirmed the correctness of a predicted mapping"
    },
    {
      "activity_term": "invalidate_prediction",
      "activity_name": "Invalidate predicted mapping",
      "activity_category": "generation",
      "score": 50,
      "description": "Affirmed the incorrectness of a predicted mapping"
    }
  ],
  "achievements_def": [
    {
      "category": "1",
      "name": "Newbie curator",
      "count_threshold": 10,
      "type": "badge",
      "list_terms": [
        "novel_curation",
        "validate_prediction",
        "invalidate_prediction"
      ],
      "color_code": "#055701"
    }
  ]
}

Add mapping commons information

In perma-id/w3id.org#3508, I will get a permalink for the Biomappings SSSOM TSV file: https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv.

I want to do one or both of the following:

  1. Have the appropriate configuration somewhere in this repository such that it can be loaded as a "mapping commons"
  2. Write the appropriate boilerplate information so it can be imported into other mapping commons that would fit in a configuation like https://github.com/mapping-commons/mapping-commons-cookiecutter/blob/main/%7B%7Bcookiecutter.project_name%7D%7D/registry.yml

@matentzn @ehartley any suggestions would be helpful. overall, the goal is to make it as easy as possible to get the content in Biomappings to other places

Species specific mappings

I was starting to do WikiPathways and Reactome mappings to GO and MeSH, but both WikiPathways and Reactome have species-specific identifiers while GO and MeSH do not. Should we make mappings between these?

Perhaps we should start including optional species information in the predictions file for both the source and target when available

Upload to NDEx workflow failing

It looks like this isn't new, past workflows also fail, see https://github.com/biopragmatics/biomappings/runs/6136744189?check_suite_focus=true

File "/opt/hostedtoolcache/Python/3.9.12/x64/bin/biomappings", line 33, in <module>
    sys.exit(load_entry_point('biomappings', 'console_scripts', 'biomappings')())
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line [105](https://github.com/biopragmatics/biomappings/runs/6136744189?check_suite_focus=true#step:4:105)5, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/upload_ndex.py", line 75, in ndex
    nice_cx.update_to(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/ndex2/nice_cx_network.py", line 1630, in update_to
    return ndex.update_cx_network(stream, uuid)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/ndex2/client.py", line 449, in update_cx_network
    self._require_auth()
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/ndex2/client.py", line 171, in _require_auth
    raise NDExUnauthorizedError("This method requires user authentication")
ndex2.exceptions.NDExUnauthorizedError: This method requires user authentication

Build error related to SSSOM export

See https://github.com/biopragmatics/biomappings/runs/6136744213?check_suite_focus=true

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.12/x64/bin/biomappings", line 33, in <module>
    sys.exit(load_entry_point('biomappings', 'console_scripts', 'biomappings')())
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/cli.py", line 46, in update
    ctx.invoke(sssom)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/export_sssom.py", line 99, in sssom
    msdf = from_sssom_dataframe(df, prefix_map=prefix_map, meta=META)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/sssom/parsers.py", line [253](https://github.com/biopragmatics/biomappings/runs/6136744213?check_suite_focus=true#step:4:253), in from_sssom_dataframe
    mlist.append(_prepare_mapping(Mapping(**mdict)))
  File "<string>", line 41, in __init__
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/sssom/sssom_datamodel.py", line 373, in __post_init__
    self.match_type = [
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/sssom/sssom_datamodel.py", line 374, in <listcomp>
    v if isinstance(v, MatchTypeEnum) else MatchTypeEnum(v)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/linkml_runtime/utils/enumerations.py", line 52, in __init__
    raise ValueError(f"Unknown {self.__class__.__name__} enumeration code: {key}")
ValueError: Unknown MatchTypeEnum enumeration code: LexicalEquivalenceMatch

Mappings to pathway ontology

From the Pathway Ontology OBO, it seems that there are xrefs to SMP (the small molecule pathway database) but neither using the appropriate OBO context. First, they appear as CURIEs in synonyms:

[Term]
id: PW:0002615
name: glyoxalase metabolic pathway
def: "The glyoxalase pathway is the main detoxifying  route to protect against the harmful effects of methylglyoxal (MG). The highly reactive dicarbonyl compound is primarily a by-product of glycolysis, but  it can also result as a by-product of fatty acid and protein metabolism. MG is also known as pyruvaldehyde or 2-oxopropanol." [PMID:23763312, PMID:25709564]
synonym: "2-oxo propanol degradation pathway" RELATED []
synonym: "methylglyoxal degradation pathway" RELATED []
synonym: "pyruvaldehyde degradation pathway" RELATED []
synonym: "SMP:00459" RELATED []
is_a: PW:0000063 ! glyoxylate and dicarboxylate metabolic pathway
created_by: vpetri
creation_date: 2016-08-08T09:55:16Z

Second, they're appearing in alt ids:

[Term]
id: PW:0000069
name: ketone bodies metabolic pathway
alt_id: KEGG:00072
alt_id: Reactome:R-HSA-74182
alt_id: SMP:00071
def: "Those metabolic reactions involved in the synthesis, utilization or degradation of ketone bodies. The chemicals acetoacetate, acetone and beta-hydroxybutyrate are collectively known as ketone bodies, although only the first two are ketones. They provide fuel for heart and skeletal muscle and for the brain during starvation. Excessive accumulation of these acidic chemicals leads to dangerous diabetic conditions known as ketoacidosis." [GO:0046950, KEGG:map00072, MCW Libraries:QU4 V876f 2008, OneLook:www.onelook.com]
comment: The definition was compiled using the information from a number of biological/medical dictionaries available at OneLook
synonym: "Ketone body metabolism" RELATED []
is_a: PW:0000058 ! fatty acid metabolic pathway

It looks like there's quite a bit of nice information here, even though its not organized well

Confidence score for lexical mappings from Gilda

@bgyori Can gilda produce a confidence score for a given lexincal mapping? It might make sense to prioritize curation in the web interface based on this mapping either for easy curation on high confidence ones, or difficult curation on relatively low confidence ones. I assume there's already a cutoff somewhere being applied to remove obviously incorrect lexical mappings.

Funny, I feel like we've gone down this road before both in INDRA's belief scores and also in the rational enrichment paper :)

MeSH now appears to provide mappings to Taxonomy

In the latest, 2021 release of MeSH, there are some new entries that provide mappings from MeSH terms to Taxonomy. I confirmed that this info was not included in the 2020 release. Example: https://meshb.nlm.nih.gov/record/ui?ui=D051379 lists

Registry Number: txid10088
Related Numbers: txid10090
                 txid10092

The same information appears in the XML dump of MeSH. This is relevant for #6 which adds some mappings between MeSH and Taxonomy but since these are now available from a "primary" source, they technically shouldn't be added here.

Neoplasms

umls C0006142 Malignant neoplasm of breast skos:exactMatch mesh D001943 Breast Neoplasms manual orcid:0000-0003-4423-4370

Can't be exact if the latter encompasses non malignant eg carcinoma in situ. I recommend getting these from mondo

Improved provenance model

  • The issue is that we threw away the provenance from the prediction once it gets human curated, we should try and address this before sending the review. Predictions do have this metadata. Copy columns into curations table
  • This information does get thrown away when doing manual curation -> we could change data model to rather not delete prediction, but rather add a reviewer on to each.

lexical score?

What is the lexical score in this file?
biomappings/src/biomappings/resources/predictions.tsv

I couldn't readily find the explanation in neither the README nor the source code.

Thanks a million!

Sorting issue with automated table extensions

The build here
https://github.com/biomappings/biomappings/runs/3157071733?check_suite_focus=true
complains about the sort order:

======================================================================
FAIL: Test the true curated mappings are in a canonical order.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/biomappings/biomappings/.tox/py/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/tests/test_validity.py", line 51, in test_curations_sorted
    ), "True curations are not sorted"
AssertionError: True curations are not sorted

======================================================================
FAIL: Test the false curated mappings are in a canonical order.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/biomappings/biomappings/.tox/py/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/tests/test_validity.py", line 58, in test_false_mappings_sorted
    ), "False curations are not sorted"
AssertionError: False curations are not sorted

======================================================================
FAIL: Test the unsure mappings are in a canonical order.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/biomappings/biomappings/.tox/py/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/tests/test_validity.py", line 65, in test_unsure_sorted
    ), "Unsure curations are not sorted"
AssertionError: Unsure curations are not sorted

----------------------------------------------------------------------
Ran 6 tests in 1.103s

MeSH proteins/genes and species-specificity

This is to start a discussion about gene/protein entries in MeSH.

  1. MeSH has supplementary concepts representing human, mouse, and rat-specific proteins. Examples (MAPK1):
    https://meshb.nlm.nih.gov/record/ui?ui=C535150
    https://meshb.nlm.nih.gov/record/ui?ui=C535148
    https://meshb.nlm.nih.gov/record/ui?ui=C535149

  2. Each supplementary concept can have a "mapped to" property that links it to one or more primary concepts. These mappings are usually to the closest match in the list of primary concepts and there are two typical types of non-exactness: a) sometimes specific proteins are mapped to primary concepts representing families of proteins e.g., NASPP1 protein, human (https://meshb.nlm.nih.gov/record/ui?ui=C489391) is mapped to Autoantigens and Nuclear Proteins.
    b) the species-specific supplementary concepts are linked to a non-species-specific primary concept. For instnace, the above 3 terms for species-specific MAPK1 are all mapped to https://meshb.nlm.nih.gov/record/ui?ui=D019950.

  3. Some complicated observations made by @steppi a few months ago: The supplementary concepts are explicitly called proteins, e.g., MAPK1 protein, human. The primary concepts aren't explicit about this but there are often clues to them being proteins rather than genes, e.g., A serine/threonine-specific protein kinase which is encoded by the CHEK1 gene in humans. (https://meshb.nlm.nih.gov/record/ui?ui=D000071877). Then there are some complicated cases related one-to-many gene/protein relationships for instance, due to splice variants. For instance for https://meshb.nlm.nih.gov/record/ui?ui=D064546 we have

PKC beta encodes two proteins (PKCB1 and PKCBII) generated by alternative splicing of C-terminal exons.

meaning that this primary concept represents two proteins from the same gene. In another example, we have estrogen receptor alpha 36, human (https://meshb.nlm.nih.gov/record/ui?ui=C000601334) and "estrogen receptor alpha, human" (https://meshb.nlm.nih.gov/record/ui?ui=C506487) as two separate entries that would correspond to separate entries in the uniprot.isoform namespace, though whether the second one can be mapped at all is questionable.

Overall, I'm fairly convinced that both the primary and supplementary concepts should be interpreted as proteins, and that the primary concepts are non-species-specific whereas the supplementary concepts are (explicitly) species specific. Consequently, mappings such as

mesh | D016906 | Interleukin-9 | skos:exactMatch | hgnc | 6029 | IL9

(https://github.com/biomappings/biomappings/blob/master/src/biomappings/resources/mappings.tsv#L142) ought to be changed.

Sure about CC0? Are labels CC0? - possible license compatibility issues

e.g. on https://raw.githubusercontent.com/biopragmatics/biomappings/master/src/biomappings/resources/mappings.tsv we have this:

orcid:0000-0003-4423-4370
wikipathways WP999 TCA Cycle speciesSpecific go GO:0006099 tricarboxylic acid cycle semapv:ManualMappingCuration orcid:0000-0003-4423-4370

It is quite unclear which parts of GO (and other CC-BY ontologies) can be turned into CC0 and which ones cannot (a kind of license stacking issue).

This is a major issue w.r.t. importing ontologies into Wikidata. Some would say that CC-BY is not compatible with CC0.

Maybe we need some kind of new BioOntology Open License.
CC-BY on ontologies creates nightmares, but most resources are afraid of going CC0 (also because of legacy licensing issues).

Possible indexing error in webapp

While curating, after these requests

127.0.0.1 - - [17/Dec/2020 20:54:51] "GET /?limit=10&offset=28513 HTTP/1.1" 200 -
[2020-12-17 20:57:28,660] ERROR in app: Exception on /mark/28549/nope [GET]

the server errored with

    prediction = self._predictions.pop(line)
IndexError: pop index out of range

so we may have an indexing error somewhere.

Mapping tagging system

uniprot-mesh proteins (suppleemnt in mehs) created by determinisic way based on static rules likely to produce very confident, 1-1 mappings

associated some kind of tag/annotation with documentation to allow downstream consumers to grab both manually curated and predicted

Error in GitHub Action for updating

The website update action says there was nothing to commit even though locally (running at the same time as the action) I had changes in my yml after running an export.

  git config --local user.email "[email protected]"
  git config --local user.name "GitHub Action"
  git commit -m "πŸ—ΊοΈ Update biomappings summary" -a
  shell: /bin/bash -e {0}
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.9.1/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.1/x64/lib
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
Error: Process completed with exit code 1.

Confidence score associated with predicted mappings

Depending on the specific subset, confidence in automated mappings can be quite different. For instance, I'd expect the ones in #1 to be very high confidence given the way they are constructed. To facilitate downstream usage of predicted mappings, it would make sense to associate either a guess or an empirically determined (by curating a random sample) confidence score with each mapping.

Github push not working from web interface

I started a branch called curation, did some curation and then pushed the usual "Commit and Push" button. The commit part worked but the push did not happen. I then manually pushed the branch and tried again after some further curations, and the push still did not happen automatically. I suppose I could print some debug messages from the code running the git commands to see why this might be the case.

Obsolete terms in mappings tables

As ontologies and other identifiers resources evolve, terms are sometimes deprecated or replaced by other terms. Through some downstream processes, I found a number of terms appearing in the curated mappings table that are not valid anymore. I don't have a complete list but these are some examples:

Some terms are obsolete while others have been replaced. I suppose when it comes to replacements, we could update mappings to use the new ID which is currently valid as a replacement. For terms that are obsolete without a replacement, perhaps keeping the mappings as is, and handling obsolete status downstream of Biomappings makes sense.

Prefix validation is inconsistent for OBOfoundry ontologies not in identifiers.org

In the case of MONDO, the validity checks are self-contradictory in that some checks are looking for the MONDO: prefix embedded in IDs whereas others expect it not the be there so some tests fail in either case.

In this particular case, MONDO doesn't appear in identifiers.org so the expectation to have the namespace embedded in the ID wouldn't have come from there, so should probably be expected to not be embedded.

Assess coverage of Bio-ID

gilda bio-id benchmark - we needed to do curation like UBERON-MeSH ones that showed up really frequently, made a big difference in the benchmark by doing highest priority

Add dark mode to webapp

I'm not sure how straightforward this would be but it might make curation easier on the eyes if we had a dark mode for the webapp.

Add a table for curated incorrect mappings

We could add a third table which contains manually reviewed incorrect mappings. The rationale for this is to make sure that we don't repeatedly predict and add a mapping (through the automated scripts) that was reviewed and deemed incorrect. Couldn't find a SKOS entry to represent "does not match" that could be used in this table...

Prediction creates duplicates, possibly depending on source

Mappings between the DO and other resources are duplicated by the script in PR #68 when previously curated.

I would guess this is either because:

  1. DOID appears as the target in files on the master branch but as the source in files in the PR
  2. The DOID prefix is dropped when DO is the source but retained when DO is the target.

It's not clear whether the duplicate mappings can appear in predictions.tsv if they haven't been curated (no DOID mappings remained in predictions.tsv on master when the script in the PR was run).

Example 1:

mappings.tsv (master)

mesh	C567782	Amyloidosis, Hereditary, Transthyretin-Related	skos:exactMatch	doid	DOID:0050638	transthyretin amyloidosis	manually_reviewed	orcid:0000-0003-1307-2508

predictions.tsv (PR #68)

doid	0050638	transthyretin amyloidosis	skos:exactMatch	mesh	C567782	Amyloidosis, Hereditary, Transthyretin-Related	lexical	0.5400948258091115	https://github.com/biomappings/biomappings/blob/a66f05/scripts/generate_doid_mappings.py

Example 2:

unsure.tsv (master)

mesh	D000071700	Cone-Rod Dystrophies	skos:exactMatch	doid	DOID:0050572	cone-rod dystrophy	manually_reviewed	orcid:0000-0003-1307-2508

predictions.tsv (PR #68)

doid	0050572	cone-rod dystrophy	skos:exactMatch	mesh	D000071700	Cone-Rod Dystrophies	lexical	0.5400948258091115	https://github.com/biomappings/biomappings/blob/a66f05/scripts/generate_doid_mappings.py

Curation interface

Not sure if @cthoyt already gave this some thought but I'm wondering if we could set up a UI to support quickly vetting predicted curations, e.g., an small webapp where you put in your ORCID and can check correct/incorrect next to a list of predicted mappings (with link out URLs if they need to be followed up on) and then click a button to get a TSV export you can commit to the manual mappings table.

Algorithms for assessing integrity and quality of mappings

There are some great polar graph algorithms for this kind of stuff!

  1. Look for n-cycles with all equivalent but one not-equivalent relation
  2. Triangles and other cliques mean the same relation was curated many times. For each component, calculate the "density" and assign to all relations
  3. Could namespace pairs have a prior confidence? Are there pairs of namespaces we expect don't go together, and therefore should have lower confidence in mappings?
  4. Do more curation for a curator consensus (2 curators agree -> great!, 2 curators disagree -> not so much)
  5. Look in connected components for the same namespace appearing twice

Some redundant predictions

I found that some MeSH-FamPlex predictions are redundant since FamPlex already provides curated mappings for those entries as a "primary" source. However, it will require some work to remove these without breaking other dependencies. We could review other predictions e.g., for EFO, HP, DOID to see if there are any redundancies, though I tried to import them in a way that there shouldn't be.

Erroneous rows in biomappings.sssom.tsv?

looking at the biomappings.sssom.tsv file in docs/_data/sssom, it appears that there are some rows which incorrectly map chemicals to/from diseases, for example:

chebi:60411	bacteriopheophytin	skos:exactMatch	mesh:D011470	Prostatic Hyperplasia	semapv:LexicalMatching		0.95	generate_chebi_mesh_mappings.py

Website update tests failing with missing sssom module

See e.g., https://github.com/biopragmatics/biomappings/runs/3954820246?check_suite_focus=true

Building SSSOM export
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.7/x64/bin/biomappings", line 33, in <module>
    sys.exit(load_entry_point('biomappings', 'console_scripts', 'biomappings')())
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/cli.py", line 46, in update
    ctx.invoke(sssom)
  File "/opt/hostedtoolcache/Python/3.9.7/x64/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/work/biomappings/biomappings/src/biomappings/export_sssom.py", line 96, in sssom
    from sssom.parsers import from_sssom_dataframe
ModuleNotFoundError: No module named 'sssom'

Documentation suggestions (README)

The documentation is pretty thorough. Below are some suggestions for changes (these might be audience-dependent).

Introduction Section

  • Make clear that this repo is both a data repository of mappings AND an opportunity for individuals to curate/edit predicted mappings. A sentence each introducing the Data and Contributing sections added after the first sentence should be sufficient.
  • Add a picture to show an example mapping including the skos value on the edge. A picture really is worth 1000 words

Data Section

  • Add information describing, or linking to a description of, how predictions were generated and how to interpret them (could be very brief).

Summary Section

  • Clarify what the summary is a summary of.
  • Change this link
    https://biomappings.github.io/biomappings/.

    to https://biopragmatics.github.io/biomappings/
  • Move the APICURON info to the Contribution section.

Contributing Section

  • Add a short introduction to this section.
  • Clarify how to edit the files when curating predicted mappings and not using the web server - e.g. "move correct predicted mappings from predictions.tsv to mappings.tsv and incorrect mappings to incorrect.tsv, add 'manually reviewed' and your ORCiD ID."
  • NDEx link only appears to work in "classic mode". The default mode either loads for long periods (never waited long enough for it to finish) or states: No layout available for this network. Do you want to visualize the network with random layout? Or click cancel to explore it without view. Clicking either OK or CANCEL results in an error.
  • Add a note after

    biomappings/README.md

    Lines 105 to 109 in 7e83ff0

    The web application can be run with:
    ```bash
    $ biomappings web
    ```
    to describe the need to access the server in a browser by entering the IP (maybe with a picture to show where the IP is listed after starting biomappings).

Other suggestions

  • For novice contributors, including a definition of different skos terms might be useful (maybe also on the summary page).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.