knowledge-graph-hub / kg-idg Goto Github PK

View Code? Open in Web Editor NEW

9.0 2.0 2.0 18.29 MB

A Knowledge Graph to Illuminate the Druggable Genome

Home Page: https://knowledge-graph-hub.github.io/kg-idg/

License: BSD 3-Clause "New" or "Revised" License

Python 5.02% Makefile 0.02% Jupyter Notebook 94.95% Shell 0.01%

knowledge-graph drug-target-interactions biolink monarchinitiative

kg-idg's People

Stargazers

Watchers

Forkers

bkbonde sierra-moxon

kg-idg's Issues

Describe the bug

The current Docker image used by the Jenkins build for KG-IDG is Ubuntu with Python 3.8.5.

After updates in #106 , this isn't sufficient - it will need at least Py3.9.

Any new image will also need to have mysql and postgresql available and set up.

This image should work: https://hub.docker.com/repository/docker/caufieldjh/ubuntu20-python-3-9-14-dev

If not, will need to check on the database service configs.

Misc. indexing issues

Indexes have various issues, though individual files (e.g., https://kg-hub.berkeleybop.io/kg-idg/20211029/README) are accessible:

Links in the main project index are incorrect - parent link should go to main kg-hub directory and link to "raw" folder is broken. In the latter case it's because none of the "raw" folders get populated with index.html.
Links in subdirectories do not include the current subdirectory in their path, so they don't work.
Parent links in subdirectories point at the main kg-hub directory, but should go to their parent.

Describe the desired behavior

Per convo with @caufieldjh, let's add a Jenkinsfile that does a download, transform, and merge (but not a push to s3 just yet), analogous to what we do for kg-microbe and kg-covid-19. This will act as an integration test, and also the basis for the versioned build that we can push to s3.

Build KG-IDG dashboard

It would help with sanity checks to have some basic graph visualization. We have a few options, any or all of which may be appropriate:

Load merged graph into Neo4j
Load merged graph into Ensmallen/Embiggen and cluster based on node/edge type
Build dashboard as per https://knowledge-graph-hub.github.io/kg-covid-19-dashboard/ and https://github.com/Knowledge-Graph-Hub/kg-covid-19-dashboard

Include side effects/adverse events from DrugCentral

Describe the desired behavior

DrugCentral has adverse events for drugs, with each mapped to a MEDDRA term. These may be mappable to HPO as well.
Would be nice to include these for each drug.

Use Koza for ingests

Building in Koza will let us do ingests on our sources without too much friction due to varying formats. We can get kgx format consistently.

Kill the transform if an ingest fails

Each ingest has the option to accept a specific file rather than the default.
The full transform operation should fail if an individual file is not found.

Update for Koza 0.1.5

Describe the desired behavior

Koza 0.1.2 includes some updates to header options in the ingest yamls. Updating will require changing 'has_header' to 'header'.

use Drug Central CURIEs for drugs in KG-IDG

Incorporate SSSOM (https://github.com/mapping-commons/sssom) as part of KG-IDG metadata.

@justaddcoffee - you had some ideas regarding what this entails

Create more fine-grained STRING ingest based on Koza examples

Koza has several examples for doing STRING ingests here: https://github.com/monarch-initiative/koza/tree/main/examples
The current ingest pulls the KG-COVID-19 STRING human PPI nodes/edges, but we could have more specific control over the desired STRING attributes by defining a Koza ingest.

MONARCH and MONARCH_NODE nodes - where are they coming from?

In meeting with @LucaCappelletti94 on Mar 28, we found that dendritic stars were present in the graph:

Dendritic stars
A dendritic star is a dendritic tree with a maximal depth of one, where nodes with maximal unique degree one are connected to a central root node with high degree and inside a strongly connected component. We have detected 76.57K dendritic stars in the graph, involving a total of 370.09K nodes (38.54%) and 370.09K edges (5.58%), with the largest one involving 1.62K nodes and 1.62K edges. The detected dendritic stars, sorted by decreasing size, are:

Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr1](https://monarchinitiative.org/MONARCH_GRCh38chr1) (degree 3.27K), and containing 1.62K nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b001b2a08902a888b688](https://monarchinitiative.org/MONARCH_.well-known/genid/b001b2a08902a888b688), [MONARCH:.well-known/genid/b00240a6cd5c9ce72cfc](https://monarchinitiative.org/MONARCH_.well-known/genid/b00240a6cd5c9ce72cfc), [MONARCH:.well-known/genid/b00264d5500fc17abdb3](https://monarchinitiative.org/MONARCH_.well-known/genid/b00264d5500fc17abdb3), [MONARCH:.well-known/genid/b00314f5c8595c3c6090](https://monarchinitiative.org/MONARCH_.well-known/genid/b00314f5c8595c3c6090) and [MONARCH:.well-known/genid/b0031ce26f12079d8eda](https://monarchinitiative.org/MONARCH_.well-known/genid/b0031ce26f12079d8eda). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).

Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr19](https://monarchinitiative.org/MONARCH_GRCh38chr19) (degree 2.08K), and containing 1.04K nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b0020b13d149a3ace335](https://monarchinitiative.org/MONARCH_.well-known/genid/b0020b13d149a3ace335), [MONARCH:.well-known/genid/b0034da11610011d83b6](https://monarchinitiative.org/MONARCH_.well-known/genid/b0034da11610011d83b6), [MONARCH:.well-known/genid/b00481884ceba59a184b](https://monarchinitiative.org/MONARCH_.well-known/genid/b00481884ceba59a184b), [MONARCH:.well-known/genid/b01061b745f09c7b17a0](https://monarchinitiative.org/MONARCH_.well-known/genid/b01061b745f09c7b17a0) and [MONARCH:.well-known/genid/b0112155b43b0392ab9e](https://monarchinitiative.org/MONARCH_.well-known/genid/b0112155b43b0392ab9e). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).

Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr2](https://monarchinitiative.org/MONARCH_GRCh38chr2) (degree 2.05K), and containing 1.01K nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b00615c4b695fbaec504](https://monarchinitiative.org/MONARCH_.well-known/genid/b00615c4b695fbaec504), [MONARCH:.well-known/genid/b00640167c662a3d85cc](https://monarchinitiative.org/MONARCH_.well-known/genid/b00640167c662a3d85cc), [MONARCH:.well-known/genid/b0084aa562f0b8e874f7](https://monarchinitiative.org/MONARCH_.well-known/genid/b0084aa562f0b8e874f7), [MONARCH:.well-known/genid/b00f01967f08455b0230](https://monarchinitiative.org/MONARCH_.well-known/genid/b00f01967f08455b0230) and [MONARCH:.well-known/genid/b010b3222397837c92c1](https://monarchinitiative.org/MONARCH_.well-known/genid/b010b3222397837c92c1). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).

Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr11](https://monarchinitiative.org/MONARCH_GRCh38chr11) (degree 1.96K), and containing 971 nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b005f4b408aacadc4052](https://monarchinitiative.org/MONARCH_.well-known/genid/b005f4b408aacadc4052), [MONARCH:.well-known/genid/b00d7e00405a3d304484](https://monarchinitiative.org/MONARCH_.well-known/genid/b00d7e00405a3d304484), [MONARCH:.well-known/genid/b01428224cf0b68106f2](https://monarchinitiative.org/MONARCH_.well-known/genid/b01428224cf0b68106f2), [MONARCH:.well-known/genid/b015611ec1932aee59c1](https://monarchinitiative.org/MONARCH_.well-known/genid/b015611ec1932aee59c1) and [MONARCH:.well-known/genid/b01c6b0056aa4a4ad4e8](https://monarchinitiative.org/MONARCH_.well-known/genid/b01c6b0056aa4a4ad4e8). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).

Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr17](https://monarchinitiative.org/MONARCH_GRCh38chr17) (degree 1.92K), and containing 950 nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b003a20d29c1de3b50d7](https://monarchinitiative.org/MONARCH_.well-known/genid/b003a20d29c1de3b50d7), [MONARCH:.well-known/genid/b014f794d06917bf8acd](https://monarchinitiative.org/MONARCH_.well-known/genid/b014f794d06917bf8acd), [MONARCH:.well-known/genid/b01a0c2037bd0770892d](https://monarchinitiative.org/MONARCH_.well-known/genid/b01a0c2037bd0770892d), [MONARCH:.well-known/genid/b01b1f0127d3a52c2477](https://monarchinitiative.org/MONARCH_.well-known/genid/b01b1f0127d3a52c2477) and [MONARCH:.well-known/genid/b02097c939e7b03de65d](https://monarchinitiative.org/MONARCH_.well-known/genid/b02097c939e7b03de65d). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).

Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr6](https://monarchinitiative.org/MONARCH_GRCh38chr6) (degree 1.80K), and containing 890 nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b000a2698b275276fc61](https://monarchinitiative.org/MONARCH_.well-known/genid/b000a2698b275276fc61), [MONARCH:.well-known/genid/b003500cbe0691866ca7](https://monarchinitiative.org/MONARCH_.well-known/genid/b003500cbe0691866ca7), [MONARCH:.well-known/genid/b003fa4fd2c425a02b14](https://monarchinitiative.org/MONARCH_.well-known/genid/b003fa4fd2c425a02b14), [MONARCH:.well-known/genid/b00a4bd069e6ea7ccd34](https://monarchinitiative.org/MONARCH_.well-known/genid/b00a4bd069e6ea7ccd34) and [MONARCH:.well-known/genid/b00fa6dadec685f40c9b](https://monarchinitiative.org/MONARCH_.well-known/genid/b00fa6dadec685f40c9b). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).

And other 76.56K dendritic stars.

Across the entire graph, there are 26 nodes with the MONARCH_NODE prefix - what is their origin?

Add ATC information about drugs from Drug Central

Describe the desired behavior

We'd like to ingest ATC level 1 information for drugs, which Jeremy Yang says in their SQL dump, in tables atc and struct2atc

Additional context

Per discussion with Tudor and Jeremy and others on the IDG call just now

uPheno transform yields empty file on Docker but not locally

Describe the bug

As of build 25 on master, the merge step fails:

...
11:26:23  [KGX][cli_utils.py][        parse_source] INFO: Processing source 'tcrd-protein'
11:26:23  [KGX][cli_utils.py][        parse_source] INFO: Processing source 'string'
11:32:29  [KGX][cli_utils.py][        parse_source] INFO: Processing source 'upheno2'
11:32:56  multiprocessing.pool.RemoteTraceback: 
11:32:56  """
11:32:56  Traceback (most recent call last):
11:32:56    File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
11:32:56      result = (True, func(*args, **kwds))
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 806, in parse_source
11:32:56      transformer.transform(input_args)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 275, in transform
11:32:56      self.process(source_generator, sink)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 315, in process
11:32:56      for rec in source:
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 165, in parse
11:32:56      file_iter = pd.read_csv(
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
11:32:56      return func(*args, **kwargs)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
11:32:56      return _read(filepath_or_buffer, kwds)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 482, in _read
11:32:56      parser = TextFileReader(filepath_or_buffer, **kwds)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
11:32:56      self._engine = self._make_engine(self.engine)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
11:32:56      return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
11:32:56      self._reader = parsers.TextReader(self.handles.handle, **kwds)
11:32:56    File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
11:32:56  pandas.errors.EmptyDataError: No columns to parse from file
11:32:56  """
11:32:56  
11:32:56  The above exception was the direct cause of the following exception:
11:32:56  
11:32:56  Traceback (most recent call last):
11:32:56    File "run.py", line 167, in <module>
11:32:56      cli()
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
11:32:56      return self.main(*args, **kwargs)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
11:32:56      rv = self.invoke(ctx)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
11:32:56      return _process_result(sub_ctx.command.invoke(sub_ctx))
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
11:32:56      return ctx.invoke(self.callback, **ctx.params)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
11:32:56      return __callback(*args, **kwargs)
11:32:56    File "run.py", line 86, in merge
11:32:56      load_and_merge(yaml, processes)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/kg_idg/merge_utils/merge_kg.py", line 36, in load_and_merge
11:32:56      merged_graph = merge(yaml_file, processes=processes)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 681, in merge
11:32:56      stores = [r.get() for r in results]
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 681, in <listcomp>
11:32:56      stores = [r.get() for r in results]
11:32:56    File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
11:32:56      raise self._value
11:32:56  pandas.errors.EmptyDataError: No columns to parse from file

As the error says, the uPheno tsv (node, edge, or both) is empty so it doesn't load.

To Reproduce

python run.py download
python run.py transform
# or for upheno alone:
# python3 run.py transform -s UPhenoTransform
python run.py merge

or see build 25

Expected behavior

The input for the merge, as defined in merge.yaml, should not be empty.

Version

d21ccb9

Download step failing on Jenkins

Describe the bug

Looks like the download step is (gracefully) failing on KG-IDG:
https://build.berkeleybop.io/job/knowledge-graph-hub/job/kg-idg/job/master/19/

Downloading https://data.bioontology.org/ontologies/ATC/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv to atc.csv.gz
18:52:42  :  95%|█████████▍| 18/19 [04:06<00:13, 13.68s/it]
18:52:42  Traceback (most recent call last):
18:52:42    File "run.py", line 164, in <module>
18:52:42      cli()
18:52:42    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
18:52:42      return self.main(*args, **kwargs)
18:52:42    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
18:52:42      rv = self.invoke(ctx)
18:52:42    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
18:52:42      return _process_result(sub_ctx.command.invoke(sub_ctx))
18:52:42    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
18:52:42      return ctx.invoke(self.callback, **ctx.params)
18:52:42    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
18:52:42      return __callback(*args, **kwargs)
18:52:42    File "run.py", line 38, in download
18:52:42      kg_download(*args, **kwargs)
18:52:42    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/kg_idg/download.py", line 20, in download
18:52:42      download_from_yaml(yaml_file=yaml_file, output_dir=output_dir,
18:52:42    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/kg_idg/utils/download_utils.py", line 54, in download_from_yaml
18:52:42      with urlopen(req) as response, open(outfile, 'wb') as out_file:  # type: ignore
18:52:42    File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
18:52:42      return opener.open(url, data, timeout)
18:52:42    File "/usr/lib/python3.8/urllib/request.py", line 531, in open
18:52:42      response = meth(req, response)
18:52:42    File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
18:52:42      response = self.parent.error(
18:52:42    File "/usr/lib/python3.8/urllib/request.py", line 569, in error
18:52:42      return self._call_chain(*args)
18:52:42    File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
18:52:42      result = func(*args)
18:52:42    File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
18:52:42      raise HTTPError(req.full_url, code, msg, hdrs, fp)
18:52:42  urllib.error.HTTPError: HTTP Error 500: Internal Server Error

To Reproduce

https://build.berkeleybop.io/job/knowledge-graph-hub/job/kg-idg/job/master/19/

Version

20016c0

Additional context

Failing gracefully - Jenkins is using cached data from s3 instead of a fresh download

Upload only compressed raw sources

Some of the data sources (e.g., DrugCentral and TCRD) are on the larger size, so we should only upload compressed versions.
At minimum, this should be the case for sources originally downloaded as compressed.
For consistency, we could also:

compress all sources and transforms prior to uploading
include a step within all transforms to check if a source is compressed, and decompress if needed

Include TCRD TDL for genes

At present, the ingest for TCRD just gets gene IDs. The critical element here is the assignment of each to a TDL, as described here: http://juniper.health.unm.edu/tcrd/

I'm not entirely sure about how to model this with Biolink. Is it enough to just treat the TDL category as a NamedThing and then assign an Association between the Gene and the NamedThing, or is there more to it?

Human Protein Atlas edges not present in merged KG

Describe the bug

Edges of type biolink:expressed_in appear to be correctly produced by the ProteinAtlasTransform, but are not present in the merged graph.

To Reproduce

python run.py download
python run.py transform
# or 
# python run.py transform -s ProteinAtlasTransform 
# for single transform
python run.py merge

$ grep biolink:expressed_in data/merged/merged-kg_edges.tsv | wc -l
0

Transformed edges look like this:

uuid:92b106b2-5e89-11ec-bbc2-00155d00d735       UniProtKB:Q3KRB8        biolink:expressed_in    GO:0031982      biolink:GeneToExpressionSiteAssociation|biolink:Association     RO:0002206              Human Protein Atlas

and transformed nodes:

GO:0031982      biolink:AnatomicalEntity|biolink:NamedThing     vesicle         Human Protein Atlas

GO terms are present in the merged graph but GO is its own ingest.

Expected behavior

Transformed edges such as that above should be present in the merged graph.

Version

d21ccb9

Loading mysql db tends to fail

Describe the bug

Something about the way mysql is loaded causes it to fail with some regularity, passing its failure on to the build.
Output looks like:

22:58:27  + sudo /etc/init.d/mysql start
22:58:27   * Starting MySQL database server mysqld
22:58:27  su: warning: cannot change directory to /nonexistent: No such file or directory
22:58:59     ...fail!

The warning isn't really the issue - the mysql start can complete even if that warning is present.

To Reproduce

Can't seem to reproduce this reliably - not certain on the underlying cause, but see build 24 on master branch.
Could be memory limitation.

Expected behavior

If mysql start fails, wait 30 sec and try again, for up to 5 times.

Version

d21ccb9

Indexer script should be in its own repository

Describe the desired behavior

The multi_indexer script used to set up index pages on the remote (https://github.com/Knowledge-Graph-Hub/kg-idg/blob/master/multi_indexer.py) doesn't really belong here:

It can be used for KG-Hub projects in general
It has its own dependencies (like pystache)

It could go somewhere else:

In the KG-Hub project template
In the main KG-Hub repo, along with the manifest generator (https://github.com/Knowledge-Graph-Hub/knowledge-graph-hub.github.io)
With NEAT (though not only with NEAT)
In its own repo, called as a dependency

Create HTML index for graph_ml output

NEAT uploads its output to the graph_ml/ directory for each build (e.g., kg-idg/20211215/graph_ml/).
This directory should also be indexed - the publish stage of the build calls multi_indexer.py to do this, and that will capture the graph_ml directory too but only if it's called after those new files are placed in the remote location.
We may also wrap up all the indexing and directory management code into its own script for the sake of neatness, then call that script once all new files have been produced.

Do embedding and tSNE plot for each build

Describe the desired behavior

At each build, do the following:

Spin up GCP instance so following steps don't take forever
Prepare embedding with embiggen
Generate tSNE plot
Upload plot along with other graph products

This could also be modularized through NEAT as we'd want to do this for other graphs.

Add basic documentation

As per KG-COVID-19 and other KG-hub projects, we need basic documentation in the form of:

Sphinx docs
ReadTheDocs documentation with getting started material, examples, data source descriptions, contributions/contributors, etc.
Jupyter notebook examples

All ingests should get processed by kgx transform, even if they're passthrough

At least one of the transforms, DrugCentral, is a passthrough transform, but like the others it should be sent through a kgx transform step prior to merge. Otherwise it may not be subject to the same validation as other data sources.

Ingest error breaks build

Describe the bug

An error occurs during the transform phase of the build:

13:32:36  + python3.8 run.py transform
13:36:57  WARNING:koza.model.config.source_config:Could not load dataset description from metadata file
13:37:05  ERROR:root:Encountered error: could not convert string to float: ''
13:37:05  Parsing data/raw/mondo_kgx_tsv.tar.gz
13:37:05  Parsing data/raw/chebi_kgx_tsv.tar.gz
13:37:05  Parsing data/raw/hp_kgx_tsv.tar.gz
13:37:05  Parsing data/raw/go_kgx_tsv.tar.gz
13:37:05  Parsing data/raw/ogms_kgx_tsv.tar.gz
13:37:05  Parsing data/raw/drug.target.interaction.tsv.gz
13:37:05  Transforming to data/transformed/drug_central using source in kg_idg/transform_utils/drug_central/drugcentral-dti.yaml
13:37:05  koza_apps entry created for: drugcentral-dti
13:37:05  koza_app: <koza.app.KozaApp object at 0x7f5d91d55a90>

This seems to be an error in transforming the drugcentral-dti. Not certain why it's happening now but it may be KGX related, as the most recent change unpinned KGX version.

To Reproduce

See Jenkins build 66. Happened in build 65 too.

Version

67eeb30

Add further edge features

Would like to have more (or any) edge features to do link prediction with:

Link drug vs. target with evidence PMIDs, as per DrugCentral
Interaction strength, as per TCRD

Or anything else we may map to an association in great numbers.

Compressed graph incorrectly includes relative local path

Describe the bug

@LucaCappelletti94 let me know that the most recent KG-IDG builds include a subfolder in the KG-IDG.tar.gz, interfering with automated loading.

Should be fixed in the Jenkinsfile, here:

kg-idg/Jenkinsfile

Lines 130 to 139 in a83caac

 stage('Merge') { 

 steps { 

 dir('./gitrepo') { 

 sh '. venv/bin/activate && python run.py merge -y merge.yaml' 

 sh 'cp merged_graph_stats.yaml merged_graph_stats_$BUILDSTARTDATE.yaml' 

 sh '. venv/bin/activate && python generate_subgraphs.py --nodes data/merged/merged-kg_nodes.tsv --edges data/merged/merged-kg_edges.tsv' 

 sh 'tar -czvf data/merged/merged-kg.tar.gz data/merged/merged-kg_nodes.tsv data/merged/merged-kg_edges.tsv merged_graph_stats_$BUILDSTARTDATE.yaml pos_valid_edges.tsv neg_train_edges.tsv neg_valid_edges.tsv' 

 } 

 } 

 }

To Reproduce

See https://kg-hub.berkeleybop.io/kg-idg/20220722/KG-IDG.tar.gz

Expected behavior

The decompressed file should not include any directories.

Update merge.yaml for each resource

Update edge types - current/example edge types appear to be from STRING

ATC ingest emits empty nodes/edges file

Describe the bug

It looks as though ATC ingest emits empty nodes/edges file in the most recent build

To Reproduce

See here for example:
https://kg-hub.berkeleybop.io/kg-idg/20220722/transformed/atc/index.html
Steps to reproduce the behavior.

Expected behavior

Should emit nodes and edges files that aren't empty

Version

This problem is present in 20220722, but not in 20220701 for example - possibly a new bug?

https://kg-hub.berkeleybop.io/kg-idg/20220701/transformed/atc/index.html

Additional context

I just happened upon this because we are thinking about using KG-IDG for a investigation into topological biases and graph ML

Update GOCAMs to be less COVID specific

Describe the desired behavior

Right now for the KG-IDG passthrough transform of GOCAMs, we are reusing the KG-COVID-19 transform of GOCAMs, which is derived from a big long SPARQL query described in this ticket.

This is fairly COVID-19 specific, so we probably want a more general SPARQL query to retrieve GOCAMs for KG-IDG.

MySQL startup issue breaks Jenkins run

Describe the bug

The database startup portion of the start of the Jenkins pipeline encounters this error:

21:44:43  + sudo /etc/init.d/postgresql start
21:44:43   * Starting PostgreSQL 12 database server
21:44:47     ...done.
21:44:47  + break
[Pipeline] sh
21:44:48  + sudo /etc/init.d/mysql start
21:44:48   * Starting MySQL database server mysqld
21:44:48  su: warning: cannot change directory to /nonexistent: No such file or directory
21:45:20     ...fail!
21:45:20  + sleep 60
[Pipeline] sh
21:46:28  + sudo /etc/init.d/postgresql status
21:46:28  12/main (port 5432): online
[Pipeline] echo
21:46:30  PostgreSQL server status:
[Pipeline] sh
21:46:30  + pg_isready -h localhost -p 5432
21:46:30  localhost:5432 - accepting connections
[Pipeline] sh
21:46:31  + sudo /etc/init.d/mysql status
21:46:31   * MySQL is stopped.

To Reproduce

Started after merging commits 557cf81 and 0390d42

These introduce ensmallen as a new req, but that's Python-specific, and this happens outside the Python context.

Could the Docker image have changed? (No, it is unchanged)

Expected behavior

In the previous run, starting the DB servers looks like this:

[2022-05-12T15:15:12.946Z] + sudo /etc/init.d/postgresql start
[2022-05-12T15:15:12.946Z]  * Starting PostgreSQL 12 database server
[2022-05-12T15:15:16.175Z]    ...done.
[2022-05-12T15:15:16.175Z] + break
[Pipeline] sh
[2022-05-12T15:15:16.574Z] + sudo /etc/init.d/mysql start
[2022-05-12T15:15:16.574Z]  * Starting MySQL database server mysqld
[2022-05-12T15:15:16.574Z] su: warning: cannot change directory to /nonexistent: No such file or directory
[2022-05-12T15:15:21.768Z]    ...done.
[2022-05-12T15:15:21.768Z] + break
[Pipeline] sh
[2022-05-12T15:15:22.167Z] + sudo /etc/init.d/postgresql status
[2022-05-12T15:15:22.167Z] 12/main (port 5432): online
[Pipeline] echo
[2022-05-12T15:15:22.350Z] PostgreSQL server status:
[Pipeline] sh
[2022-05-12T15:15:22.704Z] + pg_isready -h localhost -p 5432
[2022-05-12T15:15:22.704Z] localhost:5432 - accepting connections
[Pipeline] sh
[2022-05-12T15:15:23.108Z] + sudo /etc/init.d/mysql status
[2022-05-12T15:15:23.108Z]  * /usr/bin/mysqladmin  Ver 8.0.27-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
[2022-05-12T15:15:23.108Z] Copyright (c) 2000, 2021, Oracle and/or its affiliates.
[2022-05-12T15:15:23.108Z] 
[2022-05-12T15:15:23.108Z] Oracle is a registered trademark of Oracle Corporation and/or its
[2022-05-12T15:15:23.108Z] affiliates. Other names may be trademarks of their respective
[2022-05-12T15:15:23.108Z] owners.
[2022-05-12T15:15:23.108Z] 
[2022-05-12T15:15:23.108Z] Server version		8.0.27-0ubuntu0.20.04.1
[2022-05-12T15:15:23.108Z] Protocol version	10
[2022-05-12T15:15:23.108Z] Connection		Localhost via UNIX socket
[2022-05-12T15:15:23.108Z] UNIX socket		/var/run/mysqld/mysqld.sock
[2022-05-12T15:15:23.108Z] Uptime:			7 sec
[2022-05-12T15:15:23.108Z] 
[2022-05-12T15:15:23.108Z] Threads: 2  Questions: 8  Slow queries: 0  Opens: 117  Flush tables: 3  Open tables: 36  Queries per second avg: 1.142

Filter by combined score of >700 or so

Upload merged graph

Need to upload the merged KG, to be stored in a 'kg-idg' directory at https://kg-hub.berkeleybop.io/

In KG-COVID-19 this is accomplished during the Publish stage of the Jenkins build.

Add material from Pheno/TCRD

See https://pharos.nih.gov/faq - definite overlap with existing contents (e.g., DrugCentral) but an existing large and relevant resource

Update DrugCentral

DrugCentral has a new version as of Oct 5 2021.

We currently use a pre-processed version from KG-COVID-19, but could ingest the new release and prepare a new KGX transform instead.

Update TCRD source link

The following link to TCRD has broken 😑:
http://juniper.health.unm.edu/tcrd/download/TCRDv6.11.0.tsv

It may now need to be:
http://juniper.health.unm.edu/tcrd/download/PharosTCRD_UniProt_Mapping.tsv

It'll need to be updated in the download.yaml.

TCRD source is unavailable

Describe the bug

The source for the TCRD data dump, http://juniper.health.unm.edu/tcrd/download/ , has been down (or at least inconsistently available) for more than a week. In the meantime, I've explicitly set the download URL to a cached version, but it's unclear where to retrieve the full data dump otherwise. The TCRD site itself (http://juniper.health.unm.edu/tcrd/) is also unreachable.

Links on Pharos (https://pharos.nih.gov/ , see bottom of page) point to the same location.

Merge fails due to pydantic error

This happened during the most recent build:

18:56:53  Exporting data_type from temporary_database to data/transformed/tcrd/tcrd-data_type.tsv...
18:56:53  Complete - wrote 5.
18:56:53  Exporting info_type from temporary_database to data/transformed/tcrd/tcrd-info_type.tsv...
18:56:53  Complete - wrote 28.
18:56:53  Exporting xref_type from temporary_database to data/transformed/tcrd/tcrd-xref_type.tsv...
18:56:53  Complete - wrote 21.
18:56:53  Exporting protein from temporary_database to data/transformed/tcrd/tcrd-protein.tsv...
18:56:53  Complete - wrote 20412.
18:56:53  Transforming to data/transformed/tcrd using source in kg_idg/transform_utils/tcrd/tcrd-protein.yaml
18:56:53  Parsing data/raw/proteinatlas.tsv.zip
18:56:53  Transforming using source in kg_idg/transform_utils/hpa/hpa-data.yaml
18:56:53  Parsing data/raw/string_nodes.tsv
18:56:53  Updating Biolink categories in string_nodes.tsv
18:56:53  [KGX][cli_utils.py][    transform_source] INFO: Processing source 'string_nodes.tsv'
18:58:14  Parsing data/raw/string_edges.tsv
18:58:14  Parsing edges in string_edges.tsv
18:58:14  Applying confidence threshold of 700 to STRING
18:58:14  [KGX][cli_utils.py][    transform_source] INFO: Processing source 'string_edges.tsv'
19:12:05  WARNING:koza.model.config.source_config:Could not load dataset description from metadata file
19:12:05  ERROR:root:Encountered error: _pydantic_post_init() got an unexpected keyword argument 'provided_by'
19:12:05  Parsing data/raw/atc.csv.gz
19:12:05  Transforming using source in kg_idg/transform_utils/atc/atc-classes.yaml

I suspect this is biolink_model_pydantic becoming more...pydantic w.r.t. the provided_by keyword, so the current fix may be to pin to a previous working version.

Include antonyms

Describe the desired behavior

Ingested edges with the "is opposite of" annotation (http://purl.obolibrary.org/obo/RO_0002604) should be included while retaining this property, though at the moment this is not handled distinctly from any other predicate. It would be great to have opposites, though!

Additional context

See the current Ubergraph approach here: https://github.com/INCATools/ubergraph/blob/f078fd8969bff875127efd118783324ae631c6e4/Makefile#L60-L62

Primary IDs for drugs: use DrugCentral in lieu of CHEBI

As per discussion with Tudor, Jeremy, Chris et al. on Nov 11 IDG call:
Consider using DrugCentral unique IDs as primary drug IDs instead of CHEBI IDs.
This may mean more nodes get merged; we'll consider that a feature as we may not require the level of specificity that CHEBI applies to its entries.

CURIE prefix and predicate errors

Describe the bug

During the merge, KG-IDG produces a large number of "node id [id] has no CURIE prefix" and "Invalid predicate CURIE" errors, though it isn't immediately clear which source these are from.

To Reproduce

$ python3 run.py download
$ python3 run.py transform
$ python3 run.py merge 2> merge_out.log
$ sort merge_out.log | uniq -c | sort -n
...
      3 Warning: node id http://omim.org/entry/606689 has no CURIE prefix
      3 Warning: node id http://omim.org/entry/613364 has no CURIE prefix
      3 Warning: node id http://omim.org/entry/615383 has no CURIE prefix
      3 Warning: node id http://omim.org/entry/617704 has no CURIE prefix
      5 Invalid  predicate CURIE 'owl:versionIRI'? Ignoring...
     10 Invalid  predicate CURIE 'biolink:RegulateprocessToProcess'? Ignoring...
     14 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#narrowMatch'? Ignoring...
     29 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#broadMatch'? Ignoring...
     48 Invalid  predicate CURIE 'rdfs:isDefinedBy'? Ignoring...
     64 Invalid  predicate CURIE 'rdfs:seeAlso'? Ignoring...
     75 Invalid  predicate CURIE 'biolink:NegativelyRegulateprocessToProcess'? Ignoring...
    160 Invalid  predicate CURIE 'owl:disjointWith'? Ignoring...
  16468 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#closeMatch'? Ignoring...
  71842 Invalid  predicate CURIE ':http://www.w3.org/2004/02/skos/core#exactMatch'? Ignoring...

Most of the node id prefix errors (not shown) appear to be from OMIM, e.g.:

 2 Warning: node id http://www.omim.org/phenotypicSeries/PS619142 has no CURIE prefix

Expected behavior

This will require some forensics to identify:

which invalid CURIE is from which source
whether it matters
if it matters, what the correct CURIE should be
if there isn't a preferred CURIE, what it should look like

AND/OR

Is this an expected part of how a KGX merge operates?

Version

8a5a018

Need a good ol' fashioned dependency fix

Describe the bug

As of the most recent commit:

INFO: pip is looking at multiple versions of neo4j to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of docutils to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of kgx to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of biolink-model-pydantic to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of kg-idg to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement linkml-validator>=0.1.0 (from koza) (from versions: none)
ERROR: No matching distribution found for linkml-validator>=0.1.0

So this may just be a matter of pinning to a specific koza.

Add upheno

Upheno is available as OWL
http://www.obofoundry.org/ontology/upheno.html

Modularize the Jenkins build indexing and publishing code

The current code in the Jenkinsfile relies upon as combination of https://github.com/Knowledge-Graph-Hub/go-site/blob/master/scripts/directory_indexer.py and setting up local directories to be copied to S3, but at minimum it should also (ideally as a modified version of the directory_indexer) index all directories/files on the bucket but not present locally.
Given a bucket and a directory name, this should just make an index in that directory. In the current Jenkins setup this may mean running another s3cmd put to upload the new index.

In the slightly longer term, the entire process of indexing and uploading can be its own modular script.
See https://github.com/Knowledge-Graph-Hub/kg-obo/blob/552192a22e47b2a0a0860a509fed32983e079a96/kg_obo/upload.py#L253

Add fresh Orphanet and OMIM transforms

Describe the desired behavior

Orphanet and OMIM data currently come from Monarch data archives, but they use n-tuples.
These aren't being actively updated.
Instead, do one or more of the following:

Bring in Monarch transform output, like https://data.monarchinitiative.org/monarch-kg-dev/2022-07-27/transform_output/omim_gene_to_disease_edges.tsv
Do fresh transforms for both, from original sources
- See orphanet data here: http://95.142.173.26:8090/
- OMIM gene maps: https://www.omim.org/downloads
Re-evaluate role of these resources in the graph. Are they just being used for gene to disease maps? What else is coming in with these sources?

Fixes for new koza ingest call behavior

See new behavior as per https://koza.monarchinitiative.org/Usage/configuring_ingests/

change nodes and edges file back to "merged-kg_nodes|edges.tsv" (instead of KG-IDG_nodes)

Describe the bug

In order for Ensmallen to find the graph data correctly, the graphs need to have a base filename of "merged-kg" instead of "KG-IDG").

Example: in this directory
https://kg-hub.berkeleybop.io/kg-idg/current/
The KGX TSV tar.gz has these files inside of it:
KG-IDG_edges.tsv
KG-IDG_nodes.tsv

These should be:
merged-kg_edges.tsv
merged-kg_nodes.tsv

I think this should be a matter of just changing this line the merge.yaml

Also we need to change KGX TSV tar.gz's in the existing builds:
https://kg-hub.berkeleybop.io/kg-idg/
To conform to this.

Version

Which version of KG-IDG are you seeing the bug with? An md5 hash is most useful.

9ceba63

TODO:

Fix Jenkinsfile to use merged-kg as basename
Fix existing KGX TSV tar.gz files in existing builds to conform this too
Check that this works correctly after next Jenkins build completes

Set up an automated ML task to predict drug -> drug target edges

Describe the desired behavior

It would be useful for Drug Central to have inferred drug -> drug target edges, which we could formulate as a graph ML link prediction task.

We maybe could/should do this in an automated way, on each build of KG-IDG.

A possible roadmap:

Write Jupyter notebook for link prediction task
Write some tooling to run this code from Jenkins
Write some code to push the results to KGHub

Additional context

Per convo with Tudor et al on IDG just now

Additional value ingests from DrugCentral

As per discussion with Tudor, Jeremy et al. on Nov 11 IDG call:
Should verify that the following are being ingested:

Strength values for drug-target interactions (note these are log scale, and should not be used for any kind of quality filter without also comparing to TDL membership for the target)
INN drug classification by name (primarily for assigning function / mode of action)

Build fails due to Koza validation error in atc transform

Describe the bug

The atc transform fails Koza validation, leading to a broken build.
See Jenkins build 63 from Aug 1 2022.

Stack trace:

15:47:50  WARNING:biolink_model_pydantic.model:ATC: does not have a local identifier
15:47:50  ERROR:koza.app:Validation error while processing: {'Class ID': 'http://purl.bioontology.org/ontology/ATC/J05AJ03', 'Preferred Label': 'dolutegravir', 'Synonyms': '', 'Definitions': '', 'Obsolete': 'false', 'CUI': 'C3253985', 'Semantic Types': 'http://purl.bioontology.org/ontology/STY/T109|http://purl.bioontology.org/ontology/STY/T121', 'Parents': '', 'ATC LEVEL': '5', 'Is Drug Class': '', 'Semantic type UMLS property': 'http://purl.bioontology.org/ontology/STY/T109|http://purl.bioontology.org/ontology/STY/T121'}
15:47:50  ERROR:root:Encountered error: 1 validation error for NamedThing
15:47:50  iri
15:47:50    string does not match regex "^(http|ftp)" (type=value_error.str.regex; pattern=^(http|ftp))
15:47:50  Parsing data/raw/atc.csv.gz
15:47:50  Transforming using source in kg_idg/transform_utils/atc/atc-classes.yaml
15:47:50  koza_apps entry created for: atc-classes
15:47:50  koza_app: <koza.app.KozaApp object at 0x7f9ab7026bb0>

Not certain what this issue is as this I don't immediately see anything wrong with the entry.

	stage('Merge') {
	steps {
	dir('./gitrepo') {
	sh '. venv/bin/activate && python run.py merge -y merge.yaml'
	sh 'cp merged_graph_stats.yaml merged_graph_stats_$BUILDSTARTDATE.yaml'
	sh '. venv/bin/activate && python generate_subgraphs.py --nodes data/merged/merged-kg_nodes.tsv --edges data/merged/merged-kg_edges.tsv'
	sh 'tar -czvf data/merged/merged-kg.tar.gz data/merged/merged-kg_nodes.tsv data/merged/merged-kg_edges.tsv merged_graph_stats_$BUILDSTARTDATE.yaml pos_valid_edges.tsv neg_train_edges.tsv neg_valid_edges.tsv'
	}
	}
	}

knowledge-graph-hub / kg-idg Goto Github PK

kg-idg's People

Stargazers

Watchers

Forkers

kg-idg's Issues

Describe the bug

Describe the desired behavior

Describe the desired behavior

Describe the desired behavior

Describe the desired behavior

Additional context

Describe the bug

To Reproduce

Expected behavior

Version

Describe the bug

To Reproduce

Version

Additional context

Describe the bug

To Reproduce

Expected behavior

Version

Describe the bug

To Reproduce

Expected behavior

Version

Describe the desired behavior

Describe the desired behavior

Describe the bug

To Reproduce

Version

Describe the bug

To Reproduce

Expected behavior

Describe the bug

To Reproduce

Expected behavior

Version

Additional context

Describe the desired behavior

Describe the bug

To Reproduce

Expected behavior

Version

Describe the bug

Describe the desired behavior

Additional context

Describe the bug

To Reproduce

Expected behavior

Version

Describe the bug

Describe the desired behavior

Describe the bug

Version

Describe the desired behavior

Additional context

Describe the bug

Recommend Projects

Recommend Topics

Recommend Org

Jobs