knowledge-graph-hub / kg-idg Goto Github PK
View Code? Open in Web Editor NEWA Knowledge Graph to Illuminate the Druggable Genome
Home Page: https://knowledge-graph-hub.github.io/kg-idg/
License: BSD 3-Clause "New" or "Revised" License
A Knowledge Graph to Illuminate the Druggable Genome
Home Page: https://knowledge-graph-hub.github.io/kg-idg/
License: BSD 3-Clause "New" or "Revised" License
The current Docker image used by the Jenkins build for KG-IDG is Ubuntu with Python 3.8.5.
After updates in #106 , this isn't sufficient - it will need at least Py3.9.
Any new image will also need to have mysql and postgresql available and set up.
This image should work: https://hub.docker.com/repository/docker/caufieldjh/ubuntu20-python-3-9-14-dev
If not, will need to check on the database service configs.
Indexes have various issues, though individual files (e.g., https://kg-hub.berkeleybop.io/kg-idg/20211029/README) are accessible:
Per convo with @caufieldjh, let's add a Jenkinsfile that does a download, transform, and merge (but not a push to s3 just yet), analogous to what we do for kg-microbe and kg-covid-19. This will act as an integration test, and also the basis for the versioned build that we can push to s3.
It would help with sanity checks to have some basic graph visualization. We have a few options, any or all of which may be appropriate:
DrugCentral has adverse events for drugs, with each mapped to a MEDDRA term. These may be mappable to HPO as well.
Would be nice to include these for each drug.
Building in Koza will let us do ingests on our sources without too much friction due to varying formats. We can get kgx format consistently.
Each ingest has the option to accept a specific file rather than the default.
The full transform operation should fail if an individual file is not found.
Koza 0.1.2 includes some updates to header options in the ingest yamls. Updating will require changing 'has_header' to 'header'.
Incorporate SSSOM (https://github.com/mapping-commons/sssom) as part of KG-IDG metadata.
@justaddcoffee - you had some ideas regarding what this entails
Koza has several examples for doing STRING ingests here: https://github.com/monarch-initiative/koza/tree/main/examples
The current ingest pulls the KG-COVID-19 STRING human PPI nodes/edges, but we could have more specific control over the desired STRING attributes by defining a Koza ingest.
In meeting with @LucaCappelletti94 on Mar 28, we found that dendritic stars were present in the graph:
Dendritic stars
A dendritic star is a dendritic tree with a maximal depth of one, where nodes with maximal unique degree one are connected to a central root node with high degree and inside a strongly connected component. We have detected 76.57K dendritic stars in the graph, involving a total of 370.09K nodes (38.54%) and 370.09K edges (5.58%), with the largest one involving 1.62K nodes and 1.62K edges. The detected dendritic stars, sorted by decreasing size, are:
Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr1](https://monarchinitiative.org/MONARCH_GRCh38chr1) (degree 3.27K), and containing 1.62K nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b001b2a08902a888b688](https://monarchinitiative.org/MONARCH_.well-known/genid/b001b2a08902a888b688), [MONARCH:.well-known/genid/b00240a6cd5c9ce72cfc](https://monarchinitiative.org/MONARCH_.well-known/genid/b00240a6cd5c9ce72cfc), [MONARCH:.well-known/genid/b00264d5500fc17abdb3](https://monarchinitiative.org/MONARCH_.well-known/genid/b00264d5500fc17abdb3), [MONARCH:.well-known/genid/b00314f5c8595c3c6090](https://monarchinitiative.org/MONARCH_.well-known/genid/b00314f5c8595c3c6090) and [MONARCH:.well-known/genid/b0031ce26f12079d8eda](https://monarchinitiative.org/MONARCH_.well-known/genid/b0031ce26f12079d8eda). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).
Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr19](https://monarchinitiative.org/MONARCH_GRCh38chr19) (degree 2.08K), and containing 1.04K nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b0020b13d149a3ace335](https://monarchinitiative.org/MONARCH_.well-known/genid/b0020b13d149a3ace335), [MONARCH:.well-known/genid/b0034da11610011d83b6](https://monarchinitiative.org/MONARCH_.well-known/genid/b0034da11610011d83b6), [MONARCH:.well-known/genid/b00481884ceba59a184b](https://monarchinitiative.org/MONARCH_.well-known/genid/b00481884ceba59a184b), [MONARCH:.well-known/genid/b01061b745f09c7b17a0](https://monarchinitiative.org/MONARCH_.well-known/genid/b01061b745f09c7b17a0) and [MONARCH:.well-known/genid/b0112155b43b0392ab9e](https://monarchinitiative.org/MONARCH_.well-known/genid/b0112155b43b0392ab9e). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).
Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr2](https://monarchinitiative.org/MONARCH_GRCh38chr2) (degree 2.05K), and containing 1.01K nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b00615c4b695fbaec504](https://monarchinitiative.org/MONARCH_.well-known/genid/b00615c4b695fbaec504), [MONARCH:.well-known/genid/b00640167c662a3d85cc](https://monarchinitiative.org/MONARCH_.well-known/genid/b00640167c662a3d85cc), [MONARCH:.well-known/genid/b0084aa562f0b8e874f7](https://monarchinitiative.org/MONARCH_.well-known/genid/b0084aa562f0b8e874f7), [MONARCH:.well-known/genid/b00f01967f08455b0230](https://monarchinitiative.org/MONARCH_.well-known/genid/b00f01967f08455b0230) and [MONARCH:.well-known/genid/b010b3222397837c92c1](https://monarchinitiative.org/MONARCH_.well-known/genid/b010b3222397837c92c1). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).
Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr11](https://monarchinitiative.org/MONARCH_GRCh38chr11) (degree 1.96K), and containing 971 nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b005f4b408aacadc4052](https://monarchinitiative.org/MONARCH_.well-known/genid/b005f4b408aacadc4052), [MONARCH:.well-known/genid/b00d7e00405a3d304484](https://monarchinitiative.org/MONARCH_.well-known/genid/b00d7e00405a3d304484), [MONARCH:.well-known/genid/b01428224cf0b68106f2](https://monarchinitiative.org/MONARCH_.well-known/genid/b01428224cf0b68106f2), [MONARCH:.well-known/genid/b015611ec1932aee59c1](https://monarchinitiative.org/MONARCH_.well-known/genid/b015611ec1932aee59c1) and [MONARCH:.well-known/genid/b01c6b0056aa4a4ad4e8](https://monarchinitiative.org/MONARCH_.well-known/genid/b01c6b0056aa4a4ad4e8). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).
Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr17](https://monarchinitiative.org/MONARCH_GRCh38chr17) (degree 1.92K), and containing 950 nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b003a20d29c1de3b50d7](https://monarchinitiative.org/MONARCH_.well-known/genid/b003a20d29c1de3b50d7), [MONARCH:.well-known/genid/b014f794d06917bf8acd](https://monarchinitiative.org/MONARCH_.well-known/genid/b014f794d06917bf8acd), [MONARCH:.well-known/genid/b01a0c2037bd0770892d](https://monarchinitiative.org/MONARCH_.well-known/genid/b01a0c2037bd0770892d), [MONARCH:.well-known/genid/b01b1f0127d3a52c2477](https://monarchinitiative.org/MONARCH_.well-known/genid/b01b1f0127d3a52c2477) and [MONARCH:.well-known/genid/b02097c939e7b03de65d](https://monarchinitiative.org/MONARCH_.well-known/genid/b02097c939e7b03de65d). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).
Dendritic star starting from the root node [MONARCH_NODE:GRCh38chr6](https://monarchinitiative.org/MONARCH_GRCh38chr6) (degree 1.80K), and containing 890 nodes, with a maximal depth of 1, which are [MONARCH:.well-known/genid/b000a2698b275276fc61](https://monarchinitiative.org/MONARCH_.well-known/genid/b000a2698b275276fc61), [MONARCH:.well-known/genid/b003500cbe0691866ca7](https://monarchinitiative.org/MONARCH_.well-known/genid/b003500cbe0691866ca7), [MONARCH:.well-known/genid/b003fa4fd2c425a02b14](https://monarchinitiative.org/MONARCH_.well-known/genid/b003fa4fd2c425a02b14), [MONARCH:.well-known/genid/b00a4bd069e6ea7ccd34](https://monarchinitiative.org/MONARCH_.well-known/genid/b00a4bd069e6ea7ccd34) and [MONARCH:.well-known/genid/b00fa6dadec685f40c9b](https://monarchinitiative.org/MONARCH_.well-known/genid/b00fa6dadec685f40c9b). Its nodes have a single node type, which is [biolink:NamedThing](https://biolink.github.io/biolink-model/docs/NamedThing.html). Its edges have a single edge type, which is [biolink:related_to](https://biolink.github.io/biolink-model/docs/related_to.html).
And other 76.56K dendritic stars.
Across the entire graph, there are 26 nodes with the MONARCH_NODE
prefix - what is their origin?
We'd like to ingest ATC level 1 information for drugs, which Jeremy Yang says in their SQL dump, in tables atc and struct2atc
Per discussion with Tudor and Jeremy and others on the IDG call just now
As of build 25 on master
, the merge step fails:
...
11:26:23 [KGX][cli_utils.py][ parse_source] INFO: Processing source 'tcrd-protein'
11:26:23 [KGX][cli_utils.py][ parse_source] INFO: Processing source 'string'
11:32:29 [KGX][cli_utils.py][ parse_source] INFO: Processing source 'upheno2'
11:32:56 multiprocessing.pool.RemoteTraceback:
11:32:56 """
11:32:56 Traceback (most recent call last):
11:32:56 File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
11:32:56 result = (True, func(*args, **kwds))
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 806, in parse_source
11:32:56 transformer.transform(input_args)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 275, in transform
11:32:56 self.process(source_generator, sink)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 315, in process
11:32:56 for rec in source:
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 165, in parse
11:32:56 file_iter = pd.read_csv(
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
11:32:56 return func(*args, **kwargs)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
11:32:56 return _read(filepath_or_buffer, kwds)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 482, in _read
11:32:56 parser = TextFileReader(filepath_or_buffer, **kwds)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
11:32:56 self._engine = self._make_engine(self.engine)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
11:32:56 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
11:32:56 self._reader = parsers.TextReader(self.handles.handle, **kwds)
11:32:56 File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
11:32:56 pandas.errors.EmptyDataError: No columns to parse from file
11:32:56 """
11:32:56
11:32:56 The above exception was the direct cause of the following exception:
11:32:56
11:32:56 Traceback (most recent call last):
11:32:56 File "run.py", line 167, in <module>
11:32:56 cli()
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
11:32:56 return self.main(*args, **kwargs)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
11:32:56 rv = self.invoke(ctx)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
11:32:56 return _process_result(sub_ctx.command.invoke(sub_ctx))
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
11:32:56 return ctx.invoke(self.callback, **ctx.params)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
11:32:56 return __callback(*args, **kwargs)
11:32:56 File "run.py", line 86, in merge
11:32:56 load_and_merge(yaml, processes)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/kg_idg/merge_utils/merge_kg.py", line 36, in load_and_merge
11:32:56 merged_graph = merge(yaml_file, processes=processes)
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 681, in merge
11:32:56 stores = [r.get() for r in results]
11:32:56 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 681, in <listcomp>
11:32:56 stores = [r.get() for r in results]
11:32:56 File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
11:32:56 raise self._value
11:32:56 pandas.errors.EmptyDataError: No columns to parse from file
As the error says, the uPheno tsv (node, edge, or both) is empty so it doesn't load.
python run.py download
python run.py transform
# or for upheno alone:
# python3 run.py transform -s UPhenoTransform
python run.py merge
or see build 25
The input for the merge, as defined in merge.yaml, should not be empty.
Looks like the download step is (gracefully) failing on KG-IDG:
https://build.berkeleybop.io/job/knowledge-graph-hub/job/kg-idg/job/master/19/
Downloading https://data.bioontology.org/ontologies/ATC/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb&download_format=csv to atc.csv.gz
18:52:42 : 95%|█████████▍| 18/19 [04:06<00:13, 13.68s/it]
18:52:42 Traceback (most recent call last):
18:52:42 File "run.py", line 164, in <module>
18:52:42 cli()
18:52:42 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
18:52:42 return self.main(*args, **kwargs)
18:52:42 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
18:52:42 rv = self.invoke(ctx)
18:52:42 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
18:52:42 return _process_result(sub_ctx.command.invoke(sub_ctx))
18:52:42 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
18:52:42 return ctx.invoke(self.callback, **ctx.params)
18:52:42 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
18:52:42 return __callback(*args, **kwargs)
18:52:42 File "run.py", line 38, in download
18:52:42 kg_download(*args, **kwargs)
18:52:42 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/kg_idg/download.py", line 20, in download
18:52:42 download_from_yaml(yaml_file=yaml_file, output_dir=output_dir,
18:52:42 File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/kg_idg/utils/download_utils.py", line 54, in download_from_yaml
18:52:42 with urlopen(req) as response, open(outfile, 'wb') as out_file: # type: ignore
18:52:42 File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
18:52:42 return opener.open(url, data, timeout)
18:52:42 File "/usr/lib/python3.8/urllib/request.py", line 531, in open
18:52:42 response = meth(req, response)
18:52:42 File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
18:52:42 response = self.parent.error(
18:52:42 File "/usr/lib/python3.8/urllib/request.py", line 569, in error
18:52:42 return self._call_chain(*args)
18:52:42 File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
18:52:42 result = func(*args)
18:52:42 File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
18:52:42 raise HTTPError(req.full_url, code, msg, hdrs, fp)
18:52:42 urllib.error.HTTPError: HTTP Error 500: Internal Server Error
https://build.berkeleybop.io/job/knowledge-graph-hub/job/kg-idg/job/master/19/
Failing gracefully - Jenkins is using cached data from s3 instead of a fresh download
Some of the data sources (e.g., DrugCentral and TCRD) are on the larger size, so we should only upload compressed versions.
At minimum, this should be the case for sources originally downloaded as compressed.
For consistency, we could also:
At present, the ingest for TCRD just gets gene IDs. The critical element here is the assignment of each to a TDL, as described here: http://juniper.health.unm.edu/tcrd/
I'm not entirely sure about how to model this with Biolink. Is it enough to just treat the TDL category as a NamedThing and then assign an Association between the Gene and the NamedThing, or is there more to it?
Edges of type biolink:expressed_in
appear to be correctly produced by the ProteinAtlasTransform
, but are not present in the merged graph.
python run.py download
python run.py transform
# or
# python run.py transform -s ProteinAtlasTransform
# for single transform
python run.py merge
$ grep biolink:expressed_in data/merged/merged-kg_edges.tsv | wc -l
0
Transformed edges look like this:
uuid:92b106b2-5e89-11ec-bbc2-00155d00d735 UniProtKB:Q3KRB8 biolink:expressed_in GO:0031982 biolink:GeneToExpressionSiteAssociation|biolink:Association RO:0002206 Human Protein Atlas
and transformed nodes:
GO:0031982 biolink:AnatomicalEntity|biolink:NamedThing vesicle Human Protein Atlas
GO terms are present in the merged graph but GO is its own ingest.
Transformed edges such as that above should be present in the merged graph.
Something about the way mysql is loaded causes it to fail with some regularity, passing its failure on to the build.
Output looks like:
22:58:27 + sudo /etc/init.d/mysql start
22:58:27 * Starting MySQL database server mysqld
22:58:27 su: warning: cannot change directory to /nonexistent: No such file or directory
22:58:59 ...fail!
The warning isn't really the issue - the mysql start
can complete even if that warning is present.
Can't seem to reproduce this reliably - not certain on the underlying cause, but see build 24 on master
branch.
Could be memory limitation.
If mysql start
fails, wait 30 sec and try again, for up to 5 times.
The multi_indexer script used to set up index pages on the remote (https://github.com/Knowledge-Graph-Hub/kg-idg/blob/master/multi_indexer.py) doesn't really belong here:
It could go somewhere else:
NEAT uploads its output to the graph_ml/
directory for each build (e.g., kg-idg/20211215/graph_ml/
).
This directory should also be indexed - the publish stage of the build calls multi_indexer.py
to do this, and that will capture the graph_ml directory too but only if it's called after those new files are placed in the remote location.
We may also wrap up all the indexing and directory management code into its own script for the sake of neatness, then call that script once all new files have been produced.
At each build, do the following:
This could also be modularized through NEAT as we'd want to do this for other graphs.
As per KG-COVID-19 and other KG-hub projects, we need basic documentation in the form of:
At least one of the transforms, DrugCentral, is a passthrough transform, but like the others it should be sent through a kgx transform step prior to merge. Otherwise it may not be subject to the same validation as other data sources.
An error occurs during the transform phase of the build:
13:32:36 + python3.8 run.py transform
13:36:57 WARNING:koza.model.config.source_config:Could not load dataset description from metadata file
13:37:05 ERROR:root:Encountered error: could not convert string to float: ''
13:37:05 Parsing data/raw/mondo_kgx_tsv.tar.gz
13:37:05 Parsing data/raw/chebi_kgx_tsv.tar.gz
13:37:05 Parsing data/raw/hp_kgx_tsv.tar.gz
13:37:05 Parsing data/raw/go_kgx_tsv.tar.gz
13:37:05 Parsing data/raw/ogms_kgx_tsv.tar.gz
13:37:05 Parsing data/raw/drug.target.interaction.tsv.gz
13:37:05 Transforming to data/transformed/drug_central using source in kg_idg/transform_utils/drug_central/drugcentral-dti.yaml
13:37:05 koza_apps entry created for: drugcentral-dti
13:37:05 koza_app: <koza.app.KozaApp object at 0x7f5d91d55a90>
This seems to be an error in transforming the drugcentral-dti. Not certain why it's happening now but it may be KGX related, as the most recent change unpinned KGX version.
See Jenkins build 66. Happened in build 65 too.
Would like to have more (or any) edge features to do link prediction with:
Or anything else we may map to an association in great numbers.
@LucaCappelletti94 let me know that the most recent KG-IDG builds include a subfolder in the KG-IDG.tar.gz
, interfering with automated loading.
Should be fixed in the Jenkinsfile, here:
Lines 130 to 139 in a83caac
See https://kg-hub.berkeleybop.io/kg-idg/20220722/KG-IDG.tar.gz
The decompressed file should not include any directories.
Update edge types - current/example edge types appear to be from STRING
It looks as though ATC ingest emits empty nodes/edges file in the most recent build
See here for example:
https://kg-hub.berkeleybop.io/kg-idg/20220722/transformed/atc/index.html
Steps to reproduce the behavior.
Should emit nodes and edges files that aren't empty
This problem is present in 20220722, but not in 20220701 for example - possibly a new bug?
https://kg-hub.berkeleybop.io/kg-idg/20220701/transformed/atc/index.html
I just happened upon this because we are thinking about using KG-IDG for a investigation into topological biases and graph ML
Right now for the KG-IDG passthrough transform of GOCAMs, we are reusing the KG-COVID-19 transform of GOCAMs, which is derived from a big long SPARQL query described in this ticket.
This is fairly COVID-19 specific, so we probably want a more general SPARQL query to retrieve GOCAMs for KG-IDG.
The database startup portion of the start of the Jenkins pipeline encounters this error:
21:44:43 + sudo /etc/init.d/postgresql start
21:44:43 * Starting PostgreSQL 12 database server
21:44:47 ...done.
21:44:47 + break
[Pipeline] sh
21:44:48 + sudo /etc/init.d/mysql start
21:44:48 * Starting MySQL database server mysqld
21:44:48 su: warning: cannot change directory to /nonexistent: No such file or directory
21:45:20 ...fail!
21:45:20 + sleep 60
[Pipeline] sh
21:46:28 + sudo /etc/init.d/postgresql status
21:46:28 12/main (port 5432): online
[Pipeline] echo
21:46:30 PostgreSQL server status:
[Pipeline] sh
21:46:30 + pg_isready -h localhost -p 5432
21:46:30 localhost:5432 - accepting connections
[Pipeline] sh
21:46:31 + sudo /etc/init.d/mysql status
21:46:31 * MySQL is stopped.
Started after merging commits 557cf81 and 0390d42
These introduce ensmallen
as a new req, but that's Python-specific, and this happens outside the Python context.
Could the Docker image have changed? (No, it is unchanged)
In the previous run, starting the DB servers looks like this:
[2022-05-12T15:15:12.946Z] + sudo /etc/init.d/postgresql start
[2022-05-12T15:15:12.946Z] * Starting PostgreSQL 12 database server
[2022-05-12T15:15:16.175Z] ...done.
[2022-05-12T15:15:16.175Z] + break
[Pipeline] sh
[2022-05-12T15:15:16.574Z] + sudo /etc/init.d/mysql start
[2022-05-12T15:15:16.574Z] * Starting MySQL database server mysqld
[2022-05-12T15:15:16.574Z] su: warning: cannot change directory to /nonexistent: No such file or directory
[2022-05-12T15:15:21.768Z] ...done.
[2022-05-12T15:15:21.768Z] + break
[Pipeline] sh
[2022-05-12T15:15:22.167Z] + sudo /etc/init.d/postgresql status
[2022-05-12T15:15:22.167Z] 12/main (port 5432): online
[Pipeline] echo
[2022-05-12T15:15:22.350Z] PostgreSQL server status:
[Pipeline] sh
[2022-05-12T15:15:22.704Z] + pg_isready -h localhost -p 5432
[2022-05-12T15:15:22.704Z] localhost:5432 - accepting connections
[Pipeline] sh
[2022-05-12T15:15:23.108Z] + sudo /etc/init.d/mysql status
[2022-05-12T15:15:23.108Z] * /usr/bin/mysqladmin Ver 8.0.27-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
[2022-05-12T15:15:23.108Z] Copyright (c) 2000, 2021, Oracle and/or its affiliates.
[2022-05-12T15:15:23.108Z]
[2022-05-12T15:15:23.108Z] Oracle is a registered trademark of Oracle Corporation and/or its
[2022-05-12T15:15:23.108Z] affiliates. Other names may be trademarks of their respective
[2022-05-12T15:15:23.108Z] owners.
[2022-05-12T15:15:23.108Z]
[2022-05-12T15:15:23.108Z] Server version 8.0.27-0ubuntu0.20.04.1
[2022-05-12T15:15:23.108Z] Protocol version 10
[2022-05-12T15:15:23.108Z] Connection Localhost via UNIX socket
[2022-05-12T15:15:23.108Z] UNIX socket /var/run/mysqld/mysqld.sock
[2022-05-12T15:15:23.108Z] Uptime: 7 sec
[2022-05-12T15:15:23.108Z]
[2022-05-12T15:15:23.108Z] Threads: 2 Questions: 8 Slow queries: 0 Opens: 117 Flush tables: 3 Open tables: 36 Queries per second avg: 1.142
Use ROBOT: https://robot.obolibrary.org/convert
Include PPI from STRING, as per KG-COVID-19:
https://kg-hub.berkeleybop.io/kg-covid-19/current/transformed/STRING/index.html
Filter by combined score of >700 or so
Need to upload the merged KG, to be stored in a 'kg-idg' directory at https://kg-hub.berkeleybop.io/
In KG-COVID-19 this is accomplished during the Publish stage of the Jenkins build.
See https://pharos.nih.gov/faq - definite overlap with existing contents (e.g., DrugCentral) but an existing large and relevant resource
DrugCentral has a new version as of Oct 5 2021.
We currently use a pre-processed version from KG-COVID-19, but could ingest the new release and prepare a new KGX transform instead.
The following link to TCRD has broken 😑:
http://juniper.health.unm.edu/tcrd/download/TCRDv6.11.0.tsv
It may now need to be:
http://juniper.health.unm.edu/tcrd/download/PharosTCRD_UniProt_Mapping.tsv
It'll need to be updated in the download.yaml
.
The source for the TCRD data dump, http://juniper.health.unm.edu/tcrd/download/ , has been down (or at least inconsistently available) for more than a week. In the meantime, I've explicitly set the download URL to a cached version, but it's unclear where to retrieve the full data dump otherwise. The TCRD site itself (http://juniper.health.unm.edu/tcrd/) is also unreachable.
Links on Pharos (https://pharos.nih.gov/ , see bottom of page) point to the same location.
This happened during the most recent build:
18:56:53 Exporting data_type from temporary_database to data/transformed/tcrd/tcrd-data_type.tsv...
18:56:53 Complete - wrote 5.
18:56:53 Exporting info_type from temporary_database to data/transformed/tcrd/tcrd-info_type.tsv...
18:56:53 Complete - wrote 28.
18:56:53 Exporting xref_type from temporary_database to data/transformed/tcrd/tcrd-xref_type.tsv...
18:56:53 Complete - wrote 21.
18:56:53 Exporting protein from temporary_database to data/transformed/tcrd/tcrd-protein.tsv...
18:56:53 Complete - wrote 20412.
18:56:53 Transforming to data/transformed/tcrd using source in kg_idg/transform_utils/tcrd/tcrd-protein.yaml
18:56:53 Parsing data/raw/proteinatlas.tsv.zip
18:56:53 Transforming using source in kg_idg/transform_utils/hpa/hpa-data.yaml
18:56:53 Parsing data/raw/string_nodes.tsv
18:56:53 Updating Biolink categories in string_nodes.tsv
18:56:53 [KGX][cli_utils.py][ transform_source] INFO: Processing source 'string_nodes.tsv'
18:58:14 Parsing data/raw/string_edges.tsv
18:58:14 Parsing edges in string_edges.tsv
18:58:14 Applying confidence threshold of 700 to STRING
18:58:14 [KGX][cli_utils.py][ transform_source] INFO: Processing source 'string_edges.tsv'
19:12:05 WARNING:koza.model.config.source_config:Could not load dataset description from metadata file
19:12:05 ERROR:root:Encountered error: _pydantic_post_init() got an unexpected keyword argument 'provided_by'
19:12:05 Parsing data/raw/atc.csv.gz
19:12:05 Transforming using source in kg_idg/transform_utils/atc/atc-classes.yaml
I suspect this is biolink_model_pydantic
becoming more...pydantic w.r.t. the provided_by keyword, so the current fix may be to pin to a previous working version.
Ingested edges with the "is opposite of" annotation (http://purl.obolibrary.org/obo/RO_0002604) should be included while retaining this property, though at the moment this is not handled distinctly from any other predicate. It would be great to have opposites, though!
See the current Ubergraph approach here: https://github.com/INCATools/ubergraph/blob/f078fd8969bff875127efd118783324ae631c6e4/Makefile#L60-L62
As per discussion with Tudor, Jeremy, Chris et al. on Nov 11 IDG call:
Consider using DrugCentral unique IDs as primary drug IDs instead of CHEBI IDs.
This may mean more nodes get merged; we'll consider that a feature as we may not require the level of specificity that CHEBI applies to its entries.
During the merge, KG-IDG produces a large number of "node id [id] has no CURIE prefix" and "Invalid predicate CURIE" errors, though it isn't immediately clear which source these are from.
$ python3 run.py download
$ python3 run.py transform
$ python3 run.py merge 2> merge_out.log
$ sort merge_out.log | uniq -c | sort -n
...
3 Warning: node id http://omim.org/entry/606689 has no CURIE prefix
3 Warning: node id http://omim.org/entry/613364 has no CURIE prefix
3 Warning: node id http://omim.org/entry/615383 has no CURIE prefix
3 Warning: node id http://omim.org/entry/617704 has no CURIE prefix
5 Invalid predicate CURIE 'owl:versionIRI'? Ignoring...
10 Invalid predicate CURIE 'biolink:RegulateprocessToProcess'? Ignoring...
14 Invalid predicate CURIE ':http://www.w3.org/2004/02/skos/core#narrowMatch'? Ignoring...
29 Invalid predicate CURIE ':http://www.w3.org/2004/02/skos/core#broadMatch'? Ignoring...
48 Invalid predicate CURIE 'rdfs:isDefinedBy'? Ignoring...
64 Invalid predicate CURIE 'rdfs:seeAlso'? Ignoring...
75 Invalid predicate CURIE 'biolink:NegativelyRegulateprocessToProcess'? Ignoring...
160 Invalid predicate CURIE 'owl:disjointWith'? Ignoring...
16468 Invalid predicate CURIE ':http://www.w3.org/2004/02/skos/core#closeMatch'? Ignoring...
71842 Invalid predicate CURIE ':http://www.w3.org/2004/02/skos/core#exactMatch'? Ignoring...
Most of the node id prefix errors (not shown) appear to be from OMIM, e.g.:
2 Warning: node id http://www.omim.org/phenotypicSeries/PS619142 has no CURIE prefix
This will require some forensics to identify:
AND/OR
Is this an expected part of how a KGX merge
operates?
As of the most recent commit:
INFO: pip is looking at multiple versions of neo4j to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of docutils to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of kgx to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of biolink-model-pydantic to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of kg-idg to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement linkml-validator>=0.1.0 (from koza) (from versions: none)
ERROR: No matching distribution found for linkml-validator>=0.1.0
So this may just be a matter of pinning to a specific koza
.
Upheno is available as OWL
http://www.obofoundry.org/ontology/upheno.html
The current code in the Jenkinsfile relies upon as combination of https://github.com/Knowledge-Graph-Hub/go-site/blob/master/scripts/directory_indexer.py and setting up local directories to be copied to S3, but at minimum it should also (ideally as a modified version of the directory_indexer) index all directories/files on the bucket but not present locally.
Given a bucket and a directory name, this should just make an index in that directory. In the current Jenkins setup this may mean running another s3cmd put
to upload the new index.
In the slightly longer term, the entire process of indexing and uploading can be its own modular script.
See https://github.com/Knowledge-Graph-Hub/kg-obo/blob/552192a22e47b2a0a0860a509fed32983e079a96/kg_obo/upload.py#L253
Orphanet and OMIM data currently come from Monarch data archives, but they use n-tuples.
These aren't being actively updated.
Instead, do one or more of the following:
See new behavior as per https://koza.monarchinitiative.org/Usage/configuring_ingests/
In order for Ensmallen to find the graph data correctly, the graphs need to have a base filename of "merged-kg" instead of "KG-IDG").
Example: in this directory
https://kg-hub.berkeleybop.io/kg-idg/current/
The KGX TSV tar.gz has these files inside of it:
KG-IDG_edges.tsv
KG-IDG_nodes.tsv
These should be:
merged-kg_edges.tsv
merged-kg_nodes.tsv
I think this should be a matter of just changing this line the merge.yaml
Also we need to change KGX TSV tar.gz's in the existing builds:
https://kg-hub.berkeleybop.io/kg-idg/
To conform to this.
Which version of KG-IDG are you seeing the bug with? An md5 hash is most useful.
TODO:
It would be useful for Drug Central to have inferred drug -> drug target edges, which we could formulate as a graph ML link prediction task.
We maybe could/should do this in an automated way, on each build of KG-IDG.
A possible roadmap:
Per convo with Tudor et al on IDG just now
As per discussion with Tudor, Jeremy et al. on Nov 11 IDG call:
Should verify that the following are being ingested:
The atc transform fails Koza validation, leading to a broken build.
See Jenkins build 63 from Aug 1 2022.
Stack trace:
15:47:50 WARNING:biolink_model_pydantic.model:ATC: does not have a local identifier
15:47:50 ERROR:koza.app:Validation error while processing: {'Class ID': 'http://purl.bioontology.org/ontology/ATC/J05AJ03', 'Preferred Label': 'dolutegravir', 'Synonyms': '', 'Definitions': '', 'Obsolete': 'false', 'CUI': 'C3253985', 'Semantic Types': 'http://purl.bioontology.org/ontology/STY/T109|http://purl.bioontology.org/ontology/STY/T121', 'Parents': '', 'ATC LEVEL': '5', 'Is Drug Class': '', 'Semantic type UMLS property': 'http://purl.bioontology.org/ontology/STY/T109|http://purl.bioontology.org/ontology/STY/T121'}
15:47:50 ERROR:root:Encountered error: 1 validation error for NamedThing
15:47:50 iri
15:47:50 string does not match regex "^(http|ftp)" (type=value_error.str.regex; pattern=^(http|ftp))
15:47:50 Parsing data/raw/atc.csv.gz
15:47:50 Transforming using source in kg_idg/transform_utils/atc/atc-classes.yaml
15:47:50 koza_apps entry created for: atc-classes
15:47:50 koza_app: <koza.app.KozaApp object at 0x7f9ab7026bb0>
Not certain what this issue is as this I don't immediately see anything wrong with the entry.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.