biolink / kgx Goto Github PK
View Code? Open in Web Editor NEWKGX is a Python library for exchanging Knowledge Graphs
Home Page: https://kgx.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
KGX is a Python library for exchanging Knowledge Graphs
Home Page: https://kgx.readthedocs.io
License: BSD 3-Clause "New" or "Revised" License
That would be useful for installing the CLI, at least.
The current setup.py requires python3. We may need to add "pip3" option for compatibility.
Some KGs will provide edge IDs, others won't should we always provide one, eg, by hashing? hashing the SPO will not work as we allow separate SPOs with different semantics
currently failing due to lack of pandas
may be simpler to do this in the .travis.yml pip install -r requirements.txt
the current config is intended to do a fast binary install of scipy.. except that isn't actually needed
In the alpha medikanren instance, that pulls in monarch lite, they are lacking phenotypes for many types of diabetes, including for example MODY11
But we definitely have these:
https://monarchinitiative.org/disease/MONDO:0013242#phenotypes
So either
Currently our CSV export would require parsing if we want to load nodes and edges into a fresh Neo4j instance via neo4j-admin import
utility.
I just started playing around with Dask, it appears to be much faster than pandas and will work with much larger files. It has a similar API as the pandas DataFrame, and while it can't do everything that pandas can do it probably fits for our needs.
Similar logic to #1
Possibly 2: one for bulk loads, the other for append
Especially when the endpoint is a database (e.g. via bolt, sparql) rather than a file or set of files.
use cases:
The BdBag would contain individual files representing slices of the graph in an agreed upon format (NCATS-Tangerine/translator-knowledge-graph#6). The KG could be slice in any number of ways but a standard way would be by source.
cc @stevencox
Related to pitch 1 from @saramsey
Can we easily summarize contents of a KG and how they may differ from others? If we are all using the same types then we can multiplex a cypher aggregate query to each source. We could also download all locally and compare that way
KGX requires click==6.7, whereas BiolinkMG requires click>=7.0.
Upgrade networkx to 2.x
I'm the maintainer of PyBEL which compiles knowledge stored in Biological Expression Language into a NetworkX MultiDiGraph. I don't think it would be so hard to transform the graphs it produces to match your schema so you could incorporate them. There's quite a bit of publicly available content that only exists in BEL, and we're also working at converting several biological knowledge repositories to BEL automatically (see: Bio2BEL).
I would be happy to try and make a PR if you can point me in the right direction where to get started! If you're unaware of BEL I can share some more resources and examples too :)
This could happen a number of different ways
Will use click framework
dump command:
see tests for details
So far we can filter on the subject and object categories, and the edge label. This means we're only constraining the edge set and not the node set. You might get only a handful of relations, but then also get every node in the database. That seems incorrect.
My expectation is that every edge in the edge set should have its nodes represented in the node set, and that there should not be nodes in the node set that are not connected by edges in the edge set.
That being said, if we later add filters for the node set then maybe users will want to be able to download node sets and edge sets independently? Maybe they'll want to download only the nodes, or only the edges.
I believe the code for this is basically done, just need to add something to the README that shows how it's done - use mondo + uberon + go as example (complete versions).
This could be done as two calls
The the /concepts/data/{queryId} or /statements/data/{queryid} data access endpoints in the Translator Knowledge Beacon Aggregator generates JSON output documenting knowledge nodes and edges which could be harvested into a Translator Knowledge Graph standard graph.
cc @stuppie
Use case: instantiate an instance of the translator knowledge graph (behind firewall, etc) and merge in observational data, so we have patient encounters linking up to HPO, MONDO, OBA nodes etc.
Relies on having mapping of loinc2hpo
currently RDF sources and the monarch-full graph use URIs for node and edge properties and types. Currently we hardcode a mapping to shortforms in the standard:
# TODO: use JSON-LD context
mapping = {
'subject': OBAN.association_has_subject,
'object': OBAN.association_has_object,
'predicate': OBAN.association_has_predicate,
'comment': RDFS.comment,
'name': RDFS.label
}
this should instead come from a JSON-LD context that is generated from biolink
cc @balhoff
sramsey-laptop:kgx sramsey$ kgx neo4j-download bolt://localhost neo4j [REDACTED] test.out
/usr/local/lib/python3.6/site-packages/cachier/mongo_core.py:24: UserWarning: Cachier warning: pymongo was not found. MongoDB cores will not work.
"Cachier warning: pymongo was not found. MongoDB cores will not work.")
Error: must be str, not dict_keys
I got this error running the latest code for kgx on a MBP (OS 10.10) under python 3.5
Use case: reasoner Z needs to bring a slice of central KG into their local KG. They want to set up a CI job to do this. They run the kgx export dump command (#10) via docker run
We'd like to experiment with slurping slices of the X-ray graph (already conformant with tkg standard), and if they are amenable, allowing them to slurp slices in.
This should turn out to be a generic cypher call, no need to transform as in the monarch case
def parse(self,filename:str=None, format:str=None):
"""
Parse a file into an graph, using rdflib
"""
rdfgraph = rdflib.Graph()
if format is None:
if filename.endswith(".ttl"):
format='turtle'
elif filename.endswith(".rdf"):
format='xml'
rdfgraph.parse(filename, format=format)
# TODO: use source from RDF
self.graph_metadata['provided_by'] = filename
# self.load_edges(rdfgraph)
Is this last line supposed to be commented out? This makes the class malfunction.
We'd like to be able to do semantic diffs to compare subsets to two KGs
There are a number of different mappings to be explored. The data model is a basic property graph. Nodes and edges can be annotated with a simple key-val dict, where the val is either a literal or list of literals.
One possibility is to represent nodes and edges are fixed length tuples where each position is hardwired to a property e.g. (node ID NAME DESCRIPTION ...)* (edge PREDICATE SUBJECT OBJECT PUBLICATIONS...)*
but this is too rigid.
Option 1:
(node ID)
(node-property-value ID P V) ## V is atom or list of atoms
(edge PRED SUBJ OBJ)
(edge-property-value PRED SUBJ OBJ P V)
this uses reification for the edges; a modification is to use an edge ID
Option 2:
(node ID (list (P1 V1) (P2 V2) ..))
(edge PRED SUBJ OBJ (list (P1 V1) (P2 V2) ..))
CSV files are at link below, there will be a new version with statement id instead of predicate type.
The label array ['named_thing']
gets parsed as a string "['named_thing']"
Use ast.literal_eval
when parsing in PandasTransformer load_node and load_edge.
Host OS: Ubuntu
Installing inside a Docker container
Installation command:
pip3 install -r requirements.txt
/mnt/data/kgx/kgx$ uname -a
Linux ip-172-31-43-220 4.4.0-1060-aws #69-Ubuntu SMP Sun May 20 13:42:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Collecting pytest_logging>=0.0 (from -r requirements.txt (line 9))
Using cached https://files.pythonhosted.org/packages/dc/1e/fb11174c9eaebcec27d36e9e994b90ffa168bc3226925900b9dbbf16c9da/pytest-logging-2015.11.4.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-niimt31p/pytest-logging/setup.py", line 34, in
exec(rfh.read(), None, _LOCALS) # pylint: disable=exec-used
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 105: ordinal not in range(128)
@cmungall @deepakunni3 @kshefchek
The workflow is in rdf_to_json.py, which loads owl files and the turtle files, parses the turtle files (using the owl files to find categories for nodes), and saves the resulting networkx graph as json files. I'm using a branch where I refactored/rewrote the RDF transformers into new files: rdf_transformer2.py and utils/rdf_utils.py.
I wrote a count.py to get the edge and node frequencies and the examples of uncategorized nodes from the resulting json files.
It seems the majority of clinvar edges have an uncategorized node. Is there some other owl file I should be using?
@cmungall I just noticed that the two most common edge types for clinvar are has_uncertain_significance_for_condition
and has_uncertain_significance_for_condition
, suggesting that the uncategorized objects are actually diseases. Am I supposed to be inferring node categories from the domain and range of predicates they're adjacent to?
I am using go.owl
, so.owl
, ordo.owl
, mondo.owl
, and hp.owl
, and the latest results were derived from using all owl files on all turtle files.
owl:Restriction
and owl:Axiom
aren't the result of a bug in my category-finding method.rdf_to_json.py
with all owl files loaded and used on each turtle file. Comment: significant improvement to clinvar.ttl
, though I also added ordo.owl
so these improvements may have solely been from that.+-------------------------------------+---------------------------------------------+-----------+
| Uncategorized Example Base IRI | Uncategorized Example Full IRI | Frequency |
+-------------------------------------+---------------------------------------------+-----------+
| http://www.ncbi.nlm.nih.gov/medgen | http://www.ncbi.nlm.nih.gov/medgen/CN517202 | 1919 |
| http://purl.obolibrary.org/obo/OMIM | http://purl.obolibrary.org/obo/OMIM_612042 | 232 |
| | ClinVarVariant:242567,242568 | 197 |
| http://www.orpha.net/ORDO/Orphanet | http://www.orpha.net/ORDO/Orphanet_98301 | 3 |
+-------------------------------------+---------------------------------------------+-----------+
+------------------+-----------+
| Category | Frequency |
+------------------+-----------+
| sequence feature | 411841 |
| disease | 5181 |
| None | 2351 |
| GENO:0000871 | 288 |
| GENO:0000847 | 220 |
+------------------+-----------+
+------------------+--------------+-----------------+-----------+
| Subject Category | Predicate | Object Category | Frequency |
+------------------+--------------+-----------------+-----------+
| sequence feature | GENO:0000845 | None | 174302 |
| sequence feature | GENO:0000844 | None | 106606 |
| sequence feature | GENO:0000845 | disease | 94964 |
| sequence feature | GENO:0000840 | disease | 70360 |
| sequence feature | GENO:0000843 | None | 63270 |
| sequence feature | GENO:0000844 | disease | 45998 |
| sequence feature | GENO:0000840 | None | 37772 |
| sequence feature | GENO:0000843 | disease | 25795 |
| sequence feature | GENO:0000841 | disease | 21672 |
| sequence feature | GENO:0000841 | None | 19869 |
| GENO:0000871 | GENO:0000840 | disease | 223 |
| GENO:0000847 | GENO:0000840 | disease | 206 |
+------------------+--------------+-----------------+-----------+
+-----------------------------------------+---------------------------------------------+-----------+
| Uncategorized Example Base IRI | Uncategorized Example Full IRI | Frequency |
+-----------------------------------------+---------------------------------------------+-----------+
| http://purl.obolibrary.org/obo/MESH | http://purl.obolibrary.org/obo/MESH_D000855 | 590 |
| http://purl.obolibrary.org/obo/DOID | http://purl.obolibrary.org/obo/DOID_411 | 315 |
| http://purl.obolibrary.org/obo/OMIM | http://purl.obolibrary.org/obo/OMIM_610380 | 219 |
| http://purl.obolibrary.org/obo/HP | http://purl.obolibrary.org/obo/HP_0004929 | 87 |
| http://purl.obolibrary.org/obo/DECIPHER | http://purl.obolibrary.org/obo/DECIPHER_57 | 47 |
| http://www.orpha.net/ORDO/Orphanet | http://www.orpha.net/ORDO/Orphanet_2058 | 18 |
+-----------------------------------------+---------------------------------------------+-----------+
+-----------+-----------+
| Category | Frequency |
+-----------+-----------+
| disease | 12773 |
| phenotype | 8388 |
| None | 1276 |
+-----------+-----------+
+------------------+-----------------+-----------------+-----------+
| Subject Category | Predicate | Object Category | Frequency |
+------------------+-----------------+-----------------+-----------+
| disease | has phenotype | phenotype | 236210 |
| None | has phenotype | phenotype | 31721 |
| disease | has disposition | HP:0000005 | 7405 |
| disease | has disposition | HP:0031797 | 1544 |
| disease | has phenotype | None | 909 |
| phenotype | has phenotype | phenotype | 792 |
| disease | has disposition | HP:0012823 | 619 |
| None | has phenotype | None | 277 |
| disease | has phenotype | HP:0031797 | 269 |
| None | has disposition | HP:0000005 | 176 |
+------------------+-----------------+-----------------+-----------+
+-------------------------------------------------+---------------------------------------------------------------+-----------+
| Uncategorized Example Base IRI | Uncategorized Example Full IRI | Frequency |
+-------------------------------------------------+---------------------------------------------------------------+-----------+
| https://www.ncbi.nlm.nih.gov/gene | https://www.ncbi.nlm.nih.gov/gene/117188 | 987 |
| https://monarchinitiative.org/.well-known/genid | https://monarchinitiative.org/.well-known/genid/feature616964 | 6 |
+-------------------------------------------------+---------------------------------------------------------------+-----------+
+------------------+-----------+
| Category | Frequency |
+------------------+-----------+
| phenotype | 6296 |
| gene | 4146 |
| None | 993 |
| sequence feature | 245 |
+------------------+-----------+
+------------------+------------------+-----------------+-----------+
| Subject Category | Predicate | Object Category | Frequency |
+------------------+------------------+-----------------+-----------+
| gene | causes condition | phenotype | 4652 |
| gene | RO:0002326 | phenotype | 1089 |
| None | causes condition | phenotype | 537 |
| None | RO:0002326 | phenotype | 353 |
| sequence feature | causes condition | phenotype | 157 |
| gene | RO:0002607 | phenotype | 108 |
+------------------+------------------+-----------------+-----------+
+------------------------------------+-------------------------------------------+-----------+
| Uncategorized Example Base IRI | Uncategorized Example Full IRI | Frequency |
+------------------------------------+-------------------------------------------+-----------+
| http://www.orpha.net/ORDO/Orphanet | http://www.orpha.net/ORDO/Orphanet_449306 | 33 |
+------------------------------------+-------------------------------------------+-----------+
+------------------+-----------+
| Category | Frequency |
+------------------+-----------+
| sequence feature | 6033 |
| disease | 3658 |
| gene | 815 |
+------------------+-----------+
+------------------+------------------+-----------------+-----------+
| Subject Category | Predicate | Object Category | Frequency |
+------------------+------------------+-----------------+-----------+
| sequence feature | causes condition | disease | 5792 |
| gene | RO:0003304 | disease | 909 |
| gene | RO:0002607 | disease | 332 |
| sequence feature | causes condition | phenotype | 198 |
+------------------+------------------+-----------------+-----------+
At the last hackathon I tried to load https://data.monarchinitiative.org/ttl/clinvar.ttl (which is 1.2 GB) into a networkx graph with ObanRdfTransformer, and got an exception that I believe means I ran out of memory.
@cmungall Do you know whether anyone else has tried this? Might the problem be with my computer, or is networkx using too much memory, or is it something else?
How can we ensure that KGX will work for large files?
In theory this should just be a generic bolt/cypher call, if the group already conforms to http://bit.ly/tr-kg-standard
Ran into this issue while working with Red's knowledge graph.
If I run a SPARQL query that involves property path expansion, it results in connection timeouts.
Not sure if there is a better approach to this or if the query I am using can be improved.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bioentity: <http://bioentity.io/vocab/>
SELECT * WHERE {
?subject ?predicate ?object .
?subject rdfs:subClassOf* bioentity:Association
}
LIMIT 5
@cmungall @amalic @vemonet Would be good to get your inputs on this.
Error: Node(1335) already exists with label `named_thing` and property `id` = 'NCBIGene:25236'
Should use MERGE rather than CREATE. Also, should SET properties ON CREATE and only use id to identify a node.
This should implement a mapping that is a subset of the scigraph one here
https://github.com/SciGraph/SciGraph/wiki/Neo4jMapping
basically, class axioms
This would be similar to the existing RDF export but instead of using OBAN reification, edge properties would map to the named graph. The results could be published using the nanopub framework.
advice sought from @micheldumontier and @amalic
@putmantime made a start here #29
The validator may work by connecting to an instance (eg via BOLT) OR by downloading using kgx script and running checks locally / in-memory e.g using networkx
Looks like a TODO: load it from a config file?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.