biolink / kgx Goto Github PK

View Code? Open in Web Editor NEW

110.0 22.0 26.0 7.14 MB

KGX is a Python library for exchanging Knowledge Graphs

Home Page: https://kgx.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.03% Shell 0.01% Python 99.84% Dockerfile 0.12%

knowledge-graph biolink-model ncats-translator

kgx's People

Contributors

Stargazers

Watchers

kgx's Issues

Should this be on PyPI?

That would be useful for installing the CLI, at least.

Explicitly mention python3 required

The current setup.py requires python3. We may need to add "pip3" option for compatibility.

generate edge ids?

Some KGs will provide edge IDs, others won't should we always provide one, eg, by hashing? hashing the SPO will not work as we allow separate SPOs with different semantics

fix travis config

currently failing due to lack of pandas

may be simpler to do this in the .travis.yml pip install -r requirements.txt

the current config is intended to do a fast binary install of scipy.. except that isn't actually needed

Missing information from Monarch lite

In the alpha medikanren instance, that pulls in monarch lite, they are lacking phenotypes for many types of diabetes, including for example MODY11

But we definitely have these:
https://monarchinitiative.org/disease/MONDO:0013242#phenotypes

So either

bug in making monarch lite @kshefchek
bug in kgx that filters this @lhannest
bug in minikaner @gregr @webyrd

Write converter for semmeddb dump

add config with location of csvs from scripps group
investigate strategy for mapping UMLS (see also #8) @andrewsu @RichardBruskiewich

Better align CSV export from kgx for use in neo4j-admin import utility

Currently our CSV export would require parsing if we want to load nodes and edges into a fresh Neo4j instance via neo4j-admin import utility.

test kgx code on beacon aggregator

I just started playing around with Dask, it appears to be much faster than pandas and will work with much larger files. It has a similar API as the pandas DataFrame, and while it can't do everything that pandas can do it probably fits for our needs.

Implement import from OBAN RDF

Similar logic to #1

Add a loader for neo4j

Possibly 2: one for bulk loads, the other for append

Allow filters for I/O

Especially when the endpoint is a database (e.g. via bolt, sparql) rather than a file or set of files.

use cases:

export all nodes plus their properties of a certain type (e.g. disease)
- as above, include direct outgoing edges
export all edges for a particular pair of types, e.g d2p

Add readthedocs

Implement tool for distributing a KG as BdBags

The BdBag would contain individual files representing slices of the graph in an agreed upon format (NCATS-Tangerine/translator-knowledge-graph#6). The KG could be slice in any number of ways but a standard way would be by source.

cc @stevencox

Provide way to summarize contents of a KG

Related to pitch 1 from @saramsey

Can we easily summarize contents of a KG and how they may differ from others? If we are all using the same types then we can multiplex a cypher aggregate query to each source. We could also download all locally and compare that way

Clashing click versions

KGX requires click==6.7, whereas BiolinkMG requires click>=7.0.

networkx 2.x

Upgrade networkx to 2.x

Add for transformer for BEL

I'm the maintainer of PyBEL which compiles knowledge stored in Biological Expression Language into a NetworkX MultiDiGraph. I don't think it would be so hard to transform the graphs it produces to match your schema so you could incorporate them. There's quite a bit of publicly available content that only exists in BEL, and we're also working at converting several biological knowledge repositories to BEL automatically (see: Bio2BEL).

I would be happy to try and make a PR if you can point me in the right direction where to get started! If you're unaware of BEL I can share some more resources and examples too :)

Implement ID clique merging

This could happen a number of different ways

Use KGX to export from a source, run a rewriter over the generated files
Use KGX to export from a source, rewrite during export
Import into a database, run clique merge (could use SG framework for this)

cc @balhoff @kshefchek @yy20716

command line wrapper

Will use click framework

dump command:

input: path or endpoint
input format (if not inferred from above)
output path
output format

see tests for details

Neo4j download filters

So far we can filter on the subject and object categories, and the edge label. This means we're only constraining the edge set and not the node set. You might get only a handful of relations, but then also get every node in the database. That seems incorrect.

My expectation is that every edge in the edge set should have its nodes represented in the node set, and that there should not be nodes in the node set that are not connected by edges in the edge set.

That being said, if we later add filters for the node set then maybe users will want to be able to download node sets and edge sets independently? Maybe they'll want to download only the nodes, or only the edges.

load ontology from rdf, demonstrating inter-ontology linkages

I believe the code for this is basically done, just need to add something to the README that shows how it's done - use mondo + uberon + go as example (complete versions).

This could be done as two calls

kgx convert ontology rdf to property graph serialization (e.g. csv)
kgx load into neo4j (using the code that @deepakunni3 is finishing off)

Add a transformer that accepts a URL pointing to Translator Knowledge Beacon Aggregator output

The the /concepts/data/{queryId} or /statements/data/{queryid} data access endpoints in the Translator Knowledge Beacon Aggregator generates JSON output documenting knowledge nodes and edges which could be harvested into a Translator Knowledge Graph standard graph.

Add RDF exporter

Pull from Wikidata

cc @stuppie

Align with NCATS reasoner results format

Write adapters for clinical observation data

Use case: instantiate an instance of the translator knowledge graph (behind firewall, etc) and merge in observational data, so we have patient encounters linking up to HPO, MONDO, OBA nodes etc.

Relies on having mapping of loinc2hpo

cc @pnrobinson @realmarcin @kingmanzhang

Use JSON-LD context rather than hardcoding for mapping of RDF preds to short forms

currently RDF sources and the monarch-full graph use URIs for node and edge properties and types. Currently we hardcode a mapping to shortforms in the standard:

# TODO: use JSON-LD context
mapping = {
    'subject': OBAN.association_has_subject,
    'object': OBAN.association_has_object,
    'predicate': OBAN.association_has_predicate,
    'comment': RDFS.comment,
    'name': RDFS.label
}

this should instead come from a JSON-LD context that is generated from biolink

Add a loader for blazegraph

cc @balhoff

Error: must be str, not dict_keys (no stack backtrace)

sramsey-laptop:kgx sramsey$ kgx neo4j-download bolt://localhost neo4j [REDACTED] test.out
/usr/local/lib/python3.6/site-packages/cachier/mongo_core.py:24: UserWarning: Cachier warning: pymongo was not found. MongoDB cores will not work.
"Cachier warning: pymongo was not found. MongoDB cores will not work.")
Error: must be str, not dict_keys

I got this error running the latest code for kgx on a MBP (OS 10.10) under python 3.5

dockerize

Use case: reasoner Z needs to bring a slice of central KG into their local KG. They want to set up a CI job to do this. They run the kgx export dump command (#10) via docker run

Add adapter for X-ray reasoner

We'd like to experiment with slurping slices of the X-ray graph (already conformant with tkg standard), and if they are amenable, allowing them to slurp slices in.

This should turn out to be a generic cypher call, no need to transform as in the monarch case

rdf_transformer malfunction

    def parse(self,filename:str=None, format:str=None):
        """
        Parse a file into an graph, using rdflib
        """
        rdfgraph = rdflib.Graph()
        if format is None:
            if filename.endswith(".ttl"):
                format='turtle'
            elif filename.endswith(".rdf"):
                format='xml'
        rdfgraph.parse(filename, format=format)

        # TODO: use source from RDF
        self.graph_metadata['provided_by'] = filename
        # self.load_edges(rdfgraph)

Is this last line supposed to be commented out? This makes the class malfunction.

Add diff tool

We'd like to be able to do semantic diffs to compare subsets to two KGs

aggregate level: counts of type X, counts of uses of predicate Y
node level: compare properties for all types of X
edge level: compare all associations between X and Y

Implement import from Monarch Neo4J

Advice: @kshefchek

aka "monarch lite transform"

will assign @deepakunni3 @yy20716

Add export to sxprs

There are a number of different mappings to be explored. The data model is a basic property graph. Nodes and edges can be annotated with a simple key-val dict, where the val is either a literal or list of literals.

One possibility is to represent nodes and edges are fixed length tuples where each position is hardwired to a property e.g. (node ID NAME DESCRIPTION ...)* (edge PREDICATE SUBJECT OBJECT PUBLICATIONS...)* but this is too rigid.

Option 1:

(node ID)
(node-property-value ID P V) ## V is atom or list of atoms
(edge PRED SUBJ OBJ)
(edge-property-value PRED SUBJ OBJ P V)

this uses reification for the edges; a modification is to use an edge ID

Option 2:

(node ID (list (P1 V1) (P2 V2) ..))
(edge PRED SUBJ OBJ (list (P1 V1) (P2 V2) ..))

Integrate pharma GKB relationships into Translator uber graph

CSV files are at link below, there will be a new version with statement id instead of predicate type.

https://app.box.com/folder/56267238615?utm_source=trans

Add mypy type declarations

Pandas loads arrays as strings

The label array ['named_thing'] gets parsed as a string "['named_thing']"

Use ast.literal_eval when parsing in PandasTransformer load_node and load_edge.

UnicodeDecodeError on installing kgx from requirements.txt under python 3.5.x in docker

Host OS: Ubuntu
Installing inside a Docker container

Installation command:
pip3 install -r requirements.txt

/mnt/data/kgx/kgx$ uname -a
Linux ip-172-31-43-220 4.4.0-1060-aws #69-Ubuntu SMP Sun May 20 13:42:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Collecting pytest_logging>=0.0 (from -r requirements.txt (line 9))
Using cached https://files.pythonhosted.org/packages/dc/1e/fb11174c9eaebcec27d36e9e994b90ffa168bc3226925900b9dbbf16c9da/pytest-logging-2015.11.4.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-niimt31p/pytest-logging/setup.py", line 34, in
exec(rfh.read(), None, _LOCALS) # pylint: disable=exec-used
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 105: ordinal not in range(128)

Building a subgraph of clinvar.ttl, omim.ttl, hpoa.ttl, orphanet.ttl

@cmungall @deepakunni3 @kshefchek

The workflow is in rdf_to_json.py, which loads owl files and the turtle files, parses the turtle files (using the owl files to find categories for nodes), and saves the resulting networkx graph as json files. I'm using a branch where I refactored/rewrote the RDF transformers into new files: rdf_transformer2.py and utils/rdf_utils.py.

I wrote a count.py to get the edge and node frequencies and the examples of uncategorized nodes from the resulting json files.

It seems the majority of clinvar edges have an uncategorized node. Is there some other owl file I should be using?

@cmungall I just noticed that the two most common edge types for clinvar are has_uncertain_significance_for_condition and has_uncertain_significance_for_condition, suggesting that the uncategorized objects are actually diseases. Am I supposed to be inferring node categories from the domain and range of predicates they're adjacent to?

I am using go.owl, so.owl, ordo.owl, mondo.owl, and hp.owl, and the latest results were derived from using all owl files on all turtle files.

To do

Check that categories such as owl:Restriction and owl:Axiom aren't the result of a bug in my category-finding method.
Try re-running rdf_to_json.py with all owl files loaded and used on each turtle file. Comment: significant improvement to clinvar.ttl, though I also added ordo.owl so these improvements may have solely been from that.

Results

clinvar

+-------------------------------------+---------------------------------------------+-----------+
| Uncategorized Example Base IRI      | Uncategorized Example Full IRI              | Frequency |
+-------------------------------------+---------------------------------------------+-----------+
| http://www.ncbi.nlm.nih.gov/medgen  | http://www.ncbi.nlm.nih.gov/medgen/CN517202 | 1919      |
| http://purl.obolibrary.org/obo/OMIM | http://purl.obolibrary.org/obo/OMIM_612042  | 232       |
|                                     | ClinVarVariant:242567,242568                | 197       |
| http://www.orpha.net/ORDO/Orphanet  | http://www.orpha.net/ORDO/Orphanet_98301    | 3         |
+-------------------------------------+---------------------------------------------+-----------+
+------------------+-----------+
| Category         | Frequency |
+------------------+-----------+
| sequence feature | 411841    |
| disease          | 5181      |
| None             | 2351      |
| GENO:0000871     | 288       |
| GENO:0000847     | 220       |
+------------------+-----------+
+------------------+--------------+-----------------+-----------+
| Subject Category | Predicate    | Object Category | Frequency |
+------------------+--------------+-----------------+-----------+
| sequence feature | GENO:0000845 | None            | 174302    |
| sequence feature | GENO:0000844 | None            | 106606    |
| sequence feature | GENO:0000845 | disease         | 94964     |
| sequence feature | GENO:0000840 | disease         | 70360     |
| sequence feature | GENO:0000843 | None            | 63270     |
| sequence feature | GENO:0000844 | disease         | 45998     |
| sequence feature | GENO:0000840 | None            | 37772     |
| sequence feature | GENO:0000843 | disease         | 25795     |
| sequence feature | GENO:0000841 | disease         | 21672     |
| sequence feature | GENO:0000841 | None            | 19869     |
| GENO:0000871     | GENO:0000840 | disease         | 223       |
| GENO:0000847     | GENO:0000840 | disease         | 206       |
+------------------+--------------+-----------------+-----------+

hpoa:

+-----------------------------------------+---------------------------------------------+-----------+
| Uncategorized Example Base IRI          | Uncategorized Example Full IRI              | Frequency |
+-----------------------------------------+---------------------------------------------+-----------+
| http://purl.obolibrary.org/obo/MESH     | http://purl.obolibrary.org/obo/MESH_D000855 | 590       |
| http://purl.obolibrary.org/obo/DOID     | http://purl.obolibrary.org/obo/DOID_411     | 315       |
| http://purl.obolibrary.org/obo/OMIM     | http://purl.obolibrary.org/obo/OMIM_610380  | 219       |
| http://purl.obolibrary.org/obo/HP       | http://purl.obolibrary.org/obo/HP_0004929   | 87        |
| http://purl.obolibrary.org/obo/DECIPHER | http://purl.obolibrary.org/obo/DECIPHER_57  | 47        |
| http://www.orpha.net/ORDO/Orphanet      | http://www.orpha.net/ORDO/Orphanet_2058     | 18        |
+-----------------------------------------+---------------------------------------------+-----------+
+-----------+-----------+
| Category  | Frequency |
+-----------+-----------+
| disease   | 12773     |
| phenotype | 8388      |
| None      | 1276      |
+-----------+-----------+
+------------------+-----------------+-----------------+-----------+
| Subject Category | Predicate       | Object Category | Frequency |
+------------------+-----------------+-----------------+-----------+
| disease          | has phenotype   | phenotype       | 236210    |
| None             | has phenotype   | phenotype       | 31721     |
| disease          | has disposition | HP:0000005      | 7405      |
| disease          | has disposition | HP:0031797      | 1544      |
| disease          | has phenotype   | None            | 909       |
| phenotype        | has phenotype   | phenotype       | 792       |
| disease          | has disposition | HP:0012823      | 619       |
| None             | has phenotype   | None            | 277       |
| disease          | has phenotype   | HP:0031797      | 269       |
| None             | has disposition | HP:0000005      | 176       |
+------------------+-----------------+-----------------+-----------+

omim:

+-------------------------------------------------+---------------------------------------------------------------+-----------+
| Uncategorized Example Base IRI                  | Uncategorized Example Full IRI                                | Frequency |
+-------------------------------------------------+---------------------------------------------------------------+-----------+
| https://www.ncbi.nlm.nih.gov/gene               | https://www.ncbi.nlm.nih.gov/gene/117188                      | 987       |
| https://monarchinitiative.org/.well-known/genid | https://monarchinitiative.org/.well-known/genid/feature616964 | 6         |
+-------------------------------------------------+---------------------------------------------------------------+-----------+
+------------------+-----------+
| Category         | Frequency |
+------------------+-----------+
| phenotype        | 6296      |
| gene             | 4146      |
| None             | 993       |
| sequence feature | 245       |
+------------------+-----------+
+------------------+------------------+-----------------+-----------+
| Subject Category | Predicate        | Object Category | Frequency |
+------------------+------------------+-----------------+-----------+
| gene             | causes condition | phenotype       | 4652      |
| gene             | RO:0002326       | phenotype       | 1089      |
| None             | causes condition | phenotype       | 537       |
| None             | RO:0002326       | phenotype       | 353       |
| sequence feature | causes condition | phenotype       | 157       |
| gene             | RO:0002607       | phenotype       | 108       |
+------------------+------------------+-----------------+-----------+

orphanet

+------------------------------------+-------------------------------------------+-----------+
| Uncategorized Example Base IRI     | Uncategorized Example Full IRI            | Frequency |
+------------------------------------+-------------------------------------------+-----------+
| http://www.orpha.net/ORDO/Orphanet | http://www.orpha.net/ORDO/Orphanet_449306 | 33        |
+------------------------------------+-------------------------------------------+-----------+
+------------------+-----------+
| Category         | Frequency |
+------------------+-----------+
| sequence feature | 6033      |
| disease          | 3658      |
| gene             | 815       |
+------------------+-----------+
+------------------+------------------+-----------------+-----------+
| Subject Category | Predicate        | Object Category | Frequency |
+------------------+------------------+-----------------+-----------+
| sequence feature | causes condition | disease         | 5792      |
| gene             | RO:0003304       | disease         | 909       |
| gene             | RO:0002607       | disease         | 332       |
| sequence feature | causes condition | phenotype       | 198       |
+------------------+------------------+-----------------+-----------+

Loading clinvar.ttl into Networkx graph uses too much memory?

At the last hackathon I tried to load https://data.monarchinitiative.org/ttl/clinvar.ttl (which is 1.2 GB) into a networkx graph with ObanRdfTransformer, and got an exception that I believe means I ran out of memory.

@cmungall Do you know whether anyone else has tried this? Might the problem be with my computer, or is networkx using too much memory, or is it something else?

How can we ensure that KGX will work for large files?

Implement adapter for translator specific groups

In theory this should just be a generic bolt/cypher call, if the group already conforms to http://bit.ly/tr-kg-standard

Issues with connection timeouts on property path expansion in SPARQL queries

Ran into this issue while working with Red's knowledge graph.

If I run a SPARQL query that involves property path expansion, it results in connection timeouts.
Not sure if there is a better approach to this or if the query I am using can be improved.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bioentity: <http://bioentity.io/vocab/>
SELECT * WHERE {
?subject ?predicate ?object .
         ?subject rdfs:subClassOf* bioentity:Association
}
LIMIT 5

@cmungall @amalic @vemonet Would be good to get your inputs on this.

Saving node that already exists throws exception

Error: Node(1335) already exists with label `named_thing` and property `id` = 'NCBIGene:25236'

Should use MERGE rather than CREATE. Also, should SET properties ON CREATE and only use id to identify a node.

https://github.com/NCATS-Tangerine/kgx/blob/1a325d70e7c4853f01c6bb463aab9a47ea5bca5c/kgx/neo_transformer.py#L236

Implement RDF/OWL export

This should implement a mapping that is a subset of the scigraph one here

https://github.com/SciGraph/SciGraph/wiki/Neo4jMapping

basically, class axioms

A SubClassOf R some B ==> A R B
A SubClassOf B ==> A subClassOf B