GithubHelp home page GithubHelp logo

biolink / kgx Goto Github PK

View Code? Open in Web Editor NEW
110.0 22.0 26.0 7.14 MB

KGX is a Python library for exchanging Knowledge Graphs

Home Page: https://kgx.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.03% Shell 0.01% Python 99.84% Dockerfile 0.12%
knowledge-graph biolink-model ncats-translator

kgx's People

Contributors

caufieldjh avatar cbizon avatar cmungall avatar deepakunni3 avatar dependabot[bot] avatar dhimmel avatar dougli1sqrd avatar evandietzmorris avatar glass-ships avatar gouttegd avatar gregr avatar hrshdhgd avatar hsolbrig avatar kennethbruskiewicz avatar kevinschaper avatar kshefchek avatar lhannest avatar phillipsowen avatar richardbruskiewich avatar sierra-moxon avatar yaphetkg avatar yy20716 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kgx's Issues

generate edge ids?

Some KGs will provide edge IDs, others won't should we always provide one, eg, by hashing? hashing the SPO will not work as we allow separate SPOs with different semantics

fix travis config

currently failing due to lack of pandas

may be simpler to do this in the .travis.yml pip install -r requirements.txt

the current config is intended to do a fast binary install of scipy.. except that isn't actually needed

Trade out pandas for dask?

I just started playing around with Dask, it appears to be much faster than pandas and will work with much larger files. It has a similar API as the pandas DataFrame, and while it can't do everything that pandas can do it probably fits for our needs.

Allow filters for I/O

Especially when the endpoint is a database (e.g. via bolt, sparql) rather than a file or set of files.

use cases:

  • export all nodes plus their properties of a certain type (e.g. disease)
    • as above, include direct outgoing edges
  • export all edges for a particular pair of types, e.g d2p

Provide way to summarize contents of a KG

Related to pitch 1 from @saramsey

Can we easily summarize contents of a KG and how they may differ from others? If we are all using the same types then we can multiplex a cypher aggregate query to each source. We could also download all locally and compare that way

Add for transformer for BEL

I'm the maintainer of PyBEL which compiles knowledge stored in Biological Expression Language into a NetworkX MultiDiGraph. I don't think it would be so hard to transform the graphs it produces to match your schema so you could incorporate them. There's quite a bit of publicly available content that only exists in BEL, and we're also working at converting several biological knowledge repositories to BEL automatically (see: Bio2BEL).

I would be happy to try and make a PR if you can point me in the right direction where to get started! If you're unaware of BEL I can share some more resources and examples too :)

Implement ID clique merging

This could happen a number of different ways

  1. Use KGX to export from a source, run a rewriter over the generated files
  2. Use KGX to export from a source, rewrite during export
  3. Import into a database, run clique merge (could use SG framework for this)

cc @balhoff @kshefchek @yy20716

command line wrapper

Will use click framework

dump command:

  • input: path or endpoint
  • input format (if not inferred from above)
  • output path
  • output format

see tests for details

Neo4j download filters

So far we can filter on the subject and object categories, and the edge label. This means we're only constraining the edge set and not the node set. You might get only a handful of relations, but then also get every node in the database. That seems incorrect.

My expectation is that every edge in the edge set should have its nodes represented in the node set, and that there should not be nodes in the node set that are not connected by edges in the edge set.

That being said, if we later add filters for the node set then maybe users will want to be able to download node sets and edge sets independently? Maybe they'll want to download only the nodes, or only the edges.

load ontology from rdf, demonstrating inter-ontology linkages

I believe the code for this is basically done, just need to add something to the README that shows how it's done - use mondo + uberon + go as example (complete versions).

This could be done as two calls

  • kgx convert ontology rdf to property graph serialization (e.g. csv)
  • kgx load into neo4j (using the code that @deepakunni3 is finishing off)

Use JSON-LD context rather than hardcoding for mapping of RDF preds to short forms

currently RDF sources and the monarch-full graph use URIs for node and edge properties and types. Currently we hardcode a mapping to shortforms in the standard:

# TODO: use JSON-LD context
mapping = {
    'subject': OBAN.association_has_subject,
    'object': OBAN.association_has_object,
    'predicate': OBAN.association_has_predicate,
    'comment': RDFS.comment,
    'name': RDFS.label
}

this should instead come from a JSON-LD context that is generated from biolink

Error: must be str, not dict_keys (no stack backtrace)

sramsey-laptop:kgx sramsey$ kgx neo4j-download bolt://localhost neo4j [REDACTED] test.out
/usr/local/lib/python3.6/site-packages/cachier/mongo_core.py:24: UserWarning: Cachier warning: pymongo was not found. MongoDB cores will not work.
"Cachier warning: pymongo was not found. MongoDB cores will not work.")
Error: must be str, not dict_keys

I got this error running the latest code for kgx on a MBP (OS 10.10) under python 3.5

dockerize

Use case: reasoner Z needs to bring a slice of central KG into their local KG. They want to set up a CI job to do this. They run the kgx export dump command (#10) via docker run

Add adapter for X-ray reasoner

We'd like to experiment with slurping slices of the X-ray graph (already conformant with tkg standard), and if they are amenable, allowing them to slurp slices in.

This should turn out to be a generic cypher call, no need to transform as in the monarch case

rdf_transformer malfunction

    def parse(self,filename:str=None, format:str=None):
        """
        Parse a file into an graph, using rdflib
        """
        rdfgraph = rdflib.Graph()
        if format is None:
            if filename.endswith(".ttl"):
                format='turtle'
            elif filename.endswith(".rdf"):
                format='xml'
        rdfgraph.parse(filename, format=format)

        # TODO: use source from RDF
        self.graph_metadata['provided_by'] = filename
        # self.load_edges(rdfgraph)

Is this last line supposed to be commented out? This makes the class malfunction.

Add diff tool

We'd like to be able to do semantic diffs to compare subsets to two KGs

  • aggregate level: counts of type X, counts of uses of predicate Y
  • node level: compare properties for all types of X
  • edge level: compare all associations between X and Y

Add export to sxprs

There are a number of different mappings to be explored. The data model is a basic property graph. Nodes and edges can be annotated with a simple key-val dict, where the val is either a literal or list of literals.

One possibility is to represent nodes and edges are fixed length tuples where each position is hardwired to a property e.g. (node ID NAME DESCRIPTION ...)* (edge PREDICATE SUBJECT OBJECT PUBLICATIONS...)* but this is too rigid.

Option 1:

(node ID)
(node-property-value ID P V) ## V is atom or list of atoms
(edge PRED SUBJ OBJ)
(edge-property-value PRED SUBJ OBJ P V)

this uses reification for the edges; a modification is to use an edge ID

Option 2:

(node ID (list (P1 V1) (P2 V2) ..))
(edge PRED SUBJ OBJ (list (P1 V1) (P2 V2) ..))

Pandas loads arrays as strings

The label array ['named_thing'] gets parsed as a string "['named_thing']"

Use ast.literal_eval when parsing in PandasTransformer load_node and load_edge.

UnicodeDecodeError on installing kgx from requirements.txt under python 3.5.x in docker

Host OS: Ubuntu
Installing inside a Docker container

Installation command:
pip3 install -r requirements.txt

/mnt/data/kgx/kgx$ uname -a
Linux ip-172-31-43-220 4.4.0-1060-aws #69-Ubuntu SMP Sun May 20 13:42:07 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Collecting pytest_logging>=0.0 (from -r requirements.txt (line 9))
Using cached https://files.pythonhosted.org/packages/dc/1e/fb11174c9eaebcec27d36e9e994b90ffa168bc3226925900b9dbbf16c9da/pytest-logging-2015.11.4.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-niimt31p/pytest-logging/setup.py", line 34, in
exec(rfh.read(), None, _LOCALS) # pylint: disable=exec-used
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 105: ordinal not in range(128)

Building a subgraph of clinvar.ttl, omim.ttl, hpoa.ttl, orphanet.ttl

@cmungall @deepakunni3 @kshefchek

The workflow is in rdf_to_json.py, which loads owl files and the turtle files, parses the turtle files (using the owl files to find categories for nodes), and saves the resulting networkx graph as json files. I'm using a branch where I refactored/rewrote the RDF transformers into new files: rdf_transformer2.py and utils/rdf_utils.py.

I wrote a count.py to get the edge and node frequencies and the examples of uncategorized nodes from the resulting json files.

It seems the majority of clinvar edges have an uncategorized node. Is there some other owl file I should be using?

@cmungall I just noticed that the two most common edge types for clinvar are has_uncertain_significance_for_condition and has_uncertain_significance_for_condition, suggesting that the uncategorized objects are actually diseases. Am I supposed to be inferring node categories from the domain and range of predicates they're adjacent to?

I am using go.owl, so.owl, ordo.owl, mondo.owl, and hp.owl, and the latest results were derived from using all owl files on all turtle files.

To do

  • Check that categories such as owl:Restriction and owl:Axiom aren't the result of a bug in my category-finding method.
  • Try re-running rdf_to_json.py with all owl files loaded and used on each turtle file. Comment: significant improvement to clinvar.ttl, though I also added ordo.owl so these improvements may have solely been from that.

Results

clinvar

+-------------------------------------+---------------------------------------------+-----------+
| Uncategorized Example Base IRI      | Uncategorized Example Full IRI              | Frequency |
+-------------------------------------+---------------------------------------------+-----------+
| http://www.ncbi.nlm.nih.gov/medgen  | http://www.ncbi.nlm.nih.gov/medgen/CN517202 | 1919      |
| http://purl.obolibrary.org/obo/OMIM | http://purl.obolibrary.org/obo/OMIM_612042  | 232       |
|                                     | ClinVarVariant:242567,242568                | 197       |
| http://www.orpha.net/ORDO/Orphanet  | http://www.orpha.net/ORDO/Orphanet_98301    | 3         |
+-------------------------------------+---------------------------------------------+-----------+
+------------------+-----------+
| Category         | Frequency |
+------------------+-----------+
| sequence feature | 411841    |
| disease          | 5181      |
| None             | 2351      |
| GENO:0000871     | 288       |
| GENO:0000847     | 220       |
+------------------+-----------+
+------------------+--------------+-----------------+-----------+
| Subject Category | Predicate    | Object Category | Frequency |
+------------------+--------------+-----------------+-----------+
| sequence feature | GENO:0000845 | None            | 174302    |
| sequence feature | GENO:0000844 | None            | 106606    |
| sequence feature | GENO:0000845 | disease         | 94964     |
| sequence feature | GENO:0000840 | disease         | 70360     |
| sequence feature | GENO:0000843 | None            | 63270     |
| sequence feature | GENO:0000844 | disease         | 45998     |
| sequence feature | GENO:0000840 | None            | 37772     |
| sequence feature | GENO:0000843 | disease         | 25795     |
| sequence feature | GENO:0000841 | disease         | 21672     |
| sequence feature | GENO:0000841 | None            | 19869     |
| GENO:0000871     | GENO:0000840 | disease         | 223       |
| GENO:0000847     | GENO:0000840 | disease         | 206       |
+------------------+--------------+-----------------+-----------+

hpoa:

+-----------------------------------------+---------------------------------------------+-----------+
| Uncategorized Example Base IRI          | Uncategorized Example Full IRI              | Frequency |
+-----------------------------------------+---------------------------------------------+-----------+
| http://purl.obolibrary.org/obo/MESH     | http://purl.obolibrary.org/obo/MESH_D000855 | 590       |
| http://purl.obolibrary.org/obo/DOID     | http://purl.obolibrary.org/obo/DOID_411     | 315       |
| http://purl.obolibrary.org/obo/OMIM     | http://purl.obolibrary.org/obo/OMIM_610380  | 219       |
| http://purl.obolibrary.org/obo/HP       | http://purl.obolibrary.org/obo/HP_0004929   | 87        |
| http://purl.obolibrary.org/obo/DECIPHER | http://purl.obolibrary.org/obo/DECIPHER_57  | 47        |
| http://www.orpha.net/ORDO/Orphanet      | http://www.orpha.net/ORDO/Orphanet_2058     | 18        |
+-----------------------------------------+---------------------------------------------+-----------+
+-----------+-----------+
| Category  | Frequency |
+-----------+-----------+
| disease   | 12773     |
| phenotype | 8388      |
| None      | 1276      |
+-----------+-----------+
+------------------+-----------------+-----------------+-----------+
| Subject Category | Predicate       | Object Category | Frequency |
+------------------+-----------------+-----------------+-----------+
| disease          | has phenotype   | phenotype       | 236210    |
| None             | has phenotype   | phenotype       | 31721     |
| disease          | has disposition | HP:0000005      | 7405      |
| disease          | has disposition | HP:0031797      | 1544      |
| disease          | has phenotype   | None            | 909       |
| phenotype        | has phenotype   | phenotype       | 792       |
| disease          | has disposition | HP:0012823      | 619       |
| None             | has phenotype   | None            | 277       |
| disease          | has phenotype   | HP:0031797      | 269       |
| None             | has disposition | HP:0000005      | 176       |
+------------------+-----------------+-----------------+-----------+

omim:

+-------------------------------------------------+---------------------------------------------------------------+-----------+
| Uncategorized Example Base IRI                  | Uncategorized Example Full IRI                                | Frequency |
+-------------------------------------------------+---------------------------------------------------------------+-----------+
| https://www.ncbi.nlm.nih.gov/gene               | https://www.ncbi.nlm.nih.gov/gene/117188                      | 987       |
| https://monarchinitiative.org/.well-known/genid | https://monarchinitiative.org/.well-known/genid/feature616964 | 6         |
+-------------------------------------------------+---------------------------------------------------------------+-----------+
+------------------+-----------+
| Category         | Frequency |
+------------------+-----------+
| phenotype        | 6296      |
| gene             | 4146      |
| None             | 993       |
| sequence feature | 245       |
+------------------+-----------+
+------------------+------------------+-----------------+-----------+
| Subject Category | Predicate        | Object Category | Frequency |
+------------------+------------------+-----------------+-----------+
| gene             | causes condition | phenotype       | 4652      |
| gene             | RO:0002326       | phenotype       | 1089      |
| None             | causes condition | phenotype       | 537       |
| None             | RO:0002326       | phenotype       | 353       |
| sequence feature | causes condition | phenotype       | 157       |
| gene             | RO:0002607       | phenotype       | 108       |
+------------------+------------------+-----------------+-----------+

orphanet

+------------------------------------+-------------------------------------------+-----------+
| Uncategorized Example Base IRI     | Uncategorized Example Full IRI            | Frequency |
+------------------------------------+-------------------------------------------+-----------+
| http://www.orpha.net/ORDO/Orphanet | http://www.orpha.net/ORDO/Orphanet_449306 | 33        |
+------------------------------------+-------------------------------------------+-----------+
+------------------+-----------+
| Category         | Frequency |
+------------------+-----------+
| sequence feature | 6033      |
| disease          | 3658      |
| gene             | 815       |
+------------------+-----------+
+------------------+------------------+-----------------+-----------+
| Subject Category | Predicate        | Object Category | Frequency |
+------------------+------------------+-----------------+-----------+
| sequence feature | causes condition | disease         | 5792      |
| gene             | RO:0003304       | disease         | 909       |
| gene             | RO:0002607       | disease         | 332       |
| sequence feature | causes condition | phenotype       | 198       |
+------------------+------------------+-----------------+-----------+

Issues with connection timeouts on property path expansion in SPARQL queries

Ran into this issue while working with Red's knowledge graph.

If I run a SPARQL query that involves property path expansion, it results in connection timeouts.
Not sure if there is a better approach to this or if the query I am using can be improved.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX bioentity: <http://bioentity.io/vocab/>
SELECT * WHERE {
?subject ?predicate ?object .
         ?subject rdfs:subClassOf* bioentity:Association
}
LIMIT 5

@cmungall @amalic @vemonet Would be good to get your inputs on this.

Add nanopubs export

This would be similar to the existing RDF export but instead of using OBAN reification, edge properties would map to the named graph. The results could be published using the nanopub framework.

advice sought from @micheldumontier and @amalic

Write validator

@putmantime made a start here #29

The validator may work by connecting to an instance (eg via BOLT) OR by downloading using kgx script and running checks locally / in-memory e.g using networkx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.