GithubHelp home page GithubHelp logo

cu-dbmi / rtx-kg2-gateway Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 1.0 669 KB

Enabling RTX-KG2 data access through various means.

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 98.44% Python 1.56%

rtx-kg2-gateway's Introduction

RTX-KG2 Gateway

Enabling RTX-KG2 data access through various means.

Overview

RTX-KG2 provides a knowledge graph composed of many different data sources. The output data from the RTX-KG2 project can benefit from the use of additional specialized graph database tools for analysis purposes. Please find a brief overview of these technologies below for a better understanding of how they're used in context with the RTX-KG2 data.

Graph Database Technologies

Installation

Python

Usage of the contents found within this repository depend on Python being available on the system. One suggested way to use and manage Python is through pyenv (there are many other ways too!). Please reference the pyproject.toml file for more information on Python versions which are compatible with this project.

Poetry environment

Please use Python poetry to run and install a Python environment related to this project. The Poetry environment for this project includes dependencies which help run IDE environments, manage the data, and run workflows. See here for more information about installing Poetry within your environment.

# context: within the root of the repository
# after installing poetry, create the environment
poetry install

Development

Running and updating Jupyter notebooks

Please follow installation steps above and then use a related Jupyter environment to open and explore the notebooks under the notebooks directory. These notebooks leverage Jupyter Lab extensions (such as jupytext) through the related Poetry environment for this repository. Usage of the notebooks outside of Jupyter Lab as an IDE may have varied experiences.

# context: within the root of the repository
# after creating poetry environment, run jupyter
poetry run jupyter lab

Executing sequences of Python modules as tasks

We use Poe the Poet to define and run tasks defined within pyproject.toml under the section [tool.poe.tasks*]. This allows for the definition and use of a task workflow when implementing multiple procedures in sequence.

For example, use the following to run the notebook_sample_data_generation task:

# context: within the root of the repository
# run data_prep task using poethepoet defined within `pyproject.toml`
poetry run poe notebook_sample_data_generation

Existing tasks:

  • notebook_sample_data_generation: generates a sample parquet dataset and adds to a kuzu database.
  • notebook_full_data_generation: generates full dataset and adds to a kuzu database.
  • notebook_full_data_generation_with_metanames: generates full dataset with metanames specificity and adds to a kuzu database in similar fashion.

Citation and Acknowledgements

Data used by this repo includes RTX-KG2 which was published at the NCATS Biomedical Data Translator repository. Special thanks goes to those mentioned in the RTX-KG2 credits. Further data acknowledgments may be found within the data sources documentation.

rtx-kg2-gateway's People

Contributors

d33bs avatar

Watchers

Casey Greene avatar  avatar

Forkers

d33bs

rtx-kg2-gateway's Issues

Use `ORDER BY` with `LIMIT` and `OFFSET` SQL queries

When using SQL LIMIT and OFFSET one must use ORDER BY to ensure deterministic results. This issue pertains to the use of DuckDB for extracting row-chunks of node and edge data for ingest into a Kuzu database and adding ORDER BY to ensure all results are extracted properly.

Enhance project with reusable Python package(s) for related data integrations

  • My hope is to generalize some of the functionality for potential reuse with property graphs and Kuzu (at least in this context).
  • There's what seems like an opportunity to propose multi-dimensional property graph structures within Parquet as a strongly typed data storage alternative to JSON or TSV that may come with performance benefits. I felt the metadata storage components of Parquet were especially well-suited to shared schema and provenance understandings (along with default data citation within the files themselves).
  • It's likely we could also share a Neo4J-compatible version of the data for those who may prefer it over embedded approaches.

Originally posted by @d33bs in #1 (comment)

Elaborate on example cypher and related content from RTX-KG2 Kuzu database

Good that you show how to start Jupyter Lab! You might consider adding a short tutorial where you query the data, e.g. drawing their attention to a particular notebook where they can start trying queries and any setup they might need to do before running their first query. In the tutorial, you might even show a sample query, its result, and how you can do things with the result (e.g., drawing a graph of the resulting nodes).

Not necessary IMO, but you might consider giving a simple high-level overview of what Kuzu is doing, e.g. that it's creating an in-memory database on which to perform Cypher queries on the RTX-KG2 graph.

... I'd suggest having one or a few notebooks showing how to do that in detail, including getting into the schema of the dataset in the notebook. I see that you have a notebook called "example_cypher_kuzu" that shows an example query; perhaps that one could be extended to describe the data, etc. and show useful things you can do with Kuzu on the dataset?

Originally posted by @falquaddoomi in #1 (comment)

Check Kuzu ingest for LIST type entity attributes

In making further queries of the Kuzu database I noticed there might be a discrepancy with multi-value LIST attributes of certain entities (mostly noticed with NODE entities). This issue highlights a need to double check these values and make any necessary adjustments to ensure these are queryable as needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.