mims-harvard / primekg Goto Github PK

Precision Medicine Knowledge Graph (PrimeKG)

Home Page: https://zitniklab.hms.harvard.edu/projects/PrimeKG

License: MIT License

Jupyter Notebook 91.65% Python 7.36% Shell 0.99%

knowledge-graph precision-medicine therapeutics bioinformatics dataset graph-machine-learning network-medicine nlp-machine-learning

primekg's People

Contributors

Stargazers

Watchers

primekg's Issues

support query service?

Hi Dear Author,

Thanks for releasing this brilliant KG! I just wonder if there is any API deployment using your KG as the backend data, and supporting users to query this KG? For example, by querying some evidence (entity names) and relations, the API can find corresponding answers from the KG, probably using some machine learning techniques. I'm curious whether you've built such a "query service"?

Escaped double quotes in `disease_features.csv`

Why do some fields contain a backslash before the double quote at the end of the field? When trying to import the file in Neo4j I first have to remove all backslashes before double quotes, because otherwise it escapes the double quote.

How do I go about to visualize this in neo4j?

All edges seem to be duplicated in pre-build data files

When following the instructions as described in the README.md to load PrimeKG, the resulting dataframe has 8,100,498 edges/triples instead of the reported 4,050,249.

So running

import pandas as pd
primekg = pd.read_csv('kg.csv', low_memory=False)
primekg.query('y_type=="disease"|x_type=="disease"')

print(len(primekg))

gives

8100498 rows × 10 columns

It seems that every triple appears twice in the data in source/target swapped form (what makes sense, given that the resulting dataframe contains exactly twice as many triples as reported in the README.md). The column swapped triples do not seem to have inverse relationships (what might cause problems in ML applications). In fact, the labels are identical, just the source and target columns are swapped.

Moreover, it looks like there exist 370 duplicates for the drug_protein label as can be seen by running

primekg[primekg.duplicated()]

what gives

370 rows × 10 columns

Also when loading the graph from pykeen.datasets, the number of triples exceeds the reported 4.5 million.

from pykeen.datasets import PrimeKG
dataset = PrimeKG()
print(dataset.training.num_triples + dataset.validation.num_triples + dataset.testing.num_triples)

shows that the data has 8,099,991 triples.

Given that the source/target swapped edges do not come with a new, i.e., reverse relationship, it might come to problems in ML applications when using the unfiltered data with the ~8.1 million edges.

DrugBank vs. ChEMBL

Hi,

First, thanks for a great resource, PrimeKG is super useful!

I was wondering if you'd be willing to share the motivation to use DrugBank vs. ChEMBL?

I'm trying to decide if it's worth it for my company to buy a license to DrugBank or if we can find most of what we need already in ChEMBL.

Thanks,
Chris

Disease ontology incorrect

Hi,
Some of parent-child relations in disease ontology seem to be incorrect.
For instance, we have Short bowel disorder both as a parent (which is correct) and as a child for Primary short bowel syndrome.
Perhaps all child-parent relations appear as parent-child, hence it is impossible to build the hierarchy.

Lack of files

When I execute drugbank_drug_protein.py file, it failed because of lacking of '../data/vocab/gene_map.csv' and '../data/vocab/drugbank_vocabulary.csv'.

88 duplicate nodes with conflicting names

We have observed that 88 nodes in the July 2023 version of the KG (without taking the LCC) contain non-harmonized node names. For example, the same gene MT-ND5 is represented as either "ND5" or "MT-ND5," and the gallbladder anatomy node is represented as either "gallbladder" or "gall bladder." Below is a quick R script to merge these nodes:

# load edges of updated PrimeKG
primeKG_edges = fread(here(primeKG_dir, "kg", "auxiliary", "kg_raw.csv"))

# replace "off-label use" with off_label_use
primeKG_edges[relation == "off-label use", relation := "off_label_use"]

# construct node matrix
primeKG_nodes = primeKG_edges %>%
  .[, .(x_id, x_type, x_name, x_source)] %>%
  unique()
colnames(primeKG_nodes) = gsub("x", "node", colnames(primeKG_nodes))

# find and consolidate duplicate nodes
primeKG_nodes[, joint_id := paste(node_id, node_type, sep = "_")]
dup_list = primeKG_nodes[duplicated(joint_id), joint_id]
dup_nodes = primeKG_nodes %>%
  .[joint_id %in% dup_list] %>%
  .[order(joint_id)]

# separate out duplicate genes and anatomy, manually investigate
dup_anatomy = dup_nodes[node_type == "anatomy"] %>%
  .[, final_name := "gall bladder"]
dup_gene = dup_nodes[node_type == "gene/protein"]

# read HGNC official IDs
hgnc_set = fread(here("Data", "ID_mappings", "hgnc_complete_set.txt"), sep = "\t") %>%
  .[, entrez_id := as.character(entrez_id)]
dup_gene = merge(dup_gene, hgnc_set[, .(symbol, entrez_id)], by.x = "node_id", by.y = "entrez_id", all.x = T, all.y = F) %>%
  setnames("symbol", "final_name")

# combine back
dup_nodes = rbind(dup_anatomy, dup_gene) %>%
  .[, node_name := final_name] %>%
  .[, final_name := NULL] %>%
  unique()

# replace names as necessary
for (i in 1:nrow(dup_nodes)) {
  primeKG_nodes[node_id == dup_nodes[i, node_id] & node_type == dup_nodes[i, node_type], node_name := dup_nodes[i, node_name]]
  primeKG_edges[x_id == dup_nodes[i, node_id] & x_type == dup_nodes[i, node_type], x_name := dup_nodes[i, node_name]]
  primeKG_edges[y_id == dup_nodes[i, node_id] & y_type == dup_nodes[i, node_type], y_name := dup_nodes[i, node_name]]
}

# drop duplicates from nodes
non_dup_rows = nrow(primeKG_nodes)
primeKG_nodes = unique(primeKG_nodes)
message("Removed ", non_dup_rows - nrow(primeKG_nodes), " duplicates")

# make indices
primeKG_nodes[, node_index := 1:nrow(primeKG_nodes) - 1]
setcolorder(primeKG_nodes, "node_index")

# add indices to edges
primeKG_nodes %>% .[, node_string := paste(node_id, node_name, node_source, sep = "_")] %>%
  .[, x_index := node_index] %>%
  .[, y_index := node_index]
primeKG_edges %>%
  .[, x_string := paste(x_id, x_name, x_source, sep = "_")] %>%
  .[, y_string := paste(y_id, y_name, y_source, sep = "_")]

# merge back to edges
primeKG_edges = merge(primeKG_edges, primeKG_nodes[, .(node_string, x_index)], by.x = "x_string", by.y = "node_string", sort = F)
primeKG_edges = merge(primeKG_edges, primeKG_nodes[, .(node_string, y_index)], by.x = "y_string", by.y = "node_string", sort = F)

# drop merge columns
primeKG_nodes %>%
  .[, node_string := NULL] %>%
  .[, x_index := NULL] %>%
  .[, y_index := NULL]
primeKG_edges %>%
  .[, x_string := NULL] %>%
  .[, y_string := NULL]
setcolorder(primeKG_edges, c("relation", "display_relation", "x_index", "x_id", "x_type", "x_name", "x_source", "y_index", "y_id", "y_type", "y_name", "y_source"))

# print node counts
message("Updated PrimeKG Nodes:\t", nrow(primeKG_nodes))
message("Updated PrimeKG Edges:\t", nrow(primeKG_edges) / 2)

This script uses the file hgnc_complete_set.txt (see source), which was downloaded from the Human Gene Nomenclature Committee to resolve any conflicting gene IDs.

Are there any OMIM nodes?

I'm trying to use your knowledge graph and I wanted to hear your thoughts about OMIM, especially their link with the HPO term. Is that already covered? How I'm supposed to extract that information? I couldn't find that in the csv file you already linked in this repo.

Since the MONDO Disease Ontology44 harmonizes diseases from a wide range of ontologies, including the Online Mendelian Inheritance in Man (OMIM)49, SNOMED Clinical Terms (CT), International Classification of Diseases (ICD), and Medical Dictionary for Regulatory Activities (MedDRA), it was our preferred ontology for defining diseases.

Missing "drugbank_atc_codes.csv" file

The "drugbank_atc_codes.csv" file is supposed to be located in the vocab/ directory and is neccesary for build_graph.ipynb. I am unable to find a downloadable file of the DrugBank ATC codes. Where can we aquire this data? Maybe I have overlooked some step. Thank you in advanced. I look forward to being able to rebuild the updated PrimeKG in order to help advance research in this domain.

Install fails because `requirements.txt` contains hard coded paths

pip3 install -r requirements.txt fails as requirements.txt contains hard coded file paths. Error is:

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/conda-bld/attrs_1642510447205/work'

PrimeKG vs Hetionet vs MSI

How does PrimeKG relate to HetioNet, which appears to be a smaller knowledge graph? For example, there might be variations in the definitions of diseases or in aggregation rules. HetioNet is also considered a state-of-the-art (SOTA) knowledge graph, yet it wasn’t discussed in the original paper.

Similarly, the Multiscale Interactome (MSI) was developed some time ago and also wasn’t included in the main paper, despite being co-developed by the author of PrimeKG.

Is it reasonable to assume that both HetioNet and MSI are substantially incorporated within PrimeKG, barring differences in ontology definitions?

`kg.csv` load error

I followed this part and I got bellow error:

primekg.query('node_type=="disease"')

UndefinedVariableError: name 'node_type' is not defined

FAIR url to download TSV files

Hello, and thank you for creating this resource.

I wanted to add this dataset to the graph retrieval of my library, but I cannot find permanent URLs.
The ones that are made available here seem to be short-lived.

Thanks!
Luca

Can PrimeKG be used for commercial purposes?

I'm seeking clarification regarding the licensing terms for PrimeKG. It appears there is a contrast in licensing, as the repository is dependent on the non-commercial-use database Drugbank, while PrimeKG operates under an MIT License. I would appreciate some insights to address this inconsistency.

Failure to Create Python Environment

I'm trying to set up the Python environment but encountered the following errors. Any suggestions?

conda env create --name PrimeKG --file=environment.yml

Solving environment: failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - package tokenizers-0.10.3-py310h7bafbf5_1 is excluded by strict repo priority

input data

Hello,
Thanks for creating this great knowledge graph, while I was able to use it with PyKEEN, I would like now to learn to build it from scratch. Are the input files already available somewhere or should I download them in each of their websites ? I only see the processing script.
Thanks !
Mi

mims-harvard / primekg Goto Github PK

primekg's People

Contributors

Stargazers

Watchers

Forkers

primekg's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs