mims-harvard / primekg Goto Github PK
View Code? Open in Web Editor NEWPrecision Medicine Knowledge Graph (PrimeKG)
Home Page: https://zitniklab.hms.harvard.edu/projects/PrimeKG
License: MIT License
Precision Medicine Knowledge Graph (PrimeKG)
Home Page: https://zitniklab.hms.harvard.edu/projects/PrimeKG
License: MIT License
Hi Dear Author,
Thanks for releasing this brilliant KG! I just wonder if there is any API deployment using your KG as the backend data, and supporting users to query this KG? For example, by querying some evidence (entity names) and relations, the API can find corresponding answers from the KG, probably using some machine learning techniques. I'm curious whether you've built such a "query service"?
When following the instructions as described in the README.md
to load PrimeKG, the resulting dataframe has 8,100,498 edges/triples instead of the reported 4,050,249.
So running
import pandas as pd
primekg = pd.read_csv('kg.csv', low_memory=False)
primekg.query('y_type=="disease"|x_type=="disease"')
print(len(primekg))
gives
8100498 rows × 10 columns
It seems that every triple appears twice in the data in source/target swapped form (what makes sense, given that the resulting dataframe contains exactly twice as many triples as reported in the README.md
). The column swapped triples do not seem to have inverse relationships (what might cause problems in ML applications). In fact, the labels are identical, just the source and target columns are swapped.
Moreover, it looks like there exist 370 duplicates for the drug_protein label as can be seen by running
primekg[primekg.duplicated()]
what gives
370 rows × 10 columns
Also when loading the graph from pykeen.datasets
, the number of triples exceeds the reported 4.5 million.
from pykeen.datasets import PrimeKG
dataset = PrimeKG()
print(dataset.training.num_triples + dataset.validation.num_triples + dataset.testing.num_triples)
shows that the data has 8,099,991 triples.
Given that the source/target swapped edges do not come with a new, i.e., reverse relationship, it might come to problems in ML applications when using the unfiltered data with the ~8.1 million edges.
Hi,
First, thanks for a great resource, PrimeKG is super useful!
I was wondering if you'd be willing to share the motivation to use DrugBank vs. ChEMBL?
I'm trying to decide if it's worth it for my company to buy a license to DrugBank or if we can find most of what we need already in ChEMBL.
Thanks,
Chris
Hi,
Some of parent-child relations in disease ontology seem to be incorrect.
For instance, we have Short bowel disorder both as a parent (which is correct) and as a child for Primary short bowel syndrome.
Perhaps all child-parent relations appear as parent-child, hence it is impossible to build the hierarchy.
When I execute drugbank_drug_protein.py file, it failed because of lacking of '../data/vocab/gene_map.csv' and '../data/vocab/drugbank_vocabulary.csv'.
We have observed that 88 nodes in the July 2023 version of the KG (without taking the LCC) contain non-harmonized node names. For example, the same gene MT-ND5 is represented as either "ND5" or "MT-ND5," and the gallbladder anatomy node is represented as either "gallbladder" or "gall bladder." Below is a quick R script to merge these nodes:
# load edges of updated PrimeKG
primeKG_edges = fread(here(primeKG_dir, "kg", "auxiliary", "kg_raw.csv"))
# replace "off-label use" with off_label_use
primeKG_edges[relation == "off-label use", relation := "off_label_use"]
# construct node matrix
primeKG_nodes = primeKG_edges %>%
.[, .(x_id, x_type, x_name, x_source)] %>%
unique()
colnames(primeKG_nodes) = gsub("x", "node", colnames(primeKG_nodes))
# find and consolidate duplicate nodes
primeKG_nodes[, joint_id := paste(node_id, node_type, sep = "_")]
dup_list = primeKG_nodes[duplicated(joint_id), joint_id]
dup_nodes = primeKG_nodes %>%
.[joint_id %in% dup_list] %>%
.[order(joint_id)]
# separate out duplicate genes and anatomy, manually investigate
dup_anatomy = dup_nodes[node_type == "anatomy"] %>%
.[, final_name := "gall bladder"]
dup_gene = dup_nodes[node_type == "gene/protein"]
# read HGNC official IDs
hgnc_set = fread(here("Data", "ID_mappings", "hgnc_complete_set.txt"), sep = "\t") %>%
.[, entrez_id := as.character(entrez_id)]
dup_gene = merge(dup_gene, hgnc_set[, .(symbol, entrez_id)], by.x = "node_id", by.y = "entrez_id", all.x = T, all.y = F) %>%
setnames("symbol", "final_name")
# combine back
dup_nodes = rbind(dup_anatomy, dup_gene) %>%
.[, node_name := final_name] %>%
.[, final_name := NULL] %>%
unique()
# replace names as necessary
for (i in 1:nrow(dup_nodes)) {
primeKG_nodes[node_id == dup_nodes[i, node_id] & node_type == dup_nodes[i, node_type], node_name := dup_nodes[i, node_name]]
primeKG_edges[x_id == dup_nodes[i, node_id] & x_type == dup_nodes[i, node_type], x_name := dup_nodes[i, node_name]]
primeKG_edges[y_id == dup_nodes[i, node_id] & y_type == dup_nodes[i, node_type], y_name := dup_nodes[i, node_name]]
}
# drop duplicates from nodes
non_dup_rows = nrow(primeKG_nodes)
primeKG_nodes = unique(primeKG_nodes)
message("Removed ", non_dup_rows - nrow(primeKG_nodes), " duplicates")
# make indices
primeKG_nodes[, node_index := 1:nrow(primeKG_nodes) - 1]
setcolorder(primeKG_nodes, "node_index")
# add indices to edges
primeKG_nodes %>% .[, node_string := paste(node_id, node_name, node_source, sep = "_")] %>%
.[, x_index := node_index] %>%
.[, y_index := node_index]
primeKG_edges %>%
.[, x_string := paste(x_id, x_name, x_source, sep = "_")] %>%
.[, y_string := paste(y_id, y_name, y_source, sep = "_")]
# merge back to edges
primeKG_edges = merge(primeKG_edges, primeKG_nodes[, .(node_string, x_index)], by.x = "x_string", by.y = "node_string", sort = F)
primeKG_edges = merge(primeKG_edges, primeKG_nodes[, .(node_string, y_index)], by.x = "y_string", by.y = "node_string", sort = F)
# drop merge columns
primeKG_nodes %>%
.[, node_string := NULL] %>%
.[, x_index := NULL] %>%
.[, y_index := NULL]
primeKG_edges %>%
.[, x_string := NULL] %>%
.[, y_string := NULL]
setcolorder(primeKG_edges, c("relation", "display_relation", "x_index", "x_id", "x_type", "x_name", "x_source", "y_index", "y_id", "y_type", "y_name", "y_source"))
# print node counts
message("Updated PrimeKG Nodes:\t", nrow(primeKG_nodes))
message("Updated PrimeKG Edges:\t", nrow(primeKG_edges) / 2)
This script uses the file hgnc_complete_set.txt
(see source), which was downloaded from the Human Gene Nomenclature Committee to resolve any conflicting gene IDs.
I'm trying to use your knowledge graph and I wanted to hear your thoughts about OMIM, especially their link with the HPO term. Is that already covered? How I'm supposed to extract that information? I couldn't find that in the csv
file you already linked in this repo.
Since the MONDO Disease Ontology44 harmonizes diseases from a wide range of ontologies, including the Online Mendelian Inheritance in Man (OMIM)49, SNOMED Clinical Terms (CT), International Classification of Diseases (ICD), and Medical Dictionary for Regulatory Activities (MedDRA), it was our preferred ontology for defining diseases.
The "drugbank_atc_codes.csv" file is supposed to be located in the vocab/ directory and is neccesary for build_graph.ipynb. I am unable to find a downloadable file of the DrugBank ATC codes. Where can we aquire this data? Maybe I have overlooked some step. Thank you in advanced. I look forward to being able to rebuild the updated PrimeKG in order to help advance research in this domain.
pip3 install -r requirements.txt
fails as requirements.txt
contains hard coded file paths. Error is:
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/conda/conda-bld/attrs_1642510447205/work'
How does PrimeKG relate to HetioNet, which appears to be a smaller knowledge graph? For example, there might be variations in the definitions of diseases or in aggregation rules. HetioNet is also considered a state-of-the-art (SOTA) knowledge graph, yet it wasn’t discussed in the original paper.
Similarly, the Multiscale Interactome (MSI) was developed some time ago and also wasn’t included in the main paper, despite being co-developed by the author of PrimeKG.
Is it reasonable to assume that both HetioNet and MSI are substantially incorporated within PrimeKG, barring differences in ontology definitions?
I followed this part and I got bellow error:
primekg.query('node_type=="disease"')
UndefinedVariableError: name 'node_type' is not defined
Hello, and thank you for creating this resource.
I wanted to add this dataset to the graph retrieval of my library, but I cannot find permanent URLs.
The ones that are made available here seem to be short-lived.
Thanks!
Luca
I'm seeking clarification regarding the licensing terms for PrimeKG. It appears there is a contrast in licensing, as the repository is dependent on the non-commercial-use database Drugbank, while PrimeKG operates under an MIT License. I would appreciate some insights to address this inconsistency.
I'm trying to set up the Python environment but encountered the following errors. Any suggestions?
conda env create --name PrimeKG --file=environment.yml
Solving environment: failed
LibMambaUnsatisfiableError: Encountered problems while solving:
- package tokenizers-0.10.3-py310h7bafbf5_1 is excluded by strict repo priority
Hello,
Thanks for creating this great knowledge graph, while I was able to use it with PyKEEN, I would like now to learn to build it from scratch. Are the input files already available somewhere or should I download them in each of their websites ? I only see the processing script.
Thanks !
Mi
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.