dhimmel / learn Goto Github PK
View Code? Open in Web Editor NEWMachine learning and feature extraction for the Rephetio project
Home Page: https://doi.org/10.15363/thinklab.d210
Machine learning and feature extraction for the Rephetio project
Home Page: https://doi.org/10.15363/thinklab.d210
Hi Daniel,
i am trying to run the extract notebook in a reduced hetionet without success. The total number of queries is 41,704,686. But, the kernel dies about the 3% of submitted queries. i am working with two workers. Any hint how should i solve this issue will be very welcome.
Many thanks in advance!
Nuria
I was trying to reproduce some of the predictions from this project and noticed that blacklist.tsv seems to come from nowhere.
How was this file generated? Is it just eliminating all the CtD and DtC features?
I think this was done with a grepl in the all features file 5.6-model.ipynb?
py2neo-2.0.8 has a hardcoded default timeout of 30 seconds per Cypher query. The timeout can be adjusted by overriding the default value for py2neo.packages.httpstream.http.socket_timeout
.
In 7788972 we did not increase the default timeout in all-features/3-extract.ipynb
, which led to some queries silently failing and being omitted from all-features/data/dwpc.tsv.bz2
. Previously, the notebook imported hetio.neo4j
, which overrides the default timeout, so this issue didn't arise.
In total, we performed 27,315,900 queries. Below is the upper tail of the runtime distribution as measured in Python.
Note the subset of queries that appeared to take twice as long as the socket_timeout
default of 30 seconds. I'm not ready to call this a py2neo issue, since we're running many queries in parallel using threading, which complicates the diagnosis.
Hi Daniel,
Do you mind explaining what cell 9 in 1-prior.ipynb
is doing exactly? Code below:
%%time
# Initialize a dictionary of degree to empirical probability list
degree_to_probs = {x: list() for x in degree_to_edges}
# Perform n_perm permutations
for i in range(n_perm):
# Permute
pair_list, stats = permute_pair_list(pair_list, multiplier=multiplier, seed=i)
# Update
pair_set = set(pair_list)
for degree, probs in degree_to_probs.items():
edges = degree_to_edges[degree]
probs.append(len(edges & pair_set) / len(edges))
As far as I can tell, the inner loop does not actually do anything, since probs
is a variable defined in the for loop, and gets overwritten each time. Also, the edges
variable gets overwritten in later cells, so I could not determine what this cell is doing. Since it takes so long to compute, I was wondering if it was even necessary.
Hello Sergio and Daniel,
My students and I came across your drug repurposing paper as we were putting a manuscript on a similar topic as a part of the NIH NCATS Translator project. In short, we've constructed a Neo4j database with 125K bioentities and 7.6M relationships and are using a node vectorization algorithm and random forests to predict possible drug repurposing targets. We wanted to compare our approach to the "metapath" approach that you've taken. We've found the het.io/repurpose website, but can't seem to find the code on your Github page that does the actual logistic regression classifications.
Do you think you could either point us to (or send us) the classifier code or the trained classifier so we can do this comparison?
Hi Daniel!
I'm trying to execute the notebook 7-transform.ipynb. In cell 2, there's a reference to the file data/auroc.tsv
in the line summary_df = readr::read_tsv('data/auroc.tsv')
but I couldn't locate this file in the repository.
However, I did find a file named auroc.tsv
located at data/feature-performance/auroc.tsv
. Its content seems different from what is expected in 7-transform.ipynb.
Could you please guide me on how to obtain the correct auroc.tsv
file as referenced in 7-transform.ipynb notebook? I appreciate any assistance you can provide!
Hi Daniel :
I want evaluate my model exactly as you did with four validation datasets. Three of them(Disease Modifying, Clinical Trial, Drugcentral) are easy to get from the validation-datasets in your repository. I have not found the Symptomatic dataset yet. Can you help me with that ?
Thank you!
Lingling
Hi Daniel!
Due to recent updates in R libraries, I've made several modifications to all-features/7-transform.ipynb and the result seems a little different. Following are the changes I made.
degree_transformer = to_fxn(params['degree_transformer'])
since I assumed the project is trying to evaluate transformation for degree_df
too.funs
function to (across(everything()
due to funs
being deprecated.degree_transformer dwpc_scaler dwpc_transformer alpha auroc auroc_lower auroc_upper auprc auprc_lower auprc_upper
asinh mean log1p 0 0.9941835007236502 0.9938819015259823 0.9944850999213182 0.9805812551344717 0.9797400253392432 0.9814224849297002
log1p mean log1p 0 0.9939196526468115 0.9936136156188465 0.9942256896747764 0.9799278139925953 0.9790933266429557 0.980762301342235
asinh mean asinh 0 0.9939028551379303 0.9936457501750738 0.9941599601007868 0.9798320302118215 0.9791166118524781 0.9805474485711649
asinh mean log1p 1 0.9936890048682038 0.9933479888988564 0.9940300208375511 0.977249912523295 0.9761768963653875 0.9783229286812024
asinh mean asinh 1 0.9935960703477849 0.9932037859475018 0.9939883547480679 0.9773938761438711 0.976172127586398 0.9786156247013441
Considering these outcomes, do you find the results acceptable? If so, I plan to initiate a pull request. I would appreciate any feedback or insights.
Hi Daniel:
I'm wondering that, you have four valid data sets which contain drug-disease pairs, why not train the final model with all the data we know ?
Do you think it is a good idea?
Lingling
Hi Daniel, I am trying to export Hetnets to neo4j by running the neo4j-import.ipynb using neo4j-community-3.5.1 under windows 10.
The neo4j-import.ipynb involves creating a new instance of neo4j, modifying the configuration files, starting the neo4j server, reading a graph from a file, exporting it to neo4j, and then stopping the server. I have followed all the required steps, but unfortunately, I am facing issues that I'm unable to resolve on my own.
To give you a brief overview, here is the exporting code snippet (trying to export a single hetnet.json.bz2 instead of using ‘ProcessPool’).
neo4j_bin = os.path.join('D:/integrate1111/integrate/neo4j/neo4j-community-3.5.1_rephetio-v2.0.1/', 'bin', 'neo4j.bat')
neo4j_version = 'neo4j-community-3.5.1'
db_name = 'rephetio-v2.0.1'
listen_address_0 = 7474
connector_0 = 7687
#create a Neo4j instance and start the server
neo4j_dir = create_instance(neo4j_version, db_name, listen_address_0, connector_0, overwrite=True)
hetnet_to_neo4j(path='D:/integrate1111/integrate/data/hetnet.json.bz2', neo4j_dir=neo4j_dir, listen_address=listen_address_0)
#check command
result = subprocess.run([neo4j_bin, 'start'])
if result.returncode != 0:
print(f'Error starting Neo4j: {result.returncode}')
else:
print('Neo4j started successfully')
It turns out that the 47031 nodes in hetnet.json.bz2 was imported successfully to neo4j. But the relationships are not. The output message is as follow:
neo4j\neo4j-community-3.5.1_rephetio-v2.0.1
Starting neo4j server with neo4j\neo4j-community-3.5.1_rephetio-v2.0.1\bin\neo4j.bat
Reading graph from [D:/integrate1111/integrate/data/hetnet.json.bz2](file:///D:/integrate1111/integrate/data/hetnet.json.bz2)
Exporting graph to neo4j at http://localhost:7474/db/data/
neo4j\neo4j-community-3.5.1_rephetio-v2.0.1 'Graph' object has no attribute 'find_one'
Stopping neo4j server
Traceback (most recent call last):
File "C:\Users\我的电脑\AppData\Local\Temp\ipykernel_105868\3048707244.py", line 16, in hetnet_to_neo4j
hetnetpy.neo4j.export_neo4j(graph, uri, 1000, 250)
File "c:\anaconda\envs\myenv\Lib\site-packages\hetnetpy\neo4j.py", line 79, in export_neo4j
source = db_graph.find_one(source_label, "identifier", edge.source.identifier)
^^^^^^^^^^^^^^^^^
AttributeError: 'Graph' object has no attribute 'find_one'
Neo4j started successfully
I am hoping you might be able to provide some guidance on how to resolve this issue. Any hints, tips or advice would be greatly appreciated. Thank you very much for your time and consideration!
Hi Daniel,
My question concerns the benchmarking data sets in Fig. 3 of the paper "Systematic integration of biomedical knowledge prioritizes drugs for repurposing". Are those available for download? I tried to compile the test data myself using DrugCentral and the other datasets you made available as part of the project. However, I can't get the number of non-indications to match those in Fig 3.
I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications). But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.
Thank you.
# Calculate the number of perms
n_perm = treatment_df.compound_id.nunique() * treatment_df.disease_id.nunique()
n_perm = int(n_perm * 25)
n_perm
I wanted to know why the number of permutations for the Drug-Gene network in your prior/1-prior.ipynb
notebook is 744975 and not 29799. Is it that the possible number of permutations is too low?
code in predictr.ipynb
:
fit = hetior::glmnet_train(X = X_train, y = y_train, alpha = 0.2, s = lambda, cores = 10, seed = 0, penalty.factor=penalty, lambda.min.ratio=1e-8, nlambda=150, standardize=TRUE)
How to match these parameters with sklearn logistic regression parameters?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.