dhimmel / learn Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 5.0 575.83 MB

Machine learning and feature extraction for the Rephetio project

Home Page: https://doi.org/10.15363/thinklab.d210

Jupyter Notebook 99.99% Shell 0.01%

rephetio machine-learning logistic-regression edge-prediction drug-repurposing

learn's People

Stargazers

Watchers

Forkers

elifesciences-publications pittacus biomathematica eladbenhaim maggielee1111

learn's Issues

Kernel dies in all-features/3-extract.ipynb due to excessive RAM usage

Hi Daniel,

i am trying to run the extract notebook in a reduced hetionet without success. The total number of queries is 41,704,686. But, the kernel dies about the 3% of submitted queries. i am working with two workers. Any hint how should i solve this issue will be very welcome.

Many thanks in advance!
Nuria

How do you generate blacklist.tsv in 4-predictr.ipynb?

I was trying to reproduce some of the predictions from this project and noticed that blacklist.tsv seems to come from nowhere.
How was this file generated? Is it just eliminating all the CtD and DtC features?

I think this was done with a grepl in the all features file 5.6-model.ipynb?

Failing feature extraction queries due to py2neo's socket timeout

py2neo-2.0.8 has a hardcoded default timeout of 30 seconds per Cypher query. The timeout can be adjusted by overriding the default value for py2neo.packages.httpstream.http.socket_timeout.

In 7788972 we did not increase the default timeout in all-features/3-extract.ipynb, which led to some queries silently failing and being omitted from all-features/data/dwpc.tsv.bz2. Previously, the notebook imported hetio.neo4j, which overrides the default timeout, so this issue didn't arise.

In total, we performed 27,315,900 queries. Below is the upper tail of the runtime distribution as measured in Python.

Note the subset of queries that appeared to take twice as long as the socket_timeout default of 30 seconds. I'm not ready to call this a py2neo issue, since we're running many queries in parallel using threading, which complicates the diagnosis.

Prior probability of treatment implementation question

Hi Daniel,

Do you mind explaining what cell 9 in 1-prior.ipynb is doing exactly? Code below:

%%time

# Initialize a dictionary of degree to empirical probability list
degree_to_probs = {x: list() for x in degree_to_edges}

# Perform n_perm permutations
for i in range(n_perm):
    # Permute
    pair_list, stats = permute_pair_list(pair_list, multiplier=multiplier, seed=i)

    # Update
    pair_set = set(pair_list)
    for degree, probs in degree_to_probs.items():
        edges = degree_to_edges[degree]
        probs.append(len(edges & pair_set) / len(edges))

As far as I can tell, the inner loop does not actually do anything, since probs is a variable defined in the for loop, and gets overwritten each time. Also, the edges variable gets overwritten in later cells, so I could not determine what this cell is doing. Since it takes so long to compute, I was wondering if it was even necessary.

Trained logistic regression classifier for the Hetionet drug repurposing paper

Hello Sergio and Daniel,

My students and I came across your drug repurposing paper as we were putting a manuscript on a similar topic as a part of the NIH NCATS Translator project. In short, we've constructed a Neo4j database with 125K bioentities and 7.6M relationships and are using a node vectorization algorithm and random forests to predict possible drug repurposing targets. We wanted to compare our approach to the "metapath" approach that you've taken. We've found the het.io/repurpose website, but can't seem to find the code on your Github page that does the actual logistic regression classifications.

Do you think you could either point us to (or send us) the classifier code or the trained classifier so we can do this comparison?

auroc.tsv in 7-transform.ipynb

Hi Daniel!

I'm trying to execute the notebook 7-transform.ipynb. In cell 2, there's a reference to the file data/auroc.tsv in the line summary_df = readr::read_tsv('data/auroc.tsv') but I couldn't locate this file in the repository.

However, I did find a file named auroc.tsv located at data/feature-performance/auroc.tsv. Its content seems different from what is expected in 7-transform.ipynb.

Could you please guide me on how to obtain the correct auroc.tsv file as referenced in 7-transform.ipynb notebook? I appreciate any assistance you can provide!

Can't find the Symptomatic validation dataset

Hi Daniel :
I want evaluate my model exactly as you did with four validation datasets. Three of them(Disease Modifying, Clinical Trial, Drugcentral) are easy to get from the validation-datasets in your repository. I have not found the Symptomatic dataset yet. Can you help me with that ?
Thank you!
Lingling

Updates and Results from Changes in all-features/7-transform.ipynb

Hi Daniel!
Due to recent updates in R libraries, I've made several modifications to all-features/7-transform.ipynb and the result seems a little different. Following are the changes I made.

added degree_transformer = to_fxn(params['degree_transformer']) since I assumed the project is trying to evaluate transformation for degree_df too.
changed the funs function to (across(everything() due to funs being deprecated.
changed "rbind_all()" to "bind_rows()" since was removed in dplyr [(https://github.com/tidyverse/dplyr/issues/4430)].
And in the transformation-sweep.tsv I got, the first several lines transformation are

degree_transformer	dwpc_scaler	dwpc_transformer	alpha	auroc	auroc_lower	auroc_upper	auprc	auprc_lower	auprc_upper
asinh	mean	log1p	0	0.9941835007236502	0.9938819015259823	0.9944850999213182	0.9805812551344717	0.9797400253392432	0.9814224849297002
log1p	mean	log1p	0	0.9939196526468115	0.9936136156188465	0.9942256896747764	0.9799278139925953	0.9790933266429557	0.980762301342235
asinh	mean	asinh	0	0.9939028551379303	0.9936457501750738	0.9941599601007868	0.9798320302118215	0.9791166118524781	0.9805474485711649
asinh	mean	log1p	1	0.9936890048682038	0.9933479888988564	0.9940300208375511	0.977249912523295	0.9761768963653875	0.9783229286812024
asinh	mean	asinh	1	0.9935960703477849	0.9932037859475018	0.9939883547480679	0.9773938761438711	0.976172127586398	0.9786156247013441

Considering these outcomes, do you find the results acceptable? If so, I plan to initiate a pull request. I would appreciate any feedback or insights.

Why train the model with only one data set ?

Hi Daniel:

I'm wondering that, you have four valid data sets which contain drug-disease pairs, why not train the final model with all the data we know ?
Do you think it is a good idea?

Lingling

neo4j import failed

Hi Daniel, I am trying to export Hetnets to neo4j by running the neo4j-import.ipynb using neo4j-community-3.5.1 under windows 10.
The neo4j-import.ipynb involves creating a new instance of neo4j, modifying the configuration files, starting the neo4j server, reading a graph from a file, exporting it to neo4j, and then stopping the server. I have followed all the required steps, but unfortunately, I am facing issues that I'm unable to resolve on my own.
To give you a brief overview, here is the exporting code snippet (trying to export a single hetnet.json.bz2 instead of using ‘ProcessPool’).

neo4j_bin = os.path.join('D:/integrate1111/integrate/neo4j/neo4j-community-3.5.1_rephetio-v2.0.1/', 'bin', 'neo4j.bat')
neo4j_version = 'neo4j-community-3.5.1'
db_name = 'rephetio-v2.0.1'
listen_address_0 = 7474
connector_0 = 7687
#create a Neo4j instance and start the server
neo4j_dir = create_instance(neo4j_version, db_name, listen_address_0, connector_0, overwrite=True)
hetnet_to_neo4j(path='D:/integrate1111/integrate/data/hetnet.json.bz2', neo4j_dir=neo4j_dir, listen_address=listen_address_0)
#check command
result = subprocess.run([neo4j_bin, 'start'])
if result.returncode != 0:
    print(f'Error starting Neo4j: {result.returncode}')
else:
    print('Neo4j started successfully')

It turns out that the 47031 nodes in hetnet.json.bz2 was imported successfully to neo4j. But the relationships are not. The output message is as follow:

neo4j\neo4j-community-3.5.1_rephetio-v2.0.1
Starting neo4j server with neo4j\neo4j-community-3.5.1_rephetio-v2.0.1\bin\neo4j.bat
Reading graph from [D:/integrate1111/integrate/data/hetnet.json.bz2](file:///D:/integrate1111/integrate/data/hetnet.json.bz2)
Exporting graph to neo4j at http://localhost:7474/db/data/
neo4j\neo4j-community-3.5.1_rephetio-v2.0.1 'Graph' object has no attribute 'find_one'
Stopping neo4j server
Traceback (most recent call last):
  File "C:\Users\我的电脑\AppData\Local\Temp\ipykernel_105868\3048707244.py", line 16, in hetnet_to_neo4j
    hetnetpy.neo4j.export_neo4j(graph, uri, 1000, 250)
  File "c:\anaconda\envs\myenv\Lib\site-packages\hetnetpy\neo4j.py", line 79, in export_neo4j
    source = db_graph.find_one(source_label, "identifier", edge.source.identifier)
             ^^^^^^^^^^^^^^^^^
AttributeError: 'Graph' object has no attribute 'find_one'
Neo4j started successfully

I am hoping you might be able to provide some guidance on how to resolve this issue. Any hints, tips or advice would be greatly appreciated. Thank you very much for your time and consideration!

Benchmarking test sets

Hi Daniel,
My question concerns the benchmarking data sets in Fig. 3 of the paper "Systematic integration of biomedical knowledge prioritizes drugs for repurposing". Are those available for download? I tried to compile the test data myself using DrugCentral and the other datasets you made available as part of the project. However, I can't get the number of non-indications to match those in Fig 3.

I believe I understand how you compile non-indications for "Disease Modifying" dataset. Basically, 208,413 = (1552 - 14) * (137 - 1) - 755 (where 1552 is #compounds, 14 is #disconnected compounds, 137 is #diseases, 1 is # disconnected diseases and 755 is #DM indications). But, how do you compute the set of non-indications for Drug Central? In particular, where does 207,572 (Fig. 3) come from? Same for Clinical Trials and Symptomatic data sets.
Thank you.

Question on permutation size

# Calculate the number of perms
n_perm = treatment_df.compound_id.nunique() * treatment_df.disease_id.nunique()
n_perm = int(n_perm * 25)
n_perm

I wanted to know why the number of permutations for the Drug-Gene network in your prior/1-prior.ipynb notebook is 744975 and not 29799. Is it that the possible number of permutations is too low?

Convert parameters of logistic regression model from glment to sklearn

code in predictr.ipynb:

fit = hetior::glmnet_train(X = X_train, y = y_train, alpha = 0.2, s = lambda, cores = 10, seed = 0, penalty.factor=penalty, lambda.min.ratio=1e-8, nlambda=150, standardize=TRUE)

How to match these parameters with sklearn logistic regression parameters?

dhimmel / learn Goto Github PK

learn's People

Stargazers

Watchers

Forkers

learn's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs